mirror of
				https://github.com/postgres/postgres.git
				synced 2025-11-03 09:13:20 +03:00 
			
		
		
		
	
		
			
				
	
	
		
			5157 lines
		
	
	
		
			222 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			5157 lines
		
	
	
		
			222 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
From owner-pgsql-hackers@hub.org Sun Jun 14 18:45:04 1998
 | 
						||
Received: from hub.org (hub.org [209.47.148.200])
 | 
						||
	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id SAA03690
 | 
						||
	for <maillist@candle.pha.pa.us>; Sun, 14 Jun 1998 18:45:00 -0400 (EDT)
 | 
						||
Received: from localhost (majordom@localhost) by hub.org (8.8.8/8.7.5) with SMTP id SAA28049; Sun, 14 Jun 1998 18:39:42 -0400 (EDT)
 | 
						||
Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Sun, 14 Jun 1998 18:36:06 +0000 (EDT)
 | 
						||
Received: (from majordom@localhost) by hub.org (8.8.8/8.7.5) id SAA27943 for pgsql-hackers-outgoing; Sun, 14 Jun 1998 18:36:04 -0400 (EDT)
 | 
						||
Received: from angular.illustra.com (ifmxoak.illustra.com [206.175.10.34]) by hub.org (8.8.8/8.7.5) with ESMTP id SAA27925 for <pgsql-hackers@postgresql.org>; Sun, 14 Jun 1998 18:35:47 -0400 (EDT)
 | 
						||
Received: from hawk.illustra.com (hawk.illustra.com [158.58.61.70]) by angular.illustra.com (8.7.4/8.7.3) with SMTP id PAA21293 for <pgsql-hackers@postgresql.org>; Sun, 14 Jun 1998 15:35:12 -0700 (PDT)
 | 
						||
Received: by hawk.illustra.com (5.x/smail2.5/06-10-94/S)
 | 
						||
	id AA07922; Sun, 14 Jun 1998 15:35:13 -0700
 | 
						||
From: dg@illustra.com (David Gould)
 | 
						||
Message-Id: <9806142235.AA07922@hawk.illustra.com>
 | 
						||
Subject: [HACKERS] performance tests, initial results
 | 
						||
To: pgsql-hackers@postgreSQL.org
 | 
						||
Date: Sun, 14 Jun 1998 15:35:13 -0700 (PDT)
 | 
						||
Mime-Version: 1.0
 | 
						||
Content-Type: text/plain; charset=US-ASCII
 | 
						||
Content-Transfer-Encoding: 7bit
 | 
						||
Sender: owner-pgsql-hackers@hub.org
 | 
						||
Precedence: bulk
 | 
						||
Status: RO
 | 
						||
 | 
						||
 | 
						||
I have been playing a little with the performance tests found in
 | 
						||
pgsql/src/tests/performance and have a few observations that might be of
 | 
						||
minor interest.
 | 
						||
 | 
						||
The tests themselves are simple enough although the result parsing in the
 | 
						||
driver did not work on Linux. I am enclosing a patch below to fix this. I
 | 
						||
think it will also work better on the other systems.
 | 
						||
 | 
						||
A summary of results from my testing are below. Details are at the bottom
 | 
						||
of this message.
 | 
						||
 | 
						||
My test system is 'leslie':
 | 
						||
 | 
						||
 linux 2.0.32, gcc version 2.7.2.3
 | 
						||
 P133, HX chipset, 512K L2, 32MB mem
 | 
						||
 NCR810 fast scsi, Quantum Atlas 2GB drive (7200 rpm).
 | 
						||
 | 
						||
 | 
						||
                     Results Summary (times in seconds)
 | 
						||
 | 
						||
                    Single txn 8K txn    Create 8K idx 8K random Simple
 | 
						||
Case Description    8K insert  8K insert Index  Insert Scans     Orderby
 | 
						||
=================== ========== ========= ====== ====== ========= =======
 | 
						||
1 From Distribution
 | 
						||
  P90 FreeBsd -B256      39.56   1190.98   3.69  46.65     65.49    2.27
 | 
						||
  IDE
 | 
						||
 | 
						||
2 Running on leslie
 | 
						||
  P133 Linux 2.0.32      15.48    326.75   2.99  20.69     35.81    1.68
 | 
						||
  SCSI 32M
 | 
						||
 | 
						||
3 leslie, -o -F
 | 
						||
  no forced writes       15.90     24.98   2.63  20.46     36.43    1.69
 | 
						||
 | 
						||
4 leslie, -o -F
 | 
						||
  no ASSERTS             14.92     23.23   1.38  18.67     33.79    1.58
 | 
						||
 | 
						||
5 leslie, -o -F -B2048
 | 
						||
  more buffers           21.31     42.28   2.65  25.74     42.26    1.72
 | 
						||
 | 
						||
6 leslie, -o -F -B2048
 | 
						||
  more bufs, no ASSERT   20.52     39.79   1.40  24.77     39.51    1.55
 | 
						||
 | 
						||
 | 
						||
 | 
						||
 | 
						||
                 Case to Case Difference Factors (+ is faster)
 | 
						||
 | 
						||
                    Single txn 8K txn    Create 8K idx 8K random Simple
 | 
						||
Case Description    8K insert  8K insert Index  Insert Scans     Orderby
 | 
						||
=================== ========== ========= ====== ====== ========= =======
 | 
						||
 | 
						||
leslie vs BSD P90.        2.56      3.65   1.23   2.25      1.83    1.35
 | 
						||
 | 
						||
(noflush -F) vs no -F    -1.03     13.08   1.14   1.01     -1.02    1.00
 | 
						||
 | 
						||
No Assert vs Assert       1.05      1.07   1.90   1.06      1.07    1.09
 | 
						||
 | 
						||
-B256 vs -B2048           1.34      1.69   1.01   1.26      1.16    1.02
 | 
						||
 | 
						||
 | 
						||
Observations:
 | 
						||
 | 
						||
 - leslie (P133 linux) appears to be about 1.8 times faster than the
 | 
						||
   P90 BSD system used for the test result distributed with the source, not
 | 
						||
   counting the 8K txn insert case which was completely disk bound.
 | 
						||
 | 
						||
 - SCSI disks make a big (factor of 3.6) difference. During this test the
 | 
						||
   disk was hammering and cpu utilization was < 10%.
 | 
						||
 | 
						||
 - Assertion checking seems to cost about 7% except for create index where
 | 
						||
   it costs 90%
 | 
						||
 | 
						||
 - the -F option to avoid flushing buffers has tremendous effect if there are
 | 
						||
   many very small transactions. Or, another way, flushing at the end of the
 | 
						||
   transaction is a major disaster for performance.
 | 
						||
 | 
						||
 - Something is very wrong with our buffer cache implementation. Going from
 | 
						||
   256 buffers to 2048 buffers costs an average of 25%. In the 8K txn case
 | 
						||
   it costs about 70%. I see looking at the code and profiling that in the 8K
 | 
						||
   txn case this is in BufferSync() which examines all the buffers at commit
 | 
						||
   time. I don't quite understand why it is so costly for the single 8K row
 | 
						||
   txn (35%) though.
 | 
						||
 | 
						||
It would be nice to have some more tests. Maybe the Wisconsin stuff will
 | 
						||
be useful.
 | 
						||
 | 
						||
 | 
						||
 | 
						||
----------------- patch to test harness. apply from pgsql ------------
 | 
						||
*** src/test/performance/runtests.pl.orig	Sun Jun 14 11:34:04 1998
 | 
						||
 | 
						||
Differences %
 | 
						||
 | 
						||
 | 
						||
----------------- patch to test harness. apply from pgsql ------------
 | 
						||
*** src/test/performance/runtests.pl.orig	Sun Jun 14 11:34:04 1998
 | 
						||
--- src/test/performance/runtests.pl	Sun Jun 14 12:07:30 1998
 | 
						||
***************
 | 
						||
*** 84,123 ****
 | 
						||
  open (STDERR, ">$TmpFile") or die;
 | 
						||
  select (STDERR); $| = 1;
 | 
						||
  
 | 
						||
! for ($i = 0; $i <= $#perftests; $i++)
 | 
						||
! {
 | 
						||
  	$test = $perftests[$i];
 | 
						||
  	($test, $XACTBLOCK) = split (/ /, $test);
 | 
						||
  	$runtest = $test;
 | 
						||
! 	if ( $test =~ /\.ntm/ )
 | 
						||
! 	{
 | 
						||
! 		# 
 | 
						||
  		# No timing for this queries
 | 
						||
- 		# 
 | 
						||
  		close (STDERR);		# close $TmpFile
 | 
						||
  		open (STDERR, ">/dev/null") or die;
 | 
						||
  		$runtest =~ s/\.ntm//;
 | 
						||
  	}
 | 
						||
! 	else
 | 
						||
! 	{
 | 
						||
  		close (STDOUT);
 | 
						||
  		open(STDOUT, ">&SAVEOUT");
 | 
						||
  		print STDOUT "\nRunning: $perftests[$i+1] ...";
 | 
						||
  		close (STDOUT);
 | 
						||
  		open (STDOUT, ">/dev/null") or die;
 | 
						||
  		select (STDERR); $| = 1;
 | 
						||
! 		printf "$perftests[$i+1]: ";
 | 
						||
  	}
 | 
						||
  
 | 
						||
  	do "sqls/$runtest";
 | 
						||
  
 | 
						||
  	# Restore STDERR to $TmpFile
 | 
						||
! 	if ( $test =~ /\.ntm/ )
 | 
						||
! 	{
 | 
						||
  		close (STDERR);
 | 
						||
  		open (STDERR, ">>$TmpFile") or die;
 | 
						||
  	}
 | 
						||
- 
 | 
						||
  	select (STDERR); $| = 1;
 | 
						||
  	$i++;
 | 
						||
  }
 | 
						||
--- 84,116 ----
 | 
						||
  open (STDERR, ">$TmpFile") or die;
 | 
						||
  select (STDERR); $| = 1;
 | 
						||
  
 | 
						||
! for ($i = 0; $i <= $#perftests; $i++) {
 | 
						||
  	$test = $perftests[$i];
 | 
						||
  	($test, $XACTBLOCK) = split (/ /, $test);
 | 
						||
  	$runtest = $test;
 | 
						||
! 	if ( $test =~ /\.ntm/ ) {
 | 
						||
  		# No timing for this queries
 | 
						||
  		close (STDERR);		# close $TmpFile
 | 
						||
  		open (STDERR, ">/dev/null") or die;
 | 
						||
  		$runtest =~ s/\.ntm//;
 | 
						||
  	}
 | 
						||
! 	else {
 | 
						||
  		close (STDOUT);
 | 
						||
  		open(STDOUT, ">&SAVEOUT");
 | 
						||
  		print STDOUT "\nRunning: $perftests[$i+1] ...";
 | 
						||
  		close (STDOUT);
 | 
						||
  		open (STDOUT, ">/dev/null") or die;
 | 
						||
  		select (STDERR); $| = 1;
 | 
						||
! 		print "$perftests[$i+1]: ";
 | 
						||
  	}
 | 
						||
  
 | 
						||
  	do "sqls/$runtest";
 | 
						||
  
 | 
						||
  	# Restore STDERR to $TmpFile
 | 
						||
! 	if ( $test =~ /\.ntm/ ) {
 | 
						||
  		close (STDERR);
 | 
						||
  		open (STDERR, ">>$TmpFile") or die;
 | 
						||
  	}
 | 
						||
  	select (STDERR); $| = 1;
 | 
						||
  	$i++;
 | 
						||
  }
 | 
						||
***************
 | 
						||
*** 128,138 ****
 | 
						||
  open (TMPF, "<$TmpFile") or die;
 | 
						||
  open (RESF, ">$ResFile") or die;
 | 
						||
  
 | 
						||
! while (<TMPF>)
 | 
						||
! {
 | 
						||
! 	$str = $_;
 | 
						||
! 	($test, $rtime) = split (/:/, $str);
 | 
						||
! 	($tmp, $rtime, $rest) = split (/[ 	]+/, $rtime);
 | 
						||
! 	print RESF "$test: $rtime\n";
 | 
						||
  }
 | 
						||
  
 | 
						||
--- 121,130 ----
 | 
						||
  open (TMPF, "<$TmpFile") or die;
 | 
						||
  open (RESF, ">$ResFile") or die;
 | 
						||
  
 | 
						||
! while (<TMPF>) {
 | 
						||
!         if (m/^(.*: ).* ([0-9:.]+) *elapsed/) {
 | 
						||
! 	    ($test, $rtime) = ($1, $2);
 | 
						||
! 	     print RESF $test, $rtime, "\n";
 | 
						||
!         }
 | 
						||
  }
 | 
						||
 | 
						||
------------------------------------------------------------------------
 | 
						||
 | 
						||
  
 | 
						||
------------------------- testcase detail --------------------------
 | 
						||
   
 | 
						||
1. from distribution
 | 
						||
   DBMS:		PostgreSQL 6.2b10
 | 
						||
   OS:		FreeBSD 2.1.5-RELEASE
 | 
						||
   HardWare:	i586/90, 24M RAM, IDE
 | 
						||
   StartUp:	postmaster -B 256 '-o -S 2048' -S
 | 
						||
   Compiler:	gcc 2.6.3
 | 
						||
   Compiled:	-O, without CASSERT checking, with
 | 
						||
   		-DTBL_FREE_CMD_MEMORY (to free memory
 | 
						||
   		if BEGIN/END after each query execution)
 | 
						||
   DB connection startup: 0.20
 | 
						||
   8192 INSERTs INTO SIMPLE (1 xact): 39.58
 | 
						||
   8192 INSERTs INTO SIMPLE (8192 xacts): 1190.98
 | 
						||
   Create INDEX on SIMPLE: 3.69
 | 
						||
   8192 INSERTs INTO SIMPLE with INDEX (1 xact): 46.65
 | 
						||
   8192 random INDEX scans on SIMPLE (1 xact): 65.49
 | 
						||
   ORDER BY SIMPLE: 2.27
 | 
						||
   
 | 
						||
   
 | 
						||
2. run on leslie with asserts
 | 
						||
   DBMS:		PostgreSQL 6.3.2 (plus changes to 98/06/01)
 | 
						||
   OS:		Linux 2.0.32 leslie
 | 
						||
   HardWare:	i586/133 HX 512, 32M RAM, fast SCSI, 7200rpm
 | 
						||
   StartUp:	postmaster -B 256 '-o -S 2048' -S
 | 
						||
   Compiler:	gcc 2.7.2.3
 | 
						||
   Compiled:	-O, WITH CASSERT checking, with
 | 
						||
   		-DTBL_FREE_CMD_MEMORY (to free memory
 | 
						||
   		if BEGIN/END after each query execution)
 | 
						||
   DB connection startup: 0.10
 | 
						||
   8192 INSERTs INTO SIMPLE (1 xact): 15.48
 | 
						||
   8192 INSERTs INTO SIMPLE (8192 xacts): 326.75
 | 
						||
   Create INDEX on SIMPLE: 2.99
 | 
						||
   8192 INSERTs INTO SIMPLE with INDEX (1 xact): 20.69
 | 
						||
   8192 random INDEX scans on SIMPLE (1 xact): 35.81
 | 
						||
   ORDER BY SIMPLE: 1.68
 | 
						||
   
 | 
						||
   
 | 
						||
3. with -F to avoid forced i/o
 | 
						||
   DBMS:		PostgreSQL 6.3.2 (plus changes to 98/06/01)
 | 
						||
   OS:		Linux 2.0.32 leslie
 | 
						||
   HardWare:	i586/133 HX 512, 32M RAM, fast SCSI, 7200rpm
 | 
						||
   StartUp:	postmaster -B 256 '-o -S 2048 -F' -S
 | 
						||
   Compiler:	gcc 2.7.2.3
 | 
						||
   Compiled:	-O, WITH CASSERT checking, with
 | 
						||
   		-DTBL_FREE_CMD_MEMORY (to free memory
 | 
						||
   		if BEGIN/END after each query execution)
 | 
						||
   DB connection startup: 0.10
 | 
						||
   8192 INSERTs INTO SIMPLE (1 xact): 15.90
 | 
						||
   8192 INSERTs INTO SIMPLE (8192 xacts): 24.98
 | 
						||
   Create INDEX on SIMPLE: 2.63
 | 
						||
   8192 INSERTs INTO SIMPLE with INDEX (1 xact): 20.46
 | 
						||
   8192 random INDEX scans on SIMPLE (1 xact): 36.43
 | 
						||
   ORDER BY SIMPLE: 1.69
 | 
						||
   
 | 
						||
   
 | 
						||
4. no asserts, -F to avoid forced I/O
 | 
						||
   DBMS:		PostgreSQL 6.3.2 (plus changes to 98/06/01)
 | 
						||
   OS:		Linux 2.0.32 leslie
 | 
						||
   HardWare:	i586/133 HX 512, 32M RAM, fast SCSI, 7200rpm
 | 
						||
   StartUp:	postmaster -B 256 '-o -S 2048' -S
 | 
						||
   Compiler:	gcc 2.7.2.3
 | 
						||
   Compiled:	-O, No CASSERT checking, with
 | 
						||
   		-DTBL_FREE_CMD_MEMORY (to free memory
 | 
						||
   		if BEGIN/END after each query execution)
 | 
						||
   DB connection startup: 0.10
 | 
						||
   8192 INSERTs INTO SIMPLE (1 xact): 14.92
 | 
						||
   8192 INSERTs INTO SIMPLE (8192 xacts): 23.23
 | 
						||
   Create INDEX on SIMPLE: 1.38
 | 
						||
   8192 INSERTs INTO SIMPLE with INDEX (1 xact): 18.67
 | 
						||
   8192 random INDEX scans on SIMPLE (1 xact): 33.79
 | 
						||
   ORDER BY SIMPLE: 1.58
 | 
						||
   
 | 
						||
   
 | 
						||
5. with more buffers (2048 vs 256) and -F to avoid forced i/o
 | 
						||
   DBMS:		PostgreSQL 6.3.2 (plus changes to 98/06/01)
 | 
						||
   OS:		Linux 2.0.32 leslie
 | 
						||
   HardWare:	i586/133 HX 512, 32M RAM, fast SCSI, 7200rpm
 | 
						||
   StartUp:	postmaster -B 2048 '-o -S 2048 -F' -S
 | 
						||
   Compiler:	gcc 2.7.2.3
 | 
						||
   Compiled:	-O, WITH CASSERT checking, with
 | 
						||
   		-DTBL_FREE_CMD_MEMORY (to free memory
 | 
						||
   		if BEGIN/END after each query execution)
 | 
						||
   DB connection startup: 0.11
 | 
						||
   8192 INSERTs INTO SIMPLE (1 xact): 21.31
 | 
						||
   8192 INSERTs INTO SIMPLE (8192 xacts): 42.28
 | 
						||
   Create INDEX on SIMPLE: 2.65
 | 
						||
   8192 INSERTs INTO SIMPLE with INDEX (1 xact): 25.74
 | 
						||
   8192 random INDEX scans on SIMPLE (1 xact): 42.26
 | 
						||
   ORDER BY SIMPLE: 1.72
 | 
						||
   
 | 
						||
   
 | 
						||
6. No Asserts, more buffers (2048 vs 256) and -F to avoid forced i/o
 | 
						||
   DBMS:		PostgreSQL 6.3.2 (plus changes to 98/06/01)
 | 
						||
   OS:		Linux 2.0.32 leslie
 | 
						||
   HardWare:	i586/133 HX 512, 32M RAM, fast SCSI, 7200rpm
 | 
						||
   StartUp:	postmaster -B 2048 '-o -S 2048 -F' -S
 | 
						||
   Compiler:	gcc 2.7.2.3
 | 
						||
   Compiled:	-O, No CASSERT checking, with
 | 
						||
   		-DTBL_FREE_CMD_MEMORY (to free memory
 | 
						||
   		if BEGIN/END after each query execution)
 | 
						||
   DB connection startup: 0.11
 | 
						||
   8192 INSERTs INTO SIMPLE (1 xact): 20.52
 | 
						||
   8192 INSERTs INTO SIMPLE (8192 xacts): 39.79
 | 
						||
   Create INDEX on SIMPLE: 1.40
 | 
						||
   8192 INSERTs INTO SIMPLE with INDEX (1 xact): 24.77
 | 
						||
   8192 random INDEX scans on SIMPLE (1 xact): 39.51
 | 
						||
   ORDER BY SIMPLE: 1.55
 | 
						||
---------------------------------------------------------------------
 | 
						||
 | 
						||
-dg
 | 
						||
 | 
						||
David Gould            dg@illustra.com           510.628.3783 or 510.305.9468 
 | 
						||
Informix Software  (No, really)         300 Lakeside Drive  Oakland, CA 94612
 | 
						||
"Don't worry about people stealing your ideas.  If your ideas are any
 | 
						||
 good, you'll have to ram them down people's throats." -- Howard Aiken
 | 
						||
 | 
						||
 | 
						||
From owner-pgsql-hackers@hub.org Tue Oct 19 10:31:10 1999
 | 
						||
Received: from renoir.op.net (root@renoir.op.net [209.152.193.4])
 | 
						||
	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id KAA29087
 | 
						||
	for <maillist@candle.pha.pa.us>; Tue, 19 Oct 1999 10:31:08 -0400 (EDT)
 | 
						||
Received: from hub.org (hub.org [216.126.84.1]) by renoir.op.net (o1/$Revision: 1.17 $) with ESMTP id KAA27535 for <maillist@candle.pha.pa.us>; Tue, 19 Oct 1999 10:19:47 -0400 (EDT)
 | 
						||
Received: from localhost (majordom@localhost)
 | 
						||
	by hub.org (8.9.3/8.9.3) with SMTP id KAA30328;
 | 
						||
	Tue, 19 Oct 1999 10:12:10 -0400 (EDT)
 | 
						||
	(envelope-from owner-pgsql-hackers)
 | 
						||
Received: by hub.org (bulk_mailer v1.5); Tue, 19 Oct 1999 10:11:55 -0400
 | 
						||
Received: (from majordom@localhost)
 | 
						||
	by hub.org (8.9.3/8.9.3) id KAA30030
 | 
						||
	for pgsql-hackers-outgoing; Tue, 19 Oct 1999 10:11:00 -0400 (EDT)
 | 
						||
	(envelope-from owner-pgsql-hackers@postgreSQL.org)
 | 
						||
Received: from sss.sss.pgh.pa.us (sss.pgh.pa.us [209.114.166.2])
 | 
						||
	by hub.org (8.9.3/8.9.3) with ESMTP id KAA29914
 | 
						||
	for <pgsql-hackers@postgreSQL.org>; Tue, 19 Oct 1999 10:10:33 -0400 (EDT)
 | 
						||
	(envelope-from tgl@sss.pgh.pa.us)
 | 
						||
Received: from sss.sss.pgh.pa.us (localhost [127.0.0.1])
 | 
						||
	by sss.sss.pgh.pa.us (8.9.1/8.9.1) with ESMTP id KAA09038;
 | 
						||
	Tue, 19 Oct 1999 10:09:15 -0400 (EDT)
 | 
						||
To: "Hiroshi Inoue" <Inoue@tpf.co.jp>
 | 
						||
cc: "Vadim Mikheev" <vadim@krs.ru>, pgsql-hackers@postgreSQL.org
 | 
						||
Subject: Re: [HACKERS] mdnblocks is an amazing time sink in huge relations 
 | 
						||
In-reply-to: Your message of Tue, 19 Oct 1999 19:03:22 +0900 
 | 
						||
             <000801bf1a19$2d88ae20$2801007e@cadzone.tpf.co.jp> 
 | 
						||
Date: Tue, 19 Oct 1999 10:09:15 -0400
 | 
						||
Message-ID: <9036.940342155@sss.pgh.pa.us>
 | 
						||
From: Tom Lane <tgl@sss.pgh.pa.us>
 | 
						||
Sender: owner-pgsql-hackers@postgreSQL.org
 | 
						||
Status: RO
 | 
						||
 | 
						||
"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
 | 
						||
> 1. shared cache holds committed system tuples.
 | 
						||
> 2. private cache holds uncommitted system tuples.
 | 
						||
> 3. relpages of shared cache are updated immediately by
 | 
						||
>     phisical change and corresponding buffer pages are
 | 
						||
>     marked dirty.
 | 
						||
> 4. on commit, the contents of uncommitted tuples except
 | 
						||
>    relpages,reltuples,... are copied to correponding tuples
 | 
						||
>    in shared cache and the combined contents are
 | 
						||
>    committed.
 | 
						||
> If so,catalog cache invalidation would be no longer needed.
 | 
						||
> But synchronization of the step 4. may be difficult.
 | 
						||
 | 
						||
I think the main problem is that relpages and reltuples shouldn't
 | 
						||
be kept in pg_class columns at all, because they need to have
 | 
						||
very different update behavior from the other pg_class columns.
 | 
						||
 | 
						||
The rest of pg_class is update-on-commit, and we can lock down any one
 | 
						||
row in the normal MVCC way (if transaction A has modified a row and
 | 
						||
transaction B also wants to modify it, B waits for A to commit or abort,
 | 
						||
so it can know which version of the row to start from).  Furthermore,
 | 
						||
there can legitimately be several different values of a row in use in
 | 
						||
different places: the latest committed, an uncommitted modification, and
 | 
						||
one or more old values that are still being used by active transactions
 | 
						||
because they were current when those transactions started.  (BTW, the
 | 
						||
present relcache is pretty bad about maintaining pure MVCC transaction
 | 
						||
semantics like this, but it seems clear to me that that's the direction
 | 
						||
we want to go in.)
 | 
						||
 | 
						||
relpages cannot operate this way.  To be useful for avoiding lseeks,
 | 
						||
relpages *must* change exactly when the physical file changes.  It
 | 
						||
matters not at all whether the particular transaction that extended the
 | 
						||
file ultimately commits or not.  Moreover there can be only one correct
 | 
						||
value (per relation) across the whole system, because there is only one
 | 
						||
length of the relation file.
 | 
						||
 | 
						||
If we want to take reltuples seriously and try to maintain it
 | 
						||
on-the-fly, then I think it needs still a third behavior.  Clearly
 | 
						||
it cannot be updated using MVCC rules, or we lose all writer
 | 
						||
concurrency (if A has added tuples to a rel, B would have to wait
 | 
						||
for A to commit before it could update reltuples...).  Furthermore
 | 
						||
"updating" isn't a simple matter of storing what you think the new
 | 
						||
value is; otherwise two transactions adding tuples in parallel would
 | 
						||
leave the wrong answer after B commits and overwrites A's value.
 | 
						||
I think it would work for each transaction to keep track of a net delta
 | 
						||
in reltuples for each table it's changed (total tuples added less total
 | 
						||
tuples deleted), and then atomically add that value to the table's
 | 
						||
shared reltuples counter during commit.  But that still leaves the
 | 
						||
problem of how you use the counter during a transaction to get an
 | 
						||
accurate answer to the question "If I scan this table now, how many tuples
 | 
						||
will I see?"  At the time the question is asked, the current shared
 | 
						||
counter value might include the effects of transactions that have
 | 
						||
committed since your transaction started, and therefore are not visible
 | 
						||
under MVCC rules.  I think getting the correct answer would involve
 | 
						||
making an instantaneous copy of the current counter at the start of
 | 
						||
your xact, and then adding your own private net-uncommitted-delta to
 | 
						||
the saved shared counter value when asked the question.  This doesn't
 | 
						||
look real practical --- you'd have to save the reltuples counts of
 | 
						||
*all* tables in the database at the start of each xact, on the off
 | 
						||
chance that you might need them.  Ugh.  Perhaps someone has a better
 | 
						||
idea.  In any case, reltuples clearly needs different mechanisms than
 | 
						||
the ordinary fields in pg_class do, because updating it will be a
 | 
						||
performance bottleneck otherwise.
 | 
						||
 | 
						||
If we allow reltuples to be updated only by vacuum-like events, as
 | 
						||
it is now, then I think keeping it in pg_class is still OK.
 | 
						||
 | 
						||
In short, it seems clear to me that relpages should be removed from
 | 
						||
pg_class and kept somewhere else if we want to make it more reliable
 | 
						||
than it is now, and the same for reltuples (but reltuples doesn't
 | 
						||
behave the same as relpages, and probably ought to be handled
 | 
						||
differently).
 | 
						||
 | 
						||
			regards, tom lane
 | 
						||
 | 
						||
************
 | 
						||
 | 
						||
From owner-pgsql-hackers@hub.org Tue Oct 19 21:25:30 1999
 | 
						||
Received: from renoir.op.net (root@renoir.op.net [209.152.193.4])
 | 
						||
	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id VAA28130
 | 
						||
	for <maillist@candle.pha.pa.us>; Tue, 19 Oct 1999 21:25:26 -0400 (EDT)
 | 
						||
Received: from hub.org (hub.org [216.126.84.1]) by renoir.op.net (o1/$Revision: 1.17 $) with ESMTP id VAA10512 for <maillist@candle.pha.pa.us>; Tue, 19 Oct 1999 21:15:28 -0400 (EDT)
 | 
						||
Received: from localhost (majordom@localhost)
 | 
						||
	by hub.org (8.9.3/8.9.3) with SMTP id VAA50745;
 | 
						||
	Tue, 19 Oct 1999 21:07:23 -0400 (EDT)
 | 
						||
	(envelope-from owner-pgsql-hackers)
 | 
						||
Received: by hub.org (bulk_mailer v1.5); Tue, 19 Oct 1999 21:07:01 -0400
 | 
						||
Received: (from majordom@localhost)
 | 
						||
	by hub.org (8.9.3/8.9.3) id VAA50644
 | 
						||
	for pgsql-hackers-outgoing; Tue, 19 Oct 1999 21:06:06 -0400 (EDT)
 | 
						||
	(envelope-from owner-pgsql-hackers@postgreSQL.org)
 | 
						||
Received: from sd.tpf.co.jp (sd.tpf.co.jp [210.161.239.34])
 | 
						||
	by hub.org (8.9.3/8.9.3) with ESMTP id VAA50584
 | 
						||
	for <pgsql-hackers@postgreSQL.org>; Tue, 19 Oct 1999 21:05:26 -0400 (EDT)
 | 
						||
	(envelope-from Inoue@tpf.co.jp)
 | 
						||
Received: from cadzone ([126.0.1.40] (may be forged))
 | 
						||
          by sd.tpf.co.jp (2.5 Build 2640 (Berkeley 8.8.6)/8.8.4) with SMTP
 | 
						||
   id KAA01715; Wed, 20 Oct 1999 10:05:14 +0900
 | 
						||
From: "Hiroshi Inoue" <Inoue@tpf.co.jp>
 | 
						||
To: "Tom Lane" <tgl@sss.pgh.pa.us>
 | 
						||
Cc: <pgsql-hackers@postgreSQL.org>
 | 
						||
Subject: RE: [HACKERS] mdnblocks is an amazing time sink in huge relations 
 | 
						||
Date: Wed, 20 Oct 1999 10:09:13 +0900
 | 
						||
Message-ID: <000501bf1a97$b925a860$2801007e@cadzone.tpf.co.jp>
 | 
						||
MIME-Version: 1.0
 | 
						||
Content-Type: text/plain;
 | 
						||
	charset="iso-8859-1"
 | 
						||
Content-Transfer-Encoding: 7bit
 | 
						||
X-Priority: 3 (Normal)
 | 
						||
X-MSMail-Priority: Normal
 | 
						||
X-Mailer: Microsoft Outlook 8.5, Build 4.71.2173.0
 | 
						||
X-Mimeole: Produced By Microsoft MimeOLE V4.72.2106.4
 | 
						||
Importance: Normal
 | 
						||
Sender: owner-pgsql-hackers@postgreSQL.org
 | 
						||
Status: RO
 | 
						||
 | 
						||
> -----Original Message-----
 | 
						||
> From: Hiroshi Inoue [mailto:Inoue@tpf.co.jp]
 | 
						||
> Sent: Tuesday, October 19, 1999 6:45 PM
 | 
						||
> To: Tom Lane
 | 
						||
> Cc: pgsql-hackers@postgreSQL.org
 | 
						||
> Subject: RE: [HACKERS] mdnblocks is an amazing time sink in huge
 | 
						||
> relations 
 | 
						||
> 
 | 
						||
> 
 | 
						||
> > 
 | 
						||
> > "Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
 | 
						||
> 
 | 
						||
> [snip]
 | 
						||
>  
 | 
						||
> > 
 | 
						||
> > > Deletion is necessary only not to consume disk space.
 | 
						||
> > >
 | 
						||
> > > For example vacuum could remove not deleted files.
 | 
						||
> > 
 | 
						||
> > Hmm ... interesting idea ... but I can hear the complaints
 | 
						||
> > from users already...
 | 
						||
> >
 | 
						||
> 
 | 
						||
> My idea is only an analogy of PostgreSQL's simple recovery
 | 
						||
> mechanism of tuples.
 | 
						||
> 
 | 
						||
> And my main point is
 | 
						||
> 	"delete fails after commit" doesn't harm the database
 | 
						||
> 	except that not deleted files consume disk space.
 | 
						||
> 
 | 
						||
> Of cource,it's preferable to delete relation files immediately
 | 
						||
> after(or just when) commit.
 | 
						||
> Useless files are visible though useless tuples are invisible.
 | 
						||
>
 | 
						||
 | 
						||
Anyway I don't need "DROP TABLE inside transactions" now
 | 
						||
and my idea is originally for that issue.
 | 
						||
 | 
						||
After a thought,I propose the following solution.
 | 
						||
 | 
						||
1. mdcreate() couldn't create existent relation files.
 | 
						||
    If the existent file is of length zero,we would overwrite
 | 
						||
    the file.(seems the comment in md.c says so but the
 | 
						||
    code doesn't do so). 
 | 
						||
    If the file is an Index relation file,we would overwrite
 | 
						||
    the file.
 | 
						||
 | 
						||
2. mdunlink() couldn't unlink non-existent relation files.
 | 
						||
    mdunlink() doesn't call elog(ERROR) even if the file
 | 
						||
    doesn't exist,though I couldn't find where to change
 | 
						||
    now.
 | 
						||
    mdopen() doesn't call elog(ERROR) even if the file
 | 
						||
    doesn't exist and leaves the relation as CLOSED. 
 | 
						||
 | 
						||
Comments ?
 | 
						||
 | 
						||
Regards. 
 | 
						||
 | 
						||
Hiroshi Inoue
 | 
						||
Inoue@tpf.co.jp
 | 
						||
 | 
						||
************
 | 
						||
 | 
						||
From pgsql-hackers-owner+M6267@hub.org Sun Aug 27 21:46:37 2000
 | 
						||
Received: from hub.org (root@hub.org [216.126.84.1])
 | 
						||
	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id UAA07972
 | 
						||
	for <pgman@candle.pha.pa.us>; Sun, 27 Aug 2000 20:46:36 -0400 (EDT)
 | 
						||
Received: from hub.org (majordom@localhost [127.0.0.1])
 | 
						||
	by hub.org (8.10.1/8.10.1) with SMTP id e7S0kaL27996;
 | 
						||
	Sun, 27 Aug 2000 20:46:36 -0400 (EDT)
 | 
						||
Received: from sss2.sss.pgh.pa.us (sss.pgh.pa.us [209.114.166.2])
 | 
						||
	by hub.org (8.10.1/8.10.1) with ESMTP id e7S05aL24107
 | 
						||
	for <pgsql-hackers@postgreSQL.org>; Sun, 27 Aug 2000 20:05:36 -0400 (EDT)
 | 
						||
Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
 | 
						||
	by sss2.sss.pgh.pa.us (8.9.3/8.9.3) with ESMTP id UAA01604
 | 
						||
	for <pgsql-hackers@postgreSQL.org>; Sun, 27 Aug 2000 20:05:29 -0400 (EDT)
 | 
						||
To: pgsql-hackers@postgreSQL.org
 | 
						||
Subject: [HACKERS] Possible performance improvement: buffer replacement policy
 | 
						||
Date: Sun, 27 Aug 2000 20:05:29 -0400
 | 
						||
Message-ID: <1601.967421129@sss.pgh.pa.us>
 | 
						||
From: Tom Lane <tgl@sss.pgh.pa.us>
 | 
						||
X-Mailing-List: pgsql-hackers@postgresql.org
 | 
						||
Precedence: bulk
 | 
						||
Sender: pgsql-hackers-owner@hub.org
 | 
						||
Status: RO
 | 
						||
 | 
						||
Those of you with long memories may recall a benchmark that Edmund Mergl
 | 
						||
drew our attention to back in May '99.  That test showed extremely slow
 | 
						||
performance for updating a table with many indexes (about 20).  At the
 | 
						||
time, it seemed the problem was due to bad performance of btree with
 | 
						||
many equal keys, so I thought I'd go back and retry the benchmark after
 | 
						||
this latest round of btree hackery.
 | 
						||
 | 
						||
The good news is that btree itself seems to be pretty well fixed; the
 | 
						||
bad news is that the benchmark is still slow for large numbers of rows.
 | 
						||
The problem is I/O: the CPU mostly sits idle waiting for the disk.
 | 
						||
As best I can tell, the difficulty is that the working set of pages
 | 
						||
needed to update this many indexes is too large compared to the number
 | 
						||
of disk buffers Postgres is using.  (I was running with -B 1000 and
 | 
						||
looking at behavior for a 100000-row test table.  This gave me a table
 | 
						||
size of 3876 pages, plus 11526 pages in 20 indexes.)
 | 
						||
 | 
						||
Of course, there's only so much we can do when the number of buffers
 | 
						||
is too small, but I still started to wonder if we are using the buffers
 | 
						||
as effectively as we can.  Some tracing showed that most of the pages
 | 
						||
of the indexes were being read and written multiple times within a
 | 
						||
single UPDATE query, while most of the pages of the table proper were
 | 
						||
fetched and written only once.  That says we're not using the buffers
 | 
						||
as well as we could; the index pages are not being kept in memory when
 | 
						||
they should be.  In a query like this, we should displace main-table
 | 
						||
pages sooner to allow keeping more index pages in cache --- but with
 | 
						||
the simple LRU replacement method we use, once a page has been loaded
 | 
						||
it will stay in cache for at least the next NBuffers (-B) page
 | 
						||
references, no matter what.  With a large NBuffers that's a long time.
 | 
						||
 | 
						||
I've come across an interesting article:
 | 
						||
	The LRU-K Page Replacement Algorithm For Database Disk Buffering
 | 
						||
	Elizabeth J. O'Neil, Patrick E. O'Neil, Gerhard Weikum
 | 
						||
	Proceedings of the 1993 ACM SIGMOD international conference
 | 
						||
	on Management of Data, May 1993
 | 
						||
(If you subscribe to the ACM digital library, you can get a PDF of this
 | 
						||
from there.)  This article argues that standard LRU buffer management is
 | 
						||
inherently not great for database caches, and that it's much better to
 | 
						||
replace pages on the basis of time since the K'th most recent reference,
 | 
						||
not just time since the most recent one.  K=2 is enough to get most of
 | 
						||
the benefit.  The big win is that you are measuring an actual page
 | 
						||
interreference time (between the last two references) and not just
 | 
						||
dealing with a lower-bound guess on the interreference time.  Frequently
 | 
						||
used pages are thus much more likely to stay in cache.
 | 
						||
 | 
						||
It looks like it wouldn't take too much work to replace shared buffers
 | 
						||
on the basis of LRU-2 instead of LRU, so I'm thinking about trying it.
 | 
						||
 | 
						||
Has anyone looked into this area?  Is there a better method to try?
 | 
						||
 | 
						||
			regards, tom lane
 | 
						||
 | 
						||
From prlw1@newn.cam.ac.uk Fri Jan 19 12:54:45 2001
 | 
						||
Received: from henry.newn.cam.ac.uk (henry.newn.cam.ac.uk [131.111.204.130])
 | 
						||
	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id MAA29822
 | 
						||
	for <pgman@candle.pha.pa.us>; Fri, 19 Jan 2001 12:54:44 -0500 (EST)
 | 
						||
Received: from [131.111.204.180] (helo=quartz.newn.cam.ac.uk)
 | 
						||
	by henry.newn.cam.ac.uk with esmtp (Exim 3.13 #1)
 | 
						||
	id 14JfkU-0001WA-00; Fri, 19 Jan 2001 17:54:54 +0000
 | 
						||
Received: from prlw1 by quartz.newn.cam.ac.uk with local (Exim 3.13 #1)
 | 
						||
	id 14Jfj6-0001cq-00; Fri, 19 Jan 2001 17:53:28 +0000
 | 
						||
Date: Fri, 19 Jan 2001 17:53:28 +0000
 | 
						||
From: Patrick Welche <prlw1@newn.cam.ac.uk>
 | 
						||
To: Bruce Momjian <pgman@candle.pha.pa.us>
 | 
						||
Cc: Tom Lane <tgl@sss.pgh.pa.us>, pgsql-hackers@postgreSQL.org
 | 
						||
Subject: Re: [HACKERS] Possible performance improvement: buffer replacement policy
 | 
						||
Message-ID: <20010119175328.A6223@quartz.newn.cam.ac.uk>
 | 
						||
Reply-To: prlw1@cam.ac.uk
 | 
						||
References: <1601.967421129@sss.pgh.pa.us> <200101191703.MAA25873@candle.pha.pa.us>
 | 
						||
Mime-Version: 1.0
 | 
						||
Content-Type: text/plain; charset=us-ascii
 | 
						||
Content-Disposition: inline
 | 
						||
User-Agent: Mutt/1.2i
 | 
						||
In-Reply-To: <200101191703.MAA25873@candle.pha.pa.us>; from pgman@candle.pha.pa.us on Fri, Jan 19, 2001 at 12:03:58PM -0500
 | 
						||
Status: RO
 | 
						||
 | 
						||
On Fri, Jan 19, 2001 at 12:03:58PM -0500, Bruce Momjian wrote:
 | 
						||
> 
 | 
						||
> Tom, did we ever test this?  I think we did and found that it was the
 | 
						||
> same or worse, right?
 | 
						||
 | 
						||
(Funnily enough, I just read that message:)
 | 
						||
 | 
						||
To: Bruce Momjian <pgman@candle.pha.pa.us>
 | 
						||
cc: pgsql-hackers@postgreSQL.org
 | 
						||
Subject: Re: [HACKERS] Possible performance improvement: buffer replacement policy 
 | 
						||
In-reply-to: <200010161541.LAA06653@candle.pha.pa.us> 
 | 
						||
References: <200010161541.LAA06653@candle.pha.pa.us>
 | 
						||
Comments: In-reply-to Bruce Momjian <pgman@candle.pha.pa.us>
 | 
						||
	message dated "Mon, 16 Oct 2000 11:41:41 -0400"
 | 
						||
Date: Mon, 16 Oct 2000 11:49:52 -0400
 | 
						||
Message-ID: <26100.971711392@sss.pgh.pa.us>
 | 
						||
From: Tom Lane <tgl@sss.pgh.pa.us>
 | 
						||
X-Mailing-List: pgsql-hackers@postgresql.org
 | 
						||
Precedence: bulk
 | 
						||
Sender: pgsql-hackers-owner@hub.org
 | 
						||
Status: RO
 | 
						||
Content-Length: 947
 | 
						||
Lines: 19
 | 
						||
 | 
						||
Bruce Momjian <pgman@candle.pha.pa.us> writes:
 | 
						||
>> It looks like it wouldn't take too much work to replace shared buffers
 | 
						||
>> on the basis of LRU-2 instead of LRU, so I'm thinking about trying it.
 | 
						||
>> 
 | 
						||
>> Has anyone looked into this area?  Is there a better method to try?
 | 
						||
 | 
						||
> Sounds like a perfect idea.  Good luck.  :-)
 | 
						||
 | 
						||
Actually, the idea went down in flames :-(, but I neglected to report
 | 
						||
back to pghackers about it.  I did do some code to manage buffers as
 | 
						||
LRU-2.  I didn't have any good performance test cases to try it with,
 | 
						||
but Richard Brosnahan was kind enough to re-run the TPC tests previously
 | 
						||
published by Great Bridge with that code in place.  Wasn't any faster,
 | 
						||
in fact possibly a little slower, likely due to the extra CPU time spent
 | 
						||
on buffer freelist management.  It's possible that other scenarios might
 | 
						||
show a better result, but right now I feel pretty discouraged about the
 | 
						||
LRU-2 idea and am not pursuing it.
 | 
						||
 | 
						||
			regards, tom lane
 | 
						||
 | 
						||
 | 
						||
From pgsql-hackers-owner+M3455@postgresql.org Fri Jan 19 13:18:12 2001
 | 
						||
Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28])
 | 
						||
	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id NAA02092
 | 
						||
	for <pgman@candle.pha.pa.us>; Fri, 19 Jan 2001 13:18:12 -0500 (EST)
 | 
						||
Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28])
 | 
						||
	by mail.postgresql.org (8.11.1/8.11.1) with SMTP id f0JIFJ037872;
 | 
						||
	Fri, 19 Jan 2001 13:15:19 -0500 (EST)
 | 
						||
	(envelope-from pgsql-hackers-owner+M3455@postgresql.org)
 | 
						||
Received: from sectorbase2.sectorbase.com ([208.48.122.131])
 | 
						||
	by mail.postgresql.org (8.11.1/8.11.1) with ESMTP id f0JI7V036780
 | 
						||
	for <pgsql-hackers@postgreSQL.org>; Fri, 19 Jan 2001 13:07:31 -0500 (EST)
 | 
						||
	(envelope-from vmikheev@SECTORBASE.COM)
 | 
						||
Received: by sectorbase2.sectorbase.com with Internet Mail Service (5.5.2653.19)
 | 
						||
	id <DG1W4LRZ>; Fri, 19 Jan 2001 09:46:14 -0800
 | 
						||
Message-ID: <8F4C99C66D04D4118F580090272A7A234D329F@sectorbase1.sectorbase.com>
 | 
						||
From: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>
 | 
						||
To: "'Tom Lane'" <tgl@sss.pgh.pa.us>, Bruce Momjian <pgman@candle.pha.pa.us>
 | 
						||
Cc: pgsql-hackers@postgresql.org
 | 
						||
Subject: RE: [HACKERS] Possible performance improvement: buffer replacemen
 | 
						||
	t policy 
 | 
						||
Date: Fri, 19 Jan 2001 10:07:27 -0800
 | 
						||
MIME-Version: 1.0
 | 
						||
X-Mailer: Internet Mail Service (5.5.2653.19)
 | 
						||
Content-Type: text/plain;
 | 
						||
	charset="iso-8859-1"
 | 
						||
Precedence: bulk
 | 
						||
Sender: pgsql-hackers-owner@postgresql.org
 | 
						||
Status: RO
 | 
						||
 | 
						||
> > Tom, did we ever test this?  I think we did and found that 
 | 
						||
> > it was the same or worse, right?
 | 
						||
> 
 | 
						||
> I tried it and didn't see any noticeable improvement on the particular
 | 
						||
> test case I was using, so I got discouraged and didn't pursue the idea
 | 
						||
> further.  I'd like to come back to it someday, though.
 | 
						||
 | 
						||
I don't know how much useful could be LRU-2 but with WAL we should try
 | 
						||
to reuse undirty free buffers first, not dirty ones, just to postpone
 | 
						||
writes as long as we can. (BTW, this is what Oracle does.)
 | 
						||
So, we probably should put new unfree dirty buffer just before first
 | 
						||
dirty one in LRU.
 | 
						||
 | 
						||
Vadim
 | 
						||
 | 
						||
From markw@mohawksoft.com Thu Jun  7 14:40:02 2001
 | 
						||
Return-path: <markw@mohawksoft.com>
 | 
						||
Received: from gromit.dotclick.com (ipn9-f8366.net-resource.net [216.204.83.66])
 | 
						||
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f57Ie1c14004
 | 
						||
	for <pgman@candle.pha.pa.us>; Thu, 7 Jun 2001 14:40:02 -0400 (EDT)
 | 
						||
Received: from mohawksoft.com (IDENT:markw@localhost.localdomain [127.0.0.1])
 | 
						||
	by gromit.dotclick.com (8.9.3/8.9.3) with ESMTP id OAA04973;
 | 
						||
	Thu, 7 Jun 2001 14:37:00 -0400
 | 
						||
Sender: markw@gromit.dotclick.com
 | 
						||
Message-ID: <3B1FC9CB.57C72AD6@mohawksoft.com>
 | 
						||
Date: Thu, 07 Jun 2001 14:36:59 -0400
 | 
						||
From: mlw <markw@mohawksoft.com>
 | 
						||
X-Mailer: Mozilla 4.75 [en] (X11; U; Linux 2.4.2 i686)
 | 
						||
X-Accept-Language: en
 | 
						||
MIME-Version: 1.0
 | 
						||
To: Bruce Momjian <pgman@candle.pha.pa.us>,
 | 
						||
   "pgsql-hackers@postgresql.org" <pgsql-hackers@postgresql.org>
 | 
						||
Subject: Re: 7.2 items
 | 
						||
References: <200106071503.f57F32n03924@candle.pha.pa.us>
 | 
						||
Content-Type: text/plain; charset=us-ascii
 | 
						||
Content-Transfer-Encoding: 7bit
 | 
						||
Status: RO
 | 
						||
 | 
						||
Bruce Momjian wrote:
 | 
						||
 | 
						||
> > Bruce Momjian <pgman@candle.pha.pa.us> writes:
 | 
						||
> >
 | 
						||
> > > Here is a small list of big TODO items.  I was wondering which ones
 | 
						||
> > > people were thinking about for 7.2?
 | 
						||
> >
 | 
						||
> > A friend of mine wants to use PostgreSQL instead of Oracle for a large
 | 
						||
> > application, but has run into a snag when speed comparisons looked
 | 
						||
> > good until the Oracle folks added a couple of BITMAP indexes.  I can't
 | 
						||
> > recall seeing any discussion about that here -- are there any plans?
 | 
						||
>
 | 
						||
> It is not on our list and I am not sure what they do.
 | 
						||
 | 
						||
Do you have access to any Oracle Documentation? There is a good explanation
 | 
						||
of them.
 | 
						||
 | 
						||
However, I will try to explain.
 | 
						||
 | 
						||
If you have a table, locations. It has 1,000,000 records.
 | 
						||
 | 
						||
In oracle you do this:
 | 
						||
 | 
						||
create bitmap index bitmap_foo on locations (state) ;
 | 
						||
 | 
						||
For each unique value of 'state' oracle will create a bitmap with 1,000,000
 | 
						||
bits in it. With a one representing a match and a zero representing no
 | 
						||
match. Record '0' in the table is represented by bit '0' in the bitmap,
 | 
						||
record '1' is represented by bit '1', record two by bit '2' and so on.
 | 
						||
 | 
						||
In a table where comparatively few different values are to be indexed in a
 | 
						||
large table, a bitmap index can be quite small and not suffer the N * log(N)
 | 
						||
disk I/O most tree based indexes suffer. If the bitmap is fairly sparse or
 | 
						||
dense (or have periods of denseness and sparseness), it can be compressed
 | 
						||
very efficiently as well.
 | 
						||
 | 
						||
When the statement:
 | 
						||
 | 
						||
select * from locations where state = 'MA';
 | 
						||
 | 
						||
Is executed, the bitmap is read into memory in very few disk operations.
 | 
						||
(Perhaps even as few as one or two). It is a simple operation of rifling
 | 
						||
through the bitmap for '1's that indicate the record has the property,
 | 
						||
'state' = 'MA';
 | 
						||
 | 
						||
 | 
						||
From mascarm@mascari.com Thu Jun  7 15:36:25 2001
 | 
						||
Return-path: <mascarm@mascari.com>
 | 
						||
Received: from corvette.mascari.com (dhcp065-024-161-045.columbus.rr.com [65.24.161.45])
 | 
						||
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f57JaOc21943
 | 
						||
	for <pgman@candle.pha.pa.us>; Thu, 7 Jun 2001 15:36:24 -0400 (EDT)
 | 
						||
Received: from ferrari (ferrari.mascari.com [192.168.2.1])
 | 
						||
	by corvette.mascari.com (8.9.3/8.9.3) with SMTP id PAA25607;
 | 
						||
	Thu, 7 Jun 2001 15:29:31 -0400
 | 
						||
Received: by localhost with Microsoft MAPI; Thu, 7 Jun 2001 15:34:18 -0400
 | 
						||
Message-ID: <01C0EF67.5105D2E0.mascarm@mascari.com>
 | 
						||
From: Mike Mascari <mascarm@mascari.com>
 | 
						||
Reply-To: "mascarm@mascari.com" <mascarm@mascari.com>
 | 
						||
To: "'mlw'" <markw@mohawksoft.com>, Bruce Momjian <pgman@candle.pha.pa.us>,
 | 
						||
   "pgsql-hackers@postgresql.org" <pgsql-hackers@postgresql.org>
 | 
						||
Subject: RE: [HACKERS] Re: 7.2 items
 | 
						||
Date: Thu, 7 Jun 2001 15:34:17 -0400
 | 
						||
Organization: Mascari Development Inc.
 | 
						||
X-Mailer: Microsoft Internet E-mail/MAPI - 8.0.0.4211
 | 
						||
MIME-Version: 1.0
 | 
						||
Content-Type: text/plain; charset="us-ascii"
 | 
						||
Content-Transfer-Encoding: 7bit
 | 
						||
Status: RO
 | 
						||
 | 
						||
And in addition,
 | 
						||
 | 
						||
If you submitted the query:
 | 
						||
 | 
						||
SELECT * FROM addresses WHERE state = 'OH'
 | 
						||
AND areacode = '614'
 | 
						||
 | 
						||
Then, with bitmap indexes, the bitmaps are just logically ANDed 
 | 
						||
together, and the final bitmap determines the matching rows.
 | 
						||
 | 
						||
Mike Mascari
 | 
						||
mascarm@mascari.com
 | 
						||
 | 
						||
-----Original Message-----
 | 
						||
From:	mlw [SMTP:markw@mohawksoft.com]
 | 
						||
 | 
						||
Bruce Momjian wrote:
 | 
						||
 | 
						||
> > Bruce Momjian <pgman@candle.pha.pa.us> writes:
 | 
						||
> >
 | 
						||
> > > Here is a small list of big TODO items.  I was wondering which 
 | 
						||
ones
 | 
						||
> > > people were thinking about for 7.2?
 | 
						||
> >
 | 
						||
> > A friend of mine wants to use PostgreSQL instead of Oracle for a 
 | 
						||
large
 | 
						||
> > application, but has run into a snag when speed comparisons 
 | 
						||
looked
 | 
						||
> > good until the Oracle folks added a couple of BITMAP indexes.  I 
 | 
						||
can't
 | 
						||
> > recall seeing any discussion about that here -- are there any 
 | 
						||
plans?
 | 
						||
>
 | 
						||
> It is not on our list and I am not sure what they do.
 | 
						||
 | 
						||
Do you have access to any Oracle Documentation? There is a good 
 | 
						||
explanation
 | 
						||
of them.
 | 
						||
 | 
						||
However, I will try to explain.
 | 
						||
 | 
						||
If you have a table, locations. It has 1,000,000 records.
 | 
						||
 | 
						||
In oracle you do this:
 | 
						||
 | 
						||
create bitmap index bitmap_foo on locations (state) ;
 | 
						||
 | 
						||
For each unique value of 'state' oracle will create a bitmap with 
 | 
						||
1,000,000
 | 
						||
bits in it. With a one representing a match and a zero representing 
 | 
						||
no
 | 
						||
match. Record '0' in the table is represented by bit '0' in the 
 | 
						||
bitmap,
 | 
						||
record '1' is represented by bit '1', record two by bit '2' and so 
 | 
						||
on.
 | 
						||
 | 
						||
In a table where comparatively few different values are to be indexed 
 | 
						||
in a
 | 
						||
large table, a bitmap index can be quite small and not suffer the N * 
 | 
						||
log(N)
 | 
						||
disk I/O most tree based indexes suffer. If the bitmap is fairly 
 | 
						||
sparse or
 | 
						||
dense (or have periods of denseness and sparseness), it can be 
 | 
						||
compressed
 | 
						||
very efficiently as well.
 | 
						||
 | 
						||
When the statement:
 | 
						||
 | 
						||
select * from locations where state = 'MA';
 | 
						||
 | 
						||
Is executed, the bitmap is read into memory in very few disk 
 | 
						||
operations.
 | 
						||
(Perhaps even as few as one or two). It is a simple operation of 
 | 
						||
rifling
 | 
						||
through the bitmap for '1's that indicate the record has the 
 | 
						||
property,
 | 
						||
'state' = 'MA';
 | 
						||
 | 
						||
 | 
						||
 | 
						||
From oleg@sai.msu.su Thu Jun  7 15:39:15 2001
 | 
						||
Return-path: <oleg@sai.msu.su>
 | 
						||
Received: from ra.sai.msu.su (ra.sai.msu.su [158.250.29.2])
 | 
						||
	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f57Jd7c22010
 | 
						||
	for <pgman@candle.pha.pa.us>; Thu, 7 Jun 2001 15:39:08 -0400 (EDT)
 | 
						||
Received: from ra (ra [158.250.29.2])
 | 
						||
	by ra.sai.msu.su (8.9.3/8.9.3) with ESMTP id WAA07783;
 | 
						||
	Thu, 7 Jun 2001 22:38:20 +0300 (GMT)
 | 
						||
Date: Thu, 7 Jun 2001 22:38:20 +0300 (GMT)
 | 
						||
From: Oleg Bartunov <oleg@sai.msu.su>
 | 
						||
X-X-Sender: <megera@ra.sai.msu.su>
 | 
						||
To: mlw <markw@mohawksoft.com>
 | 
						||
cc: Bruce Momjian <pgman@candle.pha.pa.us>,
 | 
						||
   "pgsql-hackers@postgresql.org" <pgsql-hackers@postgresql.org>
 | 
						||
Subject: Re: [HACKERS] Re: 7.2 items
 | 
						||
In-Reply-To: <3B1FC9CB.57C72AD6@mohawksoft.com>
 | 
						||
Message-ID: <Pine.GSO.4.33.0106072234120.6015-100000@ra.sai.msu.su>
 | 
						||
MIME-Version: 1.0
 | 
						||
Content-Type: TEXT/PLAIN; charset=US-ASCII
 | 
						||
Status: RO
 | 
						||
 | 
						||
I think it's possible to implement bitmap indexes with a little
 | 
						||
effort using GiST. at least I know one implementation
 | 
						||
http://www.it.iitb.ernet.in/~rvijay/dbms/proj/
 | 
						||
if you have interests you could implement bitmap indexes yourself
 | 
						||
unfortunately, we're very busy
 | 
						||
 | 
						||
	Oleg
 | 
						||
On Thu, 7 Jun 2001, mlw wrote:
 | 
						||
 | 
						||
> Bruce Momjian wrote:
 | 
						||
>
 | 
						||
> > > Bruce Momjian <pgman@candle.pha.pa.us> writes:
 | 
						||
> > >
 | 
						||
> > > > Here is a small list of big TODO items.  I was wondering which ones
 | 
						||
> > > > people were thinking about for 7.2?
 | 
						||
> > >
 | 
						||
> > > A friend of mine wants to use PostgreSQL instead of Oracle for a large
 | 
						||
> > > application, but has run into a snag when speed comparisons looked
 | 
						||
> > > good until the Oracle folks added a couple of BITMAP indexes.  I can't
 | 
						||
> > > recall seeing any discussion about that here -- are there any plans?
 | 
						||
> >
 | 
						||
> > It is not on our list and I am not sure what they do.
 | 
						||
>
 | 
						||
> Do you have access to any Oracle Documentation? There is a good explanation
 | 
						||
> of them.
 | 
						||
>
 | 
						||
> However, I will try to explain.
 | 
						||
>
 | 
						||
> If you have a table, locations. It has 1,000,000 records.
 | 
						||
>
 | 
						||
> In oracle you do this:
 | 
						||
>
 | 
						||
> create bitmap index bitmap_foo on locations (state) ;
 | 
						||
>
 | 
						||
> For each unique value of 'state' oracle will create a bitmap with 1,000,000
 | 
						||
> bits in it. With a one representing a match and a zero representing no
 | 
						||
> match. Record '0' in the table is represented by bit '0' in the bitmap,
 | 
						||
> record '1' is represented by bit '1', record two by bit '2' and so on.
 | 
						||
>
 | 
						||
> In a table where comparatively few different values are to be indexed in a
 | 
						||
> large table, a bitmap index can be quite small and not suffer the N * log(N)
 | 
						||
> disk I/O most tree based indexes suffer. If the bitmap is fairly sparse or
 | 
						||
> dense (or have periods of denseness and sparseness), it can be compressed
 | 
						||
> very efficiently as well.
 | 
						||
>
 | 
						||
> When the statement:
 | 
						||
>
 | 
						||
> select * from locations where state = 'MA';
 | 
						||
>
 | 
						||
> Is executed, the bitmap is read into memory in very few disk operations.
 | 
						||
> (Perhaps even as few as one or two). It is a simple operation of rifling
 | 
						||
> through the bitmap for '1's that indicate the record has the property,
 | 
						||
> 'state' = 'MA';
 | 
						||
>
 | 
						||
>
 | 
						||
> ---------------------------(end of broadcast)---------------------------
 | 
						||
> TIP 6: Have you searched our list archives?
 | 
						||
>
 | 
						||
> http://www.postgresql.org/search.mpl
 | 
						||
>
 | 
						||
 | 
						||
	Regards,
 | 
						||
		Oleg
 | 
						||
_____________________________________________________________
 | 
						||
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
 | 
						||
Sternberg Astronomical Institute, Moscow University (Russia)
 | 
						||
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
 | 
						||
phone: +007(095)939-16-83, +007(095)939-23-83
 | 
						||
 | 
						||
 | 
						||
From pgsql-general-owner+M2497@hub.org Fri Jun 16 18:31:03 2000
 | 
						||
Received: from renoir.op.net (root@renoir.op.net [207.29.195.4])
 | 
						||
	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id RAA04165
 | 
						||
	for <pgman@candle.pha.pa.us>; Fri, 16 Jun 2000 17:31:01 -0400 (EDT)
 | 
						||
Received: from hub.org (root@hub.org [216.126.84.1]) by renoir.op.net (o1/$Revision: 1.17 $) with ESMTP id RAA13110 for <pgman@candle.pha.pa.us>; Fri, 16 Jun 2000 17:20:12 -0400 (EDT)
 | 
						||
Received: from hub.org (majordom@localhost [127.0.0.1])
 | 
						||
	by hub.org (8.10.1/8.10.1) with SMTP id e5GLDaM14477;
 | 
						||
	Fri, 16 Jun 2000 17:13:36 -0400 (EDT)
 | 
						||
Received: from home.dialix.com ([203.15.150.26])
 | 
						||
	by hub.org (8.10.1/8.10.1) with ESMTP id e5GLCQM14064
 | 
						||
	for <pgsql-general@postgresql.org>; Fri, 16 Jun 2000 17:12:27 -0400 (EDT)
 | 
						||
Received: from nemeton.com.au ([202.76.153.71])
 | 
						||
	by home.dialix.com (8.9.3/8.9.3/JustNet) with SMTP id HAA95516
 | 
						||
	for <pgsql-general@postgresql.org>; Sat, 17 Jun 2000 07:11:44 +1000 (EST)
 | 
						||
	(envelope-from giles@nemeton.com.au)
 | 
						||
Received: (qmail 10213 invoked from network); 16 Jun 2000 09:52:29 -0000
 | 
						||
Received: from nemeton.com.au (203.8.3.17)
 | 
						||
  by nemeton.com.au with SMTP; 16 Jun 2000 09:52:29 -0000
 | 
						||
To: Jurgen Defurne <defurnj@glo.be>
 | 
						||
cc: Mark Stier <kalium@gmx.de>,
 | 
						||
        postgreSQL general mailing list <pgsql-general@postgresql.org>
 | 
						||
Subject: Re: [GENERAL] optimization by removing the file system layer? 
 | 
						||
In-Reply-To: Message from Jurgen Defurne <defurnj@glo.be> 
 | 
						||
   of "Thu, 15 Jun 2000 20:26:57 +0200." <39491FF1.E1E583F8@glo.be> 
 | 
						||
Date: Fri, 16 Jun 2000 19:52:28 +1000
 | 
						||
Message-ID: <10210.961149148@nemeton.com.au>
 | 
						||
From: Giles Lean <giles@nemeton.com.au>
 | 
						||
X-Mailing-List: pgsql-general@postgresql.org
 | 
						||
Precedence: bulk
 | 
						||
Sender: pgsql-general-owner@hub.org
 | 
						||
Status: OR
 | 
						||
 | 
						||
 | 
						||
 | 
						||
> I think that the Un*x filesystem is one of the reasons that large
 | 
						||
> database vendors rather use raw devices, than filesystem storage
 | 
						||
> files.
 | 
						||
 | 
						||
This used to be the preference, back in the late 80s and possibly
 | 
						||
early 90s.  I'm seeing a preference toward using the filesystem now,
 | 
						||
possibly with some sort of async I/O and co-operation from the OS
 | 
						||
filesystem about interactions with the filesystem cache.
 | 
						||
 | 
						||
Performance preferences don't stand still.  The hardware changes, the
 | 
						||
software changes, the volume of data changes, and different solutions
 | 
						||
become preferable.
 | 
						||
 | 
						||
> Using a raw device on the disk gives them the possibility to have
 | 
						||
> complete control over their files, indices and objects without being
 | 
						||
> bothered by the operating system.
 | 
						||
>
 | 
						||
> This speeds up things in several ways :
 | 
						||
> - the least possible OS intervention
 | 
						||
 | 
						||
Not that this is especially useful, necessarily.  If the "raw" device
 | 
						||
is in fact managed by a logical volume manager doing mirroring onto
 | 
						||
some sort of storage array there is still plenty of OS code involved.
 | 
						||
 | 
						||
The cost of using a filesystem in addition may not be much if anything
 | 
						||
and of course a filesystem is considerably more flexible to
 | 
						||
administer (backup, move, change size, check integrity, etc.)
 | 
						||
 | 
						||
> - choose block sizes according to applications
 | 
						||
> - reducing fragmentation
 | 
						||
> - packing data in nearby cilinders
 | 
						||
 | 
						||
... but when this storage area is spread over multiple mechanisms in a
 | 
						||
smart storage array with write caching, you've no idea what is where
 | 
						||
anyway.  Better to let the hardware or at least the OS manage this;
 | 
						||
there are so many levels of caching between a database and the
 | 
						||
magnetic media that working hard to influence layout is almost
 | 
						||
certainly a waste of time.
 | 
						||
 | 
						||
Kirk McKusick tells a lovely story that once upon a time it used to be
 | 
						||
sensible to check some registers on a particular disk controller to
 | 
						||
find out where the heads were when scheduling I/O.  Needless to say,
 | 
						||
that is history now!
 | 
						||
 | 
						||
There's a considerable cost in complexity and code in using "raw"
 | 
						||
storage too, and it's not a one off cost: as the technologies change,
 | 
						||
the "fast" way to do things will change and the code will have to be
 | 
						||
updated to match.  Better to leave this to the OS vendor where
 | 
						||
possible, and take advantage of the tuning they do.
 | 
						||
 | 
						||
> - Anyone other ideas -> the sky is the limit here
 | 
						||
 | 
						||
> It also aids portability, at least on platforms that have an
 | 
						||
> equivalent of a raw device.
 | 
						||
 | 
						||
I don't understand that claim.  Not much is portable about raw
 | 
						||
devices, and they're typically not nearlly as well documented as the
 | 
						||
filesystem interfaces.
 | 
						||
 | 
						||
> It is also independent of the standard implemented Un*x filesystems,
 | 
						||
> for which you will have to pay extra if you want to take extra
 | 
						||
> measures against power loss.
 | 
						||
 | 
						||
Rather, it is worse.  With a Unix filesystem you get quite defined
 | 
						||
semantics about what is written when.
 | 
						||
 | 
						||
> The problem with e.g. e2fs, is that it is not robust enough if a CPU
 | 
						||
> fails.
 | 
						||
 | 
						||
ext2fs doesn't even claim to have Unix filesystem semantics.
 | 
						||
 | 
						||
Regards,
 | 
						||
 | 
						||
Giles
 | 
						||
 | 
						||
 | 
						||
 | 
						||
From pgsql-hackers-owner+M1795@postgresql.org Thu Dec  7 18:47:52 2000
 | 
						||
Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28])
 | 
						||
	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id SAA09172
 | 
						||
	for <pgman@candle.pha.pa.us>; Thu, 7 Dec 2000 18:47:52 -0500 (EST)
 | 
						||
Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28])
 | 
						||
	by mail.postgresql.org (8.11.1/8.11.1) with SMTP id eB7NjFP10612;
 | 
						||
	Thu, 7 Dec 2000 18:45:15 -0500 (EST)
 | 
						||
	(envelope-from pgsql-hackers-owner+M1795@postgresql.org)
 | 
						||
Received: from thor.tht.net (thor.tht.net [209.47.145.4])
 | 
						||
	by mail.postgresql.org (8.11.1/8.11.1) with ESMTP id eB7N6BP08233
 | 
						||
	for <pgsql-hackers@postgresql.org>; Thu, 7 Dec 2000 18:06:11 -0500 (EST)
 | 
						||
	(envelope-from bright@fw.wintelcom.net)
 | 
						||
Received: from fw.wintelcom.net (bright@ns1.wintelcom.net [209.1.153.20])
 | 
						||
	by thor.tht.net (8.9.3/8.9.3) with ESMTP id SAA97456
 | 
						||
	for <pgsql-hackers@postgresql.org>; Thu, 7 Dec 2000 18:57:32 GMT
 | 
						||
	(envelope-from bright@fw.wintelcom.net)
 | 
						||
Received: (from bright@localhost)
 | 
						||
	by fw.wintelcom.net (8.10.0/8.10.0) id eB7MvWE21269
 | 
						||
	for pgsql-hackers@postgresql.org; Thu, 7 Dec 2000 14:57:32 -0800 (PST)
 | 
						||
Date: Thu, 7 Dec 2000 14:57:32 -0800
 | 
						||
From: Alfred Perlstein <bright@wintelcom.net>
 | 
						||
To: pgsql-hackers@postgresql.org
 | 
						||
Subject: [HACKERS] Patches with vacuum fixes available for 7.0.x
 | 
						||
Message-ID: <20001207145732.X16205@fw.wintelcom.net>
 | 
						||
MIME-Version: 1.0
 | 
						||
Content-Type: text/plain; charset=us-ascii
 | 
						||
Content-Disposition: inline
 | 
						||
User-Agent: Mutt/1.2.5i
 | 
						||
Precedence: bulk
 | 
						||
Sender: pgsql-hackers-owner@postgresql.org
 | 
						||
Status: ORr
 | 
						||
 | 
						||
We recently had a very satisfactory contract completed by
 | 
						||
Vadim.
 | 
						||
 | 
						||
Basically Vadim has been able to reduce the amount of time
 | 
						||
taken by a vacuum from 10-15 minutes down to under 10 seconds.
 | 
						||
 | 
						||
We've been running with these patches under heavy load for
 | 
						||
about a week now without any problems except one:
 | 
						||
  don't 'lazy' (new option for vacuum) a table which has just
 | 
						||
  had an index created on it, or at least don't expect it to
 | 
						||
  take any less time than a normal vacuum would.
 | 
						||
 | 
						||
There's three patchsets and they are available at:
 | 
						||
 | 
						||
http://people.freebsd.org/~alfred/vacfix/
 | 
						||
 | 
						||
complete diff:
 | 
						||
http://people.freebsd.org/~alfred/vacfix/v.diff
 | 
						||
 | 
						||
only lazy vacuum option to speed up index vacuums:
 | 
						||
http://people.freebsd.org/~alfred/vacfix/vlazy.tgz
 | 
						||
 | 
						||
only lazy vacuum option to only scan from start of modified
 | 
						||
data:
 | 
						||
http://people.freebsd.org/~alfred/vacfix/mnmb.tgz
 | 
						||
 | 
						||
Although the patches are for 7.0.x I'm hoping that they
 | 
						||
can be forward ported (if Vadim hasn't done it already)
 | 
						||
to 7.1.
 | 
						||
 | 
						||
enjoy!
 | 
						||
 | 
						||
-- 
 | 
						||
-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
 | 
						||
"I have the heart of a child; I keep it in a jar on my desk."
 | 
						||
 | 
						||
From pgsql-hackers-owner+M1809@postgresql.org Thu Dec  7 20:27:39 2000
 | 
						||
Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28])
 | 
						||
	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id UAA11827
 | 
						||
	for <pgman@candle.pha.pa.us>; Thu, 7 Dec 2000 20:27:38 -0500 (EST)
 | 
						||
Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28])
 | 
						||
	by mail.postgresql.org (8.11.1/8.11.1) with SMTP id eB81PsP22362;
 | 
						||
	Thu, 7 Dec 2000 20:25:54 -0500 (EST)
 | 
						||
	(envelope-from pgsql-hackers-owner+M1809@postgresql.org)
 | 
						||
Received: from fw.wintelcom.net (ns1.wintelcom.net [209.1.153.20])
 | 
						||
	by mail.postgresql.org (8.11.1/8.11.1) with ESMTP id eB81JkP21783
 | 
						||
	for <pgsql-hackers@postgresql.org>; Thu, 7 Dec 2000 20:19:46 -0500 (EST)
 | 
						||
	(envelope-from bright@fw.wintelcom.net)
 | 
						||
Received: (from bright@localhost)
 | 
						||
	by fw.wintelcom.net (8.10.0/8.10.0) id eB81JwU25447;
 | 
						||
	Thu, 7 Dec 2000 17:19:58 -0800 (PST)
 | 
						||
Date: Thu, 7 Dec 2000 17:19:58 -0800
 | 
						||
From: Alfred Perlstein <bright@wintelcom.net>
 | 
						||
To: Tom Lane <tgl@sss.pgh.pa.us>
 | 
						||
cc: pgsql-hackers@postgresql.org
 | 
						||
Subject: Re: [HACKERS] Patches with vacuum fixes available for 7.0.x
 | 
						||
Message-ID: <20001207171958.B16205@fw.wintelcom.net>
 | 
						||
References: <20001207145732.X16205@fw.wintelcom.net> <28791.976236143@sss.pgh.pa.us>
 | 
						||
MIME-Version: 1.0
 | 
						||
Content-Type: text/plain; charset=us-ascii
 | 
						||
Content-Disposition: inline
 | 
						||
User-Agent: Mutt/1.2.5i
 | 
						||
In-Reply-To: <28791.976236143@sss.pgh.pa.us>; from tgl@sss.pgh.pa.us on Thu, Dec 07, 2000 at 07:42:23PM -0500
 | 
						||
Precedence: bulk
 | 
						||
Sender: pgsql-hackers-owner@postgresql.org
 | 
						||
Status: OR
 | 
						||
 | 
						||
* Tom Lane <tgl@sss.pgh.pa.us> [001207 17:10] wrote:
 | 
						||
> Alfred Perlstein <bright@wintelcom.net> writes:
 | 
						||
> > Basically Vadim has been able to reduce the amount of time
 | 
						||
> > taken by a vacuum from 10-15 minutes down to under 10 seconds.
 | 
						||
> 
 | 
						||
> Cool.  What's it do, exactly?
 | 
						||
 | 
						||
================================================================
 | 
						||
 | 
						||
The first is a bonus that Vadim gave us to speed up index
 | 
						||
vacuums, I'm not sure I understand it completely, but it 
 | 
						||
work really well. :)
 | 
						||
 | 
						||
here's the README he gave us:
 | 
						||
 | 
						||
           Vacuum LAZY index cleanup option
 | 
						||
 | 
						||
LAZY vacuum option introduces new way of indices cleanup.
 | 
						||
Instead of reading entire index file to remove index tuples
 | 
						||
pointing to deleted table records, with LAZY option vacuum
 | 
						||
performes index scans using keys fetched from table record
 | 
						||
to be deleted. Vacuum checks each result returned by index
 | 
						||
scan if it points to target heap record and removes
 | 
						||
corresponding index tuple.
 | 
						||
This can greatly speed up indices cleaning if not so many
 | 
						||
table records were deleted/modified between vacuum runs.
 | 
						||
Vacuum uses new option on user' demand.
 | 
						||
 | 
						||
New vacuum syntax is:
 | 
						||
 | 
						||
vacuum [verbose] [analyze] [lazy] [table [(columns)]]
 | 
						||
 | 
						||
================================================================
 | 
						||
 | 
						||
The second is one of the suggestions I gave on the lists a while
 | 
						||
back, keeping track of the "last dirtied" block in the data files
 | 
						||
to only scan the tail end of the file for deleted rows, I think
 | 
						||
what he instead did was keep a table that holds all the modified
 | 
						||
blocks and vacuum only scans those:
 | 
						||
 | 
						||
              Minimal Number Modified Block (MNMB)
 | 
						||
 | 
						||
This feature is to track MNMB of required tables with triggers
 | 
						||
to avoid reading unmodified table pages by vacuum. Triggers
 | 
						||
store MNMB in per-table files in specified directory
 | 
						||
($LIBDIR/contrib/mnmb by default) and create these files if not
 | 
						||
existed.
 | 
						||
 | 
						||
Vacuum first looks up functions
 | 
						||
 | 
						||
mnmb_getblock(Oid databaseId, Oid tableId)
 | 
						||
mnmb_setblock(Oid databaseId, Oid tableId, Oid block)
 | 
						||
 | 
						||
in catalog. If *both* functions were found *and* there was no
 | 
						||
ANALYZE option specified then vacuum calls mnmb_getblock to obtain
 | 
						||
MNMB for table being vacuumed and starts reading this table from
 | 
						||
block number returned. After table was processed vacuum calls
 | 
						||
mnmb_setblock to update data in file to last table block number.
 | 
						||
Neither mnmb_getblock nor mnmb_setblock try to create file.
 | 
						||
If there was no file for table being vacuumed then mnmb_getblock
 | 
						||
returns 0 and mnmb_setblock does nothing.
 | 
						||
mnmb_setblock() may be used to set in file MNMB to 0 and force
 | 
						||
vacuum to read entire table if required.
 | 
						||
 | 
						||
To compile MNMB you have to add -DMNMB to CUSTOM_COPT
 | 
						||
in src/Makefile.custom.
 | 
						||
 | 
						||
-- 
 | 
						||
-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
 | 
						||
"I have the heart of a child; I keep it in a jar on my desk."
 | 
						||
 | 
						||
From pgsql-general-owner+M4010@postgresql.org Mon Feb  5 18:50:47 2001
 | 
						||
Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28])
 | 
						||
	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id SAA02209
 | 
						||
	for <pgman@candle.pha.pa.us>; Mon, 5 Feb 2001 18:50:46 -0500 (EST)
 | 
						||
Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28])
 | 
						||
	by mail.postgresql.org (8.11.1/8.11.1) with SMTP id f15Nn8x86486;
 | 
						||
	Mon, 5 Feb 2001 18:49:08 -0500 (EST)
 | 
						||
	(envelope-from pgsql-general-owner+M4010@postgresql.org)
 | 
						||
Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28])
 | 
						||
	by mail.postgresql.org (8.11.1/8.11.1) with SMTP id f15N7Ux81124
 | 
						||
	for <pgsql-general@postgresql.org>; Mon, 5 Feb 2001 18:07:30 -0500 (EST)
 | 
						||
	(envelope-from pgsql-general-owner@postgresql.org)
 | 
						||
Received: from news.tht.net (news.hub.org [216.126.91.242])
 | 
						||
	by mail.postgresql.org (8.11.1/8.11.1) with ESMTP id f0V0Twq69854
 | 
						||
	for <pgsql-general@postgresql.org>; Tue, 30 Jan 2001 19:29:58 -0500 (EST)
 | 
						||
	(envelope-from news@news.tht.net)
 | 
						||
Received: (from news@localhost)
 | 
						||
	by news.tht.net (8.11.1/8.11.1) id f0V0RAO01011
 | 
						||
	for pgsql-general@postgresql.org; Tue, 30 Jan 2001 19:27:10 -0500 (EST)
 | 
						||
	(envelope-from news)
 | 
						||
From: Mike Hoskins <mikehoskins@yahoo.com>
 | 
						||
X-Newsgroups: comp.databases.postgresql.general
 | 
						||
Subject: Re: [GENERAL] MySQL file system
 | 
						||
Date: Tue, 30 Jan 2001 18:30:36 -0600
 | 
						||
Organization: Hub.Org Networking Services (http://www.hub.org)
 | 
						||
Lines: 120
 | 
						||
Message-ID: <3A775CAB.C416AA16@yahoo.com>
 | 
						||
References: <016e01c080b7$ea554080$330a0a0a@6014cwpza006>
 | 
						||
MIME-Version: 1.0
 | 
						||
Content-Type: text/plain; charset=us-ascii
 | 
						||
Content-Transfer-Encoding: 7bit
 | 
						||
X-Complaints-To: scrappy@hub.org
 | 
						||
X-Mailer: Mozilla 4.76 [en] (Windows NT 5.0; U)
 | 
						||
X-Accept-Language: en
 | 
						||
To: pgsql-general@postgresql.org
 | 
						||
Precedence: bulk
 | 
						||
Sender: pgsql-general-owner@postgresql.org
 | 
						||
Status: OR
 | 
						||
 | 
						||
This idea is such a popular (even old) one that Oracle developed it for 8i --
 | 
						||
IFS.  Yep, AS/400 has had it forever, and BeOS is another example.  Informix has
 | 
						||
had its DataBlades for years, as well.  In fact, Reiser-FS is an FS implemented
 | 
						||
on a DB, albeit probably not a SQL DB.  AIX's LVM and JFS is extent/DB-based, as
 | 
						||
well. Let's see now, why would all those guys do that?  (Now, some of those that
 | 
						||
aren't SQL-based probably won't allow SQL queries on files, so just think about
 | 
						||
those that do, for a minute)....
 | 
						||
 | 
						||
Rather than asking why, a far better question is why not?  There is SO much
 | 
						||
functionality to be gained here that it's silly to ask why.  At a higher level,
 | 
						||
treating BLOBs as files and as DB entries simultaneously has so many uses, that
 | 
						||
one has trouble answering the question properly without the puzzled stare back
 | 
						||
at the questioner.  Again, look at the above list, particularly at AS/400 -- the
 | 
						||
entire OS's FS sits on top of DB/2!
 | 
						||
 | 
						||
For example, think how easy dynamically generated web sites could access online
 | 
						||
catalog information, with all those JPEG's, GIFs, PNGs, HTML files, Text files,
 | 
						||
.PDF's, etc., both in the DB and in the FS.  This would be so much easier to
 | 
						||
maintain, when you have webmasters, web designers, artists, programmers,
 | 
						||
sysadmins, dba's, etc., all trying to manage a big, dynamic, graphics-rich web
 | 
						||
site.  Who cares if the FS is a bit slow, as long as it's not too slow?  That's
 | 
						||
not the point, anyway.
 | 
						||
 | 
						||
The point is easy access to data:  asset management, version control, the
 | 
						||
ability to access the same data as a file and as a BLOB simultaneously, the
 | 
						||
ability to replicate easier, the ability to use more tools on the same info,
 | 
						||
etc.  It's not for speed, per se; instead, it's for accessibility.
 | 
						||
 | 
						||
Think about this issue.  You have some already compiled text-based program that
 | 
						||
works on binary files, but not on databases -- it was simply never designed into
 | 
						||
the program.  How are you going to get your graphics BLOBs into that program?
 | 
						||
Oh yeah, let's write another program to transform our data into files, first,
 | 
						||
then after processing delete them in some cleanup routine....  Why?  If you have
 | 
						||
a DB'ed FS, then file data can simultaneously have two views -- one for the DB
 | 
						||
and one as an FS.  (You can easily reverse the scenario.)  Not only does this
 | 
						||
save time and disk space; it saves you from having to pay for the most expensive
 | 
						||
element of all -- programmer time.
 | 
						||
 | 
						||
BTW, once this FS-on-a-DB concept really sinks in, imagine how tightly
 | 
						||
integrated Linux/Unix apps could be written.  Imagine if a bunch of GPL'ed
 | 
						||
software started coding for this and used this as a means to exchange data, all
 | 
						||
using a common set of libraries.  You could get to the point of uniting files,
 | 
						||
BLOBs, data of all sorts, IPC, version control, etc., all under one umbrella,
 | 
						||
especially if XML was the means data was exchanged.  Heck, distributed
 | 
						||
authentication, file access, data access, etc., could be improved greatly.
 | 
						||
Well, this paragraph sounds like flame bait, but really consider the
 | 
						||
ramifications.  Also, read the next paragraph....
 | 
						||
 | 
						||
Something like this *has* existed for Postgres for a long time -- PGFS, by Brian
 | 
						||
Bartholomew.  It's even supposedly matured with age.  Unfortunately, I cannot
 | 
						||
get to http://www.wv.com/ (Working Version's main site).  Working Version is a
 | 
						||
version control system that keeps old versions of files around in the FS.  It
 | 
						||
uses PG as the back-end DB and lets you mount it like another FS.  It's
 | 
						||
supposedly an awesome system, but where is it?  It's not some clunky korbit
 | 
						||
thingy, either.  (If someone can find it, please let me know by email, if
 | 
						||
possible.)
 | 
						||
 | 
						||
The only thing I can find on this is from a Google search, which caches
 | 
						||
everything but the actual software:
 | 
						||
 | 
						||
http://www.google.com/search?q=pgfs+postgres&num=100&hl=en&lr=lang_en&newwindow=1&safe=active
 | 
						||
 | 
						||
Also, there is the Perl-FS that can be transformed into something like PGFS:
 | 
						||
http://www.assurdo.com/perlfs/  It allows you to write Perl code that can mount
 | 
						||
various protocols or data types as an FS, in user space.  (One example is the
 | 
						||
ability to mount FTP sites, BTW.)
 | 
						||
 | 
						||
Instead of ridiculing something you've never tried, consider that MySQL-FS,
 | 
						||
Oracle (IFS), Informix (DataBlades), AS/400 (DB/2), BeOS, and Reiser-FS are
 | 
						||
doing this today.  Do you want to be left behind and let them tell us what it's
 | 
						||
good for?  Or, do we want this for PG?  (Reiser-FS, BTW, is FASTER than ext2,
 | 
						||
but has no SQL hooks).
 | 
						||
 | 
						||
There were many posts on this on slashdot:
 | 
						||
    http://slashdot.org/article.pl?sid=01/01/16/1855253&mode=thread
 | 
						||
    (I wrote some comments here, as well, just look for mikehoskins)
 | 
						||
 | 
						||
I, for one, want to see this succeed for MySQL, PostgreSQL, msql, etc.  It's an
 | 
						||
awesome feature that doesn't need to be speedy because it can save HUMANS time.
 | 
						||
 | 
						||
The question really is, "When do we want to catch up to everyone else?"  We are
 | 
						||
always moving to higher levels of abstraction, anyway, so it's just a matter of
 | 
						||
time.  PG should participate.
 | 
						||
 | 
						||
 | 
						||
Adam Lang wrote:
 | 
						||
 | 
						||
> I wasn't following the thread too closely, but database for a filesystem has
 | 
						||
> been done.  BeOS uses a database for a filesystem as well as AS/400 and
 | 
						||
> Mainframes.
 | 
						||
>
 | 
						||
> Adam Lang
 | 
						||
> Systems Engineer
 | 
						||
> Rutgers Casualty Insurance Company
 | 
						||
> http://www.rutgersinsurance.com
 | 
						||
> ----- Original Message -----
 | 
						||
> From: "Alfred Perlstein" <bright@wintelcom.net>
 | 
						||
> To: "Robert D. Nelson" <RDNELSON@co.centre.pa.us>
 | 
						||
> Cc: "Joseph Shraibman" <jks@selectacast.net>; "Karl DeBisschop"
 | 
						||
> <karl@debisschop.net>; "Ned Lilly" <ned@greatbridge.com>; "PostgreSQL
 | 
						||
> General" <pgsql-general@postgresql.org>
 | 
						||
> Sent: Wednesday, January 17, 2001 12:23 PM
 | 
						||
> Subject: Re: [GENERAL] MySQL file system
 | 
						||
>
 | 
						||
> > * Robert D. Nelson <RDNELSON@co.centre.pa.us> [010117 05:17] wrote:
 | 
						||
> > > >Raw disk access allows:
 | 
						||
> > >
 | 
						||
> > > If I'm correct, mysql is providing a filesystem, not a way to access raw
 | 
						||
> > > disk, like Oracle does. Huge difference there - with a filesystem, you
 | 
						||
> have
 | 
						||
> > > overhead of FS *and* SQL at the same time.
 | 
						||
> >
 | 
						||
> > Oh, so it's sort of like /proc for mysql?
 | 
						||
> >
 | 
						||
> > What a terrible waste of time and resources. :(
 | 
						||
> >
 | 
						||
> > --
 | 
						||
> > -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
 | 
						||
> > "I have the heart of a child; I keep it in a jar on my desk."
 | 
						||
 | 
						||
 | 
						||
From pgsql-general-owner+M4049@postgresql.org Tue Feb  6 01:26:19 2001
 | 
						||
Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28])
 | 
						||
	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id BAA21425
 | 
						||
	for <pgman@candle.pha.pa.us>; Tue, 6 Feb 2001 01:26:18 -0500 (EST)
 | 
						||
Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28])
 | 
						||
	by mail.postgresql.org (8.11.1/8.11.1) with SMTP id f166Nxx26400;
 | 
						||
	Tue, 6 Feb 2001 01:23:59 -0500 (EST)
 | 
						||
	(envelope-from pgsql-general-owner+M4049@postgresql.org)
 | 
						||
Received: from simecity.com ([202.188.254.2])
 | 
						||
	by mail.postgresql.org (8.11.1/8.11.1) with ESMTP id f166GUx25754
 | 
						||
	for <pgsql-general@postgresql.org>; Tue, 6 Feb 2001 01:16:30 -0500 (EST)
 | 
						||
	(envelope-from lyeoh@pop.jaring.my)
 | 
						||
Received: (from mail@localhost)
 | 
						||
	by simecity.com (8.9.3/8.8.7) id OAA23910;
 | 
						||
	Tue, 6 Feb 2001 14:28:48 +0800
 | 
						||
Received: from <lyeoh@pop.jaring.my> (ilab2.mecomb.po.my [192.168.3.22]) by cirrus.simecity.com via smap (V2.1)
 | 
						||
	id xma023908; Tue, 6 Feb 01 14:28:34 +0800
 | 
						||
Message-ID: <3.0.5.32.20010206141555.00a3d100@192.228.128.13>
 | 
						||
X-Sender: lyeoh@192.228.128.13
 | 
						||
X-Mailer: QUALCOMM Windows Eudora Light Version 3.0.5 (32)
 | 
						||
Date: Tue, 06 Feb 2001 14:15:55 +0800
 | 
						||
To: Mike Hoskins <mikehoskins@yahoo.com>, pgsql-general@postgresql.org
 | 
						||
From: Lincoln Yeoh <lyeoh@pop.jaring.my>
 | 
						||
Subject: [GENERAL] Re: MySQL file system
 | 
						||
In-Reply-To: <3A775CF7.3C5F1909@yahoo.com>
 | 
						||
References: <016e01c080b7$ea554080$330a0a0a@6014cwpza006>
 | 
						||
MIME-Version: 1.0
 | 
						||
Content-Type: text/plain; charset="us-ascii"
 | 
						||
Precedence: bulk
 | 
						||
Sender: pgsql-general-owner@postgresql.org
 | 
						||
Status: OR
 | 
						||
 | 
						||
What you're saying seems to be to have a data structure where the same data
 | 
						||
can be accessed in both the filesystem style and the RDBMs style. How does
 | 
						||
that work? How is the mapping done between both structures? Slapping a
 | 
						||
filesystem on top of a RDBMs doesn't do that does it?
 | 
						||
 | 
						||
Most filesystems are basically databases already, just differently
 | 
						||
structured and featured databases. And so far most of them do their job
 | 
						||
pretty well. You move a folder/directory somewhere, and everything inside
 | 
						||
it moves. Tons of data are already arranged in that form. Though porting
 | 
						||
over data from one filesystem to another is not always straightforward,
 | 
						||
RDBMSes are far worse.
 | 
						||
 | 
						||
Maybe what would be nice is not a filesystem based on a database, rather
 | 
						||
one influenced by databases. One with a decent fulltextindex for data and
 | 
						||
filenames, where you have the option to ignore or not ignore
 | 
						||
nonalphanumerics and still get an indexed search.
 | 
						||
 | 
						||
Then perhaps we could do something like the following:
 | 
						||
 | 
						||
select file.name from path "/var/logs/" where file.name like "%.log%' and
 | 
						||
file.lastmodified > '2000/1/1' and file.contents =~ 'te_st[0-9]+\.gif$' use
 | 
						||
index
 | 
						||
 | 
						||
Checkpoints would be nice too. Then I can rollback to a known point if I
 | 
						||
screw up ;).
 | 
						||
 | 
						||
In fact the SQL style interface doesn't have to be built in at all. Neither
 | 
						||
does the index have to be realtime. I suppose there could be an option to
 | 
						||
make it realtime if performance is not an issue. 
 | 
						||
 | 
						||
What could be done is to use some fast filesystem. Then we add tools to
 | 
						||
maintain indexes, for SQL style interfaces and other style interfaces.
 | 
						||
Checkpoints and rollbacks would be harder of course.
 | 
						||
 | 
						||
Cheerio,
 | 
						||
Link.
 | 
						||
 | 
						||
 | 
						||
From pgsql-hackers-owner+M20329@postgresql.org Tue Mar 19 18:00:15 2002
 | 
						||
Return-path: <pgsql-hackers-owner+M20329@postgresql.org>
 | 
						||
Received: from postgresql.org (postgresql.org [64.49.215.8])
 | 
						||
	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g2K00EA02465
 | 
						||
	for <pgman@candle.pha.pa.us>; Tue, 19 Mar 2002 19:00:14 -0500 (EST)
 | 
						||
Received: from postgresql.org (postgresql.org [64.49.215.8])
 | 
						||
	by postgresql.org (Postfix) with SMTP
 | 
						||
	id 8C7164763EF; Tue, 19 Mar 2002 18:22:08 -0500 (EST)
 | 
						||
Received: from CopelandConsulting.Net (dsl-24293-ld.customer.centurytel.net [209.142.135.135])
 | 
						||
	by postgresql.org (Postfix) with ESMTP id E4DAD475F1F
 | 
						||
	for <pgsql-hackers@postgresql.org>; Tue, 19 Mar 2002 18:02:17 -0500 (EST)
 | 
						||
Received: from mouse.copelandconsulting.net (mouse.copelandconsulting.net [192.168.1.2])
 | 
						||
	by CopelandConsulting.Net (8.10.1/8.10.1) with ESMTP id g2JN0jh13185;
 | 
						||
	Tue, 19 Mar 2002 17:00:45 -0600 (CST)
 | 
						||
X-Trade-Id: <CCC.Tue, 19 Mar 2002 17:00:45 -0600 (CST).Tue, 19 Mar 2002 17:00:45 -0600 (CST).200203192300.g2JN0jh13185.g2JN0jh13185@CopelandConsulting.Net.
 | 
						||
Subject: Re: [HACKERS] Bitmap indexes?
 | 
						||
From: Greg Copeland <greg@CopelandConsulting.Net>
 | 
						||
To: Matthew Kirkwood <matthew@hairy.beasts.org>
 | 
						||
cc: Oleg Bartunov <oleg@sai.msu.su>,
 | 
						||
   PostgresSQL Hackers Mailing List <pgsql-hackers@postgresql.org>
 | 
						||
	<Pine.LNX.4.33.0203192118140.29494-100000@sphinx.mythic-beasts.com>
 | 
						||
	<Pine.LNX.4.33.0203192118140.29494-100000@sphinx.mythic-beasts.com>
 | 
						||
Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature";
 | 
						||
	boundary="=-Ivchb84S75fOMzJ9DxwK"
 | 
						||
X-Mailer: Evolution/1.0.2 
 | 
						||
Date: 19 Mar 2002 17:00:53 -0600
 | 
						||
Message-ID: <1016578854.14670.450.camel@mouse.copelandconsulting.net>
 | 
						||
MIME-Version: 1.0
 | 
						||
Precedence: bulk
 | 
						||
Sender: pgsql-hackers-owner@postgresql.org
 | 
						||
Status: OR
 | 
						||
 | 
						||
--=-Ivchb84S75fOMzJ9DxwK
 | 
						||
Content-Type: text/plain
 | 
						||
Content-Transfer-Encoding: quoted-printable
 | 
						||
 | 
						||
On Tue, 2002-03-19 at 15:30, Matthew Kirkwood wrote:
 | 
						||
> On Tue, 19 Mar 2002, Oleg Bartunov wrote:
 | 
						||
>=20
 | 
						||
> Sorry to reply over you, Oleg.
 | 
						||
>=20
 | 
						||
> > On 13 Mar 2002, Greg Copeland wrote:
 | 
						||
> >
 | 
						||
> > > One of the reasons why I originally stated following the hackers list=
 | 
						||
 is
 | 
						||
> > > because I wanted to implement bitmap indexes.  I found in the archive=
 | 
						||
s,
 | 
						||
> > > the follow link, http://www.it.iitb.ernet.in/~rvijay/dbms/proj/, which
 | 
						||
> > > was extracted from this,
 | 
						||
> > > http://groups.google.com/groups?hl=3Den&threadm=3D01C0EF67.5105D2E0.m=
 | 
						||
ascarm%40mascari.com&rnum=3D1&prev=3D/groups%3Fq%3Dbitmap%2Bindex%2Bgroup:c=
 | 
						||
omp.databases.postgresql.hackers%26hl%3Den%26selm%3D01C0EF67.5105D2E0.masca=
 | 
						||
rm%2540mascari.com%26rnum%3D1, archive thread.
 | 
						||
>=20
 | 
						||
> For every case I have used a bitmap index on Oracle, a
 | 
						||
> partial index[0] made more sense (especialy since it
 | 
						||
> could usefully be compound).
 | 
						||
 | 
						||
That's very true, however, often bitmap indexes are used where partial
 | 
						||
indexes may not work well.  It maybe you were trying to apply the cure
 | 
						||
for the wrong disease.  ;)
 | 
						||
 | 
						||
>=20
 | 
						||
> Our troublesome case (on Oracle) is a table of "events"
 | 
						||
> where maybe fifty to a couple of hundred are "published"
 | 
						||
> (ie. web-visible) at any time.  The events are categorised
 | 
						||
> by sport (about a dozen) and by "event type" (about five).
 | 
						||
> We never really query events except by PK or by sport/type/
 | 
						||
> published.
 | 
						||
 | 
						||
The reason why bitmap indexes are primarily used for DSS and data
 | 
						||
wherehousing applications is because they are best used on extremely
 | 
						||
large to very large tables which have low cardinality (e.g, 10,000,000
 | 
						||
rows having 200 distinct values).  On top of that, bitmap indexes also
 | 
						||
tend to be much smaller than their "standard" cousins.  On large and
 | 
						||
very tables tables, this can sometimes save gigs in index space alone
 | 
						||
(serious space benefit).  Plus, their small index size tends to result
 | 
						||
in much less I/O (serious speed benefit).  This, of course, can result
 | 
						||
in several orders of magnitude speed improvements when index scans are
 | 
						||
required.  As an added bonus, using AND, OR, XOR and NOT predicates are
 | 
						||
exceptionally fast and if implemented properly, can even take advantage
 | 
						||
of some 64-bit hardware for further speed improvements.  This, of
 | 
						||
course, further speeds look ups.  The primary down side is that inserts
 | 
						||
and updates to bitmap indexes are very costly (comparatively) which is,
 | 
						||
yet again, why they excel in read-only environments (DSS & data
 | 
						||
wherehousing).
 | 
						||
 | 
						||
It should also be noted that RDMS's, such as Oracle, often use multiple
 | 
						||
types of bitmap indexes.  This further impedes insert/update
 | 
						||
performance, however, the additional bitmap index types usually allow
 | 
						||
for range predicates while still making use of the bitmap index.  If I'm
 | 
						||
not mistaken, several other types of bitmaps are available as well as
 | 
						||
many ways to encode and compress (rle, quad compression, etc) bitmap
 | 
						||
indexes which further save on an already compact indexing scheme.
 | 
						||
 | 
						||
Given the proper problem domain, index bitmaps can be a big win.
 | 
						||
 | 
						||
>=20
 | 
						||
> We make a bitmap index on "published", and trust Oracle to
 | 
						||
> use it correctly, and hope that our other indexes are also
 | 
						||
> useful.
 | 
						||
>=20
 | 
						||
> On Postgres[1] we would make a partial compound index:
 | 
						||
>=20
 | 
						||
> create index ... on events(sport_id,event_type_id)
 | 
						||
> where published=3D'Y';
 | 
						||
 | 
						||
 | 
						||
Generally speaking, bitmap indexes will not serve you very will on
 | 
						||
tables having a low row counts, high cardinality or where they are
 | 
						||
attached to tables which are primarily used in an OLTP capacity.=20
 | 
						||
Situations where you have a low row count and low cardinality or high
 | 
						||
row count and high cardinality tend to be better addressed by partial
 | 
						||
indexes; which seem to make much more sense.  In your example, it sounds
 | 
						||
like you did "the right thing"(tm).  ;)
 | 
						||
 | 
						||
 | 
						||
Greg
 | 
						||
 | 
						||
 | 
						||
--=-Ivchb84S75fOMzJ9DxwK
 | 
						||
Content-Type: application/pgp-signature; name=signature.asc
 | 
						||
Content-Description: This is a digitally signed message part
 | 
						||
 | 
						||
-----BEGIN PGP SIGNATURE-----
 | 
						||
Version: GnuPG v1.0.6 (GNU/Linux)
 | 
						||
Comment: For info see http://www.gnupg.org
 | 
						||
 | 
						||
iD8DBQA8l8Ml4lr1bpbcL6kRAhldAJ9Aoi9dwm1OteZjySfsd1o42trWLACfegQj
 | 
						||
OEV6eO8MnBSlbJMHiQ08gNE=
 | 
						||
=PQvW
 | 
						||
-----END PGP SIGNATURE-----
 | 
						||
 | 
						||
--=-Ivchb84S75fOMzJ9DxwK--
 | 
						||
 | 
						||
 | 
						||
From pgsql-hackers-owner+M26157@postgresql.org Tue Aug  6 23:06:34 2002
 | 
						||
Date: Wed, 7 Aug 2002 13:07:38 +1000 (EST)
 | 
						||
From: Gavin Sherry <swm@linuxworld.com.au>
 | 
						||
To: Curt Sampson <cjs@cynic.net>
 | 
						||
cc: pgsql-hackers@postgresql.org
 | 
						||
Subject: Re: [HACKERS] CLUSTER and indisclustered
 | 
						||
In-Reply-To: <Pine.NEB.4.44.0208071126590.1214-100000@angelic.cynic.net>
 | 
						||
Message-ID: <Pine.LNX.4.21.0208071259210.13438-100000@linuxworld.com.au>
 | 
						||
X-Virus-Scanned: by AMaViS new-20020517
 | 
						||
Precedence: bulk
 | 
						||
Sender: pgsql-hackers-owner@postgresql.org
 | 
						||
X-Virus-Scanned: by AMaViS new-20020517
 | 
						||
Content-Length:  1357
 | 
						||
 | 
						||
On Wed, 7 Aug 2002, Curt Sampson wrote:
 | 
						||
 | 
						||
> But after doing some benchmarking of various sorts of random reads
 | 
						||
> and writes, it occurred to me that there might be optimizations
 | 
						||
> that could help a lot with this sort of thing. What if, when we've
 | 
						||
> got an index block with a bunch of entries, instead of doing the
 | 
						||
> reads in the order of the entries, we do them in the order of the
 | 
						||
> blocks the entries point to? That would introduce a certain amount
 | 
						||
> of "sequentialness" to the reads that the OS is not capable of
 | 
						||
> introducing (since it can't reschedule the reads you're doing, the
 | 
						||
> way it could reschedule, say, random writes).
 | 
						||
 | 
						||
This sounds more or less like the method employed by Firebird as described
 | 
						||
by Ann Douglas to Tom at OSCON (correct me if I get this wrong).
 | 
						||
 | 
						||
Basically, firebird populates a bitmap with entries the scan is interested
 | 
						||
in. The bitmap is populated in page order so that all entries on the same
 | 
						||
heap page can be fetched at once.
 | 
						||
 | 
						||
This is totally different to the way postgres does things and would
 | 
						||
require significant modification to the index access methods.
 | 
						||
 | 
						||
Gavin
 | 
						||
 | 
						||
 | 
						||
---------------------------(end of broadcast)---------------------------
 | 
						||
TIP 3: if posting/reading through Usenet, please send an appropriate
 | 
						||
subscribe-nomail command to majordomo@postgresql.org so that your
 | 
						||
message can get through to the mailing list cleanly
 | 
						||
 | 
						||
From pgsql-hackers-owner+M26162@postgresql.org Wed Aug  7 00:42:35 2002
 | 
						||
To: Curt Sampson <cjs@cynic.net>
 | 
						||
cc: mark Kirkwood <markir@slithery.org>, Gavin Sherry <swm@linuxworld.com.au>, 
 | 
						||
	   Bruce Momjian <pgman@candle.pha.pa.us>, pgsql-hackers@postgresql.org
 | 
						||
Subject: Re: [HACKERS] CLUSTER and indisclustered 
 | 
						||
In-Reply-To: <Pine.NEB.4.44.0208071126590.1214-100000@angelic.cynic.net> 
 | 
						||
References: <Pine.NEB.4.44.0208071126590.1214-100000@angelic.cynic.net>
 | 
						||
Comments: In-reply-to Curt Sampson <cjs@cynic.net>
 | 
						||
	message dated "Wed, 07 Aug 2002 11:31:32 +0900"
 | 
						||
Date: Wed, 07 Aug 2002 00:41:47 -0400
 | 
						||
Message-ID: <12593.1028695307@sss.pgh.pa.us>
 | 
						||
From: Tom Lane <tgl@sss.pgh.pa.us>
 | 
						||
X-Virus-Scanned: by AMaViS new-20020517
 | 
						||
Precedence: bulk
 | 
						||
Sender: pgsql-hackers-owner@postgresql.org
 | 
						||
X-Virus-Scanned: by AMaViS new-20020517
 | 
						||
Content-Length:  3063
 | 
						||
 | 
						||
Curt Sampson <cjs@cynic.net> writes:
 | 
						||
> But after doing some benchmarking of various sorts of random reads
 | 
						||
> and writes, it occurred to me that there might be optimizations
 | 
						||
> that could help a lot with this sort of thing. What if, when we've
 | 
						||
> got an index block with a bunch of entries, instead of doing the
 | 
						||
> reads in the order of the entries, we do them in the order of the
 | 
						||
> blocks the entries point to?
 | 
						||
 | 
						||
I thought to myself "didn't I just post something about that?"
 | 
						||
and then realized it was on a different mailing list.  Here ya go
 | 
						||
(and no, this is not the first time around on this list either...)
 | 
						||
 | 
						||
 | 
						||
I am currently thinking that bitmap indexes per se are not all that
 | 
						||
interesting.  What does interest me is bitmapped index lookup, which
 | 
						||
came back into mind after hearing Ann Harrison describe how FireBird/
 | 
						||
InterBase does it.
 | 
						||
 | 
						||
The idea is that you don't scan the index and base table concurrently
 | 
						||
as we presently do it.  Instead, you scan the index and make a list
 | 
						||
of the TIDs of the table tuples you need to visit.  This list can
 | 
						||
be conveniently represented as a sparse bitmap.  After you've finished
 | 
						||
looking at the index, you visit all the required table tuples *in
 | 
						||
physical order* using the bitmap.  This eliminates multiple fetches
 | 
						||
of the same heap page, and can possibly let you get some win from
 | 
						||
sequential access.
 | 
						||
 | 
						||
Once you have built this mechanism, you can then move on to using
 | 
						||
multiple indexes in interesting ways: you can do several indexscans
 | 
						||
in one query and then AND or OR their bitmaps before doing the heap
 | 
						||
scan.  This would allow, for example, "WHERE a = foo and b = bar"
 | 
						||
to be handled by ANDing results from separate indexes on the a and b
 | 
						||
columns, rather than having to choose only one index to use as we do
 | 
						||
now.
 | 
						||
 | 
						||
Some thoughts about implementation: FireBird's implementation seems
 | 
						||
to depend on an assumption about a fixed number of tuple pointers
 | 
						||
per page.  We don't have that, but we could probably get away with
 | 
						||
just allocating BLCKSZ/sizeof(HeapTupleHeaderData) bits per page.
 | 
						||
Also, the main downside of this approach is that the bitmap could
 | 
						||
get large --- but you could have some logic that causes you to fall
 | 
						||
back to plain sequential scan if you get too many index hits.  (It's
 | 
						||
interesting to think of this as lossy compression of the bitmap...
 | 
						||
which leads to the idea of only being fuzzy in limited areas of the
 | 
						||
bitmap, rather than losing all the information you have.)
 | 
						||
 | 
						||
A possibly nasty issue is that lazy VACUUM has some assumptions in it
 | 
						||
about indexscans holding pins on index pages --- that's what prevents
 | 
						||
it from removing heap tuples that a concurrent indexscan is just about
 | 
						||
to visit.  It might be that there is no problem: even if lazy VACUUM
 | 
						||
removes a heap tuple and someone else then installs a new tuple in that
 | 
						||
same TID slot, you should be okay because the new tuple is too new to
 | 
						||
pass your visibility test.  But I'm not convinced this is safe.
 | 
						||
 | 
						||
			regards, tom lane
 | 
						||
 | 
						||
---------------------------(end of broadcast)---------------------------
 | 
						||
TIP 6: Have you searched our list archives?
 | 
						||
 | 
						||
http://archives.postgresql.org
 | 
						||
 | 
						||
From pgsql-hackers-owner+M26172@postgresql.org Wed Aug  7 02:49:56 2002
 | 
						||
X-Authentication-Warning: rh72.home.ee: hannu set sender to hannu@tm.ee using -f
 | 
						||
Subject: Re: [HACKERS] CLUSTER and indisclustered
 | 
						||
From: Hannu Krosing <hannu@tm.ee>
 | 
						||
To: Tom Lane <tgl@sss.pgh.pa.us>
 | 
						||
cc: Curt Sampson <cjs@cynic.net>, mark Kirkwood <markir@slithery.org>, 
 | 
						||
	   Gavin Sherry <swm@linuxworld.com.au>, 
 | 
						||
	   Bruce Momjian <pgman@candle.pha.pa.us>, pgsql-hackers@postgresql.org
 | 
						||
In-Reply-To: <12776.1028697148@sss.pgh.pa.us>
 | 
						||
References: <Pine.NEB.4.44.0208071351440.1214-100000@angelic.cynic.net> 
 | 
						||
	<12776.1028697148@sss.pgh.pa.us>
 | 
						||
X-Mailer: Ximian Evolution 1.0.7 
 | 
						||
Date: 07 Aug 2002 09:46:29 +0500
 | 
						||
Message-ID: <1028695589.2133.11.camel@rh72.home.ee>
 | 
						||
X-Virus-Scanned: by AMaViS new-20020517
 | 
						||
Precedence: bulk
 | 
						||
Sender: pgsql-hackers-owner@postgresql.org
 | 
						||
X-Virus-Scanned: by AMaViS new-20020517
 | 
						||
Content-Length:  1064
 | 
						||
 | 
						||
On Wed, 2002-08-07 at 10:12, Tom Lane wrote:
 | 
						||
> Curt Sampson <cjs@cynic.net> writes:
 | 
						||
> > On Wed, 7 Aug 2002, Tom Lane wrote:
 | 
						||
> >> Also, the main downside of this approach is that the bitmap could
 | 
						||
> >> get large --- but you could have some logic that causes you to fall
 | 
						||
> >> back to plain sequential scan if you get too many index hits.
 | 
						||
> 
 | 
						||
> > Well, what I was thinking of, should the list of TIDs to fetch get too
 | 
						||
> > long, was just to break it down in to chunks.
 | 
						||
> 
 | 
						||
> But then you lose the possibility of combining multiple indexes through
 | 
						||
> bitmap AND/OR steps, which seems quite interesting to me.  If you've
 | 
						||
> visited only a part of each index then you can't apply that concept.
 | 
						||
 | 
						||
When the tuples are small relative to pagesize, you may get some
 | 
						||
"compression" by saving just pages and not the actual tids in the the
 | 
						||
bitmap.
 | 
						||
 | 
						||
-------------
 | 
						||
Hannu
 | 
						||
 | 
						||
---------------------------(end of broadcast)---------------------------
 | 
						||
TIP 2: you can get off all lists at once with the unregister command
 | 
						||
    (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
 | 
						||
 | 
						||
From pgsql-hackers-owner+M26166@postgresql.org Wed Aug  7 00:55:52 2002
 | 
						||
Date: Wed, 7 Aug 2002 13:55:41 +0900 (JST)
 | 
						||
From: Curt Sampson <cjs@cynic.net>
 | 
						||
To: Tom Lane <tgl@sss.pgh.pa.us>
 | 
						||
cc: mark Kirkwood <markir@slithery.org>, Gavin Sherry <swm@linuxworld.com.au>, 
 | 
						||
	   Bruce Momjian <pgman@candle.pha.pa.us>,  <pgsql-hackers@postgresql.org>
 | 
						||
Subject: Re: [HACKERS] CLUSTER and indisclustered 
 | 
						||
In-Reply-To: <12593.1028695307@sss.pgh.pa.us>
 | 
						||
Message-ID: <Pine.NEB.4.44.0208071351440.1214-100000@angelic.cynic.net>
 | 
						||
X-Virus-Scanned: by AMaViS new-20020517
 | 
						||
Precedence: bulk
 | 
						||
Sender: pgsql-hackers-owner@postgresql.org
 | 
						||
X-Virus-Scanned: by AMaViS new-20020517
 | 
						||
Content-Length:  1840
 | 
						||
 | 
						||
On Wed, 7 Aug 2002, Tom Lane wrote:
 | 
						||
 | 
						||
> I thought to myself "didn't I just post something about that?"
 | 
						||
> and then realized it was on a different mailing list.  Here ya go
 | 
						||
> (and no, this is not the first time around on this list either...)
 | 
						||
 | 
						||
Wow. I'm glad to see you looking at this, because this feature would so
 | 
						||
*so* much for the performance of some of my queries, and really, really
 | 
						||
impress my "billion-row-database" client.
 | 
						||
 | 
						||
> The idea is that you don't scan the index and base table concurrently
 | 
						||
> as we presently do it.  Instead, you scan the index and make a list
 | 
						||
> of the TIDs of the table tuples you need to visit.
 | 
						||
 | 
						||
Right.
 | 
						||
 | 
						||
> Also, the main downside of this approach is that the bitmap could
 | 
						||
> get large --- but you could have some logic that causes you to fall
 | 
						||
> back to plain sequential scan if you get too many index hits.
 | 
						||
 | 
						||
Well, what I was thinking of, should the list of TIDs to fetch get too
 | 
						||
long, was just to break it down in to chunks. If you want to limit to,
 | 
						||
say, 1000 TIDs, and your index has 3000, just do the first 1000, then
 | 
						||
the next 1000, then the last 1000. This would still result in much less
 | 
						||
disk head movement and speed the query immensely.
 | 
						||
 | 
						||
(BTW, I have verified this emperically during testing of random read vs.
 | 
						||
random write on a RAID controller. The writes were 5-10 times faster
 | 
						||
than the reads because the controller was caching a number of writes and
 | 
						||
then doing them in the best possible order, whereas the reads had to be
 | 
						||
satisfied in the order they were submitted to the controller.)
 | 
						||
 | 
						||
cjs
 | 
						||
-- 
 | 
						||
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
 | 
						||
    Don't you know, in this new Dark Age, we're all light.  --XTC
 | 
						||
 | 
						||
 | 
						||
---------------------------(end of broadcast)---------------------------
 | 
						||
TIP 5: Have you checked our extensive FAQ?
 | 
						||
 | 
						||
http://www.postgresql.org/users-lounge/docs/faq.html
 | 
						||
 | 
						||
From pgsql-hackers-owner+M26167@postgresql.org Wed Aug  7 01:12:54 2002
 | 
						||
To: Curt Sampson <cjs@cynic.net>
 | 
						||
cc: mark Kirkwood <markir@slithery.org>, Gavin Sherry <swm@linuxworld.com.au>, 
 | 
						||
	   Bruce Momjian <pgman@candle.pha.pa.us>, pgsql-hackers@postgresql.org
 | 
						||
Subject: Re: [HACKERS] CLUSTER and indisclustered 
 | 
						||
In-Reply-To: <Pine.NEB.4.44.0208071351440.1214-100000@angelic.cynic.net> 
 | 
						||
References: <Pine.NEB.4.44.0208071351440.1214-100000@angelic.cynic.net>
 | 
						||
Comments: In-reply-to Curt Sampson <cjs@cynic.net>
 | 
						||
	message dated "Wed, 07 Aug 2002 13:55:41 +0900"
 | 
						||
Date: Wed, 07 Aug 2002 01:12:28 -0400
 | 
						||
Message-ID: <12776.1028697148@sss.pgh.pa.us>
 | 
						||
From: Tom Lane <tgl@sss.pgh.pa.us>
 | 
						||
X-Virus-Scanned: by AMaViS new-20020517
 | 
						||
Precedence: bulk
 | 
						||
Sender: pgsql-hackers-owner@postgresql.org
 | 
						||
X-Virus-Scanned: by AMaViS new-20020517
 | 
						||
Content-Length:  1428
 | 
						||
 | 
						||
Curt Sampson <cjs@cynic.net> writes:
 | 
						||
> On Wed, 7 Aug 2002, Tom Lane wrote:
 | 
						||
>> Also, the main downside of this approach is that the bitmap could
 | 
						||
>> get large --- but you could have some logic that causes you to fall
 | 
						||
>> back to plain sequential scan if you get too many index hits.
 | 
						||
 | 
						||
> Well, what I was thinking of, should the list of TIDs to fetch get too
 | 
						||
> long, was just to break it down in to chunks.
 | 
						||
 | 
						||
But then you lose the possibility of combining multiple indexes through
 | 
						||
bitmap AND/OR steps, which seems quite interesting to me.  If you've
 | 
						||
visited only a part of each index then you can't apply that concept.
 | 
						||
 | 
						||
Another point to keep in mind is that the bigger the bitmap gets, the
 | 
						||
less useful an indexscan is, by definition --- sooner or later you might
 | 
						||
as well fall back to a seqscan.  So the idea of lossy compression of a
 | 
						||
large bitmap seems really ideal to me.  In principle you could seqscan
 | 
						||
the parts of the table where matching tuples are thick on the ground,
 | 
						||
and indexscan the parts where they ain't.  Maybe this seems natural
 | 
						||
to me as an old JPEG campaigner, but if you don't see the logic I
 | 
						||
recommend thinking about it a little ...
 | 
						||
 | 
						||
			regards, tom lane
 | 
						||
 | 
						||
---------------------------(end of broadcast)---------------------------
 | 
						||
TIP 3: if posting/reading through Usenet, please send an appropriate
 | 
						||
subscribe-nomail command to majordomo@postgresql.org so that your
 | 
						||
message can get through to the mailing list cleanly
 | 
						||
 | 
						||
From tgl@sss.pgh.pa.us Wed Aug  7 09:27:05 2002
 | 
						||
To: Hannu Krosing <hannu@tm.ee>
 | 
						||
cc: Curt Sampson <cjs@cynic.net>, mark Kirkwood <markir@slithery.org>, 
 | 
						||
	   Gavin Sherry <swm@linuxworld.com.au>, 
 | 
						||
	   Bruce Momjian <pgman@candle.pha.pa.us>, pgsql-hackers@postgresql.org
 | 
						||
Subject: Re: [HACKERS] CLUSTER and indisclustered 
 | 
						||
In-Reply-To: <1028726966.13418.12.camel@taru.tm.ee> 
 | 
						||
References: <Pine.NEB.4.44.0208071351440.1214-100000@angelic.cynic.net> <12776.1028697148@sss.pgh.pa.us> <1028695589.2133.11.camel@rh72.home.ee> <1028726966.13418.12.camel@taru.tm.ee>
 | 
						||
Comments: In-reply-to Hannu Krosing <hannu@tm.ee>
 | 
						||
	message dated "07 Aug 2002 15:29:26 +0200"
 | 
						||
Date: Wed, 07 Aug 2002 09:26:42 -0400
 | 
						||
Message-ID: <15010.1028726802@sss.pgh.pa.us>
 | 
						||
From: Tom Lane <tgl@sss.pgh.pa.us>
 | 
						||
Content-Length:  1120
 | 
						||
 | 
						||
Hannu Krosing <hannu@tm.ee> writes:
 | 
						||
> Now I remembered my original preference for page bitmaps (vs. tuple
 | 
						||
> bitmaps): one can't actually make good use of a bitmap of tuples because
 | 
						||
> there is no fixed tuples/page ratio and thus no way to quickly go from
 | 
						||
> bit position to actual tuple. You mention the same problem but propose a
 | 
						||
> different solution.
 | 
						||
 | 
						||
> Using page bitmap, we will at least avoid fetching any unneeded pages -
 | 
						||
> essentially we will have a sequential scan over possibly interesting
 | 
						||
> pages.
 | 
						||
 | 
						||
Right.  One form of the "lossy compression" idea I suggested is to
 | 
						||
switch from a per-tuple bitmap to a per-page bitmap once the bitmap gets
 | 
						||
too large to work with.  Again, one could imagine doing that only in
 | 
						||
denser areas of the bitmap.
 | 
						||
 | 
						||
> But I guess that CLUSTER support for INSERT will not be touched for 7.3
 | 
						||
> as will real bitmap indexes ;)
 | 
						||
 | 
						||
All of this is far-future work I think.  Adding a new scan type to the
 | 
						||
executor would probably be pretty localized, but the ramifications in
 | 
						||
the planner could be extensive --- especially if you want to do plans
 | 
						||
involving ANDed or ORed bitmaps.
 | 
						||
 | 
						||
			regards, tom lane
 | 
						||
 | 
						||
From pgsql-hackers-owner+M26178@postgresql.org Wed Aug  7 08:28:14 2002
 | 
						||
X-Authentication-Warning: taru.tm.ee: hannu set sender to hannu@tm.ee using -f
 | 
						||
Subject: Re: [HACKERS] CLUSTER and indisclustered
 | 
						||
From: Hannu Krosing <hannu@tm.ee>
 | 
						||
To: Hannu Krosing <hannu@tm.ee>
 | 
						||
cc: Tom Lane <tgl@sss.pgh.pa.us>, Curt Sampson <cjs@cynic.net>, 
 | 
						||
	   mark Kirkwood <markir@slithery.org>, Gavin Sherry <swm@linuxworld.com.au>, 
 | 
						||
	   Bruce Momjian <pgman@candle.pha.pa.us>, pgsql-hackers@postgresql.org
 | 
						||
In-Reply-To: <1028695589.2133.11.camel@rh72.home.ee>
 | 
						||
References: <Pine.NEB.4.44.0208071351440.1214-100000@angelic.cynic.net> 
 | 
						||
	<12776.1028697148@sss.pgh.pa.us>  <1028695589.2133.11.camel@rh72.home.ee>
 | 
						||
X-Mailer: Ximian Evolution 1.0.3.99 
 | 
						||
Date: 07 Aug 2002 15:29:26 +0200
 | 
						||
Message-ID: <1028726966.13418.12.camel@taru.tm.ee>
 | 
						||
X-Virus-Scanned: by AMaViS new-20020517
 | 
						||
Precedence: bulk
 | 
						||
Sender: pgsql-hackers-owner@postgresql.org
 | 
						||
X-Virus-Scanned: by AMaViS new-20020517
 | 
						||
Content-Length:  1837
 | 
						||
 | 
						||
On Wed, 2002-08-07 at 06:46, Hannu Krosing wrote:
 | 
						||
> On Wed, 2002-08-07 at 10:12, Tom Lane wrote:
 | 
						||
> > Curt Sampson <cjs@cynic.net> writes:
 | 
						||
> > > On Wed, 7 Aug 2002, Tom Lane wrote:
 | 
						||
> > >> Also, the main downside of this approach is that the bitmap could
 | 
						||
> > >> get large --- but you could have some logic that causes you to fall
 | 
						||
> > >> back to plain sequential scan if you get too many index hits.
 | 
						||
> > 
 | 
						||
> > > Well, what I was thinking of, should the list of TIDs to fetch get too
 | 
						||
> > > long, was just to break it down in to chunks.
 | 
						||
> > 
 | 
						||
> > But then you lose the possibility of combining multiple indexes through
 | 
						||
> > bitmap AND/OR steps, which seems quite interesting to me.  If you've
 | 
						||
> > visited only a part of each index then you can't apply that concept.
 | 
						||
> 
 | 
						||
> When the tuples are small relative to pagesize, you may get some
 | 
						||
> "compression" by saving just pages and not the actual tids in the the
 | 
						||
> bitmap.
 | 
						||
 | 
						||
Now I remembered my original preference for page bitmaps (vs. tuple
 | 
						||
bitmaps): one can't actually make good use of a bitmap of tuples because
 | 
						||
there is no fixed tuples/page ratio and thus no way to quickly go from
 | 
						||
bit position to actual tuple. You mention the same problem but propose a
 | 
						||
different solution.
 | 
						||
 | 
						||
Using page bitmap, we will at least avoid fetching any unneeded pages -
 | 
						||
essentially we will have a sequential scan over possibly interesting
 | 
						||
pages.
 | 
						||
 | 
						||
If we were to use page-bitmap index for something with only a few values
 | 
						||
like booleans, some insert-time local clustering should be useful, so
 | 
						||
that TRUEs and FALSEs end up on different pages.
 | 
						||
 | 
						||
But I guess that CLUSTER support for INSERT will not be touched for 7.3
 | 
						||
as will real bitmap indexes ;)
 | 
						||
 | 
						||
---------------
 | 
						||
Hannu
 | 
						||
 | 
						||
 | 
						||
---------------------------(end of broadcast)---------------------------
 | 
						||
TIP 6: Have you searched our list archives?
 | 
						||
 | 
						||
http://archives.postgresql.org
 | 
						||
 | 
						||
From pgsql-hackers-owner+M26192@postgresql.org Wed Aug  7 10:26:30 2002
 | 
						||
To: Hannu Krosing <hannu@tm.ee>
 | 
						||
cc: Curt Sampson <cjs@cynic.net>, mark Kirkwood <markir@slithery.org>, 
 | 
						||
	   Gavin Sherry <swm@linuxworld.com.au>, 
 | 
						||
	   Bruce Momjian <pgman@candle.pha.pa.us>, pgsql-hackers@postgresql.org
 | 
						||
Subject: Re: [HACKERS] CLUSTER and indisclustered 
 | 
						||
In-Reply-To: <1028733234.13418.113.camel@taru.tm.ee> 
 | 
						||
References: <Pine.NEB.4.44.0208071351440.1214-100000@angelic.cynic.net> <12776.1028697148@sss.pgh.pa.us> <1028695589.2133.11.camel@rh72.home.ee> <1028726966.13418.12.camel@taru.tm.ee> <15010.1028726802@sss.pgh.pa.us> <1028733234.13418.113.camel@taru.tm.ee>
 | 
						||
Comments: In-reply-to Hannu Krosing <hannu@tm.ee>
 | 
						||
	message dated "07 Aug 2002 17:13:54 +0200"
 | 
						||
Date: Wed, 07 Aug 2002 10:26:13 -0400
 | 
						||
Message-ID: <15622.1028730373@sss.pgh.pa.us>
 | 
						||
From: Tom Lane <tgl@sss.pgh.pa.us>
 | 
						||
X-Virus-Scanned: by AMaViS new-20020517
 | 
						||
Precedence: bulk
 | 
						||
Sender: pgsql-hackers-owner@postgresql.org
 | 
						||
X-Virus-Scanned: by AMaViS new-20020517
 | 
						||
Content-Length:  1224
 | 
						||
 | 
						||
Hannu Krosing <hannu@tm.ee> writes:
 | 
						||
> On Wed, 2002-08-07 at 15:26, Tom Lane wrote:
 | 
						||
>> Right.  One form of the "lossy compression" idea I suggested is to
 | 
						||
>> switch from a per-tuple bitmap to a per-page bitmap once the bitmap gets
 | 
						||
>> too large to work with.  
 | 
						||
 | 
						||
> If it is a real bitmap, should it not be easyeast to allocate at the
 | 
						||
> start ?
 | 
						||
 | 
						||
But it isn't a "real bitmap".  That would be a really poor
 | 
						||
implementation, both for space and speed --- do you really want to scan
 | 
						||
over a couple of megs of zeroes to find the few one-bits you care about,
 | 
						||
in the typical case?  "Bitmap" is a convenient term because it describes
 | 
						||
the abstract behavior we want, but the actual data structure will
 | 
						||
probably be nontrivial.  If I recall Ann's description correctly,
 | 
						||
Firebird's implementation uses run length coding of some kind (anyone
 | 
						||
care to dig in their source and get all the details?).  If we tried
 | 
						||
anything in the way of lossy compression then there'd be even more stuff
 | 
						||
lurking under the hood.
 | 
						||
 | 
						||
			regards, tom lane
 | 
						||
 | 
						||
---------------------------(end of broadcast)---------------------------
 | 
						||
TIP 2: you can get off all lists at once with the unregister command
 | 
						||
    (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
 | 
						||
 | 
						||
From pgsql-hackers-owner+M26188@postgresql.org Wed Aug  7 10:12:26 2002
 | 
						||
X-Authentication-Warning: taru.tm.ee: hannu set sender to hannu@tm.ee using -f
 | 
						||
Subject: Re: [HACKERS] CLUSTER and indisclustered
 | 
						||
From: Hannu Krosing <hannu@tm.ee>
 | 
						||
To: Tom Lane <tgl@sss.pgh.pa.us>
 | 
						||
cc: Curt Sampson <cjs@cynic.net>, mark Kirkwood <markir@slithery.org>, 
 | 
						||
	   Gavin Sherry <swm@linuxworld.com.au>, 
 | 
						||
	   Bruce Momjian <pgman@candle.pha.pa.us>, pgsql-hackers@postgresql.org
 | 
						||
In-Reply-To: <15010.1028726802@sss.pgh.pa.us>
 | 
						||
References: <Pine.NEB.4.44.0208071351440.1214-100000@angelic.cynic.net>
 | 
						||
	<12776.1028697148@sss.pgh.pa.us> <1028695589.2133.11.camel@rh72.home.ee>
 | 
						||
	<1028726966.13418.12.camel@taru.tm.ee>  <15010.1028726802@sss.pgh.pa.us>
 | 
						||
X-Mailer: Ximian Evolution 1.0.3.99 
 | 
						||
Date: 07 Aug 2002 17:13:54 +0200
 | 
						||
Message-ID: <1028733234.13418.113.camel@taru.tm.ee>
 | 
						||
X-Virus-Scanned: by AMaViS new-20020517
 | 
						||
Precedence: bulk
 | 
						||
Sender: pgsql-hackers-owner@postgresql.org
 | 
						||
X-Virus-Scanned: by AMaViS new-20020517
 | 
						||
Content-Length:  2812
 | 
						||
 | 
						||
On Wed, 2002-08-07 at 15:26, Tom Lane wrote:
 | 
						||
> Hannu Krosing <hannu@tm.ee> writes:
 | 
						||
> > Now I remembered my original preference for page bitmaps (vs. tuple
 | 
						||
> > bitmaps): one can't actually make good use of a bitmap of tuples because
 | 
						||
> > there is no fixed tuples/page ratio and thus no way to quickly go from
 | 
						||
> > bit position to actual tuple. You mention the same problem but propose a
 | 
						||
> > different solution.
 | 
						||
> 
 | 
						||
> > Using page bitmap, we will at least avoid fetching any unneeded pages -
 | 
						||
> > essentially we will have a sequential scan over possibly interesting
 | 
						||
> > pages.
 | 
						||
> 
 | 
						||
> Right.  One form of the "lossy compression" idea I suggested is to
 | 
						||
> switch from a per-tuple bitmap to a per-page bitmap once the bitmap gets
 | 
						||
> too large to work with.  
 | 
						||
 | 
						||
If it is a real bitmap, should it not be easyeast to allocate at the
 | 
						||
start ?
 | 
						||
 | 
						||
a page bitmap for a 100 000 000 tuple table with 10 tuples/page will be
 | 
						||
sized 10000000/8 = 1.25 MB, which does not look too big for me for that
 | 
						||
amount of data (the data table itself would occupy 80 GB).
 | 
						||
 | 
						||
Even having the bitmap of 16 bits/page (with the bits 0-14 meaning
 | 
						||
tuples 0-14 and bit 15 meaning "seq scan the rest of page") would
 | 
						||
consume just 20 MB of _local_ memory, and would be quite justifyiable
 | 
						||
for a query on a table that large.
 | 
						||
 | 
						||
For a real bitmap index the tuples-per-page should be a user-supplied
 | 
						||
tuning parameter.
 | 
						||
 | 
						||
> Again, one could imagine doing that only in denser areas of the bitmap.
 | 
						||
 | 
						||
I would hardly call the resulting structure "a bitmap" ;)
 | 
						||
 | 
						||
And I'm not sure the overhead for a more complex structure would win us
 | 
						||
any additional performance for most cases.
 | 
						||
 | 
						||
> > But I guess that CLUSTER support for INSERT will not be touched for 7.3
 | 
						||
> > as will real bitmap indexes ;)
 | 
						||
> 
 | 
						||
> All of this is far-future work I think. 
 | 
						||
 | 
						||
After we do that we will probably be able claim support for
 | 
						||
"datawarehousing" ;)
 | 
						||
 | 
						||
> Adding a new scan type to the
 | 
						||
> executor would probably be pretty localized, but the ramifications in
 | 
						||
> the planner could be extensive --- especially if you want to do plans
 | 
						||
> involving ANDed or ORed bitmaps.
 | 
						||
 | 
						||
Also going to "smart inserter" which can do local clustering on sets of
 | 
						||
real bitmap indexes for INSERTS (and INSERT side of UPDATE) would
 | 
						||
probably be a major change from our current "stupid inserter" ;)
 | 
						||
 | 
						||
This will not be needed for bitmap resolution higher than 1bit/page but
 | 
						||
default local clustering on bitmap indexes will probably buy us some
 | 
						||
extra performance. by avoiding data page fetches when such indexes are
 | 
						||
used.
 | 
						||
 | 
						||
AN anyway the support for INSERT being aware of clustering will probably
 | 
						||
come up sometime.
 | 
						||
 | 
						||
------------
 | 
						||
Hannu
 | 
						||
 | 
						||
 | 
						||
 | 
						||
---------------------------(end of broadcast)---------------------------
 | 
						||
TIP 2: you can get off all lists at once with the unregister command
 | 
						||
    (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
 | 
						||
 | 
						||
From hannu@tm.ee Wed Aug  7 11:22:53 2002
 | 
						||
X-Authentication-Warning: taru.tm.ee: hannu set sender to hannu@tm.ee using -f
 | 
						||
Subject: Re: [HACKERS] CLUSTER and indisclustered
 | 
						||
From: Hannu Krosing <hannu@tm.ee>
 | 
						||
To: Tom Lane <tgl@sss.pgh.pa.us>
 | 
						||
cc: Curt Sampson <cjs@cynic.net>, mark Kirkwood <markir@slithery.org>, 
 | 
						||
	   Gavin 
 | 
						||
	 Sherry <swm@linuxworld.com.au>, 
 | 
						||
	   Bruce Momjian <pgman@candle.pha.pa.us>, pgsql-hackers@postgresql.org
 | 
						||
In-Reply-To: <15622.1028730373@sss.pgh.pa.us>
 | 
						||
References: <Pine.NEB.4.44.0208071351440.1214-100000@angelic.cynic.net>
 | 
						||
	<12776.1028697148@sss.pgh.pa.us> <1028695589.2133.11.camel@rh72.home.ee>
 | 
						||
	<1028726966.13418.12.camel@taru.tm.ee> <15010.1028726802@sss.pgh.pa.us>
 | 
						||
	<1028733234.13418.113.camel@taru.tm.ee>  <15622.1028730373@sss.pgh.pa.us>
 | 
						||
X-Mailer: Ximian Evolution 1.0.3.99 
 | 
						||
Date: 07 Aug 2002 18:24:30 +0200
 | 
						||
Message-ID: <1028737470.13419.182.camel@taru.tm.ee>
 | 
						||
Content-Length:  2382
 | 
						||
 | 
						||
On Wed, 2002-08-07 at 16:26, Tom Lane wrote:
 | 
						||
> Hannu Krosing <hannu@tm.ee> writes:
 | 
						||
> > On Wed, 2002-08-07 at 15:26, Tom Lane wrote:
 | 
						||
> >> Right.  One form of the "lossy compression" idea I suggested is to
 | 
						||
> >> switch from a per-tuple bitmap to a per-page bitmap once the bitmap gets
 | 
						||
> >> too large to work with.  
 | 
						||
> 
 | 
						||
> > If it is a real bitmap, should it not be easyeast to allocate at the
 | 
						||
> > start ?
 | 
						||
> 
 | 
						||
> But it isn't a "real bitmap".  That would be a really poor
 | 
						||
> implementation, both for space and speed --- do you really want to scan
 | 
						||
> over a couple of megs of zeroes to find the few one-bits you care about,
 | 
						||
> in the typical case?
 | 
						||
 | 
						||
I guess that depends on data. The typical case should be somthing the
 | 
						||
stats process will find out so the optimiser can use it
 | 
						||
 | 
						||
The bitmap must be less than 1/48 (size of TID) full for best
 | 
						||
uncompressed "active-tid-list" to be smaller than plain bitmap. If there
 | 
						||
were some structure above list then this ratio would be even higher.
 | 
						||
 | 
						||
I have had good experience using "compressed delta lists", which will
 | 
						||
scale well ofer the whole "fullness" spectrum of bitmap, but this is for
 | 
						||
storage, not for initial constructing of lists.
 | 
						||
 | 
						||
>  "Bitmap" is a convenient term because it describes
 | 
						||
> the abstract behavior we want, but the actual data structure will
 | 
						||
> probably be nontrivial.  If I recall Ann's description correctly,
 | 
						||
> Firebird's implementation uses run length coding of some kind (anyone
 | 
						||
> care to dig in their source and get all the details?).
 | 
						||
 | 
						||
Plain RLL is probably a good way to store it and for merging two or more
 | 
						||
bitmaps, but not as good for constructing it bit-by-bit. I guess the
 | 
						||
most effective structure for updating is often still a plain bitmap
 | 
						||
(maybe not if it is very sparse and all of it does not fit in cache),
 | 
						||
followed by some kind of balanced tree (maybe rb-tree).
 | 
						||
 | 
						||
If the bitmap is relatively full then the plain bitmap is almost always
 | 
						||
the most effective to update.
 | 
						||
 | 
						||
> If we tried anything in the way of lossy compression then there'd
 | 
						||
> be even more stuff lurking under the hood.
 | 
						||
 | 
						||
Having three-valued (0,1,maybe) RLL-encoded "tritmap" would be a good
 | 
						||
way to represent lossy compression, and it would also be quite
 | 
						||
straightforward to merge two of these using AND or OR. It may even be
 | 
						||
possible to easily construct it using a fixed-length b-tree and going
 | 
						||
from 1 to "maybe" for nodes that get too dense.
 | 
						||
 | 
						||
---------------
 | 
						||
Hannu
 | 
						||
 | 
						||
 | 
						||
From pgsql-hackers-owner+M21991@postgresql.org Wed Apr 24 23:37:37 2002
 | 
						||
Return-path: <pgsql-hackers-owner+M21991@postgresql.org>
 | 
						||
Received: from postgresql.org (postgresql.org [64.49.215.8])
 | 
						||
	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3P3ba416337
 | 
						||
	for <pgman@candle.pha.pa.us>; Wed, 24 Apr 2002 23:37:36 -0400 (EDT)
 | 
						||
Received: from postgresql.org (postgresql.org [64.49.215.8])
 | 
						||
	by postgresql.org (Postfix) with SMTP
 | 
						||
	id CF13447622B; Wed, 24 Apr 2002 23:37:31 -0400 (EDT)
 | 
						||
Received: from sraigw.sra.co.jp (sraigw.sra.co.jp [202.32.10.2])
 | 
						||
	by postgresql.org (Postfix) with ESMTP id 3EE92474E4B
 | 
						||
	for <pgsql-hackers@postgresql.org>; Wed, 24 Apr 2002 23:37:19 -0400 (EDT)
 | 
						||
Received: from srascb.sra.co.jp (srascb [133.137.8.65])
 | 
						||
	by sraigw.sra.co.jp (8.9.3/3.7W-sraigw) with ESMTP id MAA76393;
 | 
						||
	Thu, 25 Apr 2002 12:35:44 +0900 (JST)
 | 
						||
Received: (from root@localhost)
 | 
						||
	by srascb.sra.co.jp (8.11.6/8.11.6) id g3P3ZCK64299;
 | 
						||
	Thu, 25 Apr 2002 12:35:12 +0900 (JST)
 | 
						||
	(envelope-from t-ishii@sra.co.jp)
 | 
						||
Received: from sranhm.sra.co.jp (sranhm [133.137.170.62])
 | 
						||
	by srascb.sra.co.jp (8.11.6/8.11.6av) with ESMTP id g3P3ZBV64291;
 | 
						||
	Thu, 25 Apr 2002 12:35:11 +0900 (JST)
 | 
						||
	(envelope-from t-ishii@sra.co.jp)
 | 
						||
Received: from localhost (IDENT:t-ishii@srapc1474.sra.co.jp [133.137.170.59])
 | 
						||
	by sranhm.sra.co.jp (8.9.3+3.2W/3.7W-srambox) with ESMTP id MAA25562;
 | 
						||
	Thu, 25 Apr 2002 12:35:43 +0900
 | 
						||
To: tgl@sss.pgh.pa.us
 | 
						||
cc: cjs@cynic.net, pgman@candle.pha.pa.us, pgsql-hackers@postgresql.org
 | 
						||
Subject: Re: [HACKERS] Sequential Scan Read-Ahead 
 | 
						||
In-Reply-To: <12342.1019705420@sss.pgh.pa.us>
 | 
						||
References: <Pine.NEB.4.43.0204251118040.445-100000@angelic.cynic.net>
 | 
						||
	<12342.1019705420@sss.pgh.pa.us>
 | 
						||
X-Mailer: Mew version 1.94.2 on Emacs 20.7 / Mule 4.1
 | 
						||
	=?iso-2022-jp?B?KBskQjAqGyhCKQ==?=
 | 
						||
MIME-Version: 1.0
 | 
						||
Content-Type: Text/Plain; charset=us-ascii
 | 
						||
Content-Transfer-Encoding: 7bit
 | 
						||
Message-ID: <20020425123429E.t-ishii@sra.co.jp>
 | 
						||
Date: Thu, 25 Apr 2002 12:34:29 +0900
 | 
						||
From: Tatsuo Ishii <t-ishii@sra.co.jp>
 | 
						||
X-Dispatcher: imput version 20000228(IM140)
 | 
						||
Lines: 12
 | 
						||
Precedence: bulk
 | 
						||
Sender: pgsql-hackers-owner@postgresql.org
 | 
						||
Status: OR
 | 
						||
 | 
						||
> Curt Sampson <cjs@cynic.net> writes:
 | 
						||
> > Grabbing bigger chunks is always optimal, AFICT, if they're not
 | 
						||
> > *too* big and you use the data. A single 64K read takes very little
 | 
						||
> > longer than a single 8K read.
 | 
						||
> 
 | 
						||
> Proof?
 | 
						||
 | 
						||
Long time ago I tested with the 32k block size and got 1.5-2x speed up
 | 
						||
comparing ordinary 8k block size in the sequential scan case.
 | 
						||
FYI, if this is the case.
 | 
						||
--
 | 
						||
Tatsuo Ishii
 | 
						||
 | 
						||
---------------------------(end of broadcast)---------------------------
 | 
						||
TIP 5: Have you checked our extensive FAQ?
 | 
						||
 | 
						||
http://www.postgresql.org/users-lounge/docs/faq.html
 | 
						||
 | 
						||
From mloftis@wgops.com Thu Apr 25 01:43:14 2002
 | 
						||
Return-path: <mloftis@wgops.com>
 | 
						||
Received: from free.wgops.com (root@dsl092-002-178.sfo1.dsl.speakeasy.net [66.92.2.178])
 | 
						||
	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3P5hC426529
 | 
						||
	for <pgman@candle.pha.pa.us>; Thu, 25 Apr 2002 01:43:13 -0400 (EDT)
 | 
						||
Received: from wgops.com ([10.1.2.207])
 | 
						||
	by free.wgops.com (8.11.3/8.11.3) with ESMTP id g3P5hBR43020;
 | 
						||
	Wed, 24 Apr 2002 22:43:11 -0700 (PDT)
 | 
						||
	(envelope-from mloftis@wgops.com)
 | 
						||
Message-ID: <3CC7976F.7070407@wgops.com>
 | 
						||
Date: Wed, 24 Apr 2002 22:43:11 -0700
 | 
						||
From: Michael Loftis <mloftis@wgops.com>
 | 
						||
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:0.9.4.1) Gecko/20020314 Netscape6/6.2.2
 | 
						||
X-Accept-Language: en-us
 | 
						||
MIME-Version: 1.0
 | 
						||
To: Tom Lane <tgl@sss.pgh.pa.us>
 | 
						||
cc: Curt Sampson <cjs@cynic.net>, Bruce Momjian <pgman@candle.pha.pa.us>,
 | 
						||
   PostgreSQL-development <pgsql-hackers@postgresql.org>
 | 
						||
Subject: Re: [HACKERS] Sequential Scan Read-Ahead
 | 
						||
References: <Pine.NEB.4.43.0204251118040.445-100000@angelic.cynic.net> <12342.1019705420@sss.pgh.pa.us>
 | 
						||
Content-Type: text/plain; charset=us-ascii; format=flowed
 | 
						||
Content-Transfer-Encoding: 7bit
 | 
						||
Status: OR
 | 
						||
 | 
						||
 | 
						||
 | 
						||
Tom Lane wrote:
 | 
						||
 | 
						||
>Curt Sampson <cjs@cynic.net> writes:
 | 
						||
>
 | 
						||
>>Grabbing bigger chunks is always optimal, AFICT, if they're not
 | 
						||
>>*too* big and you use the data. A single 64K read takes very little
 | 
						||
>>longer than a single 8K read.
 | 
						||
>>
 | 
						||
>
 | 
						||
>Proof?
 | 
						||
>
 | 
						||
I contend this statement.
 | 
						||
 | 
						||
It's optimal to a point.  I know that my system settles into it's best 
 | 
						||
read-speeds @ 32K or 64K chunks.  8K chunks are far below optimal for my 
 | 
						||
system.  Most systems I work on do far better at 16K than at 8K, and 
 | 
						||
most don't see any degradation when going to 32K chunks.  (this is 
 | 
						||
across numerous OSes and configs -- results are interpretations from 
 | 
						||
bonnie disk i/o marks).
 | 
						||
 | 
						||
Depending on what you're doing it is more efficiend to read bigger 
 | 
						||
blocks up to a point.  If you're multi-thread or reading in non-blocking 
 | 
						||
mode, take as big a chunk as you can handle or are ready to process in 
 | 
						||
quick order.  If you're picking up a bunch of little chunks here and 
 | 
						||
there and know oyu're not using them again then choose a size that will 
 | 
						||
hopeuflly cause some of the reads to overlap, failing that, pick the 
 | 
						||
smallest usable read size.
 | 
						||
 | 
						||
The OS can never do that stuff for you.
 | 
						||
 | 
						||
 | 
						||
 | 
						||
From cjs@cynic.net Thu Apr 25 03:29:05 2002
 | 
						||
Return-path: <cjs@cynic.net>
 | 
						||
Received: from angelic.cynic.net ([202.232.117.21])
 | 
						||
	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3P7T3404027
 | 
						||
	for <pgman@candle.pha.pa.us>; Thu, 25 Apr 2002 03:29:03 -0400 (EDT)
 | 
						||
Received: from localhost (localhost [127.0.0.1])
 | 
						||
	by angelic.cynic.net (Postfix) with ESMTP
 | 
						||
	id 1C44E870E; Thu, 25 Apr 2002 16:28:51 +0900 (JST)
 | 
						||
Date: Thu, 25 Apr 2002 16:28:51 +0900 (JST)
 | 
						||
From: Curt Sampson <cjs@cynic.net>
 | 
						||
To: Tom Lane <tgl@sss.pgh.pa.us>
 | 
						||
cc: Bruce Momjian <pgman@candle.pha.pa.us>,
 | 
						||
   PostgreSQL-development <pgsql-hackers@postgresql.org>
 | 
						||
Subject: Re: [HACKERS] Sequential Scan Read-Ahead 
 | 
						||
In-Reply-To: <12342.1019705420@sss.pgh.pa.us>
 | 
						||
Message-ID: <Pine.NEB.4.43.0204251534590.3111-100000@angelic.cynic.net>
 | 
						||
MIME-Version: 1.0
 | 
						||
Content-Type: TEXT/PLAIN; charset=US-ASCII
 | 
						||
Status: OR
 | 
						||
 | 
						||
On Wed, 24 Apr 2002, Tom Lane wrote:
 | 
						||
 | 
						||
> Curt Sampson <cjs@cynic.net> writes:
 | 
						||
> > Grabbing bigger chunks is always optimal, AFICT, if they're not
 | 
						||
> > *too* big and you use the data. A single 64K read takes very little
 | 
						||
> > longer than a single 8K read.
 | 
						||
>
 | 
						||
> Proof?
 | 
						||
 | 
						||
Well, there are various sorts of "proof" for this assertion. What
 | 
						||
sort do you want?
 | 
						||
 | 
						||
Here's a few samples; if you're looking for something different to
 | 
						||
satisfy you, let's discuss it.
 | 
						||
 | 
						||
1. Theoretical proof: two components of the delay in retrieving a
 | 
						||
block from disk are the disk arm movement and the wait for the
 | 
						||
right block to rotate under the head.
 | 
						||
 | 
						||
When retrieving, say, eight adjacent blocks, these will be spread
 | 
						||
across no more than two cylinders (with luck, only one). The worst
 | 
						||
case access time for a single block is the disk arm movement plus
 | 
						||
the full rotational wait; this is the same as the worst case for
 | 
						||
eight blocks if they're all on one cylinder. If they're not on one
 | 
						||
cylinder, they're still on adjacent cylinders, requiring a very
 | 
						||
short seek.
 | 
						||
 | 
						||
2. Proof by others using it: SQL server uses 64K reads when doing
 | 
						||
table scans, as they say that their research indicates that the
 | 
						||
major limitation is usually the number of I/O requests, not the
 | 
						||
I/O capacity of the disk. BSD's explicitly separates the optimum
 | 
						||
allocation size for storage (1K fragments) and optimum read size
 | 
						||
(8K blocks) because they found performance to be much better when
 | 
						||
a larger size block was read. Most file system vendors, too, do
 | 
						||
read-ahead for this very reason.
 | 
						||
 | 
						||
3. Proof by testing. I wrote a little ruby program to seek to a
 | 
						||
random point in the first 2 GB of my raw disk partition and read
 | 
						||
1-8 8K blocks of data. (This was done as one I/O request.) (Using
 | 
						||
the raw disk partition I avoid any filesystem buffering.) Here are
 | 
						||
typical results:
 | 
						||
 | 
						||
 125 reads of 16x8K blocks: 1.9 sec, 66.04 req/sec. 15.1 ms/req, 0.946 ms/block
 | 
						||
 250 reads of  8x8K blocks: 1.9 sec, 132.3 req/sec. 7.56 ms/req, 0.945 ms/block
 | 
						||
 500 reads of  4x8K blocks: 2.5 sec, 199 req/sec.   5.03 ms/req, 1.26 ms/block
 | 
						||
1000 reads of  2x8K blocks: 3.8 sec, 261.6 req/sec. 3.82 ms/req, 1.91 ms/block
 | 
						||
2000 reads of  1x8K blocks: 6.4 sec, 310.4 req/sec. 3.22 ms/req, 3.22 ms/block
 | 
						||
 | 
						||
The ratios of data retrieval speed per read for groups of adjacent
 | 
						||
8K blocks, assuming a single 8K block reads in 1 time unit, are:
 | 
						||
 | 
						||
    1 block	1.00
 | 
						||
    2 blocks	1.18
 | 
						||
    4 blocks	1.56
 | 
						||
    8 blocks	2.34
 | 
						||
    16 blocks	4.68
 | 
						||
 | 
						||
At less than 20% more expensive, certainly two-block read requests
 | 
						||
could be considered to cost "very little more" than one-block read
 | 
						||
requests. Even four-block read requests are only half-again as
 | 
						||
expensive. And if you know you're really going to be using the
 | 
						||
data, read in 8 block chunks and your cost per block (in terms of
 | 
						||
time) drops to less than a third of the cost of single-block reads.
 | 
						||
 | 
						||
Let me put paid to comments about multiple simultaneous readers
 | 
						||
making this invalid. Here's a typical result I get with four
 | 
						||
instances of the program running simultaneously:
 | 
						||
 | 
						||
125 reads of 16x8K blocks: 4.4 sec, 28.21 req/sec. 35.4 ms/req, 2.22 ms/block
 | 
						||
250 reads of 8x8K blocks: 3.9 sec, 64.88 req/sec. 15.4 ms/req, 1.93 ms/block
 | 
						||
500 reads of 4x8K blocks: 5.8 sec, 86.52 req/sec. 11.6 ms/req, 2.89 ms/block
 | 
						||
1000 reads of 2x8K blocks: 10 sec, 100.2 req/sec. 9.98 ms/req, 4.99 ms/block
 | 
						||
2000 reads of 1x8K blocks: 18 sec, 110 req/sec. 9.09 ms/req, 9.09 ms/block
 | 
						||
 | 
						||
Here's the ratio table again, with another column comparing the
 | 
						||
aggregate number of requests per second for one process and four
 | 
						||
processes:
 | 
						||
 | 
						||
    1 block	1.00		310 : 440
 | 
						||
    2 blocks	1.10		262 : 401
 | 
						||
    4 blocks	1.28		199 : 346
 | 
						||
    8 blocks	1.69		132 : 260
 | 
						||
    16 blocks	3.89		 66 : 113
 | 
						||
 | 
						||
Note that, here the relative increase in performance for increasing
 | 
						||
sizes of reads is even *better* until we get past 64K chunks. The
 | 
						||
overall throughput is better, of course, because with more requests
 | 
						||
per second coming in, the disk seek ordering code has more to work
 | 
						||
with and the average seek time spent seeking vs. reading will be
 | 
						||
reduced.
 | 
						||
 | 
						||
You know, this is not rocket science; I'm sure there must be papers
 | 
						||
all over the place about this. If anybody still disagrees that it's
 | 
						||
a good thing to read chunks up to 64K or so when the blocks are
 | 
						||
adjacent and you know you'll need the data, I'd like to see some
 | 
						||
tangible evidence to support that.
 | 
						||
 | 
						||
cjs
 | 
						||
-- 
 | 
						||
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
 | 
						||
    Don't you know, in this new Dark Age, we're all light.  --XTC
 | 
						||
 | 
						||
 | 
						||
From cjs@cynic.net Thu Apr 25 03:55:59 2002
 | 
						||
Return-path: <cjs@cynic.net>
 | 
						||
Received: from angelic.cynic.net ([202.232.117.21])
 | 
						||
	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3P7tv405489
 | 
						||
	for <pgman@candle.pha.pa.us>; Thu, 25 Apr 2002 03:55:57 -0400 (EDT)
 | 
						||
Received: from localhost (localhost [127.0.0.1])
 | 
						||
	by angelic.cynic.net (Postfix) with ESMTP
 | 
						||
	id 188EC870E; Thu, 25 Apr 2002 16:55:51 +0900 (JST)
 | 
						||
Date: Thu, 25 Apr 2002 16:55:50 +0900 (JST)
 | 
						||
From: Curt Sampson <cjs@cynic.net>
 | 
						||
To: Bruce Momjian <pgman@candle.pha.pa.us>
 | 
						||
cc: PostgreSQL-development <pgsql-hackers@postgresql.org>
 | 
						||
Subject: Re: [HACKERS] Sequential Scan Read-Ahead
 | 
						||
In-Reply-To: <200204250404.g3P44OI19061@candle.pha.pa.us>
 | 
						||
Message-ID: <Pine.NEB.4.43.0204251636550.3111-100000@angelic.cynic.net>
 | 
						||
MIME-Version: 1.0
 | 
						||
Content-Type: TEXT/PLAIN; charset=US-ASCII
 | 
						||
Status: OR
 | 
						||
 | 
						||
On Thu, 25 Apr 2002, Bruce Momjian wrote:
 | 
						||
 | 
						||
> Well, we are guilty of trying to push as much as possible on to other
 | 
						||
> software.  We do this for portability reasons, and because we think our
 | 
						||
> time is best spent dealing with db issues, not issues then can be deal
 | 
						||
> with by other existing software, as long as the software is decent.
 | 
						||
 | 
						||
That's fine. I think that's a perfectly fair thing to do.
 | 
						||
 | 
						||
It was just the wording (i.e., "it's this other software's fault
 | 
						||
that blah de blah") that got to me. To say, "We don't do readahead
 | 
						||
becase most OSes supply it, and we feel that other things would
 | 
						||
help more to improve performance," is fine by me. Or even, "Well,
 | 
						||
nobody feels like doing it. You want it, do it yourself," I have
 | 
						||
no problem with.
 | 
						||
 | 
						||
> Sure, that is certainly true.  However, it is hard to know what the
 | 
						||
> future will hold even if we had perfect knowledge of what was happening
 | 
						||
> in the kernel.  We don't know who else is going to start doing I/O once
 | 
						||
> our I/O starts.  We may have a better idea with kernel knowledge, but we
 | 
						||
> still don't know 100% what will be cached.
 | 
						||
 | 
						||
Well, we do if we use raw devices and do our own caching, using
 | 
						||
pages that are pinned in RAM. That was sort of what I was aiming
 | 
						||
at for the long run.
 | 
						||
 | 
						||
> We have free-behind on our list.
 | 
						||
 | 
						||
Uh...can't do it, if you're relying on the OS to do the buffering.
 | 
						||
How do you tell the OS that you're no longer going to use a page?
 | 
						||
 | 
						||
> I think LRU-K will do this quite well
 | 
						||
> and be a nice general solution for more than just sequential scans.
 | 
						||
 | 
						||
LRU-K sounds like a great idea to me, as does putting pages read
 | 
						||
for a table scan at the LRU end of the cache, rather than the MRU
 | 
						||
(assuming we do something to ensure that they stay in cache until
 | 
						||
read once, at any rate).
 | 
						||
 | 
						||
But again, great for your own cache, but doesn't work with the OS
 | 
						||
cache. And I'm a bit scared to crank up too high the amount of
 | 
						||
memory I give Postgres, lest the OS try to too aggressively buffer
 | 
						||
all that I/O in what memory remains to it, and start blowing programs
 | 
						||
(like maybe the backend binary itself) out of RAM. But maybe this
 | 
						||
isn't typically a problem; I don't know.
 | 
						||
 | 
						||
> There may be validity in this.  It is easy to do (I think) and could be
 | 
						||
> a win.
 | 
						||
 | 
						||
It didn't look to difficult to me, when I looked at the code, and
 | 
						||
you can see what kind of win it is from the response I just made
 | 
						||
to Tom.
 | 
						||
 | 
						||
> >     1. It is *not* true that you have no idea where data is when
 | 
						||
> >     using a storage array or other similar system. While you
 | 
						||
> >     certainly ought not worry about things such as head positions
 | 
						||
> >     and so on, it's been a given for a long, long time that two
 | 
						||
> >     blocks that have close index numbers are going to be close
 | 
						||
> >     together in physical storage.
 | 
						||
>
 | 
						||
> SCSI drivers, for example, are pretty smart.  Not sure we can take
 | 
						||
> advantage of that from user-land I/O.
 | 
						||
 | 
						||
Looking at the NetBSD ones, I don't see what they're doing that's
 | 
						||
so smart. (Aside from some awfully clever workarounds for stupid
 | 
						||
hardware limitations that would otherwise kill performance.) What
 | 
						||
sorts of "smart" are you referring to?
 | 
						||
 | 
						||
> Yes, but we are seeing some db's moving away from raw I/O.
 | 
						||
 | 
						||
Such as whom? And are you certain that they're moving to using the
 | 
						||
OS buffer cache, too? MS SQL server, for example, uses the filesystem,
 | 
						||
but turns off all buffering on those files.
 | 
						||
 | 
						||
> Our performance numbers beat most of the big db's already, so we must
 | 
						||
> be doing something right.
 | 
						||
 | 
						||
Really? Do the performance numbers for simple, bulk operations
 | 
						||
(imports, exports, table scans) beat the others handily? My intuition
 | 
						||
says not, but I'll happily be convinced otherwise.
 | 
						||
 | 
						||
> Yes, but do we spend our time doing that.  Is the payoff worth it, vs.
 | 
						||
> working on other features.  Sure it would be great to have all these
 | 
						||
> fancy things, but is this where our time should be spent, considering
 | 
						||
> other items on the TODO list?
 | 
						||
 | 
						||
I agree that these things need to be assesed.
 | 
						||
 | 
						||
> Jumping in and doing the I/O ourselves is a big undertaking, and looking
 | 
						||
> at our TODO list, I am not sure if it is worth it right now.
 | 
						||
 | 
						||
Right. I'm not trying to say this is a critical priority, I'm just
 | 
						||
trying to determine what we do right now, what we could do, and
 | 
						||
the potential performance increase that would give us.
 | 
						||
 | 
						||
cjs
 | 
						||
-- 
 | 
						||
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
 | 
						||
    Don't you know, in this new Dark Age, we're all light.  --XTC
 | 
						||
 | 
						||
 | 
						||
From cjs@cynic.net Thu Apr 25 05:19:11 2002
 | 
						||
Return-path: <cjs@cynic.net>
 | 
						||
Received: from angelic.cynic.net ([202.232.117.21])
 | 
						||
	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3P9J9412878
 | 
						||
	for <pgman@candle.pha.pa.us>; Thu, 25 Apr 2002 05:19:10 -0400 (EDT)
 | 
						||
Received: from localhost (localhost [127.0.0.1])
 | 
						||
	by angelic.cynic.net (Postfix) with ESMTP
 | 
						||
	id 50386870E; Thu, 25 Apr 2002 18:19:03 +0900 (JST)
 | 
						||
Date: Thu, 25 Apr 2002 18:19:02 +0900 (JST)
 | 
						||
From: Curt Sampson <cjs@cynic.net>
 | 
						||
To: Tom Lane <tgl@sss.pgh.pa.us>
 | 
						||
cc: Bruce Momjian <pgman@candle.pha.pa.us>,
 | 
						||
   PostgreSQL-development <pgsql-hackers@postgresql.org>
 | 
						||
Subject: Re: [HACKERS] Sequential Scan Read-Ahead 
 | 
						||
In-Reply-To: <Pine.NEB.4.43.0204251534590.3111-100000@angelic.cynic.net>
 | 
						||
Message-ID: <Pine.NEB.4.43.0204251805000.3111-100000@angelic.cynic.net>
 | 
						||
MIME-Version: 1.0
 | 
						||
Content-Type: TEXT/PLAIN; charset=US-ASCII
 | 
						||
Status: OR
 | 
						||
 | 
						||
On Thu, 25 Apr 2002, Curt Sampson wrote:
 | 
						||
 | 
						||
> Here's the ratio table again, with another column comparing the
 | 
						||
> aggregate number of requests per second for one process and four
 | 
						||
> processes:
 | 
						||
>
 | 
						||
 | 
						||
Just for interest, I ran this again with 20 processes working
 | 
						||
simultaneously. I did six runs at each blockread size and summed
 | 
						||
the tps for each process to find the aggregate number of reads per
 | 
						||
second during the test. I dropped the higest and the lowest ones,
 | 
						||
and averaged the rest. Here's the new table:
 | 
						||
 | 
						||
		1 proc	4 procs	20 procs
 | 
						||
 | 
						||
    1 block	310	440	260
 | 
						||
    2 blocks	262	401	481
 | 
						||
    4 blocks	199	346	354
 | 
						||
    8 blocks	132	260	250
 | 
						||
    16 blocks	 66	113	116
 | 
						||
 | 
						||
I'm not sure at all why performance gets so much *worse* with a lot of
 | 
						||
contention on the 1K reads. This could have something to with NetBSD, or
 | 
						||
its buffer cache, or my laptop's crappy little disk drive....
 | 
						||
 | 
						||
Or maybe I'm just running out of CPU.
 | 
						||
 | 
						||
cjs
 | 
						||
-- 
 | 
						||
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
 | 
						||
    Don't you know, in this new Dark Age, we're all light.  --XTC
 | 
						||
 | 
						||
 | 
						||
From tgl@sss.pgh.pa.us Thu Apr 25 09:54:35 2002
 | 
						||
Return-path: <tgl@sss.pgh.pa.us>
 | 
						||
Received: from sss.pgh.pa.us (root@[192.204.191.242])
 | 
						||
	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3PDsY407038
 | 
						||
	for <pgman@candle.pha.pa.us>; Thu, 25 Apr 2002 09:54:34 -0400 (EDT)
 | 
						||
Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
 | 
						||
	by sss.pgh.pa.us (8.11.4/8.11.4) with ESMTP id g3PDsXF25059;
 | 
						||
	Thu, 25 Apr 2002 09:54:33 -0400 (EDT)
 | 
						||
To: Curt Sampson <cjs@cynic.net>
 | 
						||
cc: Bruce Momjian <pgman@candle.pha.pa.us>,
 | 
						||
   PostgreSQL-development <pgsql-hackers@postgresql.org>
 | 
						||
Subject: Re: [HACKERS] Sequential Scan Read-Ahead 
 | 
						||
In-Reply-To: <Pine.NEB.4.43.0204251534590.3111-100000@angelic.cynic.net> 
 | 
						||
References: <Pine.NEB.4.43.0204251534590.3111-100000@angelic.cynic.net>
 | 
						||
Comments: In-reply-to Curt Sampson <cjs@cynic.net>
 | 
						||
	message dated "Thu, 25 Apr 2002 16:28:51 +0900"
 | 
						||
Date: Thu, 25 Apr 2002 09:54:32 -0400
 | 
						||
Message-ID: <25056.1019742872@sss.pgh.pa.us>
 | 
						||
From: Tom Lane <tgl@sss.pgh.pa.us>
 | 
						||
Status: OR
 | 
						||
 | 
						||
Curt Sampson <cjs@cynic.net> writes:
 | 
						||
> 1. Theoretical proof: two components of the delay in retrieving a
 | 
						||
> block from disk are the disk arm movement and the wait for the
 | 
						||
> right block to rotate under the head.
 | 
						||
 | 
						||
> When retrieving, say, eight adjacent blocks, these will be spread
 | 
						||
> across no more than two cylinders (with luck, only one).
 | 
						||
 | 
						||
Weren't you contending earlier that with modern disk mechs you really
 | 
						||
have no idea where the data is?  You're asserting as an article of 
 | 
						||
faith that the OS has been able to place the file's data blocks
 | 
						||
optimally --- or at least well enough to avoid unnecessary seeks.
 | 
						||
But just a few days ago I was getting told that random_page_cost
 | 
						||
was BS because there could be no such placement.
 | 
						||
 | 
						||
I'm getting a tad tired of sweeping generalizations offered without
 | 
						||
proof, especially when they conflict.
 | 
						||
 | 
						||
> 3. Proof by testing. I wrote a little ruby program to seek to a
 | 
						||
> random point in the first 2 GB of my raw disk partition and read
 | 
						||
> 1-8 8K blocks of data. (This was done as one I/O request.) (Using
 | 
						||
> the raw disk partition I avoid any filesystem buffering.)
 | 
						||
 | 
						||
And also ensure that you aren't testing the point at issue.
 | 
						||
The point at issue is that *in the presence of kernel read-ahead*
 | 
						||
it's quite unclear that there's any benefit to a larger request size.
 | 
						||
Ideally the kernel will have the next block ready for you when you
 | 
						||
ask, no matter what the request is.
 | 
						||
 | 
						||
There's been some talk of using the AIO interface (where available)
 | 
						||
to "encourage" the kernel to do read-ahead.  I don't foresee us
 | 
						||
writing our own substitute filesystem to make this happen, however.
 | 
						||
Oracle may have the manpower for that sort of boondoggle, but we
 | 
						||
don't...
 | 
						||
 | 
						||
			regards, tom lane
 | 
						||
 | 
						||
From pgsql-hackers-owner+M22053@postgresql.org Thu Apr 25 20:45:42 2002
 | 
						||
Return-path: <pgsql-hackers-owner+M22053@postgresql.org>
 | 
						||
Received: from postgresql.org (postgresql.org [64.49.215.8])
 | 
						||
	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3Q0jg405210
 | 
						||
	for <pgman@candle.pha.pa.us>; Thu, 25 Apr 2002 20:45:42 -0400 (EDT)
 | 
						||
Received: from postgresql.org (postgresql.org [64.49.215.8])
 | 
						||
	by postgresql.org (Postfix) with SMTP
 | 
						||
	id 17CE6476270; Thu, 25 Apr 2002 20:45:38 -0400 (EDT)
 | 
						||
Received: from doppelbock.patentinvestor.com (ip146.usw5.rb1.bel.nwlink.com [209.20.249.146])
 | 
						||
	by postgresql.org (Postfix) with ESMTP id 257DC47591C
 | 
						||
	for <pgsql-hackers@postgresql.org>; Thu, 25 Apr 2002 20:45:25 -0400 (EDT)
 | 
						||
Received: (from kaf@localhost)
 | 
						||
	by doppelbock.patentinvestor.com (8.11.6/8.11.2) id g3Q0erX14397;
 | 
						||
	Thu, 25 Apr 2002 17:40:53 -0700
 | 
						||
From: Kyle <kaf@nwlink.com>
 | 
						||
MIME-Version: 1.0
 | 
						||
Content-Type: text/plain; charset=us-ascii
 | 
						||
Content-Transfer-Encoding: 7bit
 | 
						||
Message-ID: <15560.41493.529847.635632@doppelbock.patentinvestor.com>
 | 
						||
Date: Thu, 25 Apr 2002 17:40:53 -0700
 | 
						||
To: PostgreSQL-development <pgsql-hackers@postgresql.org>
 | 
						||
Subject: Re: [HACKERS] Sequential Scan Read-Ahead 
 | 
						||
In-Reply-To: <25056.1019742872@sss.pgh.pa.us>
 | 
						||
References: <Pine.NEB.4.43.0204251534590.3111-100000@angelic.cynic.net>
 | 
						||
	<25056.1019742872@sss.pgh.pa.us>
 | 
						||
X-Mailer: VM 6.95 under 21.1 (patch 14) "Cuyahoga Valley" XEmacs Lucid
 | 
						||
Precedence: bulk
 | 
						||
Sender: pgsql-hackers-owner@postgresql.org
 | 
						||
Status: ORr
 | 
						||
 | 
						||
Tom Lane wrote:
 | 
						||
> ...
 | 
						||
> Curt Sampson <cjs@cynic.net> writes:
 | 
						||
> > 3. Proof by testing. I wrote a little ruby program to seek to a
 | 
						||
> > random point in the first 2 GB of my raw disk partition and read
 | 
						||
> > 1-8 8K blocks of data. (This was done as one I/O request.) (Using
 | 
						||
> > the raw disk partition I avoid any filesystem buffering.)
 | 
						||
> 
 | 
						||
> And also ensure that you aren't testing the point at issue.
 | 
						||
> The point at issue is that *in the presence of kernel read-ahead*
 | 
						||
> it's quite unclear that there's any benefit to a larger request size.
 | 
						||
> Ideally the kernel will have the next block ready for you when you
 | 
						||
> ask, no matter what the request is.
 | 
						||
> ...
 | 
						||
 | 
						||
I have to agree with Tom.  I think the numbers below show that with
 | 
						||
kernel read-ahead, block size isn't an issue.
 | 
						||
 | 
						||
The big_file1 file used below is 2.0 gig of random data, and the
 | 
						||
machine has 512 mb of main memory.  This ensures that we're not
 | 
						||
just getting cached data.
 | 
						||
 | 
						||
foreach i (4k 8k 16k 32k 64k 128k)
 | 
						||
  echo $i
 | 
						||
  time dd bs=$i if=big_file1 of=/dev/null
 | 
						||
end
 | 
						||
 | 
						||
and the results:
 | 
						||
 | 
						||
bs    user    kernel   elapsed
 | 
						||
4k:   0.260   7.740    1:27.25
 | 
						||
8k:   0.210   8.060    1:30.48
 | 
						||
16k:  0.090   7.790    1:30.88
 | 
						||
32k:  0.060   8.090    1:32.75
 | 
						||
64k:  0.030   8.190    1:29.11
 | 
						||
128k: 0.070   9.830    1:28.74
 | 
						||
 | 
						||
so with kernel read-ahead, we have basically the same elapsed (wall
 | 
						||
time) regardless of block size.  Sure, user time drops to a low at 64k
 | 
						||
blocksize, but kernel time is increasing.
 | 
						||
 | 
						||
 | 
						||
You could argue that this is a contrived example, no other I/O is
 | 
						||
being done.  Well I created a second 2.0g file (big_file2) and did two
 | 
						||
simultaneous reads from the same disk.  Sure performance went to hell
 | 
						||
but it shows blocksize is still irrelevant in a multi I/O environment
 | 
						||
with sequential read-ahead.
 | 
						||
 | 
						||
foreach i ( 4k 8k 16k 32k 64k 128k )
 | 
						||
  echo $i
 | 
						||
  time dd bs=$i if=big_file1 of=/dev/null &
 | 
						||
  time dd bs=$i if=big_file2 of=/dev/null &
 | 
						||
  wait
 | 
						||
end
 | 
						||
 | 
						||
bs    user    kernel   elapsed
 | 
						||
4k:   0.480   8.290    6:34.13  bigfile1
 | 
						||
      0.320   8.730    6:34.33  bigfile2
 | 
						||
8k:   0.250   7.580    6:31.75
 | 
						||
      0.180   8.450    6:31.88
 | 
						||
16k:  0.150   8.390    6:32.47
 | 
						||
      0.100   7.900    6:32.55
 | 
						||
32k:  0.190   8.460    6:24.72
 | 
						||
      0.060   8.410    6:24.73
 | 
						||
64k:  0.060   9.350    6:25.05
 | 
						||
      0.150   9.240    6:25.13
 | 
						||
128k: 0.090  10.610    6:33.14
 | 
						||
      0.110  11.320    6:33.31
 | 
						||
 | 
						||
 | 
						||
the differences in read times are basically in the mud.  Blocksize
 | 
						||
just doesn't matter much with the kernel doing readahead.
 | 
						||
 | 
						||
-Kyle
 | 
						||
 | 
						||
---------------------------(end of broadcast)---------------------------
 | 
						||
TIP 6: Have you searched our list archives?
 | 
						||
 | 
						||
http://archives.postgresql.org
 | 
						||
 | 
						||
From pgsql-hackers-owner+M22055@postgresql.org Thu Apr 25 22:19:07 2002
 | 
						||
Return-path: <pgsql-hackers-owner+M22055@postgresql.org>
 | 
						||
Received: from postgresql.org (postgresql.org [64.49.215.8])
 | 
						||
	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3Q2J7411254
 | 
						||
	for <pgman@candle.pha.pa.us>; Thu, 25 Apr 2002 22:19:07 -0400 (EDT)
 | 
						||
Received: from postgresql.org (postgresql.org [64.49.215.8])
 | 
						||
	by postgresql.org (Postfix) with SMTP
 | 
						||
	id F3924476208; Thu, 25 Apr 2002 22:19:02 -0400 (EDT)
 | 
						||
Received: from candle.pha.pa.us (216-55-132-35.dsl.san-diego.abac.net [216.55.132.35])
 | 
						||
	by postgresql.org (Postfix) with ESMTP id 6741D474E71
 | 
						||
	for <pgsql-hackers@postgresql.org>; Thu, 25 Apr 2002 22:18:50 -0400 (EDT)
 | 
						||
Received: (from pgman@localhost)
 | 
						||
	by candle.pha.pa.us (8.11.6/8.10.1) id g3Q2Ili11246;
 | 
						||
	Thu, 25 Apr 2002 22:18:47 -0400 (EDT)
 | 
						||
From: Bruce Momjian <pgman@candle.pha.pa.us>
 | 
						||
Message-ID: <200204260218.g3Q2Ili11246@candle.pha.pa.us>
 | 
						||
Subject: Re: [HACKERS] Sequential Scan Read-Ahead
 | 
						||
In-Reply-To: <15560.41493.529847.635632@doppelbock.patentinvestor.com>
 | 
						||
To: Kyle <kaf@nwlink.com>
 | 
						||
Date: Thu, 25 Apr 2002 22:18:47 -0400 (EDT)
 | 
						||
cc: PostgreSQL-development <pgsql-hackers@postgresql.org>
 | 
						||
X-Mailer: ELM [version 2.4ME+ PL97 (25)]
 | 
						||
MIME-Version: 1.0
 | 
						||
Content-Transfer-Encoding: 7bit
 | 
						||
Content-Type: text/plain; charset=US-ASCII
 | 
						||
Precedence: bulk
 | 
						||
Sender: pgsql-hackers-owner@postgresql.org
 | 
						||
Status: OR
 | 
						||
 | 
						||
 | 
						||
Nice test.  Would you test simultaneous 'dd' on the same file, perhaps
 | 
						||
with a slight delay between to the two so they don't read each other's
 | 
						||
blocks?
 | 
						||
 | 
						||
seek() in the file will turn off read-ahead in most OS's.  I am not
 | 
						||
saying this is a major issue for PostgreSQL but the numbers would be
 | 
						||
interesting.
 | 
						||
 | 
						||
 | 
						||
---------------------------------------------------------------------------
 | 
						||
 | 
						||
Kyle wrote:
 | 
						||
> Tom Lane wrote:
 | 
						||
> > ...
 | 
						||
> > Curt Sampson <cjs@cynic.net> writes:
 | 
						||
> > > 3. Proof by testing. I wrote a little ruby program to seek to a
 | 
						||
> > > random point in the first 2 GB of my raw disk partition and read
 | 
						||
> > > 1-8 8K blocks of data. (This was done as one I/O request.) (Using
 | 
						||
> > > the raw disk partition I avoid any filesystem buffering.)
 | 
						||
> > 
 | 
						||
> > And also ensure that you aren't testing the point at issue.
 | 
						||
> > The point at issue is that *in the presence of kernel read-ahead*
 | 
						||
> > it's quite unclear that there's any benefit to a larger request size.
 | 
						||
> > Ideally the kernel will have the next block ready for you when you
 | 
						||
> > ask, no matter what the request is.
 | 
						||
> > ...
 | 
						||
> 
 | 
						||
> I have to agree with Tom.  I think the numbers below show that with
 | 
						||
> kernel read-ahead, block size isn't an issue.
 | 
						||
> 
 | 
						||
> The big_file1 file used below is 2.0 gig of random data, and the
 | 
						||
> machine has 512 mb of main memory.  This ensures that we're not
 | 
						||
> just getting cached data.
 | 
						||
> 
 | 
						||
> foreach i (4k 8k 16k 32k 64k 128k)
 | 
						||
>   echo $i
 | 
						||
>   time dd bs=$i if=big_file1 of=/dev/null
 | 
						||
> end
 | 
						||
> 
 | 
						||
> and the results:
 | 
						||
> 
 | 
						||
> bs    user    kernel   elapsed
 | 
						||
> 4k:   0.260   7.740    1:27.25
 | 
						||
> 8k:   0.210   8.060    1:30.48
 | 
						||
> 16k:  0.090   7.790    1:30.88
 | 
						||
> 32k:  0.060   8.090    1:32.75
 | 
						||
> 64k:  0.030   8.190    1:29.11
 | 
						||
> 128k: 0.070   9.830    1:28.74
 | 
						||
> 
 | 
						||
> so with kernel read-ahead, we have basically the same elapsed (wall
 | 
						||
> time) regardless of block size.  Sure, user time drops to a low at 64k
 | 
						||
> blocksize, but kernel time is increasing.
 | 
						||
> 
 | 
						||
> 
 | 
						||
> You could argue that this is a contrived example, no other I/O is
 | 
						||
> being done.  Well I created a second 2.0g file (big_file2) and did two
 | 
						||
> simultaneous reads from the same disk.  Sure performance went to hell
 | 
						||
> but it shows blocksize is still irrelevant in a multi I/O environment
 | 
						||
> with sequential read-ahead.
 | 
						||
> 
 | 
						||
> foreach i ( 4k 8k 16k 32k 64k 128k )
 | 
						||
>   echo $i
 | 
						||
>   time dd bs=$i if=big_file1 of=/dev/null &
 | 
						||
>   time dd bs=$i if=big_file2 of=/dev/null &
 | 
						||
>   wait
 | 
						||
> end
 | 
						||
> 
 | 
						||
> bs    user    kernel   elapsed
 | 
						||
> 4k:   0.480   8.290    6:34.13  bigfile1
 | 
						||
>       0.320   8.730    6:34.33  bigfile2
 | 
						||
> 8k:   0.250   7.580    6:31.75
 | 
						||
>       0.180   8.450    6:31.88
 | 
						||
> 16k:  0.150   8.390    6:32.47
 | 
						||
>       0.100   7.900    6:32.55
 | 
						||
> 32k:  0.190   8.460    6:24.72
 | 
						||
>       0.060   8.410    6:24.73
 | 
						||
> 64k:  0.060   9.350    6:25.05
 | 
						||
>       0.150   9.240    6:25.13
 | 
						||
> 128k: 0.090  10.610    6:33.14
 | 
						||
>       0.110  11.320    6:33.31
 | 
						||
> 
 | 
						||
> 
 | 
						||
> the differences in read times are basically in the mud.  Blocksize
 | 
						||
> just doesn't matter much with the kernel doing readahead.
 | 
						||
> 
 | 
						||
> -Kyle
 | 
						||
> 
 | 
						||
> ---------------------------(end of broadcast)---------------------------
 | 
						||
> TIP 6: Have you searched our list archives?
 | 
						||
> 
 | 
						||
> http://archives.postgresql.org
 | 
						||
> 
 | 
						||
 | 
						||
-- 
 | 
						||
  Bruce Momjian                        |  http://candle.pha.pa.us
 | 
						||
  pgman@candle.pha.pa.us               |  (610) 853-3000
 | 
						||
  +  If your life is a hard drive,     |  830 Blythe Avenue
 | 
						||
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
 | 
						||
 | 
						||
---------------------------(end of broadcast)---------------------------
 | 
						||
TIP 6: Have you searched our list archives?
 | 
						||
 | 
						||
http://archives.postgresql.org
 | 
						||
 | 
						||
From cjs@cynic.net Thu Apr 25 22:27:23 2002
 | 
						||
Return-path: <cjs@cynic.net>
 | 
						||
Received: from angelic.cynic.net ([202.232.117.21])
 | 
						||
	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3Q2RL411868
 | 
						||
	for <pgman@candle.pha.pa.us>; Thu, 25 Apr 2002 22:27:22 -0400 (EDT)
 | 
						||
Received: from localhost (localhost [127.0.0.1])
 | 
						||
	by angelic.cynic.net (Postfix) with ESMTP
 | 
						||
	id AF60C870E; Fri, 26 Apr 2002 11:27:17 +0900 (JST)
 | 
						||
Date: Fri, 26 Apr 2002 11:27:17 +0900 (JST)
 | 
						||
From: Curt Sampson <cjs@cynic.net>
 | 
						||
To: Tom Lane <tgl@sss.pgh.pa.us>
 | 
						||
cc: Bruce Momjian <pgman@candle.pha.pa.us>,
 | 
						||
   PostgreSQL-development <pgsql-hackers@postgresql.org>
 | 
						||
Subject: Re: [HACKERS] Sequential Scan Read-Ahead 
 | 
						||
In-Reply-To: <25056.1019742872@sss.pgh.pa.us>
 | 
						||
Message-ID: <Pine.NEB.4.43.0204261028110.449-100000@angelic.cynic.net>
 | 
						||
MIME-Version: 1.0
 | 
						||
Content-Type: TEXT/PLAIN; charset=US-ASCII
 | 
						||
Status: OR
 | 
						||
 | 
						||
On Thu, 25 Apr 2002, Tom Lane wrote:
 | 
						||
 | 
						||
> Curt Sampson <cjs@cynic.net> writes:
 | 
						||
> > 1. Theoretical proof: two components of the delay in retrieving a
 | 
						||
> > block from disk are the disk arm movement and the wait for the
 | 
						||
> > right block to rotate under the head.
 | 
						||
>
 | 
						||
> > When retrieving, say, eight adjacent blocks, these will be spread
 | 
						||
> > across no more than two cylinders (with luck, only one).
 | 
						||
>
 | 
						||
> Weren't you contending earlier that with modern disk mechs you really
 | 
						||
> have no idea where the data is?
 | 
						||
 | 
						||
No, that was someone else. I contend that with pretty much any
 | 
						||
large-scale storage mechanism (i.e., anything beyond ramdisks),
 | 
						||
you will find that accessing two adjacent blocks is almost always
 | 
						||
1) close to as fast as accessing just the one, and 2) much, much
 | 
						||
faster than accessing two blocks that are relatively far apart.
 | 
						||
 | 
						||
There will be the odd case where the two adjacent blocks are
 | 
						||
physically far apart, but this is rare.
 | 
						||
 | 
						||
If this idea doesn't hold true, the whole idea that sequential
 | 
						||
reads are faster than random reads falls apart, and the optimizer
 | 
						||
shouldn't even have the option to make random reads cost more, much
 | 
						||
less have it set to four rather than one (or whatever it's set to).
 | 
						||
 | 
						||
> You're asserting as an article of
 | 
						||
> faith that the OS has been able to place the file's data blocks
 | 
						||
> optimally --- or at least well enough to avoid unnecessary seeks.
 | 
						||
 | 
						||
So are you, in the optimizer. But that's all right; the OS often
 | 
						||
can and does do this placement; the FFS filesystem is explicitly
 | 
						||
designed to do this sort of thing. If the filesystem isn't empty
 | 
						||
and the files grow a lot they'll be split into large fragments,
 | 
						||
but the fragments will be contiguous.
 | 
						||
 | 
						||
> But just a few days ago I was getting told that random_page_cost
 | 
						||
> was BS because there could be no such placement.
 | 
						||
 | 
						||
I've been arguing against that point as well.
 | 
						||
 | 
						||
> And also ensure that you aren't testing the point at issue.
 | 
						||
> The point at issue is that *in the presence of kernel read-ahead*
 | 
						||
> it's quite unclear that there's any benefit to a larger request size.
 | 
						||
 | 
						||
I will test this.
 | 
						||
 | 
						||
cjs
 | 
						||
-- 
 | 
						||
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
 | 
						||
    Don't you know, in this new Dark Age, we're all light.  --XTC
 | 
						||
 | 
						||
 | 
						||
From cjs@cynic.net Wed Apr 24 23:19:23 2002
 | 
						||
Return-path: <cjs@cynic.net>
 | 
						||
Received: from angelic.cynic.net ([202.232.117.21])
 | 
						||
	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3P3JM414917
 | 
						||
	for <pgman@candle.pha.pa.us>; Wed, 24 Apr 2002 23:19:22 -0400 (EDT)
 | 
						||
Received: from localhost (localhost [127.0.0.1])
 | 
						||
	by angelic.cynic.net (Postfix) with ESMTP
 | 
						||
	id 1F36F870E; Thu, 25 Apr 2002 12:19:14 +0900 (JST)
 | 
						||
Date: Thu, 25 Apr 2002 12:19:14 +0900 (JST)
 | 
						||
From: Curt Sampson <cjs@cynic.net>
 | 
						||
To: Bruce Momjian <pgman@candle.pha.pa.us>
 | 
						||
cc: PostgreSQL-development <pgsql-hackers@postgresql.org>
 | 
						||
Subject: Re: Sequential Scan Read-Ahead
 | 
						||
In-Reply-To: <200204250156.g3P1ufh05751@candle.pha.pa.us>
 | 
						||
Message-ID: <Pine.NEB.4.43.0204251118040.445-100000@angelic.cynic.net>
 | 
						||
MIME-Version: 1.0
 | 
						||
Content-Type: TEXT/PLAIN; charset=US-ASCII
 | 
						||
Status: OR
 | 
						||
 | 
						||
On Wed, 24 Apr 2002, Bruce Momjian wrote:
 | 
						||
 | 
						||
> >     1. Not all systems do readahead.
 | 
						||
>
 | 
						||
> If they don't, that isn't our problem.  We expect it to be there, and if
 | 
						||
> it isn't, the vendor/kernel is at fault.
 | 
						||
 | 
						||
It is your problem when another database kicks Postgres' ass
 | 
						||
performance-wise.
 | 
						||
 | 
						||
And at that point, *you're* at fault. You're the one who's knowingly
 | 
						||
decided to do things inefficiently.
 | 
						||
 | 
						||
Sorry if this sounds harsh, but this, "Oh, someone else is to blame"
 | 
						||
attitude gets me steamed. It's one thing to say, "We don't support
 | 
						||
this." That's fine; there are often good reasons for that. It's a
 | 
						||
completely different thing to say, "It's an unrelated entity's fault we
 | 
						||
don't support this."
 | 
						||
 | 
						||
At any rate, relying on the kernel to guess how to optimise for
 | 
						||
the workload will never work as well as well as the software that
 | 
						||
knows the workload doing the optimization.
 | 
						||
 | 
						||
The lack of support thing is no joke. Sure, lots of systems nowadays
 | 
						||
support unified buffer cache and read-ahead. But how many, besides
 | 
						||
Solaris, support free-behind, which is also very important to avoid
 | 
						||
blowing out your buffer cache when doing sequential reads? And who
 | 
						||
at all supports read-ahead for reverse scans? (Or does Postgres
 | 
						||
not do those, anyway? I can see the support is there.)
 | 
						||
 | 
						||
And even when the facilities are there, you create problems by
 | 
						||
using them.  Look at the OS buffer cache, for example. Not only do
 | 
						||
we lose efficiency by using two layers of caching, but (as people
 | 
						||
have pointed out recently on the lists), the optimizer can't even
 | 
						||
know how much or what is being cached, and thus can't make decisions
 | 
						||
based on that.
 | 
						||
 | 
						||
> Yes, seek() in file will turn off read-ahead.  Grabbing bigger chunks
 | 
						||
> would help here, but if you have two people already reading from the
 | 
						||
> same file, grabbing bigger chunks of the file may not be optimal.
 | 
						||
 | 
						||
Grabbing bigger chunks is always optimal, AFICT, if they're not
 | 
						||
*too* big and you use the data. A single 64K read takes very little
 | 
						||
longer than a single 8K read.
 | 
						||
 | 
						||
> >     3. Even when the read-ahead does occur, you're still doing more
 | 
						||
> >     syscalls, and thus more expensive kernel/userland transitions, than
 | 
						||
> >     you have to.
 | 
						||
>
 | 
						||
> I would guess the performance impact is minimal.
 | 
						||
 | 
						||
If it were minimal, people wouldn't work so hard to build multi-level
 | 
						||
thread systems, where multiple userland threads are scheduled on
 | 
						||
top of kernel threads.
 | 
						||
 | 
						||
However, it does depend on how much CPU your particular application
 | 
						||
is using. You may have it to spare.
 | 
						||
 | 
						||
> 	http://candle.pha.pa.us/mhonarc/todo.detail/performance/msg00009.html
 | 
						||
 | 
						||
Well, this message has some points in it that I feel are just incorrect.
 | 
						||
 | 
						||
    1. It is *not* true that you have no idea where data is when
 | 
						||
    using a storage array or other similar system. While you
 | 
						||
    certainly ought not worry about things such as head positions
 | 
						||
    and so on, it's been a given for a long, long time that two
 | 
						||
    blocks that have close index numbers are going to be close
 | 
						||
    together in physical storage.
 | 
						||
 | 
						||
    2. Raw devices are quite standard across Unix systems (except
 | 
						||
    in the unfortunate case of Linux, which I think has been
 | 
						||
    remedied, hasn't it?). They're very portable, and have just as
 | 
						||
    well--if not better--defined write semantics as a filesystem.
 | 
						||
 | 
						||
    3. My observations of OS performance tuning over the past six
 | 
						||
    or eight years contradict the statement, "There's a considerable
 | 
						||
    cost in complexity and code in using "raw" storage too, and
 | 
						||
    it's not a one off cost: as the technologies change, the "fast"
 | 
						||
    way to do things will change and the code will have to be
 | 
						||
    updated to match." While optimizations have been removed over
 | 
						||
    the years the basic optimizations (order reads by block number,
 | 
						||
    do larger reads rather than smaller, cache the data) have
 | 
						||
    remained unchanged for a long, long time.
 | 
						||
 | 
						||
    4. "Better to leave this to the OS vendor where possible, and
 | 
						||
    take advantage of the tuning they do." Well, sorry guys, but
 | 
						||
    have a look at the tuning they do. It hasn't changed in years,
 | 
						||
    except to remove now-unnecessary complexity realated to really,
 | 
						||
    really old and slow disk devices, and to add a few thing that
 | 
						||
    guess workload but still do a worse job than if the workload
 | 
						||
    generator just did its own optimisations in the first place.
 | 
						||
 | 
						||
> 	http://candle.pha.pa.us/mhonarc/todo.detail/optimizer/msg00011.html
 | 
						||
 | 
						||
Well, this one, with statements like "Postgres does have control
 | 
						||
over its buffer cache," I don't know what to say. You can interpret
 | 
						||
the statement however you like, but in the end Postgres very little
 | 
						||
control at all over how data is moved between memory and disk.
 | 
						||
 | 
						||
BTW, please don't take me as saying that all control over physical
 | 
						||
IO should be done by Postgres. I just think that Posgres could do
 | 
						||
a better job of managing data transfer between disk and memory than
 | 
						||
the OS can. The rest of the things (using raw paritions, read-ahead,
 | 
						||
free-behind, etc.) just drop out of that one idea.
 | 
						||
 | 
						||
cjs
 | 
						||
-- 
 | 
						||
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
 | 
						||
    Don't you know, in this new Dark Age, we're all light.  --XTC
 | 
						||
 | 
						||
 | 
						||
From kaf@nwlink.com Fri Apr 26 14:22:39 2002
 | 
						||
Return-path: <kaf@nwlink.com>
 | 
						||
Received: from doppelbock.patentinvestor.com (ip146.usw5.rb1.bel.nwlink.com [209.20.249.146])
 | 
						||
	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3QIMc400783
 | 
						||
	for <pgman@candle.pha.pa.us>; Fri, 26 Apr 2002 14:22:38 -0400 (EDT)
 | 
						||
Received: (from kaf@localhost)
 | 
						||
	by doppelbock.patentinvestor.com (8.11.6/8.11.2) id g3QII0l16824;
 | 
						||
	Fri, 26 Apr 2002 11:18:00 -0700
 | 
						||
From: Kyle <kaf@nwlink.com>
 | 
						||
MIME-Version: 1.0
 | 
						||
Content-Type: text/plain; charset=us-ascii
 | 
						||
Content-Transfer-Encoding: 7bit
 | 
						||
Message-ID: <15561.39384.296503.501888@doppelbock.patentinvestor.com>
 | 
						||
Date: Fri, 26 Apr 2002 11:18:00 -0700
 | 
						||
To: Bruce Momjian <pgman@candle.pha.pa.us>
 | 
						||
Subject: Re: [HACKERS] Sequential Scan Read-Ahead
 | 
						||
In-Reply-To: <200204261444.g3QEiFh11090@candle.pha.pa.us>
 | 
						||
References: <15561.26116.817541.950416@doppelbock.patentinvestor.com>
 | 
						||
	<200204261444.g3QEiFh11090@candle.pha.pa.us>
 | 
						||
X-Mailer: VM 6.95 under 21.1 (patch 14) "Cuyahoga Valley" XEmacs Lucid
 | 
						||
Status: ORr
 | 
						||
 | 
						||
Hey Bruce,
 | 
						||
 | 
						||
I'll forward this to the list if you think they'd benefit from it.
 | 
						||
I'm not sure it says anything about read-ahead, I think this is more a
 | 
						||
kernel caching issue.  But I've been known to be wrong in the past.
 | 
						||
Anyway...
 | 
						||
 | 
						||
 | 
						||
the test:
 | 
						||
 | 
						||
foreach i (5 15 20 25 30 )
 | 
						||
  echo $i
 | 
						||
  time dd bs=8k if=big_file1 of=/dev/null &
 | 
						||
  sleep $i
 | 
						||
  time dd bs=8k if=big_file1 of=/dev/null &
 | 
						||
  wait
 | 
						||
end
 | 
						||
 | 
						||
I did a couple more runs in the low range since their is a drastic
 | 
						||
jump in elapsed (wall clock) time after doing a 6 second sleep:
 | 
						||
 | 
						||
            first process                second process
 | 
						||
sleep    user    kernel   elapsed     user    kernel   elapsed
 | 
						||
0 sec    0.200   7.980    1:26.57     0.240   7.720    1:26.56
 | 
						||
3 sec    0.260   7.600    1:25.71     0.260   8.100    1:22.60
 | 
						||
5 sec    0.160   7.890    1:26.04     0.220   8.180    1:21.04
 | 
						||
6 sec    0.220   8.070    1:19.59     0.230   7.620    1:25.69
 | 
						||
7 sec    0.210   9.270    1:57.92     0.100   8.750    1:50.76
 | 
						||
8 sec    0.240   8.060    4:47.47     0.300   7.800    4:40.40
 | 
						||
15 sec   0.200   8.500    4:51.11     0.180   7.280    4:44.36
 | 
						||
20 sec   0.160   8.040    4:40.72     0.240   7.790    4:37.24
 | 
						||
25 sec   0.170   8.150    4:37.58     0.140   8.200    4:33.08
 | 
						||
30 sec   0.200   7.390    4:37.01     0.230   8.220    4:31.83
 | 
						||
 | 
						||
 | 
						||
 | 
						||
with a sleep of > 6 seconds, either the second process isn't getting
 | 
						||
cached data or readahead is being turned off.  I'd guess the former, I
 | 
						||
don't see why read-ahead would be turned off since they're both doing
 | 
						||
sequential operations.
 | 
						||
 | 
						||
Although with 512mb of memory and the disk reading at about 22 mb/sec,
 | 
						||
maybe we're not hitting the cache.  I'd guess at least ~400 megs of
 | 
						||
kernel cache is being used for buffering this 2 gig file.  free(1)
 | 
						||
reports:
 | 
						||
 | 
						||
% free
 | 
						||
             total       used       free     shared    buffers     cached
 | 
						||
Mem:        512924     508576       4348          0       2640     477960
 | 
						||
-/+ buffers/cache:      27976     484948
 | 
						||
Swap:       527152      15864     511288
 | 
						||
 | 
						||
so shouldn't we be getting cached data even with a sleep of up to
 | 
						||
about (400/22) 18 seconds...?  Maybe I'm just in the dark on what's
 | 
						||
really happening.  I should point out that this is linux 2.4.18.
 | 
						||
 | 
						||
 | 
						||
 | 
						||
 | 
						||
Bruce Momjian wrote:
 | 
						||
> 
 | 
						||
> I am trying to illustrate how kernel read-ahead could be turned off in
 | 
						||
> certain cases.
 | 
						||
> 
 | 
						||
> ---------------------------------------------------------------------------
 | 
						||
> 
 | 
						||
> Kyle wrote:
 | 
						||
> > What are you trying to test, the kernel's cache vs disk speed?
 | 
						||
> > 
 | 
						||
> > 
 | 
						||
> > Bruce Momjian wrote:
 | 
						||
> > > 
 | 
						||
> > > Nice test.  Would you test simultaneous 'dd' on the same file, perhaps
 | 
						||
> > > with a slight delay between to the two so they don't read each other's
 | 
						||
> > > blocks?
 | 
						||
> > > 
 | 
						||
> > > seek() in the file will turn off read-ahead in most OS's.  I am not
 | 
						||
> > > saying this is a major issue for PostgreSQL but the numbers would be
 | 
						||
> > > interesting.
 | 
						||
 | 
						||
From pgsql-hackers-owner+M49418=pgman=candle.pha.pa.us@postgresql.org Tue Jan 27 15:52:28 2004
 | 
						||
Return-path: <pgsql-hackers-owner+M49418=pgman=candle.pha.pa.us@postgresql.org>
 | 
						||
Received: from vm2.hub.org ([200.46.204.60])
 | 
						||
	by candle.pha.pa.us (8.11.6/8.11.6) with ESMTP id i0RKqPe07814
 | 
						||
	for <pgman@candle.pha.pa.us>; Tue, 27 Jan 2004 15:52:28 -0500 (EST)
 | 
						||
Received: from postgresql.org (svr1.postgresql.org [200.46.204.71])
 | 
						||
	by vm2.hub.org (Postfix) with ESMTP id 70DC3CD397A
 | 
						||
	for <pgman@candle.pha.pa.us>; Tue, 27 Jan 2004 20:52:19 +0000 (GMT)
 | 
						||
X-Original-To: pgsql-hackers-postgresql.org@localhost.postgresql.org
 | 
						||
Received: from localhost (neptune.hub.org [200.46.204.2])
 | 
						||
	by svr1.postgresql.org (Postfix) with ESMTP id A93D7D1D3A4
 | 
						||
	for <pgsql-hackers-postgresql.org@localhost.postgresql.org>; Tue, 27 Jan 2004 20:41:43 +0000 (GMT)
 | 
						||
Received: from svr1.postgresql.org ([200.46.204.71])
 | 
						||
	by localhost (neptune.hub.org [200.46.204.2]) (amavisd-new, port 10024)
 | 
						||
	with ESMTP id 54186-02
 | 
						||
	for <pgsql-hackers-postgresql.org@localhost.postgresql.org>;
 | 
						||
	Tue, 27 Jan 2004 16:41:12 -0400 (AST)
 | 
						||
Received: from smtp.istop.com (dci.doncaster.on.ca [66.11.168.194])
 | 
						||
	by svr1.postgresql.org (Postfix) with ESMTP id 33243D1E1F2
 | 
						||
	for <pgsql-hackers@postgresql.org>; Tue, 27 Jan 2004 16:36:24 -0400 (AST)
 | 
						||
Received: from stark.xeocode.com (gsstark.mtl.istop.com [66.11.160.162])
 | 
						||
	by smtp.istop.com (Postfix) with ESMTP
 | 
						||
	id 2A41136C44; Tue, 27 Jan 2004 15:36:21 -0500 (EST)
 | 
						||
Received: from localhost ([127.0.0.1] helo=stark.xeocode.com)
 | 
						||
	by stark.xeocode.com with smtp (Exim 3.36 #1 (Debian))
 | 
						||
	id 1AlZwa-0006sL-00; Tue, 27 Jan 2004 15:36:20 -0500
 | 
						||
To: pgsql-hackers@postgresql.org
 | 
						||
Subject: [HACKERS] Question about indexes
 | 
						||
From: Greg Stark <gsstark@mit.edu>
 | 
						||
Organization: The Emacs Conspiracy; member since 1992
 | 
						||
Date: 27 Jan 2004 15:36:20 -0500
 | 
						||
Message-ID: <87isixt9h7.fsf@stark.xeocode.com>
 | 
						||
Lines: 9
 | 
						||
User-Agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.3
 | 
						||
MIME-Version: 1.0
 | 
						||
Content-Type: text/plain; charset=us-ascii
 | 
						||
X-Virus-Scanned: by amavisd-new at postgresql.org
 | 
						||
X-Mailing-List: pgsql-hackers
 | 
						||
Precedence: bulk
 | 
						||
Sender: pgsql-hackers-owner@postgresql.org
 | 
						||
X-Spam-Checker-Version: SpamAssassin 2.61 (1.212.2.1-2003-12-09-exp) on 
 | 
						||
	candle.pha.pa.us
 | 
						||
X-Spam-Status: No, hits=-4.9 required=5.0 tests=BAYES_00 autolearn=ham 
 | 
						||
	version=2.61
 | 
						||
Status: OR
 | 
						||
 | 
						||
 | 
						||
How feasible would it be to have a btree index on ctid? I'm thinking it ought
 | 
						||
to work simply enough for the normal case of insert/delet/update, but I'm not
 | 
						||
completely certain how vacuum, vacuum full, and cluster would interact.
 | 
						||
 | 
						||
You may think this would be utterly useless, but I have a cunning plan.
 | 
						||
 | 
						||
-- 
 | 
						||
greg
 | 
						||
 | 
						||
 | 
						||
---------------------------(end of broadcast)---------------------------
 | 
						||
TIP 8: explain analyze is your friend
 | 
						||
 | 
						||
From pgsql-hackers-owner+M49439=pgman=candle.pha.pa.us@postgresql.org Tue Jan 27 18:01:59 2004
 | 
						||
Return-path: <pgsql-hackers-owner+M49439=pgman=candle.pha.pa.us@postgresql.org>
 | 
						||
Received: from bricolage.postgresql.org ([200.46.204.116])
 | 
						||
	by candle.pha.pa.us (8.11.6/8.11.6) with ESMTP id i0RN1we27517
 | 
						||
	for <pgman@candle.pha.pa.us>; Tue, 27 Jan 2004 18:01:59 -0500 (EST)
 | 
						||
Received: from postgresql.org (svr1.postgresql.org [200.46.204.71])
 | 
						||
	by bricolage.postgresql.org (Postfix) with ESMTP id 946B3148343C
 | 
						||
	for <pgman@candle.pha.pa.us>; Tue, 27 Jan 2004 23:01:52 +0000 (GMT)
 | 
						||
X-Original-To: pgsql-hackers-postgresql.org@localhost.postgresql.org
 | 
						||
Received: from localhost (neptune.hub.org [200.46.204.2])
 | 
						||
	by svr1.postgresql.org (Postfix) with ESMTP id 778CED1D362
 | 
						||
	for <pgsql-hackers-postgresql.org@localhost.postgresql.org>; Tue, 27 Jan 2004 22:52:27 +0000 (GMT)
 | 
						||
Received: from svr1.postgresql.org ([200.46.204.71])
 | 
						||
	by localhost (neptune.hub.org [200.46.204.2]) (amavisd-new, port 10024)
 | 
						||
	with ESMTP id 09353-02
 | 
						||
	for <pgsql-hackers-postgresql.org@localhost.postgresql.org>;
 | 
						||
	Tue, 27 Jan 2004 18:51:56 -0400 (AST)
 | 
						||
Received: from sss.pgh.pa.us (unknown [192.204.191.242])
 | 
						||
	by svr1.postgresql.org (Postfix) with ESMTP id 5C5D5D1B47D
 | 
						||
	for <pgsql-hackers@postgresql.org>; Tue, 27 Jan 2004 18:51:55 -0400 (AST)
 | 
						||
Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
 | 
						||
	by sss.pgh.pa.us (8.12.10/8.12.10) with ESMTP id i0RMpunX029816;
 | 
						||
	Tue, 27 Jan 2004 17:51:56 -0500 (EST)
 | 
						||
To: Greg Stark <gsstark@mit.edu>
 | 
						||
cc: pgsql-hackers@postgresql.org
 | 
						||
Subject: Re: [HACKERS] Question about indexes 
 | 
						||
In-Reply-To: <87isixt9h7.fsf@stark.xeocode.com> 
 | 
						||
References: <87isixt9h7.fsf@stark.xeocode.com>
 | 
						||
Comments: In-reply-to Greg Stark <gsstark@mit.edu>
 | 
						||
	message dated "27 Jan 2004 15:36:20 -0500"
 | 
						||
Date: Tue, 27 Jan 2004 17:51:56 -0500
 | 
						||
Message-ID: <29815.1075243916@sss.pgh.pa.us>
 | 
						||
From: Tom Lane <tgl@sss.pgh.pa.us>
 | 
						||
X-Virus-Scanned: by amavisd-new at postgresql.org
 | 
						||
X-Mailing-List: pgsql-hackers
 | 
						||
Precedence: bulk
 | 
						||
Sender: pgsql-hackers-owner@postgresql.org
 | 
						||
X-Spam-Checker-Version: SpamAssassin 2.61 (1.212.2.1-2003-12-09-exp) on 
 | 
						||
	candle.pha.pa.us
 | 
						||
X-Spam-Status: No, hits=-4.9 required=5.0 tests=BAYES_00 autolearn=ham 
 | 
						||
	version=2.61
 | 
						||
Status: OR
 | 
						||
 | 
						||
Greg Stark <gsstark@mit.edu> writes:
 | 
						||
> How feasible would it be to have a btree index on ctid?
 | 
						||
 | 
						||
Why would you want one?  Direct access by ctid beats out an index lookup
 | 
						||
every time.  In any case, vacuum and friends would break such an index
 | 
						||
entirely.
 | 
						||
 | 
						||
			regards, tom lane
 | 
						||
 | 
						||
---------------------------(end of broadcast)---------------------------
 | 
						||
TIP 3: if posting/reading through Usenet, please send an appropriate
 | 
						||
      subscribe-nomail command to majordomo@postgresql.org so that your
 | 
						||
      message can get through to the mailing list cleanly
 | 
						||
 | 
						||
From pgsql-hackers-owner+M49440=pgman=candle.pha.pa.us@postgresql.org Tue Jan 27 18:19:13 2004
 | 
						||
Return-path: <pgsql-hackers-owner+M49440=pgman=candle.pha.pa.us@postgresql.org>
 | 
						||
Received: from krusty-motorsports.com (IDENT:exim@krusty-motorsports.com [192.94.170.8])
 | 
						||
	by candle.pha.pa.us (8.11.6/8.11.6) with ESMTP id i0RNJCe00301
 | 
						||
	for <pgman@candle.pha.pa.us>; Tue, 27 Jan 2004 18:19:13 -0500 (EST)
 | 
						||
Received: from [200.46.204.71] (helo=postgresql.org)
 | 
						||
	by krusty-motorsports.com with esmtp (Exim 4.22)
 | 
						||
	id 1AldQ9-0007JC-2z
 | 
						||
	for pgman@candle.pha.pa.us; Wed, 28 Jan 2004 00:19:05 +0000
 | 
						||
X-Original-To: pgsql-hackers-postgresql.org@localhost.postgresql.org
 | 
						||
Received: from localhost (neptune.hub.org [200.46.204.2])
 | 
						||
	by svr1.postgresql.org (Postfix) with ESMTP id 6D641D1D54A
 | 
						||
	for <pgsql-hackers-postgresql.org@localhost.postgresql.org>; Tue, 27 Jan 2004 23:12:01 +0000 (GMT)
 | 
						||
Received: from svr1.postgresql.org ([200.46.204.71])
 | 
						||
	by localhost (neptune.hub.org [200.46.204.2]) (amavisd-new, port 10024)
 | 
						||
	with ESMTP id 14466-06
 | 
						||
	for <pgsql-hackers-postgresql.org@localhost.postgresql.org>;
 | 
						||
	Tue, 27 Jan 2004 19:11:30 -0400 (AST)
 | 
						||
Received: from smtp.istop.com (dci.doncaster.on.ca [66.11.168.194])
 | 
						||
	by svr1.postgresql.org (Postfix) with ESMTP id 6D58FD1D49E
 | 
						||
	for <pgsql-hackers@postgresql.org>; Tue, 27 Jan 2004 19:11:29 -0400 (AST)
 | 
						||
Received: from stark.xeocode.com (gsstark.mtl.istop.com [66.11.160.162])
 | 
						||
	by smtp.istop.com (Postfix) with ESMTP
 | 
						||
	id 9B74536ADA; Tue, 27 Jan 2004 18:11:31 -0500 (EST)
 | 
						||
Received: from localhost ([127.0.0.1] helo=stark.xeocode.com)
 | 
						||
	by stark.xeocode.com with smtp (Exim 3.36 #1 (Debian))
 | 
						||
	id 1AlcMl-0007Tk-00; Tue, 27 Jan 2004 18:11:31 -0500
 | 
						||
To: Tom Lane <tgl@sss.pgh.pa.us>
 | 
						||
cc: Greg Stark <gsstark@mit.edu>, pgsql-hackers@postgresql.org
 | 
						||
Subject: Re: [HACKERS] Question about indexes
 | 
						||
References: <87isixt9h7.fsf@stark.xeocode.com>
 | 
						||
	<29815.1075243916@sss.pgh.pa.us>
 | 
						||
In-Reply-To: <29815.1075243916@sss.pgh.pa.us>
 | 
						||
From: Greg Stark <gsstark@mit.edu>
 | 
						||
Organization: The Emacs Conspiracy; member since 1992
 | 
						||
Date: 27 Jan 2004 18:11:31 -0500
 | 
						||
Message-ID: <87d695t2ak.fsf@stark.xeocode.com>
 | 
						||
Lines: 33
 | 
						||
User-Agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.3
 | 
						||
MIME-Version: 1.0
 | 
						||
Content-Type: text/plain; charset=us-ascii
 | 
						||
X-Virus-Scanned: by amavisd-new at postgresql.org
 | 
						||
X-Mailing-List: pgsql-hackers
 | 
						||
Precedence: bulk
 | 
						||
Sender: pgsql-hackers-owner@postgresql.org
 | 
						||
X-Spam-Checker-Version: SpamAssassin 2.61 (1.212.2.1-2003-12-09-exp) on 
 | 
						||
	candle.pha.pa.us
 | 
						||
X-Spam-Status: No, hits=-4.9 required=5.0 tests=BAYES_00 autolearn=ham 
 | 
						||
	version=2.61
 | 
						||
Status: OR
 | 
						||
 | 
						||
Tom Lane <tgl@sss.pgh.pa.us> writes:
 | 
						||
 | 
						||
> Greg Stark <gsstark@mit.edu> writes:
 | 
						||
>
 | 
						||
> > How feasible would it be to have a btree index on ctid?
 | 
						||
> 
 | 
						||
> Why would you want one?  Direct access by ctid beats out an index lookup
 | 
						||
> every time.  
 | 
						||
 | 
						||
Of course. But as I mentioned, I have a cunning plan.
 | 
						||
 | 
						||
If you have two indexes (a,ctid) and (b,ctid) and do a query where a=1 and b=2
 | 
						||
then it would be particularly easy to combine the two efficiently. 
 | 
						||
 | 
						||
If specially marked btree indexes -- or even all btree indexes -- implicitly
 | 
						||
had ctid as a final sort order after all the index column, then it would
 | 
						||
esentially obviate the need for bitmap indexes. They wouldn't have the space
 | 
						||
advantage, but they would be possible to combine using arbitrary boolean
 | 
						||
expressions without looking at the actual tuples.
 | 
						||
 | 
						||
This is essentially what is in the TODO about using bitmaps, but without
 | 
						||
having to do any extra sorts.
 | 
						||
 | 
						||
This would only really be an advantage for particularly wide tables where the
 | 
						||
combination of boolean clauses narrows the result set down a lot more than any
 | 
						||
one clause.
 | 
						||
 | 
						||
> In any case, vacuum and friends would break such an index entirely.
 | 
						||
 | 
						||
That was what I was afraid of.
 | 
						||
 | 
						||
-- 
 | 
						||
greg
 | 
						||
 | 
						||
 | 
						||
---------------------------(end of broadcast)---------------------------
 | 
						||
TIP 5: Have you checked our extensive FAQ?
 | 
						||
 | 
						||
               http://www.postgresql.org/docs/faqs/FAQ.html
 | 
						||
 | 
						||
From pgsql-hackers-owner+M49442=pgman=candle.pha.pa.us@postgresql.org Tue Jan 27 18:32:25 2004
 | 
						||
Return-path: <pgsql-hackers-owner+M49442=pgman=candle.pha.pa.us@postgresql.org>
 | 
						||
Received: from vm2.hub.org ([200.46.204.60])
 | 
						||
	by candle.pha.pa.us (8.11.6/8.11.6) with ESMTP id i0RNWNe02539
 | 
						||
	for <pgman@candle.pha.pa.us>; Tue, 27 Jan 2004 18:32:24 -0500 (EST)
 | 
						||
Received: from postgresql.org (svr1.postgresql.org [200.46.204.71])
 | 
						||
	by vm2.hub.org (Postfix) with ESMTP id DC003CD49A4
 | 
						||
	for <pgman@candle.pha.pa.us>; Tue, 27 Jan 2004 23:32:17 +0000 (GMT)
 | 
						||
X-Original-To: pgsql-hackers-postgresql.org@localhost.postgresql.org
 | 
						||
Received: from localhost (neptune.hub.org [200.46.204.2])
 | 
						||
	by svr1.postgresql.org (Postfix) with ESMTP id 34466D1D17D
 | 
						||
	for <pgsql-hackers-postgresql.org@localhost.postgresql.org>; Tue, 27 Jan 2004 23:25:11 +0000 (GMT)
 | 
						||
Received: from svr1.postgresql.org ([200.46.204.71])
 | 
						||
	by localhost (neptune.hub.org [200.46.204.2]) (amavisd-new, port 10024)
 | 
						||
	with ESMTP id 20117-05
 | 
						||
	for <pgsql-hackers-postgresql.org@localhost.postgresql.org>;
 | 
						||
	Tue, 27 Jan 2004 19:24:41 -0400 (AST)
 | 
						||
Received: from sss.pgh.pa.us (unknown [192.204.191.242])
 | 
						||
	by svr1.postgresql.org (Postfix) with ESMTP id 33E28D1D548
 | 
						||
	for <pgsql-hackers@postgresql.org>; Tue, 27 Jan 2004 19:24:40 -0400 (AST)
 | 
						||
Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
 | 
						||
	by sss.pgh.pa.us (8.12.10/8.12.10) with ESMTP id i0RNOfnX000404;
 | 
						||
	Tue, 27 Jan 2004 18:24:41 -0500 (EST)
 | 
						||
To: Greg Stark <gsstark@mit.edu>
 | 
						||
cc: pgsql-hackers@postgresql.org
 | 
						||
Subject: Re: [HACKERS] Question about indexes 
 | 
						||
In-Reply-To: <87d695t2ak.fsf@stark.xeocode.com> 
 | 
						||
References: <87isixt9h7.fsf@stark.xeocode.com> <29815.1075243916@sss.pgh.pa.us> <87d695t2ak.fsf@stark.xeocode.com>
 | 
						||
Comments: In-reply-to Greg Stark <gsstark@mit.edu>
 | 
						||
	message dated "27 Jan 2004 18:11:31 -0500"
 | 
						||
Date: Tue, 27 Jan 2004 18:24:41 -0500
 | 
						||
Message-ID: <403.1075245881@sss.pgh.pa.us>
 | 
						||
From: Tom Lane <tgl@sss.pgh.pa.us>
 | 
						||
X-Virus-Scanned: by amavisd-new at postgresql.org
 | 
						||
X-Mailing-List: pgsql-hackers
 | 
						||
Precedence: bulk
 | 
						||
Sender: pgsql-hackers-owner@postgresql.org
 | 
						||
X-Spam-Checker-Version: SpamAssassin 2.61 (1.212.2.1-2003-12-09-exp) on 
 | 
						||
	candle.pha.pa.us
 | 
						||
X-Spam-Status: No, hits=-4.9 required=5.0 tests=BAYES_00 autolearn=ham 
 | 
						||
	version=2.61
 | 
						||
Status: OR
 | 
						||
 | 
						||
Greg Stark <gsstark@mit.edu> writes:
 | 
						||
> If you have two indexes (a,ctid) and (b,ctid) and do a query where a=1 and b=2
 | 
						||
> then it would be particularly easy to combine the two efficiently. 
 | 
						||
 | 
						||
> If specially marked btree indexes -- or even all btree indexes -- implicitly
 | 
						||
> had ctid as a final sort order after all the index column, then it would
 | 
						||
> esentially obviate the need for bitmap indexes.
 | 
						||
 | 
						||
I don't think so.  You are thinking only of exact-equality queries ---
 | 
						||
as soon as the WHERE clause describes a range of index entries, the
 | 
						||
readout wouldn't be sorted by ctid anyway.
 | 
						||
 | 
						||
Combining indexes via a bitmap intermediate step (which is not really
 | 
						||
the same thing as bitmap indexes, IIUC) seems like a more robust
 | 
						||
approach than relying on the index entries to be in ctid order.
 | 
						||
 | 
						||
But if we did want to sort indexes that way, we could do it today,
 | 
						||
I think.  The ctid is already stored in index entries (it is the
 | 
						||
"payload" remember...) and we could use it as a tiebreaker when
 | 
						||
determining insertion position.  This doesn't have the problems that
 | 
						||
putting ctid into the user columns would do, because the system knows
 | 
						||
about that ctid as being special; the difficulty with ctid in the user
 | 
						||
columns is the code not knowing that it'd need to change on a tuple move.
 | 
						||
 | 
						||
			regards, tom lane
 | 
						||
 | 
						||
---------------------------(end of broadcast)---------------------------
 | 
						||
TIP 5: Have you checked our extensive FAQ?
 | 
						||
 | 
						||
               http://www.postgresql.org/docs/faqs/FAQ.html
 | 
						||
 | 
						||
From pgsql-hackers-owner+M49450=pgman=candle.pha.pa.us@postgresql.org Tue Jan 27 21:28:20 2004
 | 
						||
Return-path: <pgsql-hackers-owner+M49450=pgman=candle.pha.pa.us@postgresql.org>
 | 
						||
Received: from postgresql.wavefire.com (postgresql.wavefire.com [64.141.14.48])
 | 
						||
	by candle.pha.pa.us (8.11.6/8.11.6) with ESMTP id i0S2SIe29755
 | 
						||
	for <pgman@candle.pha.pa.us>; Tue, 27 Jan 2004 21:28:19 -0500 (EST)
 | 
						||
Received: from postgresql.org ([200.46.204.71])
 | 
						||
	by postgresql.wavefire.com (8.9.3/8.9.3) with ESMTP id TBM02845
 | 
						||
	for <pgman@candle.pha.pa.us>; Tue, 27 Jan 2004 19:06:45 -0800 (PST)
 | 
						||
	(envelope-from pgsql-hackers-owner+M49450=pgman=candle.pha.pa.us@postgresql.org)
 | 
						||
X-Original-To: pgsql-hackers-postgresql.org@localhost.postgresql.org
 | 
						||
Received: from localhost (neptune.hub.org [200.46.204.2])
 | 
						||
	by svr1.postgresql.org (Postfix) with ESMTP id 6213BD1B85F
 | 
						||
	for <pgsql-hackers-postgresql.org@localhost.postgresql.org>; Wed, 28 Jan 2004 02:19:56 +0000 (GMT)
 | 
						||
Received: from svr1.postgresql.org ([200.46.204.71])
 | 
						||
	by localhost (neptune.hub.org [200.46.204.2]) (amavisd-new, port 10024)
 | 
						||
	with ESMTP id 69438-06
 | 
						||
	for <pgsql-hackers-postgresql.org@localhost.postgresql.org>;
 | 
						||
	Tue, 27 Jan 2004 22:19:26 -0400 (AST)
 | 
						||
Received: from smtp.istop.com (dci.doncaster.on.ca [66.11.168.194])
 | 
						||
	by svr1.postgresql.org (Postfix) with ESMTP id 1964FD1B47D
 | 
						||
	for <pgsql-hackers@postgresql.org>; Tue, 27 Jan 2004 22:19:24 -0400 (AST)
 | 
						||
Received: from stark.xeocode.com (gsstark.mtl.istop.com [66.11.160.162])
 | 
						||
	by smtp.istop.com (Postfix) with ESMTP
 | 
						||
	id BE92136B37; Tue, 27 Jan 2004 21:19:26 -0500 (EST)
 | 
						||
Received: from localhost ([127.0.0.1] helo=stark.xeocode.com)
 | 
						||
	by stark.xeocode.com with smtp (Exim 3.36 #1 (Debian))
 | 
						||
	id 1AlfIc-00084d-00; Tue, 27 Jan 2004 21:19:26 -0500
 | 
						||
To: Tom Lane <tgl@sss.pgh.pa.us>
 | 
						||
cc: Greg Stark <gsstark@mit.edu>, pgsql-hackers@postgresql.org
 | 
						||
Subject: Re: [HACKERS] Question about indexes
 | 
						||
References: <87isixt9h7.fsf@stark.xeocode.com>
 | 
						||
	<29815.1075243916@sss.pgh.pa.us> <87d695t2ak.fsf@stark.xeocode.com>
 | 
						||
	<403.1075245881@sss.pgh.pa.us>
 | 
						||
In-Reply-To: <403.1075245881@sss.pgh.pa.us>
 | 
						||
From: Greg Stark <gsstark@mit.edu>
 | 
						||
Organization: The Emacs Conspiracy; member since 1992
 | 
						||
Date: 27 Jan 2004 21:19:26 -0500
 | 
						||
Message-ID: <877jzcu85t.fsf@stark.xeocode.com>
 | 
						||
Lines: 43
 | 
						||
User-Agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.3
 | 
						||
MIME-Version: 1.0
 | 
						||
Content-Type: text/plain; charset=us-ascii
 | 
						||
X-Virus-Scanned: by amavisd-new at postgresql.org
 | 
						||
X-Mailing-List: pgsql-hackers
 | 
						||
Precedence: bulk
 | 
						||
Sender: pgsql-hackers-owner@postgresql.org
 | 
						||
X-Spam-Checker-Version: SpamAssassin 2.61 (1.212.2.1-2003-12-09-exp) on 
 | 
						||
	candle.pha.pa.us
 | 
						||
X-Spam-Status: No, hits=-4.9 required=5.0 tests=BAYES_00 autolearn=ham 
 | 
						||
	version=2.61
 | 
						||
Status: OR
 | 
						||
 | 
						||
 | 
						||
Tom Lane <tgl@sss.pgh.pa.us> writes:
 | 
						||
 | 
						||
> I don't think so.  You are thinking only of exact-equality queries ---
 | 
						||
> as soon as the WHERE clause describes a range of index entries, the
 | 
						||
> readout wouldn't be sorted by ctid anyway.
 | 
						||
 | 
						||
But then even bitmap indexes would fail in that way too, or at least have a
 | 
						||
lot of extra cost that would have to be taken into account based on the number
 | 
						||
of values in the range.
 | 
						||
 | 
						||
> Combining indexes via a bitmap intermediate step (which is not really
 | 
						||
> the same thing as bitmap indexes, IIUC) seems like a more robust
 | 
						||
> approach than relying on the index entries to be in ctid order.
 | 
						||
 | 
						||
I would see that as the next step, But it seems to me it would be only a small
 | 
						||
set of queries where it would really help enough to outweigh the extra work of
 | 
						||
the sort. Whereas if the ctid is already pre-sorted then the extra cost is
 | 
						||
fairly low. Sort of like the difference in cost between a merge join where
 | 
						||
both sides have to be sorted and a merge join where both sides are pre-sorted.
 | 
						||
 | 
						||
> But if we did want to sort indexes that way, we could do it today,
 | 
						||
> I think.  The ctid is already stored in index entries (it is the
 | 
						||
> "payload" remember...) and we could use it as a tiebreaker when
 | 
						||
> determining insertion position. This doesn't have the problems that
 | 
						||
> putting ctid into the user columns would do, because the system knows
 | 
						||
> about that ctid as being special; the difficulty with ctid in the user
 | 
						||
> columns is the code not knowing that it'd need to change on a tuple move.
 | 
						||
 | 
						||
That's exactly what I was thinking. I just don't know how badly it would
 | 
						||
complicate the vacuum{,full}/cluster code and whether those are the only cases
 | 
						||
to worry about.
 | 
						||
 | 
						||
 | 
						||
Note that the space saving of bitmap indexes is still a substantial factor.
 | 
						||
Using btree indexes the i/o costs of doing multiple index scans plus a table
 | 
						||
scan of the relevant pages would still be quite substantial. So this doesn't
 | 
						||
completely obviate the need for bitmap indexes, but I think it would remove a
 | 
						||
lot of the pressure from people who just need them to handle a few select
 | 
						||
queries.
 | 
						||
 | 
						||
-- 
 | 
						||
greg
 | 
						||
 | 
						||
 | 
						||
---------------------------(end of broadcast)---------------------------
 | 
						||
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
 | 
						||
 | 
						||
From pgsql-hackers-owner+M49453=pgman=candle.pha.pa.us@postgresql.org Tue Jan 27 21:53:09 2004
 | 
						||
Return-path: <pgsql-hackers-owner+M49453=pgman=candle.pha.pa.us@postgresql.org>
 | 
						||
Received: from joeconway.com (66-146-172-86.skyriver.net [66.146.172.86])
 | 
						||
	by candle.pha.pa.us (8.11.6/8.11.6) with ESMTP id i0S2r3e04133
 | 
						||
	for <pgman@candle.pha.pa.us>; Tue, 27 Jan 2004 21:53:08 -0500 (EST)
 | 
						||
Received: from postgresql.org ([200.46.204.71] verified)
 | 
						||
  by joeconway.com (CommuniGate Pro SMTP 4.1.8)
 | 
						||
  with ESMTP id 791556 for pgman@candle.pha.pa.us; Tue, 27 Jan 2004 18:49:49 -0800
 | 
						||
X-Original-To: pgsql-hackers-postgresql.org@localhost.postgresql.org
 | 
						||
Received: from localhost (neptune.hub.org [200.46.204.2])
 | 
						||
	by svr1.postgresql.org (Postfix) with ESMTP id C4A10D1B47D
 | 
						||
	for <pgsql-hackers-postgresql.org@localhost.postgresql.org>; Wed, 28 Jan 2004 02:49:28 +0000 (GMT)
 | 
						||
Received: from svr1.postgresql.org ([200.46.204.71])
 | 
						||
	by localhost (neptune.hub.org [200.46.204.2]) (amavisd-new, port 10024)
 | 
						||
	with ESMTP id 76787-10
 | 
						||
	for <pgsql-hackers-postgresql.org@localhost.postgresql.org>;
 | 
						||
	Tue, 27 Jan 2004 22:48:59 -0400 (AST)
 | 
						||
Received: from sss.pgh.pa.us (unknown [192.204.191.242])
 | 
						||
	by svr1.postgresql.org (Postfix) with ESMTP id A5C5CD1B4DC
 | 
						||
	for <pgsql-hackers@postgresql.org>; Tue, 27 Jan 2004 22:48:56 -0400 (AST)
 | 
						||
Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
 | 
						||
	by sss.pgh.pa.us (8.12.11/8.12.11) with ESMTP id i0S2mxTx005814;
 | 
						||
	Tue, 27 Jan 2004 21:48:59 -0500 (EST)
 | 
						||
To: Greg Stark <gsstark@mit.edu>
 | 
						||
cc: pgsql-hackers@postgresql.org
 | 
						||
Subject: Re: [HACKERS] Question about indexes 
 | 
						||
In-Reply-To: <877jzcu85t.fsf@stark.xeocode.com> 
 | 
						||
References: <87isixt9h7.fsf@stark.xeocode.com> <29815.1075243916@sss.pgh.pa.us> <87d695t2ak.fsf@stark.xeocode.com> <403.1075245881@sss.pgh.pa.us> <877jzcu85t.fsf@stark.xeocode.com>
 | 
						||
Comments: In-reply-to Greg Stark <gsstark@mit.edu>
 | 
						||
	message dated "27 Jan 2004 21:19:26 -0500"
 | 
						||
Date: Tue, 27 Jan 2004 21:48:59 -0500
 | 
						||
Message-ID: <5813.1075258139@sss.pgh.pa.us>
 | 
						||
From: Tom Lane <tgl@sss.pgh.pa.us>
 | 
						||
X-Virus-Scanned: by amavisd-new at postgresql.org
 | 
						||
X-Mailing-List: pgsql-hackers
 | 
						||
Precedence: bulk
 | 
						||
Sender: pgsql-hackers-owner@postgresql.org
 | 
						||
X-Spam-Checker-Version: SpamAssassin 2.61 (1.212.2.1-2003-12-09-exp) on 
 | 
						||
	candle.pha.pa.us
 | 
						||
X-Spam-Status: No, hits=-4.9 required=5.0 tests=BAYES_00 autolearn=ham 
 | 
						||
	version=2.61
 | 
						||
Status: OR
 | 
						||
 | 
						||
Greg Stark <gsstark@mit.edu> writes:
 | 
						||
>> Combining indexes via a bitmap intermediate step (which is not really
 | 
						||
>> the same thing as bitmap indexes, IIUC) seems like a more robust
 | 
						||
>> approach than relying on the index entries to be in ctid order.
 | 
						||
 | 
						||
> I would see that as the next step, But it seems to me it would be only a small
 | 
						||
> set of queries where it would really help enough to outweigh the extra work of
 | 
						||
> the sort.
 | 
						||
 | 
						||
What sort?  The whole point of a bitmap is that it makes it easy to
 | 
						||
visit the tuples in heap order.  You scan the index, you set the
 | 
						||
appropriate bits in the bitmap, and then you scan the bitmap and go to
 | 
						||
the heap tuples that have their bits set.  If you are using multiple
 | 
						||
indexes you can AND or OR their results at the bitmap phase before you
 | 
						||
go to the heap.
 | 
						||
 | 
						||
An implementation of this kind would not produce tuples in index order,
 | 
						||
so if you have an ORDER BY to satisfy then you end up doing an explicit
 | 
						||
sort after you have the tuples.  It would be up to the planner to
 | 
						||
consider this cost versus the advantages of being able to use multiple
 | 
						||
indexes; we'd certainly want to keep the existing scan mechanism as an
 | 
						||
available alternative.  But if the query is suited to multiple indexes
 | 
						||
I suspect it'd be a win pretty often.
 | 
						||
 | 
						||
> Note that the space saving of bitmap indexes is still a substantial factor.
 | 
						||
 | 
						||
I think you are still confusing what I'm talking about with a bitmap
 | 
						||
index, ie, a persistent structure on-disk.  It's not that at all, but
 | 
						||
a transient structure built in-memory during an index scan.
 | 
						||
 | 
						||
I'm a little dubious that true bitmap indexes would be worth building
 | 
						||
for Postgres.  Seems like partial indexes cover the same sorts of
 | 
						||
applications and are more flexible.
 | 
						||
 | 
						||
			regards, tom lane
 | 
						||
 | 
						||
---------------------------(end of broadcast)---------------------------
 | 
						||
TIP 5: Have you checked our extensive FAQ?
 | 
						||
 | 
						||
               http://www.postgresql.org/docs/faqs/FAQ.html
 | 
						||
 | 
						||
From pgsql-hackers-owner+M49462=pgman=candle.pha.pa.us@postgresql.org Wed Jan 28 13:10:48 2004
 | 
						||
Return-path: <pgsql-hackers-owner+M49462=pgman=candle.pha.pa.us@postgresql.org>
 | 
						||
Received: from joeconway.com (66-146-172-86.skyriver.net [66.146.172.86])
 | 
						||
	by candle.pha.pa.us (8.11.6/8.11.6) with ESMTP id i0SIAle25230
 | 
						||
	for <pgman@candle.pha.pa.us>; Wed, 28 Jan 2004 13:10:47 -0500 (EST)
 | 
						||
Received: from postgresql.org ([200.46.204.71] verified)
 | 
						||
  by joeconway.com (CommuniGate Pro SMTP 4.1.8)
 | 
						||
  with ESMTP id 793300 for pgman@candle.pha.pa.us; Wed, 28 Jan 2004 10:07:34 -0800
 | 
						||
X-Original-To: pgsql-hackers-postgresql.org@localhost.postgresql.org
 | 
						||
Received: from localhost (unknown [200.46.204.2])
 | 
						||
	by svr1.postgresql.org (Postfix) with ESMTP id 19389D1CCAF
 | 
						||
	for <pgsql-hackers-postgresql.org@localhost.postgresql.org>; Wed, 28 Jan 2004 17:56:46 +0000 (GMT)
 | 
						||
Received: from svr1.postgresql.org ([200.46.204.71])
 | 
						||
	by localhost (neptune.hub.org [200.46.204.2]) (amavisd-new, port 10024)
 | 
						||
	with ESMTP id 10780-09
 | 
						||
	for <pgsql-hackers-postgresql.org@localhost.postgresql.org>;
 | 
						||
	Wed, 28 Jan 2004 13:56:14 -0400 (AST)
 | 
						||
Received: from www.postgresql.com (www.postgresql.com [200.46.204.209])
 | 
						||
	by svr1.postgresql.org (Postfix) with ESMTP id A53DAD1DF6B
 | 
						||
	for <pgsql-hackers@postgresql.org>; Wed, 28 Jan 2004 13:52:13 -0400 (AST)
 | 
						||
Received: from smtp.istop.com (dci.doncaster.on.ca [66.11.168.194])
 | 
						||
	by www.postgresql.com (Postfix) with ESMTP id E0414CF6FBA
 | 
						||
	for <pgsql-hackers@postgresql.org>; Wed, 28 Jan 2004 10:47:17 -0400 (AST)
 | 
						||
Received: from stark.xeocode.com (gsstark.mtl.istop.com [66.11.160.162])
 | 
						||
	by smtp.istop.com (Postfix) with ESMTP
 | 
						||
	id C4D5036BA2; Wed, 28 Jan 2004 09:13:47 -0500 (EST)
 | 
						||
Received: from localhost ([127.0.0.1] helo=stark.xeocode.com)
 | 
						||
	by stark.xeocode.com with smtp (Exim 3.36 #1 (Debian))
 | 
						||
	id 1AlqRv-0001fZ-00; Wed, 28 Jan 2004 09:13:47 -0500
 | 
						||
To: Tom Lane <tgl@sss.pgh.pa.us>
 | 
						||
cc: Greg Stark <gsstark@mit.edu>, pgsql-hackers@postgresql.org
 | 
						||
Subject: Re: [HACKERS] Question about indexes
 | 
						||
References: <87isixt9h7.fsf@stark.xeocode.com>
 | 
						||
	<29815.1075243916@sss.pgh.pa.us> <87d695t2ak.fsf@stark.xeocode.com>
 | 
						||
	<403.1075245881@sss.pgh.pa.us> <877jzcu85t.fsf@stark.xeocode.com>
 | 
						||
	<5813.1075258139@sss.pgh.pa.us>
 | 
						||
In-Reply-To: <5813.1075258139@sss.pgh.pa.us>
 | 
						||
From: Greg Stark <gsstark@mit.edu>
 | 
						||
Organization: The Emacs Conspiracy; member since 1992
 | 
						||
Date: 28 Jan 2004 09:13:47 -0500
 | 
						||
Message-ID: <871xpktb38.fsf@stark.xeocode.com>
 | 
						||
Lines: 38
 | 
						||
User-Agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.3
 | 
						||
MIME-Version: 1.0
 | 
						||
Content-Type: text/plain; charset=us-ascii
 | 
						||
X-Virus-Scanned: by amavisd-new at postgresql.org
 | 
						||
X-Mailing-List: pgsql-hackers
 | 
						||
Precedence: bulk
 | 
						||
Sender: pgsql-hackers-owner@postgresql.org
 | 
						||
X-Spam-Checker-Version: SpamAssassin 2.61 (1.212.2.1-2003-12-09-exp) on 
 | 
						||
	candle.pha.pa.us
 | 
						||
X-Spam-Status: No, hits=-4.9 required=5.0 tests=BAYES_00 autolearn=ham 
 | 
						||
	version=2.61
 | 
						||
Status: OR
 | 
						||
 | 
						||
Tom Lane <tgl@sss.pgh.pa.us> writes:
 | 
						||
 | 
						||
> Greg Stark <gsstark@mit.edu> writes:
 | 
						||
> >
 | 
						||
> > I would see that as the next step, But it seems to me it would be only a small
 | 
						||
> > set of queries where it would really help enough to outweigh the extra work of
 | 
						||
> > the sort.
 | 
						||
> 
 | 
						||
> What sort?  
 | 
						||
 | 
						||
To build the in-memory bitmap you effectively have to do a sort. If the tuples
 | 
						||
come out of the index in heap order then you can combine them without having
 | 
						||
to go through that step.
 | 
						||
 | 
						||
> I'm a little dubious that true bitmap indexes would be worth building
 | 
						||
> for Postgres.  Seems like partial indexes cover the same sorts of
 | 
						||
> applications and are more flexible.
 | 
						||
 | 
						||
I'm clear on the distinction. I think bitmap indexes still have a place, but
 | 
						||
if regular btree indexes could be combined efficiently then that would be an
 | 
						||
even narrower niche.
 | 
						||
 | 
						||
Partial indexes are very handy, and they're useful in corner cases where
 | 
						||
bitmap indexes are useful, such as flags for special types of records.
 | 
						||
 | 
						||
But I think bitmap indexes are specifically wanted by certain types of data
 | 
						||
warehousing applications where you have an index on virtually every column and
 | 
						||
then want to do arbitrary boolean combinations of all of them. btree indexes
 | 
						||
would generate more i/o scanning all the indexes than just doing a sequential
 | 
						||
scan would. Whereas bitmap indexes are much denser on disk.
 | 
						||
 | 
						||
However my experience leans more towards the OLTP side and I very rarely saw
 | 
						||
applications like this.
 | 
						||
 | 
						||
 | 
						||
 | 
						||
-- 
 | 
						||
greg
 | 
						||
 | 
						||
 | 
						||
---------------------------(end of broadcast)---------------------------
 | 
						||
TIP 3: if posting/reading through Usenet, please send an appropriate
 | 
						||
      subscribe-nomail command to majordomo@postgresql.org so that your
 | 
						||
      message can get through to the mailing list cleanly
 | 
						||
 | 
						||
From pgsql-hackers-owner+M49465=pgman=candle.pha.pa.us@postgresql.org Wed Jan 28 13:30:48 2004
 | 
						||
Return-path: <pgsql-hackers-owner+M49465=pgman=candle.pha.pa.us@postgresql.org>
 | 
						||
Received: from joeconway.com (66-146-172-86.skyriver.net [66.146.172.86])
 | 
						||
	by candle.pha.pa.us (8.11.6/8.11.6) with ESMTP id i0SIUke29027
 | 
						||
	for <pgman@candle.pha.pa.us>; Wed, 28 Jan 2004 13:30:47 -0500 (EST)
 | 
						||
Received: from postgresql.org ([200.46.204.71] verified)
 | 
						||
  by joeconway.com (CommuniGate Pro SMTP 4.1.8)
 | 
						||
  with ESMTP id 793371 for pgman@candle.pha.pa.us; Wed, 28 Jan 2004 10:27:31 -0800
 | 
						||
X-Original-To: pgsql-hackers-postgresql.org@localhost.postgresql.org
 | 
						||
Received: from localhost (unknown [200.46.204.2])
 | 
						||
	by svr1.postgresql.org (Postfix) with ESMTP id 92005D1D3F7
 | 
						||
	for <pgsql-hackers-postgresql.org@localhost.postgresql.org>; Wed, 28 Jan 2004 18:14:02 +0000 (GMT)
 | 
						||
Received: from svr1.postgresql.org ([200.46.204.71])
 | 
						||
	by localhost (neptune.hub.org [200.46.204.2]) (amavisd-new, port 10024)
 | 
						||
	with ESMTP id 21680-08
 | 
						||
	for <pgsql-hackers-postgresql.org@localhost.postgresql.org>;
 | 
						||
	Wed, 28 Jan 2004 14:13:31 -0400 (AST)
 | 
						||
Received: from www.postgresql.com (www.postgresql.com [200.46.204.209])
 | 
						||
	by svr1.postgresql.org (Postfix) with ESMTP id 088B0D1DC77
 | 
						||
	for <pgsql-hackers@postgresql.org>; Wed, 28 Jan 2004 14:08:44 -0400 (AST)
 | 
						||
Received: from sss.pgh.pa.us (unknown [192.204.191.242])
 | 
						||
	by www.postgresql.com (Postfix) with ESMTP id CFF50CF77BD
 | 
						||
	for <pgsql-hackers@postgresql.org>; Wed, 28 Jan 2004 11:00:42 -0400 (AST)
 | 
						||
Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
 | 
						||
	by sss.pgh.pa.us (8.12.11/8.12.11) with ESMTP id i0SExBYA018093;
 | 
						||
	Wed, 28 Jan 2004 09:59:12 -0500 (EST)
 | 
						||
To: Greg Stark <gsstark@mit.edu>
 | 
						||
cc: pgsql-hackers@postgresql.org
 | 
						||
Subject: Re: [HACKERS] Question about indexes 
 | 
						||
In-Reply-To: <871xpktb38.fsf@stark.xeocode.com> 
 | 
						||
References: <87isixt9h7.fsf@stark.xeocode.com> <29815.1075243916@sss.pgh.pa.us> <87d695t2ak.fsf@stark.xeocode.com> <403.1075245881@sss.pgh.pa.us> <877jzcu85t.fsf@stark.xeocode.com> <5813.1075258139@sss.pgh.pa.us> <871xpktb38.fsf@stark.xeocode.com>
 | 
						||
Comments: In-reply-to Greg Stark <gsstark@mit.edu>
 | 
						||
	message dated "28 Jan 2004 09:13:47 -0500"
 | 
						||
Date: Wed, 28 Jan 2004 09:59:11 -0500
 | 
						||
Message-ID: <18092.1075301951@sss.pgh.pa.us>
 | 
						||
From: Tom Lane <tgl@sss.pgh.pa.us>
 | 
						||
X-Virus-Scanned: by amavisd-new at postgresql.org
 | 
						||
X-Mailing-List: pgsql-hackers
 | 
						||
Precedence: bulk
 | 
						||
Sender: pgsql-hackers-owner@postgresql.org
 | 
						||
X-Spam-Checker-Version: SpamAssassin 2.61 (1.212.2.1-2003-12-09-exp) on 
 | 
						||
	candle.pha.pa.us
 | 
						||
X-Spam-Status: No, hits=-4.9 required=5.0 tests=BAYES_00 autolearn=ham 
 | 
						||
	version=2.61
 | 
						||
Status: OR
 | 
						||
 | 
						||
Greg Stark <gsstark@mit.edu> writes:
 | 
						||
> Tom Lane <tgl@sss.pgh.pa.us> writes:
 | 
						||
>> What sort?  
 | 
						||
 | 
						||
> To build the in-memory bitmap you effectively have to do a sort.
 | 
						||
 | 
						||
Hm, you're thinking that the operation of inserting a bit into a bitmap
 | 
						||
has to be at least O(log N).  Seems to me that that depends on the data
 | 
						||
structure you use.  In principle it could be O(1), if you use a true
 | 
						||
bitmap (linear array) -- just index and set the bit.  You might be right
 | 
						||
that practical data structures would be O(log N), but I'm not totally
 | 
						||
convinced.
 | 
						||
 | 
						||
> If the tuples come out of the index in heap order then you can combine
 | 
						||
> them without having to go through that step.
 | 
						||
 | 
						||
But considering the restrictions implied by that assumption --- no range
 | 
						||
scans, no non-btree indexes --- I doubt we will take the trouble to
 | 
						||
implement that variant.  We'll want to do the generalized bitmap code
 | 
						||
anyway.
 | 
						||
 | 
						||
In any case, this discussion is predicated on the assumption that the
 | 
						||
operations involving the bitmap are a significant fraction of the total
 | 
						||
time, which I think is quite uncertain.  Until we build it and profile
 | 
						||
it, we won't know that.
 | 
						||
 | 
						||
			regards, tom lane
 | 
						||
 | 
						||
---------------------------(end of broadcast)---------------------------
 | 
						||
TIP 4: Don't 'kill -9' the postmaster
 | 
						||
 | 
						||
From pgsql-hackers-owner+M49457=pgman=candle.pha.pa.us@postgresql.org Wed Jan 28 10:42:58 2004
 | 
						||
Return-path: <pgsql-hackers-owner+M49457=pgman=candle.pha.pa.us@postgresql.org>
 | 
						||
Received: from joeconway.com (66-146-172-86.skyriver.net [66.146.172.86])
 | 
						||
	by candle.pha.pa.us (8.11.6/8.11.6) with ESMTP id i0SFgue00574
 | 
						||
	for <pgman@candle.pha.pa.us>; Wed, 28 Jan 2004 10:42:57 -0500 (EST)
 | 
						||
Received: from postgresql.org ([200.46.204.71] verified)
 | 
						||
  by joeconway.com (CommuniGate Pro SMTP 4.1.8)
 | 
						||
  with ESMTP id 792727 for pgman@candle.pha.pa.us; Wed, 28 Jan 2004 07:39:41 -0800
 | 
						||
X-Original-To: pgsql-hackers-postgresql.org@localhost.postgresql.org
 | 
						||
Received: from localhost (unknown [200.46.204.2])
 | 
						||
	by svr1.postgresql.org (Postfix) with ESMTP id 08484D1CA01
 | 
						||
	for <pgsql-hackers-postgresql.org@localhost.postgresql.org>; Wed, 28 Jan 2004 15:38:28 +0000 (GMT)
 | 
						||
Received: from svr1.postgresql.org ([200.46.204.71])
 | 
						||
	by localhost (neptune.hub.org [200.46.204.2]) (amavisd-new, port 10024)
 | 
						||
	with ESMTP id 36717-02
 | 
						||
	for <pgsql-hackers-postgresql.org@localhost.postgresql.org>;
 | 
						||
	Wed, 28 Jan 2004 11:37:55 -0400 (AST)
 | 
						||
Received: from smtp.istop.com (dci.doncaster.on.ca [66.11.168.194])
 | 
						||
	by svr1.postgresql.org (Postfix) with ESMTP id E27BDD1D201
 | 
						||
	for <pgsql-hackers@postgresql.org>; Wed, 28 Jan 2004 11:37:55 -0400 (AST)
 | 
						||
Received: from stark.xeocode.com (gsstark.mtl.istop.com [66.11.160.162])
 | 
						||
	by smtp.istop.com (Postfix) with ESMTP
 | 
						||
	id 1E70F36BBA; Wed, 28 Jan 2004 10:09:35 -0500 (EST)
 | 
						||
Received: from localhost ([127.0.0.1] helo=stark.xeocode.com)
 | 
						||
	by stark.xeocode.com with smtp (Exim 3.36 #1 (Debian))
 | 
						||
	id 1AlrJu-0001rj-00; Wed, 28 Jan 2004 10:09:34 -0500
 | 
						||
To: Tom Lane <tgl@sss.pgh.pa.us>
 | 
						||
cc: Greg Stark <gsstark@mit.edu>, pgsql-hackers@postgresql.org
 | 
						||
Subject: Re: [HACKERS] Question about indexes
 | 
						||
References: <87isixt9h7.fsf@stark.xeocode.com>
 | 
						||
	<29815.1075243916@sss.pgh.pa.us> <87d695t2ak.fsf@stark.xeocode.com>
 | 
						||
	<403.1075245881@sss.pgh.pa.us> <877jzcu85t.fsf@stark.xeocode.com>
 | 
						||
	<5813.1075258139@sss.pgh.pa.us> <871xpktb38.fsf@stark.xeocode.com>
 | 
						||
	<18092.1075301951@sss.pgh.pa.us>
 | 
						||
In-Reply-To: <18092.1075301951@sss.pgh.pa.us>
 | 
						||
From: Greg Stark <gsstark@mit.edu>
 | 
						||
Organization: The Emacs Conspiracy; member since 1992
 | 
						||
Date: 28 Jan 2004 10:09:34 -0500
 | 
						||
Message-ID: <87vfmwrtxt.fsf@stark.xeocode.com>
 | 
						||
Lines: 15
 | 
						||
User-Agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.3
 | 
						||
MIME-Version: 1.0
 | 
						||
Content-Type: text/plain; charset=us-ascii
 | 
						||
X-Virus-Scanned: by amavisd-new at postgresql.org
 | 
						||
X-Mailing-List: pgsql-hackers
 | 
						||
Precedence: bulk
 | 
						||
Sender: pgsql-hackers-owner@postgresql.org
 | 
						||
X-Spam-Checker-Version: SpamAssassin 2.61 (1.212.2.1-2003-12-09-exp) on 
 | 
						||
	candle.pha.pa.us
 | 
						||
X-Spam-Status: No, hits=-4.9 required=5.0 tests=BAYES_00 autolearn=ham 
 | 
						||
	version=2.61
 | 
						||
Status: ORr
 | 
						||
 | 
						||
 | 
						||
Tom Lane <tgl@sss.pgh.pa.us> writes:
 | 
						||
 | 
						||
> In any case, this discussion is predicated on the assumption that the
 | 
						||
> operations involving the bitmap are a significant fraction of the total
 | 
						||
> time, which I think is quite uncertain.  Until we build it and profile
 | 
						||
> it, we won't know that.
 | 
						||
 | 
						||
The other thought I had was that it would be difficult to tell when to follow
 | 
						||
this path. Since the main case where it wins is when the individual indexes
 | 
						||
aren't very selective but the combination is very selective, and we don't have
 | 
						||
inter-column correlation statistics ...
 | 
						||
 | 
						||
-- 
 | 
						||
greg
 | 
						||
 | 
						||
 | 
						||
---------------------------(end of broadcast)---------------------------
 | 
						||
TIP 9: the planner will ignore your desire to choose an index scan if your
 | 
						||
      joining column's datatypes do not match
 | 
						||
 | 
						||
From pgsql-hackers-owner+M49467=pgman=candle.pha.pa.us@postgresql.org Wed Jan 28 17:29:11 2004
 | 
						||
Return-path: <pgsql-hackers-owner+M49467=pgman=candle.pha.pa.us@postgresql.org>
 | 
						||
Received: from svr1.postgresql.org ([200.46.204.71])
 | 
						||
	by candle.pha.pa.us (8.11.6/8.11.6) with ESMTP id i0SMT9e09381
 | 
						||
	for <pgman@candle.pha.pa.us>; Wed, 28 Jan 2004 17:29:10 -0500 (EST)
 | 
						||
Received: from localhost (unknown [200.46.204.2])
 | 
						||
	by svr1.postgresql.org (Postfix) with ESMTP id 7E6A1D1D0F9
 | 
						||
	for <pgman@candle.pha.pa.us>; Wed, 28 Jan 2004 22:29:02 +0000 (GMT)
 | 
						||
Received: from svr1.postgresql.org ([200.46.204.71])
 | 
						||
	by localhost (neptune.hub.org [200.46.204.2]) (amavisd-new, port 10024)
 | 
						||
	with ESMTP id 30501-10 for <pgman@candle.pha.pa.us>;
 | 
						||
	Wed, 28 Jan 2004 18:28:33 -0400 (AST)
 | 
						||
Received: from postgresql.org (svr1.postgresql.org [200.46.204.71])
 | 
						||
	by svr1.postgresql.org (Postfix) with ESMTP id 002FED1CCDA
 | 
						||
	for <pgman@candle.pha.pa.us>; Wed, 28 Jan 2004 18:28:30 -0400 (AST)
 | 
						||
X-Original-To: pgsql-hackers-postgresql.org@localhost.postgresql.org
 | 
						||
Received: from localhost (unknown [200.46.204.2])
 | 
						||
	by svr1.postgresql.org (Postfix) with ESMTP id BC300D1B4BD
 | 
						||
	for <pgsql-hackers-postgresql.org@localhost.postgresql.org>; Wed, 28 Jan 2004 22:16:19 +0000 (GMT)
 | 
						||
Received: from svr1.postgresql.org ([200.46.204.71])
 | 
						||
	by localhost (neptune.hub.org [200.46.204.2]) (amavisd-new, port 10024)
 | 
						||
	with ESMTP id 29171-03
 | 
						||
	for <pgsql-hackers-postgresql.org@localhost.postgresql.org>;
 | 
						||
	Wed, 28 Jan 2004 18:15:50 -0400 (AST)
 | 
						||
Received: from cmailm1.svr.pol.co.uk (cmailm1.svr.pol.co.uk [195.92.193.18])
 | 
						||
	by svr1.postgresql.org (Postfix) with ESMTP id 99F4BD1C50E
 | 
						||
	for <pgsql-hackers@postgresql.org>; Wed, 28 Jan 2004 18:15:47 -0400 (AST)
 | 
						||
Received: from modem-182.leopard.dialup.pol.co.uk ([217.135.144.182] helo=LaptopDellXP)
 | 
						||
	by cmailm1.svr.pol.co.uk with esmtp (Exim 4.14)
 | 
						||
	id 1AlxyO-0002XD-Ab; Wed, 28 Jan 2004 22:15:48 +0000
 | 
						||
Reply-To: <simon@2ndquadrant.com>
 | 
						||
From: "Simon Riggs" <simon@2ndquadrant.com>
 | 
						||
To: "'Tom Lane'" <tgl@sss.pgh.pa.us>, "'Greg Stark'" <gsstark@mit.edu>
 | 
						||
cc: <pgsql-hackers@postgresql.org>
 | 
						||
Subject: Re: [HACKERS] Question about indexes 
 | 
						||
Date: Wed, 28 Jan 2004 22:15:40 -0000
 | 
						||
Organization: 2nd Quadrant
 | 
						||
Message-ID: <003701c3e5ec$44306250$efb887d9@LaptopDellXP>
 | 
						||
MIME-Version: 1.0
 | 
						||
Content-Type: text/plain;
 | 
						||
	charset="US-ASCII"
 | 
						||
Content-Transfer-Encoding: 7bit
 | 
						||
X-Priority: 3 (Normal)
 | 
						||
X-MSMail-Priority: Normal
 | 
						||
X-Mailer: Microsoft Outlook, Build 10.0.2627
 | 
						||
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2727.1300
 | 
						||
Importance: Normal
 | 
						||
In-Reply-To: <18092.1075301951@sss.pgh.pa.us>
 | 
						||
X-Virus-Scanned: by amavisd-new at postgresql.org
 | 
						||
X-Mailing-List: pgsql-hackers
 | 
						||
Precedence: bulk
 | 
						||
Sender: pgsql-hackers-owner@postgresql.org
 | 
						||
X-Virus-Scanned: by amavisd-new at postgresql.org
 | 
						||
X-Spam-Checker-Version: SpamAssassin 2.61 (1.212.2.1-2003-12-09-exp) on 
 | 
						||
	candle.pha.pa.us
 | 
						||
X-Spam-Status: No, hits=-4.9 required=5.0 tests=BAYES_00 autolearn=ham 
 | 
						||
	version=2.61
 | 
						||
Status: OR
 | 
						||
 | 
						||
Some potentially helpful background comments on the discussion so far...
 | 
						||
 | 
						||
>Tom Lane writes
 | 
						||
>>Greg Stark writes
 | 
						||
>> Note that the space saving of bitmap indexes is still a substantial 
 | 
						||
>> factor.
 | 
						||
>I think you are still confusing what I'm talking about with a bitmap
 | 
						||
index, >ie, a persistent structure on-disk.  It's not that at all, but a
 | 
						||
transient >structure built in-memory during an index scan.
 | 
						||
 | 
						||
Oracle allows the creation of bitmap indices as persistent data
 | 
						||
structures. 
 | 
						||
 | 
						||
The "space saving" of bitmap indices is only a saving when compared with
 | 
						||
btree indices. If you don't have them at all because they are built
 | 
						||
dynamically when required, as Tom is suggesting, then you "save" even
 | 
						||
more space. 
 | 
						||
 | 
						||
Maintaining the bitmap index is a costly operation. You tend to want to
 | 
						||
build them on "characteristic" columns, of which there tends to be more
 | 
						||
of in a database than "partial/full identity" columns on which you build
 | 
						||
btrees (forgive the vagueness of that comment), so you end up with loads
 | 
						||
of the damn things, so the space soon adds up. It can be hard to judge
 | 
						||
which ones are the important ones, especially when each is used by a
 | 
						||
different user/group. Building them dynamically is a good way of solving
 | 
						||
the question "which ones are needed?". Ever seen 58 indices on a table?
 | 
						||
Don't go there.
 | 
						||
 | 
						||
My vote would be implement the dynamic building capability, then return
 | 
						||
to implement a persisted structure later if that seems like it would be
 | 
						||
a further improvement. [The option would be nice]
 | 
						||
 | 
						||
If we do it dynamically, as Tom suggests, then we don't have to code the
 | 
						||
index maintenance logic at all and the functionality will be with us all
 | 
						||
the sooner. Go Tom!
 | 
						||
 | 
						||
>Tom Lane writes
 | 
						||
> In any case, this discussion is predicated on the assumption that the
 | 
						||
> operations involving the bitmap are a significant fraction of the
 | 
						||
total
 | 
						||
> time, which I think is quite uncertain.  Until we build it and profile
 | 
						||
> it, we won't know that.
 | 
						||
 | 
						||
Dynamically building the bitmaps has been the strategy in use by
 | 
						||
Teradata for nearly a decade on many large datawarehouses. I can
 | 
						||
personally vouch for the effectiveness of this approach - I was
 | 
						||
surprised when Oracle went for the persistent option. Certainly in that
 | 
						||
case building the bitmaps adds much less time than is saved overall by
 | 
						||
the better total query strategy.
 | 
						||
 | 
						||
>Greg Stark writes
 | 
						||
> > To build the in-memory bitmap you effectively have to do a sort.
 | 
						||
 | 
						||
Not sure on this latter point: I think I agree with Greg on that point,
 | 
						||
but want to believe Tom because requiring a sort will definitely add
 | 
						||
time. 
 | 
						||
 | 
						||
To shed some light in this area, some other major implementations are:
 | 
						||
 | 
						||
In Teradata, tables are stored based upon a primary index, which is
 | 
						||
effectively an index-organised table. The index pointers are stored in
 | 
						||
sorted order lock step with the blocks of the associated table - No sort
 | 
						||
required. (The ordering is based upon a hashed index, but that doesn't
 | 
						||
change the technique).
 | 
						||
 | 
						||
Oracle's tables/indexes use heaps/btrees also, though they do provide an
 | 
						||
index-organised table feature similar to Teradata. Maybe the lack of
 | 
						||
heap/btree consistent ordering in Oracle and their subsequent design
 | 
						||
choice of persistent bitmap indices is an indication for PostgreSQL too?
 | 
						||
 | 
						||
In Oracle, bitmap indices are an important precursor to the star join
 | 
						||
technique. AFAICS it is still possible to have a star join plan without
 | 
						||
having persistent bitmap indices. IMHO, the longer term goal of a good
 | 
						||
star join plan is an important one - that may influence the design
 | 
						||
selection for this discussion.
 | 
						||
 | 
						||
Hope some of that helps,
 | 
						||
 | 
						||
Best regards, Simon Riggs
 | 
						||
 | 
						||
 | 
						||
---------------------------(end of broadcast)---------------------------
 | 
						||
TIP 8: explain analyze is your friend
 | 
						||
 | 
						||
From pgsql-hackers-owner+M49477=pgman=candle.pha.pa.us@postgresql.org Thu Jan 29 04:24:47 2004
 | 
						||
Return-path: <pgsql-hackers-owner+M49477=pgman=candle.pha.pa.us@postgresql.org>
 | 
						||
Received: from joeconway.com (66-146-172-86.skyriver.net [66.146.172.86])
 | 
						||
	by candle.pha.pa.us (8.11.6/8.11.6) with ESMTP id i0T9Ohe19178
 | 
						||
	for <pgman@candle.pha.pa.us>; Thu, 29 Jan 2004 04:24:43 -0500 (EST)
 | 
						||
Received: from postgresql.org ([200.46.204.71] verified)
 | 
						||
  by joeconway.com (CommuniGate Pro SMTP 4.1.8)
 | 
						||
  with ESMTP id 794811 for pgman@candle.pha.pa.us; Thu, 29 Jan 2004 01:21:28 -0800
 | 
						||
X-Original-To: pgsql-hackers-postgresql.org@localhost.postgresql.org
 | 
						||
Received: from localhost (unknown [200.46.204.2])
 | 
						||
	by svr1.postgresql.org (Postfix) with ESMTP id 639A8D1B4CE
 | 
						||
	for <pgsql-hackers-postgresql.org@localhost.postgresql.org>; Thu, 29 Jan 2004 09:17:40 +0000 (GMT)
 | 
						||
Received: from svr1.postgresql.org ([200.46.204.71])
 | 
						||
	by localhost (neptune.hub.org [200.46.204.2]) (amavisd-new, port 10024)
 | 
						||
	with ESMTP id 24681-09
 | 
						||
	for <pgsql-hackers-postgresql.org@localhost.postgresql.org>;
 | 
						||
	Thu, 29 Jan 2004 05:17:16 -0400 (AST)
 | 
						||
Received: from loki.hnit.is (unknown [193.4.243.180])
 | 
						||
	by svr1.postgresql.org (Postfix) with ESMTP id 98971D1C9FD
 | 
						||
	for <pgsql-hackers@postgresql.org>; Thu, 29 Jan 2004 05:17:07 -0400 (AST)
 | 
						||
Received: from seifur.hnit.is ([193.4.243.99]) by 193.4.243.180 with trend_isnt_name_B; Thu, 29 Jan 2004 09:17:12 -0000
 | 
						||
X-MimeOLE: Produced By Microsoft Exchange V6.0.6487.1
 | 
						||
Content-Class: urn:content-classes:message
 | 
						||
MIME-Version: 1.0
 | 
						||
Content-Type: text/plain;
 | 
						||
	charset="us-ascii"
 | 
						||
Subject: Re: [HACKERS] Question about indexes 
 | 
						||
Date: Thu, 29 Jan 2004 09:17:11 -0000
 | 
						||
Message-ID: <0A5B2E3C3A64CA4AB14F76DBCA76DDA44EF9B2@seifur.hnit.is>
 | 
						||
Thread-Topic: [HACKERS] Question about indexes 
 | 
						||
Thread-Index: AcPl7J1SKohPpCtfSZq2EeeqhKLynAAW3BDw
 | 
						||
From: <lnd@hnit.is>
 | 
						||
To: <pgsql-hackers@postgresql.org>
 | 
						||
X-Virus-Scanned: by amavisd-new at postgresql.org
 | 
						||
X-Mailing-List: pgsql-hackers
 | 
						||
Precedence: bulk
 | 
						||
Sender: pgsql-hackers-owner@postgresql.org
 | 
						||
Content-Transfer-Encoding: 8bit
 | 
						||
X-MIME-Autoconverted: from quoted-printable to 8bit by candle.pha.pa.us id i0T9Ohe19178
 | 
						||
X-Spam-Checker-Version: SpamAssassin 2.61 (1.212.2.1-2003-12-09-exp) on 
 | 
						||
	candle.pha.pa.us
 | 
						||
X-Spam-Status: No, hits=-4.7 required=5.0 tests=BAYES_00,NO_REAL_NAME 
 | 
						||
	autolearn=no version=2.61
 | 
						||
Status: OR
 | 
						||
 | 
						||
 | 
						||
A small comment on Oracle's implementation of persistent bitmap indexes:
 | 
						||
 | 
						||
Oracle's bitmap index is concurently locked by DML, i.e. it suites for OLAP
 | 
						||
(basically read only data warehouses) but in no way for OLTP. 
 | 
						||
 | 
						||
IMHO, 
 | 
						||
Laimis
 | 
						||
 | 
						||
> Maybe the lack of heap/btree consistent ordering in Oracle 
 | 
						||
> and their subsequent design choice of persistent bitmap 
 | 
						||
> indices is an indication for PostgreSQL too?
 | 
						||
 | 
						||
 | 
						||
---------------------------(end of broadcast)---------------------------
 | 
						||
TIP 9: the planner will ignore your desire to choose an index scan if your
 | 
						||
      joining column's datatypes do not match
 | 
						||
 | 
						||
From pgsql-hackers-owner+M49497=pgman=candle.pha.pa.us@postgresql.org Fri Jan 30 01:22:15 2004
 | 
						||
Return-path: <pgsql-hackers-owner+M49497=pgman=candle.pha.pa.us@postgresql.org>
 | 
						||
Received: from joeconway.com (66-146-172-86.skyriver.net [66.146.172.86])
 | 
						||
	by candle.pha.pa.us (8.11.6/8.11.6) with ESMTP id i0U6MCe03385
 | 
						||
	for <pgman@candle.pha.pa.us>; Fri, 30 Jan 2004 01:22:14 -0500 (EST)
 | 
						||
Received: from postgresql.org ([200.46.204.71] verified)
 | 
						||
  by joeconway.com (CommuniGate Pro SMTP 4.1.8)
 | 
						||
  with ESMTP id 797306 for pgman@candle.pha.pa.us; Thu, 29 Jan 2004 22:18:52 -0800
 | 
						||
X-Original-To: pgsql-hackers-postgresql.org@localhost.postgresql.org
 | 
						||
Received: from localhost (unknown [200.46.204.2])
 | 
						||
	by svr1.postgresql.org (Postfix) with ESMTP id 6CCBCD1C967
 | 
						||
	for <pgsql-hackers-postgresql.org@localhost.postgresql.org>; Fri, 30 Jan 2004 06:16:52 +0000 (GMT)
 | 
						||
Received: from svr1.postgresql.org ([200.46.204.71])
 | 
						||
	by localhost (neptune.hub.org [200.46.204.2]) (amavisd-new, port 10024)
 | 
						||
	with ESMTP id 81674-05
 | 
						||
	for <pgsql-hackers-postgresql.org@localhost.postgresql.org>;
 | 
						||
	Fri, 30 Jan 2004 02:16:22 -0400 (AST)
 | 
						||
Received: from smtp.istop.com (dci.doncaster.on.ca [66.11.168.194])
 | 
						||
	by svr1.postgresql.org (Postfix) with ESMTP id 6DC4BD1CC98
 | 
						||
	for <pgsql-hackers@postgresql.org>; Fri, 30 Jan 2004 02:16:21 -0400 (AST)
 | 
						||
Received: from stark.xeocode.com (gsstark.mtl.istop.com [66.11.160.162])
 | 
						||
	by smtp.istop.com (Postfix) with ESMTP
 | 
						||
	id 8FD5F369BB; Fri, 30 Jan 2004 01:16:21 -0500 (EST)
 | 
						||
Received: from localhost ([127.0.0.1] helo=stark.xeocode.com)
 | 
						||
	by stark.xeocode.com with smtp (Exim 3.36 #1 (Debian))
 | 
						||
	id 1AmRwz-0004kf-00; Fri, 30 Jan 2004 01:16:21 -0500
 | 
						||
To: pgsql-hackers@postgresql.org
 | 
						||
Subject: Re: [HACKERS] Question about indexes
 | 
						||
References: <0A5B2E3C3A64CA4AB14F76DBCA76DDA44EF9B2@seifur.hnit.is>
 | 
						||
In-Reply-To: <0A5B2E3C3A64CA4AB14F76DBCA76DDA44EF9B2@seifur.hnit.is>
 | 
						||
From: Greg Stark <gsstark@mit.edu>
 | 
						||
Organization: The Emacs Conspiracy; member since 1992
 | 
						||
Date: 30 Jan 2004 01:16:21 -0500
 | 
						||
Message-ID: <87y8rqx8p6.fsf@stark.xeocode.com>
 | 
						||
Lines: 31
 | 
						||
User-Agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.3
 | 
						||
MIME-Version: 1.0
 | 
						||
Content-Type: text/plain; charset=us-ascii
 | 
						||
X-Virus-Scanned: by amavisd-new at postgresql.org
 | 
						||
X-Mailing-List: pgsql-hackers
 | 
						||
Precedence: bulk
 | 
						||
Sender: pgsql-hackers-owner@postgresql.org
 | 
						||
X-Spam-Checker-Version: SpamAssassin 2.61 (1.212.2.1-2003-12-09-exp) on 
 | 
						||
	candle.pha.pa.us
 | 
						||
X-Spam-Status: No, hits=-4.9 required=5.0 tests=BAYES_00 autolearn=ham 
 | 
						||
	version=2.61
 | 
						||
Status: OR
 | 
						||
 | 
						||
 | 
						||
<lnd@hnit.is> writes:
 | 
						||
 | 
						||
> A small comment on Oracle's implementation of persistent bitmap indexes:
 | 
						||
> 
 | 
						||
> Oracle's bitmap index is concurently locked by DML, i.e. it suites for OLAP
 | 
						||
> (basically read only data warehouses) but in no way for OLTP. 
 | 
						||
 | 
						||
I knew this. I think they figured that was ok because bitmap indexes were
 | 
						||
mainly intended to solve data warehouse problems anyways.
 | 
						||
 | 
						||
Thinking out loud here, I wonder whether this would be less of a problem for
 | 
						||
postgres. Since tuples are never updated in place there would never be a need
 | 
						||
to lock the entire bitmap until a transaction completes.
 | 
						||
 | 
						||
There would never be as much concurrency as btrees, assuming there was any
 | 
						||
kind of compression on the bitmap, but I don't see any reason why a long-term
 | 
						||
lock would have to be held for updates.
 | 
						||
 | 
						||
Even regular vacuum might not have to lock anything for long, just long enough
 | 
						||
to clear the bits. and vacuum full/cluster already take table locks anyways.
 | 
						||
 | 
						||
I think the problem Oracle ran into was that storing rollback ids in the
 | 
						||
bitmap is untenable. The whole point of persistent bitmap indexes is to store
 | 
						||
a very dense representation that represents thousands of records per page.
 | 
						||
Allocating space to store thousands of pending transaction ids and having
 | 
						||
thousands of old versions of the page in the rollback segment would defeat the
 | 
						||
purpose.
 | 
						||
 | 
						||
-- 
 | 
						||
greg
 | 
						||
 | 
						||
 | 
						||
---------------------------(end of broadcast)---------------------------
 | 
						||
TIP 7: don't forget to increase your free space map settings
 | 
						||
 | 
						||
From pgsql-hackers-owner+M49502=pgman=candle.pha.pa.us@postgresql.org Fri Jan 30 06:37:25 2004
 | 
						||
Return-path: <pgsql-hackers-owner+M49502=pgman=candle.pha.pa.us@postgresql.org>
 | 
						||
Received: from joeconway.com (66-146-172-86.skyriver.net [66.146.172.86])
 | 
						||
	by candle.pha.pa.us (8.11.6/8.11.6) with ESMTP id i0UBbOe07302
 | 
						||
	for <pgman@candle.pha.pa.us>; Fri, 30 Jan 2004 06:37:25 -0500 (EST)
 | 
						||
Received: from postgresql.org ([200.46.204.71] verified)
 | 
						||
  by joeconway.com (CommuniGate Pro SMTP 4.1.8)
 | 
						||
  with ESMTP id 797695 for pgman@candle.pha.pa.us; Fri, 30 Jan 2004 03:34:06 -0800
 | 
						||
X-Original-To: pgsql-hackers-postgresql.org@localhost.postgresql.org
 | 
						||
Received: from localhost (unknown [200.46.204.2])
 | 
						||
	by svr1.postgresql.org (Postfix) with ESMTP id 92A3CD1CCB7
 | 
						||
	for <pgsql-hackers-postgresql.org@localhost.postgresql.org>; Fri, 30 Jan 2004 11:31:21 +0000 (GMT)
 | 
						||
Received: from svr1.postgresql.org ([200.46.204.71])
 | 
						||
	by localhost (neptune.hub.org [200.46.204.2]) (amavisd-new, port 10024)
 | 
						||
	with ESMTP id 76882-10
 | 
						||
	for <pgsql-hackers-postgresql.org@localhost.postgresql.org>;
 | 
						||
	Fri, 30 Jan 2004 07:31:24 -0400 (AST)
 | 
						||
Received: from candle.pha.pa.us (candle.pha.pa.us [207.106.42.251])
 | 
						||
	by svr1.postgresql.org (Postfix) with ESMTP id 59850D1CACB
 | 
						||
	for <pgsql-hackers@postgresql.org>; Fri, 30 Jan 2004 07:31:20 -0400 (AST)
 | 
						||
Received: (from pgman@localhost)
 | 
						||
	by candle.pha.pa.us (8.11.6/8.11.6) id i0UBVHU04169;
 | 
						||
	Fri, 30 Jan 2004 06:31:17 -0500 (EST)
 | 
						||
From: Bruce Momjian <pgman@candle.pha.pa.us>
 | 
						||
Message-ID: <200401301131.i0UBVHU04169@candle.pha.pa.us>
 | 
						||
Subject: Re: [HACKERS] Question about indexes
 | 
						||
In-Reply-To: <87vfmwrtxt.fsf@stark.xeocode.com>
 | 
						||
To: Greg Stark <gsstark@mit.edu>
 | 
						||
Date: Fri, 30 Jan 2004 06:31:17 -0500 (EST)
 | 
						||
cc: Tom Lane <tgl@sss.pgh.pa.us>, pgsql-hackers@postgresql.org
 | 
						||
X-Mailer: ELM [version 2.4ME+ PL108 (25)]
 | 
						||
MIME-Version: 1.0
 | 
						||
Content-Transfer-Encoding: 7bit
 | 
						||
Content-Type: text/plain; charset=US-ASCII
 | 
						||
X-Virus-Scanned: by amavisd-new at postgresql.org
 | 
						||
X-Mailing-List: pgsql-hackers
 | 
						||
Precedence: bulk
 | 
						||
Sender: pgsql-hackers-owner@postgresql.org
 | 
						||
Status: OR
 | 
						||
 | 
						||
Greg Stark wrote:
 | 
						||
> 
 | 
						||
> Tom Lane <tgl@sss.pgh.pa.us> writes:
 | 
						||
> 
 | 
						||
> > In any case, this discussion is predicated on the assumption that the
 | 
						||
> > operations involving the bitmap are a significant fraction of the total
 | 
						||
> > time, which I think is quite uncertain.  Until we build it and profile
 | 
						||
> > it, we won't know that.
 | 
						||
> 
 | 
						||
> The other thought I had was that it would be difficult to tell when to follow
 | 
						||
> this path. Since the main case where it wins is when the individual indexes
 | 
						||
> aren't very selective but the combination is very selective, and we don't have
 | 
						||
> inter-column correlation statistics ...
 | 
						||
 | 
						||
I like the idea of building in-memory bitmapped indexes.
 | 
						||
 | 
						||
In your example, if you are restricting on A and B, and have no A,B
 | 
						||
index but an A index and B index, why wouldn't you always create an
 | 
						||
in-memory bitmapped index from indexes A and B, unless index A hits only
 | 
						||
a few rows.  In fact, from the optimizer statistics, you can guess on
 | 
						||
how many bits you will hit from index A and index B, so we only have to
 | 
						||
decide if it is better to take the more restrictive index and do heap
 | 
						||
lookups for those, or scan the second index and then hit the heap.  The
 | 
						||
only thing A,B combined statistics would tell you is how many heap
 | 
						||
matches you will find.  The time to scan A and B indexes and create the
 | 
						||
bitmap is already guessable from the single column statistics.
 | 
						||
 | 
						||
Also, what does an in-memory bitmapped index look like?  Is it:
 | 
						||
 | 
						||
	value:  bitmap...
 | 
						||
	value:  bitmap...
 | 
						||
 | 
						||
with the values organized in a btree fashion?
 | 
						||
 | 
						||
-- 
 | 
						||
  Bruce Momjian                        |  http://candle.pha.pa.us
 | 
						||
  pgman@candle.pha.pa.us               |  (610) 359-1001
 | 
						||
  +  If your life is a hard drive,     |  13 Roberts Road
 | 
						||
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
 | 
						||
 | 
						||
---------------------------(end of broadcast)---------------------------
 | 
						||
TIP 6: Have you searched our list archives?
 | 
						||
 | 
						||
               http://archives.postgresql.org
 | 
						||
 | 
						||
From pgsql-hackers-owner+M49505=pgman=candle.pha.pa.us@postgresql.org Fri Jan 30 09:55:27 2004
 | 
						||
Return-path: <pgsql-hackers-owner+M49505=pgman=candle.pha.pa.us@postgresql.org>
 | 
						||
Received: from zippy.ims.net (IDENT:BTCTknqFfnMWdPgoZjvES928uVdg+CPr@zippy.ims.net [208.166.202.2])
 | 
						||
	by candle.pha.pa.us (8.11.6/8.11.6) with ESMTP id i0UEtPe12397
 | 
						||
	for <pgman@candle.pha.pa.us>; Fri, 30 Jan 2004 09:55:26 -0500 (EST)
 | 
						||
Received: from postgresql.org (svr1.postgresql.org [200.46.204.71])
 | 
						||
	by zippy.ims.net (8.11.6/linuxconf) with ESMTP id i0UEsQt01250
 | 
						||
	for <pgman@candle.pha.pa.us>; Fri, 30 Jan 2004 08:54:31 -0600
 | 
						||
X-Original-To: pgsql-hackers-postgresql.org@localhost.postgresql.org
 | 
						||
Received: from localhost (unknown [200.46.204.2])
 | 
						||
	by svr1.postgresql.org (Postfix) with ESMTP id 3DF5DD1C9E1
 | 
						||
	for <pgsql-hackers-postgresql.org@localhost.postgresql.org>; Fri, 30 Jan 2004 14:48:26 +0000 (GMT)
 | 
						||
Received: from svr1.postgresql.org ([200.46.204.71])
 | 
						||
	by localhost (neptune.hub.org [200.46.204.2]) (amavisd-new, port 10024)
 | 
						||
	with ESMTP id 55394-05
 | 
						||
	for <pgsql-hackers-postgresql.org@localhost.postgresql.org>;
 | 
						||
	Fri, 30 Jan 2004 10:48:29 -0400 (AST)
 | 
						||
Received: from sss.pgh.pa.us (unknown [192.204.191.242])
 | 
						||
	by svr1.postgresql.org (Postfix) with ESMTP id 79B71D1C992
 | 
						||
	for <pgsql-hackers@postgresql.org>; Fri, 30 Jan 2004 10:48:25 -0400 (AST)
 | 
						||
Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
 | 
						||
	by sss.pgh.pa.us (8.12.11/8.12.11) with ESMTP id i0UEmJw9012966;
 | 
						||
	Fri, 30 Jan 2004 09:48:19 -0500 (EST)
 | 
						||
To: Bruce Momjian <pgman@candle.pha.pa.us>
 | 
						||
cc: Greg Stark <gsstark@mit.edu>, pgsql-hackers@postgresql.org
 | 
						||
Subject: Re: [HACKERS] Question about indexes 
 | 
						||
In-Reply-To: <200401301131.i0UBVHU04169@candle.pha.pa.us> 
 | 
						||
References: <200401301131.i0UBVHU04169@candle.pha.pa.us>
 | 
						||
Comments: In-reply-to Bruce Momjian <pgman@candle.pha.pa.us>
 | 
						||
	message dated "Fri, 30 Jan 2004 06:31:17 -0500"
 | 
						||
Date: Fri, 30 Jan 2004 09:48:19 -0500
 | 
						||
Message-ID: <12965.1075474099@sss.pgh.pa.us>
 | 
						||
From: Tom Lane <tgl@sss.pgh.pa.us>
 | 
						||
X-Virus-Scanned: by amavisd-new at postgresql.org
 | 
						||
X-Mailing-List: pgsql-hackers
 | 
						||
Precedence: bulk
 | 
						||
Sender: pgsql-hackers-owner@postgresql.org
 | 
						||
X-Spam-Checker-Version: SpamAssassin 2.61 (1.212.2.1-2003-12-09-exp) on 
 | 
						||
	candle.pha.pa.us
 | 
						||
X-Spam-Status: No, hits=-4.9 required=5.0 tests=BAYES_00 autolearn=no 
 | 
						||
	version=2.61
 | 
						||
Status: ORr
 | 
						||
 | 
						||
Bruce Momjian <pgman@candle.pha.pa.us> writes:
 | 
						||
> Also, what does an in-memory bitmapped index look like?
 | 
						||
 | 
						||
One idea that might work: a binary search tree in which each node
 | 
						||
represents a single page of the table, and contains a bit array with
 | 
						||
one bit for each possible item number on the page.  You could not need
 | 
						||
more than BLCKSZ/(sizeof(HeapTupleHeaderData)+sizeof(ItemIdData)) bits
 | 
						||
in a node, or about 36 bytes at default BLCKSZ --- for most tables you
 | 
						||
could probably prove it would be a great deal less.  You only allocate
 | 
						||
nodes for pages that have at least one interesting row.
 | 
						||
 | 
						||
I think this would represent a reasonable compromise between size and
 | 
						||
insertion speed.  It would only get large if the indexscan output
 | 
						||
demanded visiting many different pages --- but at some point you could
 | 
						||
abandon index usage and do a sequential scan, so I think that property
 | 
						||
is okay.
 | 
						||
 | 
						||
A variant is to make the per-page bit arrays be entries in a hash table
 | 
						||
with page number as hash key.  This would reduce insertion to a nearly
 | 
						||
constant-time operation, but the drawback is that you'd need an explicit
 | 
						||
sort at the end to put the per-page entries into page number order
 | 
						||
before you scan 'em.  You might come out ahead anyway, not sure.
 | 
						||
 | 
						||
Or we could try a true linear bitmap (indexed by page number times
 | 
						||
max-items-per-page plus item number) that's compressed in some fashion,
 | 
						||
probably just by eliminating large runs of zeroes.  The difficulty here
 | 
						||
is that inserting a new one-bit could be pretty expensive, and we need
 | 
						||
it to be cheap.
 | 
						||
 | 
						||
Perhaps someone can come up with other better ideas ...
 | 
						||
 | 
						||
			regards, tom lane
 | 
						||
 | 
						||
---------------------------(end of broadcast)---------------------------
 | 
						||
TIP 8: explain analyze is your friend
 | 
						||
 | 
						||
From pgsql-hackers-owner+M49506=pgman=candle.pha.pa.us@postgresql.org Fri Jan 30 10:23:37 2004
 | 
						||
Return-path: <pgsql-hackers-owner+M49506=pgman=candle.pha.pa.us@postgresql.org>
 | 
						||
Received: from joeconway.com (66-146-172-86.skyriver.net [66.146.172.86])
 | 
						||
	by candle.pha.pa.us (8.11.6/8.11.6) with ESMTP id i0UFNZe17036
 | 
						||
	for <pgman@candle.pha.pa.us>; Fri, 30 Jan 2004 10:23:36 -0500 (EST)
 | 
						||
Received: from postgresql.org ([200.46.204.71] verified)
 | 
						||
  by joeconway.com (CommuniGate Pro SMTP 4.1.8)
 | 
						||
  with ESMTP id 797996 for pgman@candle.pha.pa.us; Fri, 30 Jan 2004 07:20:18 -0800
 | 
						||
X-Original-To: pgsql-hackers-postgresql.org@localhost.postgresql.org
 | 
						||
Received: from localhost (unknown [200.46.204.2])
 | 
						||
	by svr1.postgresql.org (Postfix) with ESMTP id 8901ED1C9B3
 | 
						||
	for <pgsql-hackers-postgresql.org@localhost.postgresql.org>; Fri, 30 Jan 2004 15:14:26 +0000 (GMT)
 | 
						||
Received: from svr1.postgresql.org ([200.46.204.71])
 | 
						||
	by localhost (neptune.hub.org [200.46.204.2]) (amavisd-new, port 10024)
 | 
						||
	with ESMTP id 67347-02
 | 
						||
	for <pgsql-hackers-postgresql.org@localhost.postgresql.org>;
 | 
						||
	Fri, 30 Jan 2004 11:14:30 -0400 (AST)
 | 
						||
Received: from candle.pha.pa.us (candle.pha.pa.us [207.106.42.251])
 | 
						||
	by svr1.postgresql.org (Postfix) with ESMTP id F021AD1C95E
 | 
						||
	for <pgsql-hackers@postgresql.org>; Fri, 30 Jan 2004 11:14:24 -0400 (AST)
 | 
						||
Received: (from pgman@localhost)
 | 
						||
	by candle.pha.pa.us (8.11.6/8.11.6) id i0UFEMl15556;
 | 
						||
	Fri, 30 Jan 2004 10:14:22 -0500 (EST)
 | 
						||
From: Bruce Momjian <pgman@candle.pha.pa.us>
 | 
						||
Message-ID: <200401301514.i0UFEMl15556@candle.pha.pa.us>
 | 
						||
Subject: Re: [HACKERS] Question about indexes
 | 
						||
In-Reply-To: <12965.1075474099@sss.pgh.pa.us>
 | 
						||
To: Tom Lane <tgl@sss.pgh.pa.us>
 | 
						||
Date: Fri, 30 Jan 2004 10:14:22 -0500 (EST)
 | 
						||
cc: Greg Stark <gsstark@mit.edu>, pgsql-hackers@postgresql.org
 | 
						||
X-Mailer: ELM [version 2.4ME+ PL108 (25)]
 | 
						||
MIME-Version: 1.0
 | 
						||
Content-Transfer-Encoding: 7bit
 | 
						||
Content-Type: text/plain; charset=US-ASCII
 | 
						||
X-Virus-Scanned: by amavisd-new at postgresql.org
 | 
						||
X-Mailing-List: pgsql-hackers
 | 
						||
Precedence: bulk
 | 
						||
Sender: pgsql-hackers-owner@postgresql.org
 | 
						||
Status: OR
 | 
						||
 | 
						||
Tom Lane wrote:
 | 
						||
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
 | 
						||
> > Also, what does an in-memory bitmapped index look like?
 | 
						||
> 
 | 
						||
> One idea that might work: a binary search tree in which each node
 | 
						||
> represents a single page of the table, and contains a bit array with
 | 
						||
> one bit for each possible item number on the page.  You could not need
 | 
						||
> more than BLCKSZ/(sizeof(HeapTupleHeaderData)+sizeof(ItemIdData)) bits
 | 
						||
> in a node, or about 36 bytes at default BLCKSZ --- for most tables you
 | 
						||
> could probably prove it would be a great deal less.  You only allocate
 | 
						||
> nodes for pages that have at least one interesting row.
 | 
						||
 | 
						||
Actually, I think I made a mistake.  I was wondering what on-disk
 | 
						||
bitmapped indexes look like.
 | 
						||
 | 
						||
-- 
 | 
						||
  Bruce Momjian                        |  http://candle.pha.pa.us
 | 
						||
  pgman@candle.pha.pa.us               |  (610) 359-1001
 | 
						||
  +  If your life is a hard drive,     |  13 Roberts Road
 | 
						||
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
 | 
						||
 | 
						||
---------------------------(end of broadcast)---------------------------
 | 
						||
TIP 9: the planner will ignore your desire to choose an index scan if your
 | 
						||
      joining column's datatypes do not match
 | 
						||
 | 
						||
From pgsql-hackers-owner+M49507=pgman=candle.pha.pa.us@postgresql.org Fri Jan 30 10:31:27 2004
 | 
						||
Return-path: <pgsql-hackers-owner+M49507=pgman=candle.pha.pa.us@postgresql.org>
 | 
						||
Received: from zippy.ims.net (IDENT:AWZrLd+EfFmX1x4Ch6+4AfIqn908pAfY@zippy.ims.net [208.166.202.2])
 | 
						||
	by candle.pha.pa.us (8.11.6/8.11.6) with ESMTP id i0UFVOe18065
 | 
						||
	for <pgman@candle.pha.pa.us>; Fri, 30 Jan 2004 10:31:26 -0500 (EST)
 | 
						||
Received: from postgresql.org (svr1.postgresql.org [200.46.204.71])
 | 
						||
	by zippy.ims.net (8.11.6/linuxconf) with ESMTP id i0UFURt02719
 | 
						||
	for <pgman@candle.pha.pa.us>; Fri, 30 Jan 2004 09:30:32 -0600
 | 
						||
X-Original-To: pgsql-hackers-postgresql.org@localhost.postgresql.org
 | 
						||
Received: from localhost (unknown [200.46.204.2])
 | 
						||
	by svr1.postgresql.org (Postfix) with ESMTP id 9DF9ED1CCA7
 | 
						||
	for <pgsql-hackers-postgresql.org@localhost.postgresql.org>; Fri, 30 Jan 2004 15:22:35 +0000 (GMT)
 | 
						||
Received: from svr1.postgresql.org ([200.46.204.71])
 | 
						||
	by localhost (neptune.hub.org [200.46.204.2]) (amavisd-new, port 10024)
 | 
						||
	with ESMTP id 66733-09
 | 
						||
	for <pgsql-hackers-postgresql.org@localhost.postgresql.org>;
 | 
						||
	Fri, 30 Jan 2004 11:22:39 -0400 (AST)
 | 
						||
Received: from candle.pha.pa.us (candle.pha.pa.us [207.106.42.251])
 | 
						||
	by svr1.postgresql.org (Postfix) with ESMTP id 235C3D1CCB2
 | 
						||
	for <pgsql-hackers@postgresql.org>; Fri, 30 Jan 2004 11:22:33 -0400 (AST)
 | 
						||
Received: (from pgman@localhost)
 | 
						||
	by candle.pha.pa.us (8.11.6/8.11.6) id i0UFMYr16926;
 | 
						||
	Fri, 30 Jan 2004 10:22:34 -0500 (EST)
 | 
						||
From: Bruce Momjian <pgman@candle.pha.pa.us>
 | 
						||
Message-ID: <200401301522.i0UFMYr16926@candle.pha.pa.us>
 | 
						||
Subject: Re: [HACKERS] Question about indexes
 | 
						||
In-Reply-To: <87vfmwrtxt.fsf@stark.xeocode.com>
 | 
						||
To: Greg Stark <gsstark@mit.edu>
 | 
						||
Date: Fri, 30 Jan 2004 10:22:34 -0500 (EST)
 | 
						||
cc: Tom Lane <tgl@sss.pgh.pa.us>, pgsql-hackers@postgresql.org
 | 
						||
X-Mailer: ELM [version 2.4ME+ PL108 (25)]
 | 
						||
MIME-Version: 1.0
 | 
						||
Content-Transfer-Encoding: 7bit
 | 
						||
Content-Type: text/plain; charset=US-ASCII
 | 
						||
X-Virus-Scanned: by amavisd-new at postgresql.org
 | 
						||
X-Mailing-List: pgsql-hackers
 | 
						||
Precedence: bulk
 | 
						||
Sender: pgsql-hackers-owner@postgresql.org
 | 
						||
Status: OR
 | 
						||
 | 
						||
Greg Stark wrote:
 | 
						||
> 
 | 
						||
> Tom Lane <tgl@sss.pgh.pa.us> writes:
 | 
						||
> 
 | 
						||
> > In any case, this discussion is predicated on the assumption that the
 | 
						||
> > operations involving the bitmap are a significant fraction of the total
 | 
						||
> > time, which I think is quite uncertain.  Until we build it and profile
 | 
						||
> > it, we won't know that.
 | 
						||
> 
 | 
						||
> The other thought I had was that it would be difficult to tell when to follow
 | 
						||
> this path. Since the main case where it wins is when the individual indexes
 | 
						||
> aren't very selective but the combination is very selective, and we don't have
 | 
						||
> inter-column correlation statistics ...
 | 
						||
 | 
						||
We actually have heap access cost and index access cost.  You could
 | 
						||
compare costs of looking at all of index A's heap vs. looking at index
 | 
						||
B and then hopefully fewer heap rows.
 | 
						||
 | 
						||
-- 
 | 
						||
  Bruce Momjian                        |  http://candle.pha.pa.us
 | 
						||
  pgman@candle.pha.pa.us               |  (610) 359-1001
 | 
						||
  +  If your life is a hard drive,     |  13 Roberts Road
 | 
						||
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
 | 
						||
 | 
						||
---------------------------(end of broadcast)---------------------------
 | 
						||
TIP 2: you can get off all lists at once with the unregister command
 | 
						||
    (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
 | 
						||
 | 
						||
From alvherre@CM-lcon2-51-253.cm.vtr.net Fri Jan 30 10:24:32 2004
 | 
						||
Return-path: <alvherre@CM-lcon2-51-253.cm.vtr.net>
 | 
						||
Received: from CM-lcon2-51-253.cm.vtr.net (CM-lcon2-51-253.cm.vtr.net [200.83.51.253])
 | 
						||
	by candle.pha.pa.us (8.11.6/8.11.6) with ESMTP id i0UFOSe17199
 | 
						||
	for <pgman@candle.pha.pa.us>; Fri, 30 Jan 2004 10:24:31 -0500 (EST)
 | 
						||
Received: by CM-lcon2-51-253.cm.vtr.net (Postfix, from userid 500)
 | 
						||
	id 9A93157578; Fri, 30 Jan 2004 10:24:18 -0500 (EST)
 | 
						||
Date: Fri, 30 Jan 2004 12:24:18 -0300
 | 
						||
From: Alvaro Herrera <alvherre@dcc.uchile.cl>
 | 
						||
To: Tom Lane <tgl@sss.pgh.pa.us>
 | 
						||
cc: Bruce Momjian <pgman@candle.pha.pa.us>, Greg Stark <gsstark@mit.edu>,
 | 
						||
   pgsql-hackers@postgresql.org
 | 
						||
Subject: Re: [HACKERS] Question about indexes
 | 
						||
Message-ID: <20040130152418.GB24123@dcc.uchile.cl>
 | 
						||
References: <200401301131.i0UBVHU04169@candle.pha.pa.us> <12965.1075474099@sss.pgh.pa.us>
 | 
						||
MIME-Version: 1.0
 | 
						||
Content-Type: text/plain; charset=iso-8859-1
 | 
						||
Content-Disposition: inline
 | 
						||
Content-Transfer-Encoding: 8bit
 | 
						||
In-Reply-To: <12965.1075474099@sss.pgh.pa.us>
 | 
						||
User-Agent: Mutt/1.4.1i
 | 
						||
X-Spam-Checker-Version: SpamAssassin 2.61 (1.212.2.1-2003-12-09-exp) on 
 | 
						||
	candle.pha.pa.us
 | 
						||
X-Spam-Status: No, hits=-4.9 required=5.0 tests=BAYES_00 autolearn=ham 
 | 
						||
	version=2.61
 | 
						||
Status: ORr
 | 
						||
 | 
						||
On Fri, Jan 30, 2004 at 09:48:19AM -0500, Tom Lane wrote:
 | 
						||
 | 
						||
> A variant is to make the per-page bit arrays be entries in a hash table
 | 
						||
> with page number as hash key.  This would reduce insertion to a nearly
 | 
						||
> constant-time operation, but the drawback is that you'd need an explicit
 | 
						||
> sort at the end to put the per-page entries into page number order
 | 
						||
> before you scan 'em.  You might come out ahead anyway, not sure.
 | 
						||
 | 
						||
Is there a reason sort the pages before scanning them?  The result won't
 | 
						||
come out sorted one way or the other.
 | 
						||
 | 
						||
-- 
 | 
						||
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
 | 
						||
"Para tener m<>s hay que desear menos"
 | 
						||
 | 
						||
From pgsql-hackers-owner+M49508=pgman=candle.pha.pa.us@postgresql.org Fri Jan 30 10:33:18 2004
 | 
						||
Return-path: <pgsql-hackers-owner+M49508=pgman=candle.pha.pa.us@postgresql.org>
 | 
						||
Received: from zippy.ims.net (IDENT:Lj5veoF1GO3p04hu8b6BDDLvyD1wii0f@zippy.ims.net [208.166.202.2])
 | 
						||
	by candle.pha.pa.us (8.11.6/8.11.6) with ESMTP id i0UFXHe18303
 | 
						||
	for <pgman@candle.pha.pa.us>; Fri, 30 Jan 2004 10:33:18 -0500 (EST)
 | 
						||
Received: from postgresql.org (svr1.postgresql.org [200.46.204.71])
 | 
						||
	by zippy.ims.net (8.11.6/linuxconf) with ESMTP id i0UFWIt02804
 | 
						||
	for <pgman@candle.pha.pa.us>; Fri, 30 Jan 2004 09:32:21 -0600
 | 
						||
X-Original-To: pgsql-hackers-postgresql.org@localhost.postgresql.org
 | 
						||
Received: from localhost (unknown [200.46.204.2])
 | 
						||
	by svr1.postgresql.org (Postfix) with ESMTP id E41F6D1CCDC
 | 
						||
	for <pgsql-hackers-postgresql.org@localhost.postgresql.org>; Fri, 30 Jan 2004 15:24:25 +0000 (GMT)
 | 
						||
Received: from svr1.postgresql.org ([200.46.204.71])
 | 
						||
	by localhost (neptune.hub.org [200.46.204.2]) (amavisd-new, port 10024)
 | 
						||
	with ESMTP id 72118-01
 | 
						||
	for <pgsql-hackers-postgresql.org@localhost.postgresql.org>;
 | 
						||
	Fri, 30 Jan 2004 11:24:29 -0400 (AST)
 | 
						||
Received: from CM-lcon2-51-253.cm.vtr.net (CM-lcon2-51-253.cm.vtr.net [200.83.51.253])
 | 
						||
	by svr1.postgresql.org (Postfix) with ESMTP id 219F9D1CCDB
 | 
						||
	for <pgsql-hackers@postgresql.org>; Fri, 30 Jan 2004 11:24:25 -0400 (AST)
 | 
						||
Received: by CM-lcon2-51-253.cm.vtr.net (Postfix, from userid 500)
 | 
						||
	id 9A93157578; Fri, 30 Jan 2004 10:24:18 -0500 (EST)
 | 
						||
Date: Fri, 30 Jan 2004 12:24:18 -0300
 | 
						||
From: Alvaro Herrera <alvherre@dcc.uchile.cl>
 | 
						||
To: Tom Lane <tgl@sss.pgh.pa.us>
 | 
						||
cc: Bruce Momjian <pgman@candle.pha.pa.us>, Greg Stark <gsstark@mit.edu>,
 | 
						||
   pgsql-hackers@postgresql.org
 | 
						||
Subject: Re: [HACKERS] Question about indexes
 | 
						||
Message-ID: <20040130152418.GB24123@dcc.uchile.cl>
 | 
						||
References: <200401301131.i0UBVHU04169@candle.pha.pa.us> <12965.1075474099@sss.pgh.pa.us>
 | 
						||
MIME-Version: 1.0
 | 
						||
Content-Type: text/plain; charset=iso-8859-1
 | 
						||
Content-Disposition: inline
 | 
						||
Content-Transfer-Encoding: 8bit
 | 
						||
In-Reply-To: <12965.1075474099@sss.pgh.pa.us>
 | 
						||
User-Agent: Mutt/1.4.1i
 | 
						||
X-Virus-Scanned: by amavisd-new at postgresql.org
 | 
						||
X-Mailing-List: pgsql-hackers
 | 
						||
Precedence: bulk
 | 
						||
Sender: pgsql-hackers-owner@postgresql.org
 | 
						||
X-Spam-Checker-Version: SpamAssassin 2.61 (1.212.2.1-2003-12-09-exp) on 
 | 
						||
	candle.pha.pa.us
 | 
						||
X-Spam-Status: No, hits=-4.9 required=5.0 tests=BAYES_00 autolearn=no 
 | 
						||
	version=2.61
 | 
						||
Status: OR
 | 
						||
 | 
						||
On Fri, Jan 30, 2004 at 09:48:19AM -0500, Tom Lane wrote:
 | 
						||
 | 
						||
> A variant is to make the per-page bit arrays be entries in a hash table
 | 
						||
> with page number as hash key.  This would reduce insertion to a nearly
 | 
						||
> constant-time operation, but the drawback is that you'd need an explicit
 | 
						||
> sort at the end to put the per-page entries into page number order
 | 
						||
> before you scan 'em.  You might come out ahead anyway, not sure.
 | 
						||
 | 
						||
Is there a reason sort the pages before scanning them?  The result won't
 | 
						||
come out sorted one way or the other.
 | 
						||
 | 
						||
-- 
 | 
						||
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
 | 
						||
"Para tener m<>s hay que desear menos"
 | 
						||
 | 
						||
---------------------------(end of broadcast)---------------------------
 | 
						||
TIP 4: Don't 'kill -9' the postmaster
 | 
						||
 | 
						||
From pgsql-hackers-owner+M49509=pgman=candle.pha.pa.us@postgresql.org Fri Jan 30 10:39:11 2004
 | 
						||
Return-path: <pgsql-hackers-owner+M49509=pgman=candle.pha.pa.us@postgresql.org>
 | 
						||
Received: from zippy.ims.net (IDENT:QumGpJuSSF+qB+W577trqd4FqP6fc1O+@zippy.ims.net [208.166.202.2])
 | 
						||
	by candle.pha.pa.us (8.11.6/8.11.6) with ESMTP id i0UFd9e19273
 | 
						||
	for <pgman@candle.pha.pa.us>; Fri, 30 Jan 2004 10:39:10 -0500 (EST)
 | 
						||
Received: from postgresql.org (svr1.postgresql.org [200.46.204.71])
 | 
						||
	by zippy.ims.net (8.11.6/linuxconf) with ESMTP id i0UFcDt02990
 | 
						||
	for <pgman@candle.pha.pa.us>; Fri, 30 Jan 2004 09:38:17 -0600
 | 
						||
X-Original-To: pgsql-hackers-postgresql.org@localhost.postgresql.org
 | 
						||
Received: from localhost (unknown [200.46.204.2])
 | 
						||
	by svr1.postgresql.org (Postfix) with ESMTP id 606FBD1BA96
 | 
						||
	for <pgsql-hackers-postgresql.org@localhost.postgresql.org>; Fri, 30 Jan 2004 15:31:24 +0000 (GMT)
 | 
						||
Received: from svr1.postgresql.org ([200.46.204.71])
 | 
						||
	by localhost (neptune.hub.org [200.46.204.2]) (amavisd-new, port 10024)
 | 
						||
	with ESMTP id 73148-04
 | 
						||
	for <pgsql-hackers-postgresql.org@localhost.postgresql.org>;
 | 
						||
	Fri, 30 Jan 2004 11:31:28 -0400 (AST)
 | 
						||
Received: from candle.pha.pa.us (candle.pha.pa.us [207.106.42.251])
 | 
						||
	by svr1.postgresql.org (Postfix) with ESMTP id D7A47D1B4BD
 | 
						||
	for <pgsql-hackers@postgresql.org>; Fri, 30 Jan 2004 11:31:22 -0400 (AST)
 | 
						||
Received: (from pgman@localhost)
 | 
						||
	by candle.pha.pa.us (8.11.6/8.11.6) id i0UFUgQ18014;
 | 
						||
	Fri, 30 Jan 2004 10:30:42 -0500 (EST)
 | 
						||
From: Bruce Momjian <pgman@candle.pha.pa.us>
 | 
						||
Message-ID: <200401301530.i0UFUgQ18014@candle.pha.pa.us>
 | 
						||
Subject: Re: [HACKERS] Question about indexes
 | 
						||
In-Reply-To: <20040130152418.GB24123@dcc.uchile.cl>
 | 
						||
To: Alvaro Herrera <alvherre@dcc.uchile.cl>
 | 
						||
Date: Fri, 30 Jan 2004 10:30:42 -0500 (EST)
 | 
						||
cc: Tom Lane <tgl@sss.pgh.pa.us>, Greg Stark <gsstark@mit.edu>,
 | 
						||
   pgsql-hackers@postgresql.org
 | 
						||
X-Mailer: ELM [version 2.4ME+ PL108 (25)]
 | 
						||
MIME-Version: 1.0
 | 
						||
Content-Transfer-Encoding: 7bit
 | 
						||
Content-Type: text/plain; charset=US-ASCII
 | 
						||
X-Virus-Scanned: by amavisd-new at postgresql.org
 | 
						||
X-Mailing-List: pgsql-hackers
 | 
						||
Precedence: bulk
 | 
						||
Sender: pgsql-hackers-owner@postgresql.org
 | 
						||
Status: OR
 | 
						||
 | 
						||
Alvaro Herrera wrote:
 | 
						||
> On Fri, Jan 30, 2004 at 09:48:19AM -0500, Tom Lane wrote:
 | 
						||
> 
 | 
						||
> > A variant is to make the per-page bit arrays be entries in a hash table
 | 
						||
> > with page number as hash key.  This would reduce insertion to a nearly
 | 
						||
> > constant-time operation, but the drawback is that you'd need an explicit
 | 
						||
> > sort at the end to put the per-page entries into page number order
 | 
						||
> > before you scan 'em.  You might come out ahead anyway, not sure.
 | 
						||
> 
 | 
						||
> Is there a reason sort the pages before scanning them?  The result won't
 | 
						||
> come out sorted one way or the other.
 | 
						||
 | 
						||
I think the goal would be to hit the heap in sequential order as much as
 | 
						||
possible.  When we are doing reading right from the index, we haven't
 | 
						||
collected all the heap values in one place, but since we have them in
 | 
						||
memory, we might as well sort them, though I don't think that is a
 | 
						||
requirement, just a performance enhancement, or at least that is my
 | 
						||
guess.
 | 
						||
 | 
						||
-- 
 | 
						||
  Bruce Momjian                        |  http://candle.pha.pa.us
 | 
						||
  pgman@candle.pha.pa.us               |  (610) 359-1001
 | 
						||
  +  If your life is a hard drive,     |  13 Roberts Road
 | 
						||
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
 | 
						||
 | 
						||
---------------------------(end of broadcast)---------------------------
 | 
						||
TIP 8: explain analyze is your friend
 | 
						||
 | 
						||
From hannu@tm.ee Fri Jan 30 17:44:13 2004
 | 
						||
Return-path: <hannu@tm.ee>
 | 
						||
Received: from fuji.krosing.net (217-159-136-226-dsl.kt.estpak.ee [217.159.136.226])
 | 
						||
	by candle.pha.pa.us (8.11.6/8.11.6) with ESMTP id i0UMi5e23093
 | 
						||
	for <pgman@candle.pha.pa.us>; Fri, 30 Jan 2004 17:44:12 -0500 (EST)
 | 
						||
Received: from fuji.krosing.net (localhost.localdomain [127.0.0.1])
 | 
						||
	by fuji.krosing.net (8.12.8/8.12.8) with ESMTP id i0UMhuEl005243;
 | 
						||
	Sat, 31 Jan 2004 00:43:57 +0200
 | 
						||
Received: (from hannu@localhost)
 | 
						||
	by fuji.krosing.net (8.12.8/8.12.8/Submit) id i0UMhs94005241;
 | 
						||
	Sat, 31 Jan 2004 00:43:54 +0200
 | 
						||
X-Authentication-Warning: fuji.krosing.net: hannu set sender to hannu@tm.ee using -f
 | 
						||
Subject: Re: [HACKERS] Question about indexes
 | 
						||
From: Hannu Krosing <hannu@tm.ee>
 | 
						||
To: Tom Lane <tgl@sss.pgh.pa.us>
 | 
						||
cc: Bruce Momjian <pgman@candle.pha.pa.us>, Greg Stark <gsstark@mit.edu>,
 | 
						||
   pgsql-hackers@postgresql.org
 | 
						||
In-Reply-To: <12965.1075474099@sss.pgh.pa.us>
 | 
						||
References: <200401301131.i0UBVHU04169@candle.pha.pa.us>
 | 
						||
  <12965.1075474099@sss.pgh.pa.us>
 | 
						||
Content-Type: text/plain; charset=
 | 
						||
Message-ID: <1075502634.4007.32.camel@fuji.krosing.net>
 | 
						||
MIME-Version: 1.0
 | 
						||
X-Mailer: Ximian Evolution 1.4.5 
 | 
						||
Date: Sat, 31 Jan 2004 00:43:54 +0200
 | 
						||
Content-Transfer-Encoding: 8bit
 | 
						||
X-MIME-Autoconverted: from quoted-printable to 8bit by candle.pha.pa.us id i0UMi5e23093
 | 
						||
X-Spam-Checker-Version: SpamAssassin 2.61 (1.212.2.1-2003-12-09-exp) on 
 | 
						||
	candle.pha.pa.us
 | 
						||
X-Spam-Status: No, hits=-4.9 required=5.0 tests=BAYES_00 autolearn=ham 
 | 
						||
	version=2.61
 | 
						||
Status: OR
 | 
						||
 | 
						||
Tom Lane kirjutas R, 30.01.2004 kell 16:48:
 | 
						||
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
 | 
						||
> > Also, what does an in-memory bitmapped index look like?
 | 
						||
> 
 | 
						||
> One idea that might work: a binary search tree in which each node
 | 
						||
> represents a single page of the table, and contains a bit array with
 | 
						||
> one bit for each possible item number on the page.  You could not need
 | 
						||
> more than BLCKSZ/(sizeof(HeapTupleHeaderData)+sizeof(ItemIdData)) bits
 | 
						||
> in a node, or about 36 bytes at default BLCKSZ --- for most tables you
 | 
						||
> could probably prove it would be a great deal less.  You only allocate
 | 
						||
> nodes for pages that have at least one interesting row.
 | 
						||
 | 
						||
Another idea would be using bitmaps where we have just one bit per
 | 
						||
database page and do a seq scan but just over marked pages.
 | 
						||
 | 
						||
Even when allocating them in full such indexes would occupy just
 | 
						||
1/(8k*8bit) of the amount they describe, so index for 1GB table would be
 | 
						||
1G/(8k*8bit) = 16 kilobytes (2 pages)
 | 
						||
 | 
						||
Also, such indexes, if persistent, could also be used (together with
 | 
						||
FSM) when deciding placement of new tuples, so they provide a form of
 | 
						||
clustering.
 | 
						||
 | 
						||
This would of course be most useful for data-warehouse type operations,
 | 
						||
where database is significantöy bigger than memory.
 | 
						||
 | 
						||
And the seqscan over bitmap should not be done in simple page order, but
 | 
						||
rather in two passes -
 | 
						||
 1. over those pages which are already in cache (either postgresqls 
 | 
						||
    or systems (if we find a way to get such info from the system))
 | 
						||
 2. in sequential order over the rest.
 | 
						||
 | 
						||
> I think this would represent a reasonable compromise between size and
 | 
						||
> insertion speed.  It would only get large if the indexscan output
 | 
						||
> demanded visiting many different pages --- but at some point you could
 | 
						||
> abandon index usage and do a sequential scan, so I think that property
 | 
						||
> is okay.
 | 
						||
 | 
						||
One case where almost full intermediate bitmap could be needed is when
 | 
						||
doing a star join or just AND of several conditions, where each single
 | 
						||
index spans a significant part of the table, but the result does not.
 | 
						||
 | 
						||
> A variant is to make the per-page bit arrays be entries in a hash table
 | 
						||
> with page number as hash key.  This would reduce insertion to a nearly
 | 
						||
> constant-time operation, but the drawback is that you'd need an explicit
 | 
						||
> sort at the end to put the per-page entries into page number order
 | 
						||
> before you scan 'em.  You might come out ahead anyway, not sure.
 | 
						||
> 
 | 
						||
> Or we could try a true linear bitmap (indexed by page number times
 | 
						||
> max-items-per-page plus item number) that's compressed in some fashion,
 | 
						||
> probably just by eliminating large runs of zeroes.  The difficulty here
 | 
						||
> is that inserting a new one-bit could be pretty expensive, and we need
 | 
						||
> it to be cheap.
 | 
						||
> 
 | 
						||
> Perhaps someone can come up with other better ideas ...
 | 
						||
 | 
						||
I have also contemplated a scenario, where we could use some
 | 
						||
not-quite-max power-of-2 bits-per-page linear bitmap and mark intra-page
 | 
						||
wraps (when we tried to mark a point past that not-quite-max number in a
 | 
						||
page) in high bit (or another bitmap) making info for that page folded.
 | 
						||
AN example would be setting bit 40 in 32-bits/page index - this would
 | 
						||
set bit 40&31 and mark the page folded.
 | 
						||
 | 
						||
When combining such indexes using AND or OR, we need some spcial
 | 
						||
handling of folded pages, but could still get non-folded (0) results out
 | 
						||
from AND of 2 folded pages if the bits are distributed nicely.
 | 
						||
 | 
						||
--------------
 | 
						||
Hannu
 | 
						||
 | 
						||
 | 
						||
 | 
						||
 | 
						||
 | 
						||
 | 
						||
 | 
						||
 | 
						||
 | 
						||
 | 
						||
 | 
						||
 | 
						||
 | 
						||
From pgsql-hackers-owner+M49529=pgman=candle.pha.pa.us@postgresql.org Fri Jan 30 18:10:22 2004
 | 
						||
Return-path: <pgsql-hackers-owner+M49529=pgman=candle.pha.pa.us@postgresql.org>
 | 
						||
Received: from joeconway.com (66-146-172-86.skyriver.net [66.146.172.86])
 | 
						||
	by candle.pha.pa.us (8.11.6/8.11.6) with ESMTP id i0UNAKe25860
 | 
						||
	for <pgman@candle.pha.pa.us>; Fri, 30 Jan 2004 18:10:21 -0500 (EST)
 | 
						||
Received: from postgresql.org ([200.46.204.71] verified)
 | 
						||
  by joeconway.com (CommuniGate Pro SMTP 4.1.8)
 | 
						||
  with ESMTP id 799059 for pgman@candle.pha.pa.us; Fri, 30 Jan 2004 15:07:00 -0800
 | 
						||
X-Original-To: pgsql-hackers-postgresql.org@localhost.postgresql.org
 | 
						||
Received: from localhost (unknown [200.46.204.2])
 | 
						||
	by svr1.postgresql.org (Postfix) with ESMTP id C2AB7D1CCDD
 | 
						||
	for <pgsql-hackers-postgresql.org@localhost.postgresql.org>; Fri, 30 Jan 2004 23:03:05 +0000 (GMT)
 | 
						||
Received: from svr1.postgresql.org ([200.46.204.71])
 | 
						||
	by localhost (neptune.hub.org [200.46.204.2]) (amavisd-new, port 10024)
 | 
						||
	with ESMTP id 46819-09
 | 
						||
	for <pgsql-hackers-postgresql.org@localhost.postgresql.org>;
 | 
						||
	Fri, 30 Jan 2004 19:03:08 -0400 (AST)
 | 
						||
Received: from sss.pgh.pa.us (unknown [192.204.191.242])
 | 
						||
	by svr1.postgresql.org (Postfix) with ESMTP id AD55DD1C967
 | 
						||
	for <pgsql-hackers@postgresql.org>; Fri, 30 Jan 2004 19:03:04 -0400 (AST)
 | 
						||
Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
 | 
						||
	by sss.pgh.pa.us (8.12.11/8.12.11) with ESMTP id i0UN2wBL020777;
 | 
						||
	Fri, 30 Jan 2004 18:02:58 -0500 (EST)
 | 
						||
To: Hannu Krosing <hannu@tm.ee>
 | 
						||
cc: Bruce Momjian <pgman@candle.pha.pa.us>, Greg Stark <gsstark@mit.edu>,
 | 
						||
   pgsql-hackers@postgresql.org
 | 
						||
Subject: Re: [HACKERS] Question about indexes 
 | 
						||
In-Reply-To: <1075502634.4007.32.camel@fuji.krosing.net> 
 | 
						||
References: <200401301131.i0UBVHU04169@candle.pha.pa.us> <12965.1075474099@sss.pgh.pa.us> <1075502634.4007.32.camel@fuji.krosing.net>
 | 
						||
Comments: In-reply-to Hannu Krosing <hannu@tm.ee>
 | 
						||
	message dated "Sat, 31 Jan 2004 00:43:54 +0200"
 | 
						||
Date: Fri, 30 Jan 2004 18:02:58 -0500
 | 
						||
Message-ID: <20776.1075503778@sss.pgh.pa.us>
 | 
						||
From: Tom Lane <tgl@sss.pgh.pa.us>
 | 
						||
X-Virus-Scanned: by amavisd-new at postgresql.org
 | 
						||
X-Mailing-List: pgsql-hackers
 | 
						||
Precedence: bulk
 | 
						||
Sender: pgsql-hackers-owner@postgresql.org
 | 
						||
X-Spam-Checker-Version: SpamAssassin 2.61 (1.212.2.1-2003-12-09-exp) on 
 | 
						||
	candle.pha.pa.us
 | 
						||
X-Spam-Status: No, hits=-4.9 required=5.0 tests=BAYES_00 autolearn=no 
 | 
						||
	version=2.61
 | 
						||
Status: OR
 | 
						||
 | 
						||
Hannu Krosing <hannu@tm.ee> writes:
 | 
						||
> Another idea would be using bitmaps where we have just one bit per
 | 
						||
> database page and do a seq scan but just over marked pages.
 | 
						||
 | 
						||
That seems a bit too lossy for me, but I really like your later idea
 | 
						||
about folding.  Generalizing that a little, we can choose any fold point
 | 
						||
we like.  We could allocate, say, one 32-bit word per page and set the
 | 
						||
(i mod 32) bit when item i is fingered by the index.  After retrieving
 | 
						||
the heap page, we'd need to test all the valid rows that have item
 | 
						||
numbers matching a set bit mod 32.  On typical tables (with circa 100
 | 
						||
items per page) this would require testing only about 3 rows per page.
 | 
						||
ORing and ANDing of such bitmaps still works, with the understanding
 | 
						||
that it's lossy and you have to double check each retrieved tuple.
 | 
						||
 | 
						||
If the fold point is above about 100, your idea of keeping track of
 | 
						||
whether we actually set any wrapped-around bits would become useful,
 | 
						||
but below that I think we'd just be wasting a bit.
 | 
						||
 | 
						||
			regards, tom lane
 | 
						||
 | 
						||
---------------------------(end of broadcast)---------------------------
 | 
						||
TIP 5: Have you checked our extensive FAQ?
 | 
						||
 | 
						||
               http://www.postgresql.org/docs/faqs/FAQ.html
 | 
						||
 | 
						||
From tgl@sss.pgh.pa.us Fri Jan 30 18:03:08 2004
 | 
						||
Return-path: <tgl@sss.pgh.pa.us>
 | 
						||
Received: from sss.pgh.pa.us (root@[192.204.191.242])
 | 
						||
	by candle.pha.pa.us (8.11.6/8.11.6) with ESMTP id i0UN37e24951
 | 
						||
	for <pgman@candle.pha.pa.us>; Fri, 30 Jan 2004 18:03:08 -0500 (EST)
 | 
						||
Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
 | 
						||
	by sss.pgh.pa.us (8.12.11/8.12.11) with ESMTP id i0UN2wBL020777;
 | 
						||
	Fri, 30 Jan 2004 18:02:58 -0500 (EST)
 | 
						||
To: Hannu Krosing <hannu@tm.ee>
 | 
						||
cc: Bruce Momjian <pgman@candle.pha.pa.us>, Greg Stark <gsstark@mit.edu>,
 | 
						||
   pgsql-hackers@postgresql.org
 | 
						||
Subject: Re: [HACKERS] Question about indexes 
 | 
						||
In-Reply-To: <1075502634.4007.32.camel@fuji.krosing.net> 
 | 
						||
References: <200401301131.i0UBVHU04169@candle.pha.pa.us> <12965.1075474099@sss.pgh.pa.us> <1075502634.4007.32.camel@fuji.krosing.net>
 | 
						||
Comments: In-reply-to Hannu Krosing <hannu@tm.ee>
 | 
						||
	message dated "Sat, 31 Jan 2004 00:43:54 +0200"
 | 
						||
Date: Fri, 30 Jan 2004 18:02:58 -0500
 | 
						||
Message-ID: <20776.1075503778@sss.pgh.pa.us>
 | 
						||
From: Tom Lane <tgl@sss.pgh.pa.us>
 | 
						||
X-Spam-Checker-Version: SpamAssassin 2.61 (1.212.2.1-2003-12-09-exp) on 
 | 
						||
	candle.pha.pa.us
 | 
						||
X-Spam-Status: No, hits=-4.9 required=5.0 tests=BAYES_00 autolearn=ham 
 | 
						||
	version=2.61
 | 
						||
Status: OR
 | 
						||
 | 
						||
Hannu Krosing <hannu@tm.ee> writes:
 | 
						||
> Another idea would be using bitmaps where we have just one bit per
 | 
						||
> database page and do a seq scan but just over marked pages.
 | 
						||
 | 
						||
That seems a bit too lossy for me, but I really like your later idea
 | 
						||
about folding.  Generalizing that a little, we can choose any fold point
 | 
						||
we like.  We could allocate, say, one 32-bit word per page and set the
 | 
						||
(i mod 32) bit when item i is fingered by the index.  After retrieving
 | 
						||
the heap page, we'd need to test all the valid rows that have item
 | 
						||
numbers matching a set bit mod 32.  On typical tables (with circa 100
 | 
						||
items per page) this would require testing only about 3 rows per page.
 | 
						||
ORing and ANDing of such bitmaps still works, with the understanding
 | 
						||
that it's lossy and you have to double check each retrieved tuple.
 | 
						||
 | 
						||
If the fold point is above about 100, your idea of keeping track of
 | 
						||
whether we actually set any wrapped-around bits would become useful,
 | 
						||
but below that I think we'd just be wasting a bit.
 | 
						||
 | 
						||
			regards, tom lane
 | 
						||
 | 
						||
From hannu@tm.ee Fri Jan 30 18:21:59 2004
 | 
						||
Return-path: <hannu@tm.ee>
 | 
						||
Received: from fuji.krosing.net (217-159-136-226-dsl.kt.estpak.ee [217.159.136.226])
 | 
						||
	by candle.pha.pa.us (8.11.6/8.11.6) with ESMTP id i0UNLue27301
 | 
						||
	for <pgman@candle.pha.pa.us>; Fri, 30 Jan 2004 18:21:57 -0500 (EST)
 | 
						||
Received: from fuji.krosing.net (localhost.localdomain [127.0.0.1])
 | 
						||
	by fuji.krosing.net (8.12.8/8.12.8) with ESMTP id i0UNLpEl006023;
 | 
						||
	Sat, 31 Jan 2004 01:21:51 +0200
 | 
						||
Received: (from hannu@localhost)
 | 
						||
	by fuji.krosing.net (8.12.8/8.12.8/Submit) id i0UNLgx1006021;
 | 
						||
	Sat, 31 Jan 2004 01:21:42 +0200
 | 
						||
X-Authentication-Warning: fuji.krosing.net: hannu set sender to hannu@tm.ee using -f
 | 
						||
Subject: Re: [HACKERS] Question about indexes
 | 
						||
From: Hannu Krosing <hannu@tm.ee>
 | 
						||
To: Tom Lane <tgl@sss.pgh.pa.us>
 | 
						||
cc: Bruce Momjian <pgman@candle.pha.pa.us>, Greg Stark <gsstark@mit.edu>,
 | 
						||
   pgsql-hackers@postgresql.org
 | 
						||
In-Reply-To: <20776.1075503778@sss.pgh.pa.us>
 | 
						||
References: <200401301131.i0UBVHU04169@candle.pha.pa.us>
 | 
						||
  <12965.1075474099@sss.pgh.pa.us>
 | 
						||
  <1075502634.4007.32.camel@fuji.krosing.net>
 | 
						||
  <20776.1075503778@sss.pgh.pa.us>
 | 
						||
Content-Type: text/plain
 | 
						||
Content-Transfer-Encoding: 7bit
 | 
						||
Message-ID: <1075504902.4007.43.camel@fuji.krosing.net>
 | 
						||
MIME-Version: 1.0
 | 
						||
X-Mailer: Ximian Evolution 1.4.5 
 | 
						||
Date: Sat, 31 Jan 2004 01:21:42 +0200
 | 
						||
X-Spam-Checker-Version: SpamAssassin 2.61 (1.212.2.1-2003-12-09-exp) on 
 | 
						||
	candle.pha.pa.us
 | 
						||
X-Spam-Status: No, hits=-4.9 required=5.0 tests=BAYES_00 autolearn=ham 
 | 
						||
	version=2.61
 | 
						||
Status: OR
 | 
						||
 | 
						||
Tom Lane kirjutas L, 31.01.2004 kell 01:02:
 | 
						||
> Hannu Krosing <hannu@tm.ee> writes:
 | 
						||
> > Another idea would be using bitmaps where we have just one bit per
 | 
						||
> > database page and do a seq scan but just over marked pages.
 | 
						||
> 
 | 
						||
> That seems a bit too lossy for me,
 | 
						||
 | 
						||
I originally thought of it in context of data-warehousing and persistent
 | 
						||
bitmap indexes. there the use of these same bitmaps for clustering would
 | 
						||
un-lossify this approach.
 | 
						||
 | 
						||
>  but I really like your later idea
 | 
						||
> about folding.  Generalizing that a little, we can choose any fold point
 | 
						||
> we like.  We could allocate, say, one 32-bit word per page and set the
 | 
						||
> (i mod 32) bit when item i is fingered by the index.  After retrieving
 | 
						||
> the heap page, we'd need to test all the valid rows that have item
 | 
						||
> numbers matching a set bit mod 32.  On typical tables (with circa 100
 | 
						||
> items per page) this would require testing only about 3 rows per page.
 | 
						||
> ORing and ANDing of such bitmaps still works, with the understanding
 | 
						||
> that it's lossy and you have to double check each retrieved tuple.
 | 
						||
> 
 | 
						||
> If the fold point is above about 100, your idea of keeping track of
 | 
						||
> whether we actually set any wrapped-around bits would become useful,
 | 
						||
> but below that I think we'd just be wasting a bit.
 | 
						||
 | 
						||
Not only wasting bits, but also making the code hairier - we can't just
 | 
						||
do simple ANDs and ORs.
 | 
						||
 | 
						||
--------------
 | 
						||
Hannu
 | 
						||
 | 
						||
From gsstark@mit.edu Fri Jan 30 19:04:21 2004
 | 
						||
Return-path: <gsstark@mit.edu>
 | 
						||
Received: from smtp.istop.com (dci.doncaster.on.ca [66.11.168.194])
 | 
						||
	by candle.pha.pa.us (8.11.6/8.11.6) with ESMTP id i0V04De01505
 | 
						||
	for <pgman@candle.pha.pa.us>; Fri, 30 Jan 2004 19:04:21 -0500 (EST)
 | 
						||
Received: from stark.xeocode.com (gsstark.mtl.istop.com [66.11.160.162])
 | 
						||
	by smtp.istop.com (Postfix) with ESMTP
 | 
						||
	id 7CC2436E2F; Fri, 30 Jan 2004 19:04:04 -0500 (EST)
 | 
						||
Received: from localhost ([127.0.0.1] helo=stark.xeocode.com)
 | 
						||
	by stark.xeocode.com with smtp (Exim 3.36 #1 (Debian))
 | 
						||
	id 1AmicG-0007zf-00; Fri, 30 Jan 2004 19:04:04 -0500
 | 
						||
Sender: gsstark@mit.edu
 | 
						||
To: Tom Lane <tgl@sss.pgh.pa.us>
 | 
						||
cc: Hannu Krosing <hannu@tm.ee>, Bruce Momjian <pgman@candle.pha.pa.us>,
 | 
						||
   Greg Stark <gsstark@mit.edu>, pgsql-hackers@postgresql.org
 | 
						||
Subject: Re: [HACKERS] Question about indexes
 | 
						||
References: <200401301131.i0UBVHU04169@candle.pha.pa.us>
 | 
						||
	<12965.1075474099@sss.pgh.pa.us>
 | 
						||
	<1075502634.4007.32.camel@fuji.krosing.net>
 | 
						||
	<20776.1075503778@sss.pgh.pa.us>
 | 
						||
In-Reply-To: <20776.1075503778@sss.pgh.pa.us>
 | 
						||
From: Greg Stark <gsstark@mit.edu>
 | 
						||
Organization: The Emacs Conspiracy; member since 1992
 | 
						||
Date: 30 Jan 2004 19:04:03 -0500
 | 
						||
Message-ID: <87wu79vv9o.fsf@stark.xeocode.com>
 | 
						||
Lines: 21
 | 
						||
User-Agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.3
 | 
						||
MIME-Version: 1.0
 | 
						||
Content-Type: text/plain; charset=us-ascii
 | 
						||
X-Spam-Checker-Version: SpamAssassin 2.61 (1.212.2.1-2003-12-09-exp) on 
 | 
						||
	candle.pha.pa.us
 | 
						||
X-Spam-Status: No, hits=-4.9 required=5.0 tests=BAYES_00 autolearn=ham 
 | 
						||
	version=2.61
 | 
						||
Status: OR
 | 
						||
 | 
						||
 | 
						||
Tom Lane <tgl@sss.pgh.pa.us> writes:
 | 
						||
 | 
						||
> That seems a bit too lossy for me, but I really like your later idea
 | 
						||
> about folding.  Generalizing that a little, we can choose any fold point
 | 
						||
> we like.  We could allocate, say, one 32-bit word per page and set the
 | 
						||
> (i mod 32) bit when item i is fingered by the index.  After retrieving
 | 
						||
> the heap page, we'd need to test all the valid rows that have item
 | 
						||
> numbers matching a set bit mod 32.  On typical tables (with circa 100
 | 
						||
> items per page) this would require testing only about 3 rows per page.
 | 
						||
> ORing and ANDing of such bitmaps still works, with the understanding
 | 
						||
> that it's lossy and you have to double check each retrieved tuple.
 | 
						||
 | 
						||
That would make it really hard to ever clear the bits. What do you do when you
 | 
						||
vacuum and one of the tuples is no longer needed. You can't be sure you can
 | 
						||
clear the bit in the index because there could be multiple tuples represented
 | 
						||
by the bit being set. You would have to test the condition on the other tuples
 | 
						||
covered by the bit to see if it can be cleared.
 | 
						||
 | 
						||
-- 
 | 
						||
greg
 | 
						||
 | 
						||
From pgsql-hackers-owner+M49533=pgman=candle.pha.pa.us@postgresql.org Fri Jan 30 19:56:45 2004
 | 
						||
Return-path: <pgsql-hackers-owner+M49533=pgman=candle.pha.pa.us@postgresql.org>
 | 
						||
Received: from joeconway.com (66-146-172-86.skyriver.net [66.146.172.86])
 | 
						||
	by candle.pha.pa.us (8.11.6/8.11.6) with ESMTP id i0V0uhe05716
 | 
						||
	for <pgman@candle.pha.pa.us>; Fri, 30 Jan 2004 19:56:44 -0500 (EST)
 | 
						||
Received: from postgresql.org ([200.46.204.71] verified)
 | 
						||
  by joeconway.com (CommuniGate Pro SMTP 4.1.8)
 | 
						||
  with ESMTP id 799253 for pgman@candle.pha.pa.us; Fri, 30 Jan 2004 16:53:23 -0800
 | 
						||
X-Original-To: pgsql-hackers-postgresql.org@localhost.postgresql.org
 | 
						||
Received: from localhost (unknown [200.46.204.2])
 | 
						||
	by svr1.postgresql.org (Postfix) with ESMTP id B7F53D1CC9B
 | 
						||
	for <pgsql-hackers-postgresql.org@localhost.postgresql.org>; Sat, 31 Jan 2004 00:50:25 +0000 (GMT)
 | 
						||
Received: from svr1.postgresql.org ([200.46.204.71])
 | 
						||
	by localhost (neptune.hub.org [200.46.204.2]) (amavisd-new, port 10024)
 | 
						||
	with ESMTP id 76472-01
 | 
						||
	for <pgsql-hackers-postgresql.org@localhost.postgresql.org>;
 | 
						||
	Fri, 30 Jan 2004 20:50:28 -0400 (AST)
 | 
						||
Received: from sss.pgh.pa.us (unknown [192.204.191.242])
 | 
						||
	by svr1.postgresql.org (Postfix) with ESMTP id 0A06FD1CB1D
 | 
						||
	for <pgsql-hackers@postgresql.org>; Fri, 30 Jan 2004 20:50:25 -0400 (AST)
 | 
						||
Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
 | 
						||
	by sss.pgh.pa.us (8.12.11/8.12.11) with ESMTP id i0V0oN9U023293;
 | 
						||
	Fri, 30 Jan 2004 19:50:24 -0500 (EST)
 | 
						||
To: Greg Stark <gsstark@mit.edu>
 | 
						||
cc: Hannu Krosing <hannu@tm.ee>, Bruce Momjian <pgman@candle.pha.pa.us>,
 | 
						||
   pgsql-hackers@postgresql.org
 | 
						||
Subject: Re: [HACKERS] Question about indexes 
 | 
						||
In-Reply-To: <87wu79vv9o.fsf@stark.xeocode.com> 
 | 
						||
References: <200401301131.i0UBVHU04169@candle.pha.pa.us> <12965.1075474099@sss.pgh.pa.us> <1075502634.4007.32.camel@fuji.krosing.net> <20776.1075503778@sss.pgh.pa.us> <87wu79vv9o.fsf@stark.xeocode.com>
 | 
						||
Comments: In-reply-to Greg Stark <gsstark@mit.edu>
 | 
						||
	message dated "30 Jan 2004 19:04:03 -0500"
 | 
						||
Date: Fri, 30 Jan 2004 19:50:23 -0500
 | 
						||
Message-ID: <23292.1075510223@sss.pgh.pa.us>
 | 
						||
From: Tom Lane <tgl@sss.pgh.pa.us>
 | 
						||
X-Virus-Scanned: by amavisd-new at postgresql.org
 | 
						||
X-Mailing-List: pgsql-hackers
 | 
						||
Precedence: bulk
 | 
						||
Sender: pgsql-hackers-owner@postgresql.org
 | 
						||
X-Spam-Checker-Version: SpamAssassin 2.61 (1.212.2.1-2003-12-09-exp) on 
 | 
						||
	candle.pha.pa.us
 | 
						||
X-Spam-Status: No, hits=-4.9 required=5.0 tests=BAYES_00 autolearn=no 
 | 
						||
	version=2.61
 | 
						||
Status: OR
 | 
						||
 | 
						||
Greg Stark <gsstark@mit.edu> writes:
 | 
						||
> Tom Lane <tgl@sss.pgh.pa.us> writes:
 | 
						||
>> ORing and ANDing of such bitmaps still works, with the understanding
 | 
						||
>> that it's lossy and you have to double check each retrieved tuple.
 | 
						||
 | 
						||
> That would make it really hard to ever clear the bits.
 | 
						||
 | 
						||
We're speaking of in-memory bitmaps constructed on-the-fly here.  You're
 | 
						||
right that it wouldn't work for persistent indexes, but I'm not very
 | 
						||
interested in that case at the moment ...
 | 
						||
 | 
						||
			regards, tom lane
 | 
						||
 | 
						||
---------------------------(end of broadcast)---------------------------
 | 
						||
TIP 8: explain analyze is your friend
 | 
						||
 |