mirror of
				https://github.com/postgres/postgres.git
				synced 2025-10-31 10:30:33 +03:00 
			
		
		
		
	
		
			
				
	
	
		
			1652 lines
		
	
	
		
			67 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			1652 lines
		
	
	
		
			67 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
| From owner-pgsql-hackers@hub.org Sun Jun 14 18:45:04 1998
 | |
| Received: from hub.org (hub.org [209.47.148.200])
 | |
| 	by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id SAA03690
 | |
| 	for <maillist@candle.pha.pa.us>; Sun, 14 Jun 1998 18:45:00 -0400 (EDT)
 | |
| Received: from localhost (majordom@localhost) by hub.org (8.8.8/8.7.5) with SMTP id SAA28049; Sun, 14 Jun 1998 18:39:42 -0400 (EDT)
 | |
| Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Sun, 14 Jun 1998 18:36:06 +0000 (EDT)
 | |
| Received: (from majordom@localhost) by hub.org (8.8.8/8.7.5) id SAA27943 for pgsql-hackers-outgoing; Sun, 14 Jun 1998 18:36:04 -0400 (EDT)
 | |
| Received: from angular.illustra.com (ifmxoak.illustra.com [206.175.10.34]) by hub.org (8.8.8/8.7.5) with ESMTP id SAA27925 for <pgsql-hackers@postgresql.org>; Sun, 14 Jun 1998 18:35:47 -0400 (EDT)
 | |
| Received: from hawk.illustra.com (hawk.illustra.com [158.58.61.70]) by angular.illustra.com (8.7.4/8.7.3) with SMTP id PAA21293 for <pgsql-hackers@postgresql.org>; Sun, 14 Jun 1998 15:35:12 -0700 (PDT)
 | |
| Received: by hawk.illustra.com (5.x/smail2.5/06-10-94/S)
 | |
| 	id AA07922; Sun, 14 Jun 1998 15:35:13 -0700
 | |
| From: dg@illustra.com (David Gould)
 | |
| Message-Id: <9806142235.AA07922@hawk.illustra.com>
 | |
| Subject: [HACKERS] performance tests, initial results
 | |
| To: pgsql-hackers@postgreSQL.org
 | |
| Date: Sun, 14 Jun 1998 15:35:13 -0700 (PDT)
 | |
| Mime-Version: 1.0
 | |
| Content-Type: text/plain; charset=US-ASCII
 | |
| Content-Transfer-Encoding: 7bit
 | |
| Sender: owner-pgsql-hackers@hub.org
 | |
| Precedence: bulk
 | |
| Status: RO
 | |
| 
 | |
| 
 | |
| I have been playing a little with the performance tests found in
 | |
| pgsql/src/tests/performance and have a few observations that might be of
 | |
| minor interest.
 | |
| 
 | |
| The tests themselves are simple enough although the result parsing in the
 | |
| driver did not work on Linux. I am enclosing a patch below to fix this. I
 | |
| think it will also work better on the other systems.
 | |
| 
 | |
| A summary of results from my testing are below. Details are at the bottom
 | |
| of this message.
 | |
| 
 | |
| My test system is 'leslie':
 | |
| 
 | |
|  linux 2.0.32, gcc version 2.7.2.3
 | |
|  P133, HX chipset, 512K L2, 32MB mem
 | |
|  NCR810 fast scsi, Quantum Atlas 2GB drive (7200 rpm).
 | |
| 
 | |
| 
 | |
|                      Results Summary (times in seconds)
 | |
| 
 | |
|                     Single txn 8K txn    Create 8K idx 8K random Simple
 | |
| Case Description    8K insert  8K insert Index  Insert Scans     Orderby
 | |
| =================== ========== ========= ====== ====== ========= =======
 | |
| 1 From Distribution
 | |
|   P90 FreeBsd -B256      39.56   1190.98   3.69  46.65     65.49    2.27
 | |
|   IDE
 | |
| 
 | |
| 2 Running on leslie
 | |
|   P133 Linux 2.0.32      15.48    326.75   2.99  20.69     35.81    1.68
 | |
|   SCSI 32M
 | |
| 
 | |
| 3 leslie, -o -F
 | |
|   no forced writes       15.90     24.98   2.63  20.46     36.43    1.69
 | |
| 
 | |
| 4 leslie, -o -F
 | |
|   no ASSERTS             14.92     23.23   1.38  18.67     33.79    1.58
 | |
| 
 | |
| 5 leslie, -o -F -B2048
 | |
|   more buffers           21.31     42.28   2.65  25.74     42.26    1.72
 | |
| 
 | |
| 6 leslie, -o -F -B2048
 | |
|   more bufs, no ASSERT   20.52     39.79   1.40  24.77     39.51    1.55
 | |
| 
 | |
| 
 | |
| 
 | |
| 
 | |
|                  Case to Case Difference Factors (+ is faster)
 | |
| 
 | |
|                     Single txn 8K txn    Create 8K idx 8K random Simple
 | |
| Case Description    8K insert  8K insert Index  Insert Scans     Orderby
 | |
| =================== ========== ========= ====== ====== ========= =======
 | |
| 
 | |
| leslie vs BSD P90.        2.56      3.65   1.23   2.25      1.83    1.35
 | |
| 
 | |
| (noflush -F) vs no -F    -1.03     13.08   1.14   1.01     -1.02    1.00
 | |
| 
 | |
| No Assert vs Assert       1.05      1.07   1.90   1.06      1.07    1.09
 | |
| 
 | |
| -B256 vs -B2048           1.34      1.69   1.01   1.26      1.16    1.02
 | |
| 
 | |
| 
 | |
| Observations:
 | |
| 
 | |
|  - leslie (P133 linux) appears to be about 1.8 times faster than the
 | |
|    P90 BSD system used for the test result distributed with the source, not
 | |
|    counting the 8K txn insert case which was completely disk bound.
 | |
| 
 | |
|  - SCSI disks make a big (factor of 3.6) difference. During this test the
 | |
|    disk was hammering and cpu utilization was < 10%.
 | |
| 
 | |
|  - Assertion checking seems to cost about 7% except for create index where
 | |
|    it costs 90%
 | |
| 
 | |
|  - the -F option to avoid flushing buffers has tremendous effect if there are
 | |
|    many very small transactions. Or, another way, flushing at the end of the
 | |
|    transaction is a major disaster for performance.
 | |
| 
 | |
|  - Something is very wrong with our buffer cache implementation. Going from
 | |
|    256 buffers to 2048 buffers costs an average of 25%. In the 8K txn case
 | |
|    it costs about 70%. I see looking at the code and profiling that in the 8K
 | |
|    txn case this is in BufferSync() which examines all the buffers at commit
 | |
|    time. I don't quite understand why it is so costly for the single 8K row
 | |
|    txn (35%) though.
 | |
| 
 | |
| It would be nice to have some more tests. Maybe the Wisconsin stuff will
 | |
| be useful.
 | |
| 
 | |
| 
 | |
| 
 | |
| ----------------- patch to test harness. apply from pgsql ------------
 | |
| *** src/test/performance/runtests.pl.orig	Sun Jun 14 11:34:04 1998
 | |
| 
 | |
| Differences %
 | |
| 
 | |
| 
 | |
| ----------------- patch to test harness. apply from pgsql ------------
 | |
| *** src/test/performance/runtests.pl.orig	Sun Jun 14 11:34:04 1998
 | |
| --- src/test/performance/runtests.pl	Sun Jun 14 12:07:30 1998
 | |
| ***************
 | |
| *** 84,123 ****
 | |
|   open (STDERR, ">$TmpFile") or die;
 | |
|   select (STDERR); $| = 1;
 | |
|   
 | |
| ! for ($i = 0; $i <= $#perftests; $i++)
 | |
| ! {
 | |
|   	$test = $perftests[$i];
 | |
|   	($test, $XACTBLOCK) = split (/ /, $test);
 | |
|   	$runtest = $test;
 | |
| ! 	if ( $test =~ /\.ntm/ )
 | |
| ! 	{
 | |
| ! 		# 
 | |
|   		# No timing for this queries
 | |
| - 		# 
 | |
|   		close (STDERR);		# close $TmpFile
 | |
|   		open (STDERR, ">/dev/null") or die;
 | |
|   		$runtest =~ s/\.ntm//;
 | |
|   	}
 | |
| ! 	else
 | |
| ! 	{
 | |
|   		close (STDOUT);
 | |
|   		open(STDOUT, ">&SAVEOUT");
 | |
|   		print STDOUT "\nRunning: $perftests[$i+1] ...";
 | |
|   		close (STDOUT);
 | |
|   		open (STDOUT, ">/dev/null") or die;
 | |
|   		select (STDERR); $| = 1;
 | |
| ! 		printf "$perftests[$i+1]: ";
 | |
|   	}
 | |
|   
 | |
|   	do "sqls/$runtest";
 | |
|   
 | |
|   	# Restore STDERR to $TmpFile
 | |
| ! 	if ( $test =~ /\.ntm/ )
 | |
| ! 	{
 | |
|   		close (STDERR);
 | |
|   		open (STDERR, ">>$TmpFile") or die;
 | |
|   	}
 | |
| - 
 | |
|   	select (STDERR); $| = 1;
 | |
|   	$i++;
 | |
|   }
 | |
| --- 84,116 ----
 | |
|   open (STDERR, ">$TmpFile") or die;
 | |
|   select (STDERR); $| = 1;
 | |
|   
 | |
| ! for ($i = 0; $i <= $#perftests; $i++) {
 | |
|   	$test = $perftests[$i];
 | |
|   	($test, $XACTBLOCK) = split (/ /, $test);
 | |
|   	$runtest = $test;
 | |
| ! 	if ( $test =~ /\.ntm/ ) {
 | |
|   		# No timing for this queries
 | |
|   		close (STDERR);		# close $TmpFile
 | |
|   		open (STDERR, ">/dev/null") or die;
 | |
|   		$runtest =~ s/\.ntm//;
 | |
|   	}
 | |
| ! 	else {
 | |
|   		close (STDOUT);
 | |
|   		open(STDOUT, ">&SAVEOUT");
 | |
|   		print STDOUT "\nRunning: $perftests[$i+1] ...";
 | |
|   		close (STDOUT);
 | |
|   		open (STDOUT, ">/dev/null") or die;
 | |
|   		select (STDERR); $| = 1;
 | |
| ! 		print "$perftests[$i+1]: ";
 | |
|   	}
 | |
|   
 | |
|   	do "sqls/$runtest";
 | |
|   
 | |
|   	# Restore STDERR to $TmpFile
 | |
| ! 	if ( $test =~ /\.ntm/ ) {
 | |
|   		close (STDERR);
 | |
|   		open (STDERR, ">>$TmpFile") or die;
 | |
|   	}
 | |
|   	select (STDERR); $| = 1;
 | |
|   	$i++;
 | |
|   }
 | |
| ***************
 | |
| *** 128,138 ****
 | |
|   open (TMPF, "<$TmpFile") or die;
 | |
|   open (RESF, ">$ResFile") or die;
 | |
|   
 | |
| ! while (<TMPF>)
 | |
| ! {
 | |
| ! 	$str = $_;
 | |
| ! 	($test, $rtime) = split (/:/, $str);
 | |
| ! 	($tmp, $rtime, $rest) = split (/[ 	]+/, $rtime);
 | |
| ! 	print RESF "$test: $rtime\n";
 | |
|   }
 | |
|   
 | |
| --- 121,130 ----
 | |
|   open (TMPF, "<$TmpFile") or die;
 | |
|   open (RESF, ">$ResFile") or die;
 | |
|   
 | |
| ! while (<TMPF>) {
 | |
| !         if (m/^(.*: ).* ([0-9:.]+) *elapsed/) {
 | |
| ! 	    ($test, $rtime) = ($1, $2);
 | |
| ! 	     print RESF $test, $rtime, "\n";
 | |
| !         }
 | |
|   }
 | |
| 
 | |
| ------------------------------------------------------------------------
 | |
| 
 | |
|   
 | |
| ------------------------- testcase detail --------------------------
 | |
|    
 | |
| 1. from distribution
 | |
|    DBMS:		PostgreSQL 6.2b10
 | |
|    OS:		FreeBSD 2.1.5-RELEASE
 | |
|    HardWare:	i586/90, 24M RAM, IDE
 | |
|    StartUp:	postmaster -B 256 '-o -S 2048' -S
 | |
|    Compiler:	gcc 2.6.3
 | |
|    Compiled:	-O, without CASSERT checking, with
 | |
|    		-DTBL_FREE_CMD_MEMORY (to free memory
 | |
|    		if BEGIN/END after each query execution)
 | |
|    DB connection startup: 0.20
 | |
|    8192 INSERTs INTO SIMPLE (1 xact): 39.58
 | |
|    8192 INSERTs INTO SIMPLE (8192 xacts): 1190.98
 | |
|    Create INDEX on SIMPLE: 3.69
 | |
|    8192 INSERTs INTO SIMPLE with INDEX (1 xact): 46.65
 | |
|    8192 random INDEX scans on SIMPLE (1 xact): 65.49
 | |
|    ORDER BY SIMPLE: 2.27
 | |
|    
 | |
|    
 | |
| 2. run on leslie with asserts
 | |
|    DBMS:		PostgreSQL 6.3.2 (plus changes to 98/06/01)
 | |
|    OS:		Linux 2.0.32 leslie
 | |
|    HardWare:	i586/133 HX 512, 32M RAM, fast SCSI, 7200rpm
 | |
|    StartUp:	postmaster -B 256 '-o -S 2048' -S
 | |
|    Compiler:	gcc 2.7.2.3
 | |
|    Compiled:	-O, WITH CASSERT checking, with
 | |
|    		-DTBL_FREE_CMD_MEMORY (to free memory
 | |
|    		if BEGIN/END after each query execution)
 | |
|    DB connection startup: 0.10
 | |
|    8192 INSERTs INTO SIMPLE (1 xact): 15.48
 | |
|    8192 INSERTs INTO SIMPLE (8192 xacts): 326.75
 | |
|    Create INDEX on SIMPLE: 2.99
 | |
|    8192 INSERTs INTO SIMPLE with INDEX (1 xact): 20.69
 | |
|    8192 random INDEX scans on SIMPLE (1 xact): 35.81
 | |
|    ORDER BY SIMPLE: 1.68
 | |
|    
 | |
|    
 | |
| 3. with -F to avoid forced i/o
 | |
|    DBMS:		PostgreSQL 6.3.2 (plus changes to 98/06/01)
 | |
|    OS:		Linux 2.0.32 leslie
 | |
|    HardWare:	i586/133 HX 512, 32M RAM, fast SCSI, 7200rpm
 | |
|    StartUp:	postmaster -B 256 '-o -S 2048 -F' -S
 | |
|    Compiler:	gcc 2.7.2.3
 | |
|    Compiled:	-O, WITH CASSERT checking, with
 | |
|    		-DTBL_FREE_CMD_MEMORY (to free memory
 | |
|    		if BEGIN/END after each query execution)
 | |
|    DB connection startup: 0.10
 | |
|    8192 INSERTs INTO SIMPLE (1 xact): 15.90
 | |
|    8192 INSERTs INTO SIMPLE (8192 xacts): 24.98
 | |
|    Create INDEX on SIMPLE: 2.63
 | |
|    8192 INSERTs INTO SIMPLE with INDEX (1 xact): 20.46
 | |
|    8192 random INDEX scans on SIMPLE (1 xact): 36.43
 | |
|    ORDER BY SIMPLE: 1.69
 | |
|    
 | |
|    
 | |
| 4. no asserts, -F to avoid forced I/O
 | |
|    DBMS:		PostgreSQL 6.3.2 (plus changes to 98/06/01)
 | |
|    OS:		Linux 2.0.32 leslie
 | |
|    HardWare:	i586/133 HX 512, 32M RAM, fast SCSI, 7200rpm
 | |
|    StartUp:	postmaster -B 256 '-o -S 2048' -S
 | |
|    Compiler:	gcc 2.7.2.3
 | |
|    Compiled:	-O, No CASSERT checking, with
 | |
|    		-DTBL_FREE_CMD_MEMORY (to free memory
 | |
|    		if BEGIN/END after each query execution)
 | |
|    DB connection startup: 0.10
 | |
|    8192 INSERTs INTO SIMPLE (1 xact): 14.92
 | |
|    8192 INSERTs INTO SIMPLE (8192 xacts): 23.23
 | |
|    Create INDEX on SIMPLE: 1.38
 | |
|    8192 INSERTs INTO SIMPLE with INDEX (1 xact): 18.67
 | |
|    8192 random INDEX scans on SIMPLE (1 xact): 33.79
 | |
|    ORDER BY SIMPLE: 1.58
 | |
|    
 | |
|    
 | |
| 5. with more buffers (2048 vs 256) and -F to avoid forced i/o
 | |
|    DBMS:		PostgreSQL 6.3.2 (plus changes to 98/06/01)
 | |
|    OS:		Linux 2.0.32 leslie
 | |
|    HardWare:	i586/133 HX 512, 32M RAM, fast SCSI, 7200rpm
 | |
|    StartUp:	postmaster -B 2048 '-o -S 2048 -F' -S
 | |
|    Compiler:	gcc 2.7.2.3
 | |
|    Compiled:	-O, WITH CASSERT checking, with
 | |
|    		-DTBL_FREE_CMD_MEMORY (to free memory
 | |
|    		if BEGIN/END after each query execution)
 | |
|    DB connection startup: 0.11
 | |
|    8192 INSERTs INTO SIMPLE (1 xact): 21.31
 | |
|    8192 INSERTs INTO SIMPLE (8192 xacts): 42.28
 | |
|    Create INDEX on SIMPLE: 2.65
 | |
|    8192 INSERTs INTO SIMPLE with INDEX (1 xact): 25.74
 | |
|    8192 random INDEX scans on SIMPLE (1 xact): 42.26
 | |
|    ORDER BY SIMPLE: 1.72
 | |
|    
 | |
|    
 | |
| 6. No Asserts, more buffers (2048 vs 256) and -F to avoid forced i/o
 | |
|    DBMS:		PostgreSQL 6.3.2 (plus changes to 98/06/01)
 | |
|    OS:		Linux 2.0.32 leslie
 | |
|    HardWare:	i586/133 HX 512, 32M RAM, fast SCSI, 7200rpm
 | |
|    StartUp:	postmaster -B 2048 '-o -S 2048 -F' -S
 | |
|    Compiler:	gcc 2.7.2.3
 | |
|    Compiled:	-O, No CASSERT checking, with
 | |
|    		-DTBL_FREE_CMD_MEMORY (to free memory
 | |
|    		if BEGIN/END after each query execution)
 | |
|    DB connection startup: 0.11
 | |
|    8192 INSERTs INTO SIMPLE (1 xact): 20.52
 | |
|    8192 INSERTs INTO SIMPLE (8192 xacts): 39.79
 | |
|    Create INDEX on SIMPLE: 1.40
 | |
|    8192 INSERTs INTO SIMPLE with INDEX (1 xact): 24.77
 | |
|    8192 random INDEX scans on SIMPLE (1 xact): 39.51
 | |
|    ORDER BY SIMPLE: 1.55
 | |
| ---------------------------------------------------------------------
 | |
| 
 | |
| -dg
 | |
| 
 | |
| David Gould            dg@illustra.com           510.628.3783 or 510.305.9468 
 | |
| Informix Software  (No, really)         300 Lakeside Drive  Oakland, CA 94612
 | |
| "Don't worry about people stealing your ideas.  If your ideas are any
 | |
|  good, you'll have to ram them down people's throats." -- Howard Aiken
 | |
| 
 | |
| 
 | |
| From owner-pgsql-hackers@hub.org Tue Oct 19 10:31:10 1999
 | |
| Received: from renoir.op.net (root@renoir.op.net [209.152.193.4])
 | |
| 	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id KAA29087
 | |
| 	for <maillist@candle.pha.pa.us>; Tue, 19 Oct 1999 10:31:08 -0400 (EDT)
 | |
| Received: from hub.org (hub.org [216.126.84.1]) by renoir.op.net (o1/$Revision: 1.12 $) with ESMTP id KAA27535 for <maillist@candle.pha.pa.us>; Tue, 19 Oct 1999 10:19:47 -0400 (EDT)
 | |
| Received: from localhost (majordom@localhost)
 | |
| 	by hub.org (8.9.3/8.9.3) with SMTP id KAA30328;
 | |
| 	Tue, 19 Oct 1999 10:12:10 -0400 (EDT)
 | |
| 	(envelope-from owner-pgsql-hackers)
 | |
| Received: by hub.org (bulk_mailer v1.5); Tue, 19 Oct 1999 10:11:55 -0400
 | |
| Received: (from majordom@localhost)
 | |
| 	by hub.org (8.9.3/8.9.3) id KAA30030
 | |
| 	for pgsql-hackers-outgoing; Tue, 19 Oct 1999 10:11:00 -0400 (EDT)
 | |
| 	(envelope-from owner-pgsql-hackers@postgreSQL.org)
 | |
| Received: from sss.sss.pgh.pa.us (sss.pgh.pa.us [209.114.166.2])
 | |
| 	by hub.org (8.9.3/8.9.3) with ESMTP id KAA29914
 | |
| 	for <pgsql-hackers@postgreSQL.org>; Tue, 19 Oct 1999 10:10:33 -0400 (EDT)
 | |
| 	(envelope-from tgl@sss.pgh.pa.us)
 | |
| Received: from sss.sss.pgh.pa.us (localhost [127.0.0.1])
 | |
| 	by sss.sss.pgh.pa.us (8.9.1/8.9.1) with ESMTP id KAA09038;
 | |
| 	Tue, 19 Oct 1999 10:09:15 -0400 (EDT)
 | |
| To: "Hiroshi Inoue" <Inoue@tpf.co.jp>
 | |
| cc: "Vadim Mikheev" <vadim@krs.ru>, pgsql-hackers@postgreSQL.org
 | |
| Subject: Re: [HACKERS] mdnblocks is an amazing time sink in huge relations 
 | |
| In-reply-to: Your message of Tue, 19 Oct 1999 19:03:22 +0900 
 | |
|              <000801bf1a19$2d88ae20$2801007e@cadzone.tpf.co.jp> 
 | |
| Date: Tue, 19 Oct 1999 10:09:15 -0400
 | |
| Message-ID: <9036.940342155@sss.pgh.pa.us>
 | |
| From: Tom Lane <tgl@sss.pgh.pa.us>
 | |
| Sender: owner-pgsql-hackers@postgreSQL.org
 | |
| Status: RO
 | |
| 
 | |
| "Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
 | |
| > 1. shared cache holds committed system tuples.
 | |
| > 2. private cache holds uncommitted system tuples.
 | |
| > 3. relpages of shared cache are updated immediately by
 | |
| >     phisical change and corresponding buffer pages are
 | |
| >     marked dirty.
 | |
| > 4. on commit, the contents of uncommitted tuples except
 | |
| >    relpages,reltuples,... are copied to correponding tuples
 | |
| >    in shared cache and the combined contents are
 | |
| >    committed.
 | |
| > If so,catalog cache invalidation would be no longer needed.
 | |
| > But synchronization of the step 4. may be difficult.
 | |
| 
 | |
| I think the main problem is that relpages and reltuples shouldn't
 | |
| be kept in pg_class columns at all, because they need to have
 | |
| very different update behavior from the other pg_class columns.
 | |
| 
 | |
| The rest of pg_class is update-on-commit, and we can lock down any one
 | |
| row in the normal MVCC way (if transaction A has modified a row and
 | |
| transaction B also wants to modify it, B waits for A to commit or abort,
 | |
| so it can know which version of the row to start from).  Furthermore,
 | |
| there can legitimately be several different values of a row in use in
 | |
| different places: the latest committed, an uncommitted modification, and
 | |
| one or more old values that are still being used by active transactions
 | |
| because they were current when those transactions started.  (BTW, the
 | |
| present relcache is pretty bad about maintaining pure MVCC transaction
 | |
| semantics like this, but it seems clear to me that that's the direction
 | |
| we want to go in.)
 | |
| 
 | |
| relpages cannot operate this way.  To be useful for avoiding lseeks,
 | |
| relpages *must* change exactly when the physical file changes.  It
 | |
| matters not at all whether the particular transaction that extended the
 | |
| file ultimately commits or not.  Moreover there can be only one correct
 | |
| value (per relation) across the whole system, because there is only one
 | |
| length of the relation file.
 | |
| 
 | |
| If we want to take reltuples seriously and try to maintain it
 | |
| on-the-fly, then I think it needs still a third behavior.  Clearly
 | |
| it cannot be updated using MVCC rules, or we lose all writer
 | |
| concurrency (if A has added tuples to a rel, B would have to wait
 | |
| for A to commit before it could update reltuples...).  Furthermore
 | |
| "updating" isn't a simple matter of storing what you think the new
 | |
| value is; otherwise two transactions adding tuples in parallel would
 | |
| leave the wrong answer after B commits and overwrites A's value.
 | |
| I think it would work for each transaction to keep track of a net delta
 | |
| in reltuples for each table it's changed (total tuples added less total
 | |
| tuples deleted), and then atomically add that value to the table's
 | |
| shared reltuples counter during commit.  But that still leaves the
 | |
| problem of how you use the counter during a transaction to get an
 | |
| accurate answer to the question "If I scan this table now, how many tuples
 | |
| will I see?"  At the time the question is asked, the current shared
 | |
| counter value might include the effects of transactions that have
 | |
| committed since your transaction started, and therefore are not visible
 | |
| under MVCC rules.  I think getting the correct answer would involve
 | |
| making an instantaneous copy of the current counter at the start of
 | |
| your xact, and then adding your own private net-uncommitted-delta to
 | |
| the saved shared counter value when asked the question.  This doesn't
 | |
| look real practical --- you'd have to save the reltuples counts of
 | |
| *all* tables in the database at the start of each xact, on the off
 | |
| chance that you might need them.  Ugh.  Perhaps someone has a better
 | |
| idea.  In any case, reltuples clearly needs different mechanisms than
 | |
| the ordinary fields in pg_class do, because updating it will be a
 | |
| performance bottleneck otherwise.
 | |
| 
 | |
| If we allow reltuples to be updated only by vacuum-like events, as
 | |
| it is now, then I think keeping it in pg_class is still OK.
 | |
| 
 | |
| In short, it seems clear to me that relpages should be removed from
 | |
| pg_class and kept somewhere else if we want to make it more reliable
 | |
| than it is now, and the same for reltuples (but reltuples doesn't
 | |
| behave the same as relpages, and probably ought to be handled
 | |
| differently).
 | |
| 
 | |
| 			regards, tom lane
 | |
| 
 | |
| ************
 | |
| 
 | |
| From owner-pgsql-hackers@hub.org Tue Oct 19 21:25:30 1999
 | |
| Received: from renoir.op.net (root@renoir.op.net [209.152.193.4])
 | |
| 	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id VAA28130
 | |
| 	for <maillist@candle.pha.pa.us>; Tue, 19 Oct 1999 21:25:26 -0400 (EDT)
 | |
| Received: from hub.org (hub.org [216.126.84.1]) by renoir.op.net (o1/$Revision: 1.12 $) with ESMTP id VAA10512 for <maillist@candle.pha.pa.us>; Tue, 19 Oct 1999 21:15:28 -0400 (EDT)
 | |
| Received: from localhost (majordom@localhost)
 | |
| 	by hub.org (8.9.3/8.9.3) with SMTP id VAA50745;
 | |
| 	Tue, 19 Oct 1999 21:07:23 -0400 (EDT)
 | |
| 	(envelope-from owner-pgsql-hackers)
 | |
| Received: by hub.org (bulk_mailer v1.5); Tue, 19 Oct 1999 21:07:01 -0400
 | |
| Received: (from majordom@localhost)
 | |
| 	by hub.org (8.9.3/8.9.3) id VAA50644
 | |
| 	for pgsql-hackers-outgoing; Tue, 19 Oct 1999 21:06:06 -0400 (EDT)
 | |
| 	(envelope-from owner-pgsql-hackers@postgreSQL.org)
 | |
| Received: from sd.tpf.co.jp (sd.tpf.co.jp [210.161.239.34])
 | |
| 	by hub.org (8.9.3/8.9.3) with ESMTP id VAA50584
 | |
| 	for <pgsql-hackers@postgreSQL.org>; Tue, 19 Oct 1999 21:05:26 -0400 (EDT)
 | |
| 	(envelope-from Inoue@tpf.co.jp)
 | |
| Received: from cadzone ([126.0.1.40] (may be forged))
 | |
|           by sd.tpf.co.jp (2.5 Build 2640 (Berkeley 8.8.6)/8.8.4) with SMTP
 | |
|    id KAA01715; Wed, 20 Oct 1999 10:05:14 +0900
 | |
| From: "Hiroshi Inoue" <Inoue@tpf.co.jp>
 | |
| To: "Tom Lane" <tgl@sss.pgh.pa.us>
 | |
| Cc: <pgsql-hackers@postgreSQL.org>
 | |
| Subject: RE: [HACKERS] mdnblocks is an amazing time sink in huge relations 
 | |
| Date: Wed, 20 Oct 1999 10:09:13 +0900
 | |
| Message-ID: <000501bf1a97$b925a860$2801007e@cadzone.tpf.co.jp>
 | |
| MIME-Version: 1.0
 | |
| Content-Type: text/plain;
 | |
| 	charset="iso-8859-1"
 | |
| Content-Transfer-Encoding: 7bit
 | |
| X-Priority: 3 (Normal)
 | |
| X-MSMail-Priority: Normal
 | |
| X-Mailer: Microsoft Outlook 8.5, Build 4.71.2173.0
 | |
| X-Mimeole: Produced By Microsoft MimeOLE V4.72.2106.4
 | |
| Importance: Normal
 | |
| Sender: owner-pgsql-hackers@postgreSQL.org
 | |
| Status: RO
 | |
| 
 | |
| > -----Original Message-----
 | |
| > From: Hiroshi Inoue [mailto:Inoue@tpf.co.jp]
 | |
| > Sent: Tuesday, October 19, 1999 6:45 PM
 | |
| > To: Tom Lane
 | |
| > Cc: pgsql-hackers@postgreSQL.org
 | |
| > Subject: RE: [HACKERS] mdnblocks is an amazing time sink in huge
 | |
| > relations 
 | |
| > 
 | |
| > 
 | |
| > > 
 | |
| > > "Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
 | |
| > 
 | |
| > [snip]
 | |
| >  
 | |
| > > 
 | |
| > > > Deletion is necessary only not to consume disk space.
 | |
| > > >
 | |
| > > > For example vacuum could remove not deleted files.
 | |
| > > 
 | |
| > > Hmm ... interesting idea ... but I can hear the complaints
 | |
| > > from users already...
 | |
| > >
 | |
| > 
 | |
| > My idea is only an analogy of PostgreSQL's simple recovery
 | |
| > mechanism of tuples.
 | |
| > 
 | |
| > And my main point is
 | |
| > 	"delete fails after commit" doesn't harm the database
 | |
| > 	except that not deleted files consume disk space.
 | |
| > 
 | |
| > Of cource,it's preferable to delete relation files immediately
 | |
| > after(or just when) commit.
 | |
| > Useless files are visible though useless tuples are invisible.
 | |
| >
 | |
| 
 | |
| Anyway I don't need "DROP TABLE inside transactions" now
 | |
| and my idea is originally for that issue.
 | |
| 
 | |
| After a thought,I propose the following solution.
 | |
| 
 | |
| 1. mdcreate() couldn't create existent relation files.
 | |
|     If the existent file is of length zero,we would overwrite
 | |
|     the file.(seems the comment in md.c says so but the
 | |
|     code doesn't do so). 
 | |
|     If the file is an Index relation file,we would overwrite
 | |
|     the file.
 | |
| 
 | |
| 2. mdunlink() couldn't unlink non-existent relation files.
 | |
|     mdunlink() doesn't call elog(ERROR) even if the file
 | |
|     doesn't exist,though I couldn't find where to change
 | |
|     now.
 | |
|     mdopen() doesn't call elog(ERROR) even if the file
 | |
|     doesn't exist and leaves the relation as CLOSED. 
 | |
| 
 | |
| Comments ?
 | |
| 
 | |
| Regards. 
 | |
| 
 | |
| Hiroshi Inoue
 | |
| Inoue@tpf.co.jp
 | |
| 
 | |
| ************
 | |
| 
 | |
| From pgsql-hackers-owner+M6267@hub.org Sun Aug 27 21:46:37 2000
 | |
| Received: from hub.org (root@hub.org [216.126.84.1])
 | |
| 	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id UAA07972
 | |
| 	for <pgman@candle.pha.pa.us>; Sun, 27 Aug 2000 20:46:36 -0400 (EDT)
 | |
| Received: from hub.org (majordom@localhost [127.0.0.1])
 | |
| 	by hub.org (8.10.1/8.10.1) with SMTP id e7S0kaL27996;
 | |
| 	Sun, 27 Aug 2000 20:46:36 -0400 (EDT)
 | |
| Received: from sss2.sss.pgh.pa.us (sss.pgh.pa.us [209.114.166.2])
 | |
| 	by hub.org (8.10.1/8.10.1) with ESMTP id e7S05aL24107
 | |
| 	for <pgsql-hackers@postgreSQL.org>; Sun, 27 Aug 2000 20:05:36 -0400 (EDT)
 | |
| Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
 | |
| 	by sss2.sss.pgh.pa.us (8.9.3/8.9.3) with ESMTP id UAA01604
 | |
| 	for <pgsql-hackers@postgreSQL.org>; Sun, 27 Aug 2000 20:05:29 -0400 (EDT)
 | |
| To: pgsql-hackers@postgreSQL.org
 | |
| Subject: [HACKERS] Possible performance improvement: buffer replacement policy
 | |
| Date: Sun, 27 Aug 2000 20:05:29 -0400
 | |
| Message-ID: <1601.967421129@sss.pgh.pa.us>
 | |
| From: Tom Lane <tgl@sss.pgh.pa.us>
 | |
| X-Mailing-List: pgsql-hackers@postgresql.org
 | |
| Precedence: bulk
 | |
| Sender: pgsql-hackers-owner@hub.org
 | |
| Status: RO
 | |
| 
 | |
| Those of you with long memories may recall a benchmark that Edmund Mergl
 | |
| drew our attention to back in May '99.  That test showed extremely slow
 | |
| performance for updating a table with many indexes (about 20).  At the
 | |
| time, it seemed the problem was due to bad performance of btree with
 | |
| many equal keys, so I thought I'd go back and retry the benchmark after
 | |
| this latest round of btree hackery.
 | |
| 
 | |
| The good news is that btree itself seems to be pretty well fixed; the
 | |
| bad news is that the benchmark is still slow for large numbers of rows.
 | |
| The problem is I/O: the CPU mostly sits idle waiting for the disk.
 | |
| As best I can tell, the difficulty is that the working set of pages
 | |
| needed to update this many indexes is too large compared to the number
 | |
| of disk buffers Postgres is using.  (I was running with -B 1000 and
 | |
| looking at behavior for a 100000-row test table.  This gave me a table
 | |
| size of 3876 pages, plus 11526 pages in 20 indexes.)
 | |
| 
 | |
| Of course, there's only so much we can do when the number of buffers
 | |
| is too small, but I still started to wonder if we are using the buffers
 | |
| as effectively as we can.  Some tracing showed that most of the pages
 | |
| of the indexes were being read and written multiple times within a
 | |
| single UPDATE query, while most of the pages of the table proper were
 | |
| fetched and written only once.  That says we're not using the buffers
 | |
| as well as we could; the index pages are not being kept in memory when
 | |
| they should be.  In a query like this, we should displace main-table
 | |
| pages sooner to allow keeping more index pages in cache --- but with
 | |
| the simple LRU replacement method we use, once a page has been loaded
 | |
| it will stay in cache for at least the next NBuffers (-B) page
 | |
| references, no matter what.  With a large NBuffers that's a long time.
 | |
| 
 | |
| I've come across an interesting article:
 | |
| 	The LRU-K Page Replacement Algorithm For Database Disk Buffering
 | |
| 	Elizabeth J. O'Neil, Patrick E. O'Neil, Gerhard Weikum
 | |
| 	Proceedings of the 1993 ACM SIGMOD international conference
 | |
| 	on Management of Data, May 1993
 | |
| (If you subscribe to the ACM digital library, you can get a PDF of this
 | |
| from there.)  This article argues that standard LRU buffer management is
 | |
| inherently not great for database caches, and that it's much better to
 | |
| replace pages on the basis of time since the K'th most recent reference,
 | |
| not just time since the most recent one.  K=2 is enough to get most of
 | |
| the benefit.  The big win is that you are measuring an actual page
 | |
| interreference time (between the last two references) and not just
 | |
| dealing with a lower-bound guess on the interreference time.  Frequently
 | |
| used pages are thus much more likely to stay in cache.
 | |
| 
 | |
| It looks like it wouldn't take too much work to replace shared buffers
 | |
| on the basis of LRU-2 instead of LRU, so I'm thinking about trying it.
 | |
| 
 | |
| Has anyone looked into this area?  Is there a better method to try?
 | |
| 
 | |
| 			regards, tom lane
 | |
| 
 | |
| From prlw1@newn.cam.ac.uk Fri Jan 19 12:54:45 2001
 | |
| Received: from henry.newn.cam.ac.uk (henry.newn.cam.ac.uk [131.111.204.130])
 | |
| 	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id MAA29822
 | |
| 	for <pgman@candle.pha.pa.us>; Fri, 19 Jan 2001 12:54:44 -0500 (EST)
 | |
| Received: from [131.111.204.180] (helo=quartz.newn.cam.ac.uk)
 | |
| 	by henry.newn.cam.ac.uk with esmtp (Exim 3.13 #1)
 | |
| 	id 14JfkU-0001WA-00; Fri, 19 Jan 2001 17:54:54 +0000
 | |
| Received: from prlw1 by quartz.newn.cam.ac.uk with local (Exim 3.13 #1)
 | |
| 	id 14Jfj6-0001cq-00; Fri, 19 Jan 2001 17:53:28 +0000
 | |
| Date: Fri, 19 Jan 2001 17:53:28 +0000
 | |
| From: Patrick Welche <prlw1@newn.cam.ac.uk>
 | |
| To: Bruce Momjian <pgman@candle.pha.pa.us>
 | |
| Cc: Tom Lane <tgl@sss.pgh.pa.us>, pgsql-hackers@postgreSQL.org
 | |
| Subject: Re: [HACKERS] Possible performance improvement: buffer replacement policy
 | |
| Message-ID: <20010119175328.A6223@quartz.newn.cam.ac.uk>
 | |
| Reply-To: prlw1@cam.ac.uk
 | |
| References: <1601.967421129@sss.pgh.pa.us> <200101191703.MAA25873@candle.pha.pa.us>
 | |
| Mime-Version: 1.0
 | |
| Content-Type: text/plain; charset=us-ascii
 | |
| Content-Disposition: inline
 | |
| User-Agent: Mutt/1.2i
 | |
| In-Reply-To: <200101191703.MAA25873@candle.pha.pa.us>; from pgman@candle.pha.pa.us on Fri, Jan 19, 2001 at 12:03:58PM -0500
 | |
| Status: RO
 | |
| 
 | |
| On Fri, Jan 19, 2001 at 12:03:58PM -0500, Bruce Momjian wrote:
 | |
| > 
 | |
| > Tom, did we ever test this?  I think we did and found that it was the
 | |
| > same or worse, right?
 | |
| 
 | |
| (Funnily enough, I just read that message:)
 | |
| 
 | |
| To: Bruce Momjian <pgman@candle.pha.pa.us>
 | |
| cc: pgsql-hackers@postgreSQL.org
 | |
| Subject: Re: [HACKERS] Possible performance improvement: buffer replacement policy 
 | |
| In-reply-to: <200010161541.LAA06653@candle.pha.pa.us> 
 | |
| References: <200010161541.LAA06653@candle.pha.pa.us>
 | |
| Comments: In-reply-to Bruce Momjian <pgman@candle.pha.pa.us>
 | |
| 	message dated "Mon, 16 Oct 2000 11:41:41 -0400"
 | |
| Date: Mon, 16 Oct 2000 11:49:52 -0400
 | |
| Message-ID: <26100.971711392@sss.pgh.pa.us>
 | |
| From: Tom Lane <tgl@sss.pgh.pa.us>
 | |
| X-Mailing-List: pgsql-hackers@postgresql.org
 | |
| Precedence: bulk
 | |
| Sender: pgsql-hackers-owner@hub.org
 | |
| Status: RO
 | |
| Content-Length: 947
 | |
| Lines: 19
 | |
| 
 | |
| Bruce Momjian <pgman@candle.pha.pa.us> writes:
 | |
| >> It looks like it wouldn't take too much work to replace shared buffers
 | |
| >> on the basis of LRU-2 instead of LRU, so I'm thinking about trying it.
 | |
| >> 
 | |
| >> Has anyone looked into this area?  Is there a better method to try?
 | |
| 
 | |
| > Sounds like a perfect idea.  Good luck.  :-)
 | |
| 
 | |
| Actually, the idea went down in flames :-(, but I neglected to report
 | |
| back to pghackers about it.  I did do some code to manage buffers as
 | |
| LRU-2.  I didn't have any good performance test cases to try it with,
 | |
| but Richard Brosnahan was kind enough to re-run the TPC tests previously
 | |
| published by Great Bridge with that code in place.  Wasn't any faster,
 | |
| in fact possibly a little slower, likely due to the extra CPU time spent
 | |
| on buffer freelist management.  It's possible that other scenarios might
 | |
| show a better result, but right now I feel pretty discouraged about the
 | |
| LRU-2 idea and am not pursuing it.
 | |
| 
 | |
| 			regards, tom lane
 | |
| 
 | |
| 
 | |
| From pgsql-hackers-owner+M3455@postgresql.org Fri Jan 19 13:18:12 2001
 | |
| Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28])
 | |
| 	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id NAA02092
 | |
| 	for <pgman@candle.pha.pa.us>; Fri, 19 Jan 2001 13:18:12 -0500 (EST)
 | |
| Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28])
 | |
| 	by mail.postgresql.org (8.11.1/8.11.1) with SMTP id f0JIFJ037872;
 | |
| 	Fri, 19 Jan 2001 13:15:19 -0500 (EST)
 | |
| 	(envelope-from pgsql-hackers-owner+M3455@postgresql.org)
 | |
| Received: from sectorbase2.sectorbase.com ([208.48.122.131])
 | |
| 	by mail.postgresql.org (8.11.1/8.11.1) with ESMTP id f0JI7V036780
 | |
| 	for <pgsql-hackers@postgreSQL.org>; Fri, 19 Jan 2001 13:07:31 -0500 (EST)
 | |
| 	(envelope-from vmikheev@SECTORBASE.COM)
 | |
| Received: by sectorbase2.sectorbase.com with Internet Mail Service (5.5.2653.19)
 | |
| 	id <DG1W4LRZ>; Fri, 19 Jan 2001 09:46:14 -0800
 | |
| Message-ID: <8F4C99C66D04D4118F580090272A7A234D329F@sectorbase1.sectorbase.com>
 | |
| From: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>
 | |
| To: "'Tom Lane'" <tgl@sss.pgh.pa.us>, Bruce Momjian <pgman@candle.pha.pa.us>
 | |
| Cc: pgsql-hackers@postgresql.org
 | |
| Subject: RE: [HACKERS] Possible performance improvement: buffer replacemen
 | |
| 	t policy 
 | |
| Date: Fri, 19 Jan 2001 10:07:27 -0800
 | |
| MIME-Version: 1.0
 | |
| X-Mailer: Internet Mail Service (5.5.2653.19)
 | |
| Content-Type: text/plain;
 | |
| 	charset="iso-8859-1"
 | |
| Precedence: bulk
 | |
| Sender: pgsql-hackers-owner@postgresql.org
 | |
| Status: RO
 | |
| 
 | |
| > > Tom, did we ever test this?  I think we did and found that 
 | |
| > > it was the same or worse, right?
 | |
| > 
 | |
| > I tried it and didn't see any noticeable improvement on the particular
 | |
| > test case I was using, so I got discouraged and didn't pursue the idea
 | |
| > further.  I'd like to come back to it someday, though.
 | |
| 
 | |
| I don't know how much useful could be LRU-2 but with WAL we should try
 | |
| to reuse undirty free buffers first, not dirty ones, just to postpone
 | |
| writes as long as we can. (BTW, this is what Oracle does.)
 | |
| So, we probably should put new unfree dirty buffer just before first
 | |
| dirty one in LRU.
 | |
| 
 | |
| Vadim
 | |
| 
 | |
| From markw@mohawksoft.com Thu Jun  7 14:40:02 2001
 | |
| Return-path: <markw@mohawksoft.com>
 | |
| Received: from gromit.dotclick.com (ipn9-f8366.net-resource.net [216.204.83.66])
 | |
| 	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f57Ie1c14004
 | |
| 	for <pgman@candle.pha.pa.us>; Thu, 7 Jun 2001 14:40:02 -0400 (EDT)
 | |
| Received: from mohawksoft.com (IDENT:markw@localhost.localdomain [127.0.0.1])
 | |
| 	by gromit.dotclick.com (8.9.3/8.9.3) with ESMTP id OAA04973;
 | |
| 	Thu, 7 Jun 2001 14:37:00 -0400
 | |
| Sender: markw@gromit.dotclick.com
 | |
| Message-ID: <3B1FC9CB.57C72AD6@mohawksoft.com>
 | |
| Date: Thu, 07 Jun 2001 14:36:59 -0400
 | |
| From: mlw <markw@mohawksoft.com>
 | |
| X-Mailer: Mozilla 4.75 [en] (X11; U; Linux 2.4.2 i686)
 | |
| X-Accept-Language: en
 | |
| MIME-Version: 1.0
 | |
| To: Bruce Momjian <pgman@candle.pha.pa.us>,
 | |
|    "pgsql-hackers@postgresql.org" <pgsql-hackers@postgresql.org>
 | |
| Subject: Re: 7.2 items
 | |
| References: <200106071503.f57F32n03924@candle.pha.pa.us>
 | |
| Content-Type: text/plain; charset=us-ascii
 | |
| Content-Transfer-Encoding: 7bit
 | |
| Status: RO
 | |
| 
 | |
| Bruce Momjian wrote:
 | |
| 
 | |
| > > Bruce Momjian <pgman@candle.pha.pa.us> writes:
 | |
| > >
 | |
| > > > Here is a small list of big TODO items.  I was wondering which ones
 | |
| > > > people were thinking about for 7.2?
 | |
| > >
 | |
| > > A friend of mine wants to use PostgreSQL instead of Oracle for a large
 | |
| > > application, but has run into a snag when speed comparisons looked
 | |
| > > good until the Oracle folks added a couple of BITMAP indexes.  I can't
 | |
| > > recall seeing any discussion about that here -- are there any plans?
 | |
| >
 | |
| > It is not on our list and I am not sure what they do.
 | |
| 
 | |
| Do you have access to any Oracle Documentation? There is a good explanation
 | |
| of them.
 | |
| 
 | |
| However, I will try to explain.
 | |
| 
 | |
| If you have a table, locations. It has 1,000,000 records.
 | |
| 
 | |
| In oracle you do this:
 | |
| 
 | |
| create bitmap index bitmap_foo on locations (state) ;
 | |
| 
 | |
| For each unique value of 'state' oracle will create a bitmap with 1,000,000
 | |
| bits in it. With a one representing a match and a zero representing no
 | |
| match. Record '0' in the table is represented by bit '0' in the bitmap,
 | |
| record '1' is represented by bit '1', record two by bit '2' and so on.
 | |
| 
 | |
| In a table where comparatively few different values are to be indexed in a
 | |
| large table, a bitmap index can be quite small and not suffer the N * log(N)
 | |
| disk I/O most tree based indexes suffer. If the bitmap is fairly sparse or
 | |
| dense (or have periods of denseness and sparseness), it can be compressed
 | |
| very efficiently as well.
 | |
| 
 | |
| When the statement:
 | |
| 
 | |
| select * from locations where state = 'MA';
 | |
| 
 | |
| Is executed, the bitmap is read into memory in very few disk operations.
 | |
| (Perhaps even as few as one or two). It is a simple operation of rifling
 | |
| through the bitmap for '1's that indicate the record has the property,
 | |
| 'state' = 'MA';
 | |
| 
 | |
| 
 | |
| From mascarm@mascari.com Thu Jun  7 15:36:25 2001
 | |
| Return-path: <mascarm@mascari.com>
 | |
| Received: from corvette.mascari.com (dhcp065-024-161-045.columbus.rr.com [65.24.161.45])
 | |
| 	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f57JaOc21943
 | |
| 	for <pgman@candle.pha.pa.us>; Thu, 7 Jun 2001 15:36:24 -0400 (EDT)
 | |
| Received: from ferrari (ferrari.mascari.com [192.168.2.1])
 | |
| 	by corvette.mascari.com (8.9.3/8.9.3) with SMTP id PAA25607;
 | |
| 	Thu, 7 Jun 2001 15:29:31 -0400
 | |
| Received: by localhost with Microsoft MAPI; Thu, 7 Jun 2001 15:34:18 -0400
 | |
| Message-ID: <01C0EF67.5105D2E0.mascarm@mascari.com>
 | |
| From: Mike Mascari <mascarm@mascari.com>
 | |
| Reply-To: "mascarm@mascari.com" <mascarm@mascari.com>
 | |
| To: "'mlw'" <markw@mohawksoft.com>, Bruce Momjian <pgman@candle.pha.pa.us>,
 | |
|    "pgsql-hackers@postgresql.org" <pgsql-hackers@postgresql.org>
 | |
| Subject: RE: [HACKERS] Re: 7.2 items
 | |
| Date: Thu, 7 Jun 2001 15:34:17 -0400
 | |
| Organization: Mascari Development Inc.
 | |
| X-Mailer: Microsoft Internet E-mail/MAPI - 8.0.0.4211
 | |
| MIME-Version: 1.0
 | |
| Content-Type: text/plain; charset="us-ascii"
 | |
| Content-Transfer-Encoding: 7bit
 | |
| Status: RO
 | |
| 
 | |
| And in addition,
 | |
| 
 | |
| If you submitted the query:
 | |
| 
 | |
| SELECT * FROM addresses WHERE state = 'OH'
 | |
| AND areacode = '614'
 | |
| 
 | |
| Then, with bitmap indexes, the bitmaps are just logically ANDed 
 | |
| together, and the final bitmap determines the matching rows.
 | |
| 
 | |
| Mike Mascari
 | |
| mascarm@mascari.com
 | |
| 
 | |
| -----Original Message-----
 | |
| From:	mlw [SMTP:markw@mohawksoft.com]
 | |
| 
 | |
| Bruce Momjian wrote:
 | |
| 
 | |
| > > Bruce Momjian <pgman@candle.pha.pa.us> writes:
 | |
| > >
 | |
| > > > Here is a small list of big TODO items.  I was wondering which 
 | |
| ones
 | |
| > > > people were thinking about for 7.2?
 | |
| > >
 | |
| > > A friend of mine wants to use PostgreSQL instead of Oracle for a 
 | |
| large
 | |
| > > application, but has run into a snag when speed comparisons 
 | |
| looked
 | |
| > > good until the Oracle folks added a couple of BITMAP indexes.  I 
 | |
| can't
 | |
| > > recall seeing any discussion about that here -- are there any 
 | |
| plans?
 | |
| >
 | |
| > It is not on our list and I am not sure what they do.
 | |
| 
 | |
| Do you have access to any Oracle Documentation? There is a good 
 | |
| explanation
 | |
| of them.
 | |
| 
 | |
| However, I will try to explain.
 | |
| 
 | |
| If you have a table, locations. It has 1,000,000 records.
 | |
| 
 | |
| In oracle you do this:
 | |
| 
 | |
| create bitmap index bitmap_foo on locations (state) ;
 | |
| 
 | |
| For each unique value of 'state' oracle will create a bitmap with 
 | |
| 1,000,000
 | |
| bits in it. With a one representing a match and a zero representing 
 | |
| no
 | |
| match. Record '0' in the table is represented by bit '0' in the 
 | |
| bitmap,
 | |
| record '1' is represented by bit '1', record two by bit '2' and so 
 | |
| on.
 | |
| 
 | |
| In a table where comparatively few different values are to be indexed 
 | |
| in a
 | |
| large table, a bitmap index can be quite small and not suffer the N * 
 | |
| log(N)
 | |
| disk I/O most tree based indexes suffer. If the bitmap is fairly 
 | |
| sparse or
 | |
| dense (or have periods of denseness and sparseness), it can be 
 | |
| compressed
 | |
| very efficiently as well.
 | |
| 
 | |
| When the statement:
 | |
| 
 | |
| select * from locations where state = 'MA';
 | |
| 
 | |
| Is executed, the bitmap is read into memory in very few disk 
 | |
| operations.
 | |
| (Perhaps even as few as one or two). It is a simple operation of 
 | |
| rifling
 | |
| through the bitmap for '1's that indicate the record has the 
 | |
| property,
 | |
| 'state' = 'MA';
 | |
| 
 | |
| 
 | |
| 
 | |
| From oleg@sai.msu.su Thu Jun  7 15:39:15 2001
 | |
| Return-path: <oleg@sai.msu.su>
 | |
| Received: from ra.sai.msu.su (ra.sai.msu.su [158.250.29.2])
 | |
| 	by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f57Jd7c22010
 | |
| 	for <pgman@candle.pha.pa.us>; Thu, 7 Jun 2001 15:39:08 -0400 (EDT)
 | |
| Received: from ra (ra [158.250.29.2])
 | |
| 	by ra.sai.msu.su (8.9.3/8.9.3) with ESMTP id WAA07783;
 | |
| 	Thu, 7 Jun 2001 22:38:20 +0300 (GMT)
 | |
| Date: Thu, 7 Jun 2001 22:38:20 +0300 (GMT)
 | |
| From: Oleg Bartunov <oleg@sai.msu.su>
 | |
| X-X-Sender: <megera@ra.sai.msu.su>
 | |
| To: mlw <markw@mohawksoft.com>
 | |
| cc: Bruce Momjian <pgman@candle.pha.pa.us>,
 | |
|    "pgsql-hackers@postgresql.org" <pgsql-hackers@postgresql.org>
 | |
| Subject: Re: [HACKERS] Re: 7.2 items
 | |
| In-Reply-To: <3B1FC9CB.57C72AD6@mohawksoft.com>
 | |
| Message-ID: <Pine.GSO.4.33.0106072234120.6015-100000@ra.sai.msu.su>
 | |
| MIME-Version: 1.0
 | |
| Content-Type: TEXT/PLAIN; charset=US-ASCII
 | |
| Status: RO
 | |
| 
 | |
| I think it's possible to implement bitmap indexes with a little
 | |
| effort using GiST. at least I know one implementation
 | |
| http://www.it.iitb.ernet.in/~rvijay/dbms/proj/
 | |
| if you have interests you could implement bitmap indexes yourself
 | |
| unfortunately, we're very busy
 | |
| 
 | |
| 	Oleg
 | |
| On Thu, 7 Jun 2001, mlw wrote:
 | |
| 
 | |
| > Bruce Momjian wrote:
 | |
| >
 | |
| > > > Bruce Momjian <pgman@candle.pha.pa.us> writes:
 | |
| > > >
 | |
| > > > > Here is a small list of big TODO items.  I was wondering which ones
 | |
| > > > > people were thinking about for 7.2?
 | |
| > > >
 | |
| > > > A friend of mine wants to use PostgreSQL instead of Oracle for a large
 | |
| > > > application, but has run into a snag when speed comparisons looked
 | |
| > > > good until the Oracle folks added a couple of BITMAP indexes.  I can't
 | |
| > > > recall seeing any discussion about that here -- are there any plans?
 | |
| > >
 | |
| > > It is not on our list and I am not sure what they do.
 | |
| >
 | |
| > Do you have access to any Oracle Documentation? There is a good explanation
 | |
| > of them.
 | |
| >
 | |
| > However, I will try to explain.
 | |
| >
 | |
| > If you have a table, locations. It has 1,000,000 records.
 | |
| >
 | |
| > In oracle you do this:
 | |
| >
 | |
| > create bitmap index bitmap_foo on locations (state) ;
 | |
| >
 | |
| > For each unique value of 'state' oracle will create a bitmap with 1,000,000
 | |
| > bits in it. With a one representing a match and a zero representing no
 | |
| > match. Record '0' in the table is represented by bit '0' in the bitmap,
 | |
| > record '1' is represented by bit '1', record two by bit '2' and so on.
 | |
| >
 | |
| > In a table where comparatively few different values are to be indexed in a
 | |
| > large table, a bitmap index can be quite small and not suffer the N * log(N)
 | |
| > disk I/O most tree based indexes suffer. If the bitmap is fairly sparse or
 | |
| > dense (or have periods of denseness and sparseness), it can be compressed
 | |
| > very efficiently as well.
 | |
| >
 | |
| > When the statement:
 | |
| >
 | |
| > select * from locations where state = 'MA';
 | |
| >
 | |
| > Is executed, the bitmap is read into memory in very few disk operations.
 | |
| > (Perhaps even as few as one or two). It is a simple operation of rifling
 | |
| > through the bitmap for '1's that indicate the record has the property,
 | |
| > 'state' = 'MA';
 | |
| >
 | |
| >
 | |
| > ---------------------------(end of broadcast)---------------------------
 | |
| > TIP 6: Have you searched our list archives?
 | |
| >
 | |
| > http://www.postgresql.org/search.mpl
 | |
| >
 | |
| 
 | |
| 	Regards,
 | |
| 		Oleg
 | |
| _____________________________________________________________
 | |
| Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
 | |
| Sternberg Astronomical Institute, Moscow University (Russia)
 | |
| Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
 | |
| phone: +007(095)939-16-83, +007(095)939-23-83
 | |
| 
 | |
| 
 | |
| From pgsql-general-owner+M2497@hub.org Fri Jun 16 18:31:03 2000
 | |
| Received: from renoir.op.net (root@renoir.op.net [207.29.195.4])
 | |
| 	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id RAA04165
 | |
| 	for <pgman@candle.pha.pa.us>; Fri, 16 Jun 2000 17:31:01 -0400 (EDT)
 | |
| Received: from hub.org (root@hub.org [216.126.84.1]) by renoir.op.net (o1/$Revision: 1.12 $) with ESMTP id RAA13110 for <pgman@candle.pha.pa.us>; Fri, 16 Jun 2000 17:20:12 -0400 (EDT)
 | |
| Received: from hub.org (majordom@localhost [127.0.0.1])
 | |
| 	by hub.org (8.10.1/8.10.1) with SMTP id e5GLDaM14477;
 | |
| 	Fri, 16 Jun 2000 17:13:36 -0400 (EDT)
 | |
| Received: from home.dialix.com ([203.15.150.26])
 | |
| 	by hub.org (8.10.1/8.10.1) with ESMTP id e5GLCQM14064
 | |
| 	for <pgsql-general@postgresql.org>; Fri, 16 Jun 2000 17:12:27 -0400 (EDT)
 | |
| Received: from nemeton.com.au ([202.76.153.71])
 | |
| 	by home.dialix.com (8.9.3/8.9.3/JustNet) with SMTP id HAA95516
 | |
| 	for <pgsql-general@postgresql.org>; Sat, 17 Jun 2000 07:11:44 +1000 (EST)
 | |
| 	(envelope-from giles@nemeton.com.au)
 | |
| Received: (qmail 10213 invoked from network); 16 Jun 2000 09:52:29 -0000
 | |
| Received: from nemeton.com.au (203.8.3.17)
 | |
|   by nemeton.com.au with SMTP; 16 Jun 2000 09:52:29 -0000
 | |
| To: Jurgen Defurne <defurnj@glo.be>
 | |
| cc: Mark Stier <kalium@gmx.de>,
 | |
|         postgreSQL general mailing list <pgsql-general@postgresql.org>
 | |
| Subject: Re: [GENERAL] optimization by removing the file system layer? 
 | |
| In-Reply-To: Message from Jurgen Defurne <defurnj@glo.be> 
 | |
|    of "Thu, 15 Jun 2000 20:26:57 +0200." <39491FF1.E1E583F8@glo.be> 
 | |
| Date: Fri, 16 Jun 2000 19:52:28 +1000
 | |
| Message-ID: <10210.961149148@nemeton.com.au>
 | |
| From: Giles Lean <giles@nemeton.com.au>
 | |
| X-Mailing-List: pgsql-general@postgresql.org
 | |
| Precedence: bulk
 | |
| Sender: pgsql-general-owner@hub.org
 | |
| Status: OR
 | |
| 
 | |
| 
 | |
| 
 | |
| > I think that the Un*x filesystem is one of the reasons that large
 | |
| > database vendors rather use raw devices, than filesystem storage
 | |
| > files.
 | |
| 
 | |
| This used to be the preference, back in the late 80s and possibly
 | |
| early 90s.  I'm seeing a preference toward using the filesystem now,
 | |
| possibly with some sort of async I/O and co-operation from the OS
 | |
| filesystem about interactions with the filesystem cache.
 | |
| 
 | |
| Performance preferences don't stand still.  The hardware changes, the
 | |
| software changes, the volume of data changes, and different solutions
 | |
| become preferable.
 | |
| 
 | |
| > Using a raw device on the disk gives them the possibility to have
 | |
| > complete control over their files, indices and objects without being
 | |
| > bothered by the operating system.
 | |
| >
 | |
| > This speeds up things in several ways :
 | |
| > - the least possible OS intervention
 | |
| 
 | |
| Not that this is especially useful, necessarily.  If the "raw" device
 | |
| is in fact managed by a logical volume manager doing mirroring onto
 | |
| some sort of storage array there is still plenty of OS code involved.
 | |
| 
 | |
| The cost of using a filesystem in addition may not be much if anything
 | |
| and of course a filesystem is considerably more flexible to
 | |
| administer (backup, move, change size, check integrity, etc.)
 | |
| 
 | |
| > - choose block sizes according to applications
 | |
| > - reducing fragmentation
 | |
| > - packing data in nearby cilinders
 | |
| 
 | |
| ... but when this storage area is spread over multiple mechanisms in a
 | |
| smart storage array with write caching, you've no idea what is where
 | |
| anyway.  Better to let the hardware or at least the OS manage this;
 | |
| there are so many levels of caching between a database and the
 | |
| magnetic media that working hard to influence layout is almost
 | |
| certainly a waste of time.
 | |
| 
 | |
| Kirk McKusick tells a lovely story that once upon a time it used to be
 | |
| sensible to check some registers on a particular disk controller to
 | |
| find out where the heads were when scheduling I/O.  Needless to say,
 | |
| that is history now!
 | |
| 
 | |
| There's a considerable cost in complexity and code in using "raw"
 | |
| storage too, and it's not a one off cost: as the technologies change,
 | |
| the "fast" way to do things will change and the code will have to be
 | |
| updated to match.  Better to leave this to the OS vendor where
 | |
| possible, and take advantage of the tuning they do.
 | |
| 
 | |
| > - Anyone other ideas -> the sky is the limit here
 | |
| 
 | |
| > It also aids portability, at least on platforms that have an
 | |
| > equivalent of a raw device.
 | |
| 
 | |
| I don't understand that claim.  Not much is portable about raw
 | |
| devices, and they're typically not nearlly as well documented as the
 | |
| filesystem interfaces.
 | |
| 
 | |
| > It is also independent of the standard implemented Un*x filesystems,
 | |
| > for which you will have to pay extra if you want to take extra
 | |
| > measures against power loss.
 | |
| 
 | |
| Rather, it is worse.  With a Unix filesystem you get quite defined
 | |
| semantics about what is written when.
 | |
| 
 | |
| > The problem with e.g. e2fs, is that it is not robust enough if a CPU
 | |
| > fails.
 | |
| 
 | |
| ext2fs doesn't even claim to have Unix filesystem semantics.
 | |
| 
 | |
| Regards,
 | |
| 
 | |
| Giles
 | |
| 
 | |
| 
 | |
| 
 | |
| From pgsql-hackers-owner+M1795@postgresql.org Thu Dec  7 18:47:52 2000
 | |
| Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28])
 | |
| 	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id SAA09172
 | |
| 	for <pgman@candle.pha.pa.us>; Thu, 7 Dec 2000 18:47:52 -0500 (EST)
 | |
| Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28])
 | |
| 	by mail.postgresql.org (8.11.1/8.11.1) with SMTP id eB7NjFP10612;
 | |
| 	Thu, 7 Dec 2000 18:45:15 -0500 (EST)
 | |
| 	(envelope-from pgsql-hackers-owner+M1795@postgresql.org)
 | |
| Received: from thor.tht.net (thor.tht.net [209.47.145.4])
 | |
| 	by mail.postgresql.org (8.11.1/8.11.1) with ESMTP id eB7N6BP08233
 | |
| 	for <pgsql-hackers@postgresql.org>; Thu, 7 Dec 2000 18:06:11 -0500 (EST)
 | |
| 	(envelope-from bright@fw.wintelcom.net)
 | |
| Received: from fw.wintelcom.net (bright@ns1.wintelcom.net [209.1.153.20])
 | |
| 	by thor.tht.net (8.9.3/8.9.3) with ESMTP id SAA97456
 | |
| 	for <pgsql-hackers@postgresql.org>; Thu, 7 Dec 2000 18:57:32 GMT
 | |
| 	(envelope-from bright@fw.wintelcom.net)
 | |
| Received: (from bright@localhost)
 | |
| 	by fw.wintelcom.net (8.10.0/8.10.0) id eB7MvWE21269
 | |
| 	for pgsql-hackers@postgresql.org; Thu, 7 Dec 2000 14:57:32 -0800 (PST)
 | |
| Date: Thu, 7 Dec 2000 14:57:32 -0800
 | |
| From: Alfred Perlstein <bright@wintelcom.net>
 | |
| To: pgsql-hackers@postgresql.org
 | |
| Subject: [HACKERS] Patches with vacuum fixes available for 7.0.x
 | |
| Message-ID: <20001207145732.X16205@fw.wintelcom.net>
 | |
| MIME-Version: 1.0
 | |
| Content-Type: text/plain; charset=us-ascii
 | |
| Content-Disposition: inline
 | |
| User-Agent: Mutt/1.2.5i
 | |
| Precedence: bulk
 | |
| Sender: pgsql-hackers-owner@postgresql.org
 | |
| Status: ORr
 | |
| 
 | |
| We recently had a very satisfactory contract completed by
 | |
| Vadim.
 | |
| 
 | |
| Basically Vadim has been able to reduce the amount of time
 | |
| taken by a vacuum from 10-15 minutes down to under 10 seconds.
 | |
| 
 | |
| We've been running with these patches under heavy load for
 | |
| about a week now without any problems except one:
 | |
|   don't 'lazy' (new option for vacuum) a table which has just
 | |
|   had an index created on it, or at least don't expect it to
 | |
|   take any less time than a normal vacuum would.
 | |
| 
 | |
| There's three patchsets and they are available at:
 | |
| 
 | |
| http://people.freebsd.org/~alfred/vacfix/
 | |
| 
 | |
| complete diff:
 | |
| http://people.freebsd.org/~alfred/vacfix/v.diff
 | |
| 
 | |
| only lazy vacuum option to speed up index vacuums:
 | |
| http://people.freebsd.org/~alfred/vacfix/vlazy.tgz
 | |
| 
 | |
| only lazy vacuum option to only scan from start of modified
 | |
| data:
 | |
| http://people.freebsd.org/~alfred/vacfix/mnmb.tgz
 | |
| 
 | |
| Although the patches are for 7.0.x I'm hoping that they
 | |
| can be forward ported (if Vadim hasn't done it already)
 | |
| to 7.1.
 | |
| 
 | |
| enjoy!
 | |
| 
 | |
| -- 
 | |
| -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
 | |
| "I have the heart of a child; I keep it in a jar on my desk."
 | |
| 
 | |
| From pgsql-hackers-owner+M1809@postgresql.org Thu Dec  7 20:27:39 2000
 | |
| Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28])
 | |
| 	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id UAA11827
 | |
| 	for <pgman@candle.pha.pa.us>; Thu, 7 Dec 2000 20:27:38 -0500 (EST)
 | |
| Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28])
 | |
| 	by mail.postgresql.org (8.11.1/8.11.1) with SMTP id eB81PsP22362;
 | |
| 	Thu, 7 Dec 2000 20:25:54 -0500 (EST)
 | |
| 	(envelope-from pgsql-hackers-owner+M1809@postgresql.org)
 | |
| Received: from fw.wintelcom.net (ns1.wintelcom.net [209.1.153.20])
 | |
| 	by mail.postgresql.org (8.11.1/8.11.1) with ESMTP id eB81JkP21783
 | |
| 	for <pgsql-hackers@postgresql.org>; Thu, 7 Dec 2000 20:19:46 -0500 (EST)
 | |
| 	(envelope-from bright@fw.wintelcom.net)
 | |
| Received: (from bright@localhost)
 | |
| 	by fw.wintelcom.net (8.10.0/8.10.0) id eB81JwU25447;
 | |
| 	Thu, 7 Dec 2000 17:19:58 -0800 (PST)
 | |
| Date: Thu, 7 Dec 2000 17:19:58 -0800
 | |
| From: Alfred Perlstein <bright@wintelcom.net>
 | |
| To: Tom Lane <tgl@sss.pgh.pa.us>
 | |
| cc: pgsql-hackers@postgresql.org
 | |
| Subject: Re: [HACKERS] Patches with vacuum fixes available for 7.0.x
 | |
| Message-ID: <20001207171958.B16205@fw.wintelcom.net>
 | |
| References: <20001207145732.X16205@fw.wintelcom.net> <28791.976236143@sss.pgh.pa.us>
 | |
| MIME-Version: 1.0
 | |
| Content-Type: text/plain; charset=us-ascii
 | |
| Content-Disposition: inline
 | |
| User-Agent: Mutt/1.2.5i
 | |
| In-Reply-To: <28791.976236143@sss.pgh.pa.us>; from tgl@sss.pgh.pa.us on Thu, Dec 07, 2000 at 07:42:23PM -0500
 | |
| Precedence: bulk
 | |
| Sender: pgsql-hackers-owner@postgresql.org
 | |
| Status: OR
 | |
| 
 | |
| * Tom Lane <tgl@sss.pgh.pa.us> [001207 17:10] wrote:
 | |
| > Alfred Perlstein <bright@wintelcom.net> writes:
 | |
| > > Basically Vadim has been able to reduce the amount of time
 | |
| > > taken by a vacuum from 10-15 minutes down to under 10 seconds.
 | |
| > 
 | |
| > Cool.  What's it do, exactly?
 | |
| 
 | |
| ================================================================
 | |
| 
 | |
| The first is a bonus that Vadim gave us to speed up index
 | |
| vacuums, I'm not sure I understand it completely, but it 
 | |
| work really well. :)
 | |
| 
 | |
| here's the README he gave us:
 | |
| 
 | |
|            Vacuum LAZY index cleanup option
 | |
| 
 | |
| LAZY vacuum option introduces new way of indices cleanup.
 | |
| Instead of reading entire index file to remove index tuples
 | |
| pointing to deleted table records, with LAZY option vacuum
 | |
| performes index scans using keys fetched from table record
 | |
| to be deleted. Vacuum checks each result returned by index
 | |
| scan if it points to target heap record and removes
 | |
| corresponding index tuple.
 | |
| This can greatly speed up indices cleaning if not so many
 | |
| table records were deleted/modified between vacuum runs.
 | |
| Vacuum uses new option on user' demand.
 | |
| 
 | |
| New vacuum syntax is:
 | |
| 
 | |
| vacuum [verbose] [analyze] [lazy] [table [(columns)]]
 | |
| 
 | |
| ================================================================
 | |
| 
 | |
| The second is one of the suggestions I gave on the lists a while
 | |
| back, keeping track of the "last dirtied" block in the data files
 | |
| to only scan the tail end of the file for deleted rows, I think
 | |
| what he instead did was keep a table that holds all the modified
 | |
| blocks and vacuum only scans those:
 | |
| 
 | |
|               Minimal Number Modified Block (MNMB)
 | |
| 
 | |
| This feature is to track MNMB of required tables with triggers
 | |
| to avoid reading unmodified table pages by vacuum. Triggers
 | |
| store MNMB in per-table files in specified directory
 | |
| ($LIBDIR/contrib/mnmb by default) and create these files if not
 | |
| existed.
 | |
| 
 | |
| Vacuum first looks up functions
 | |
| 
 | |
| mnmb_getblock(Oid databaseId, Oid tableId)
 | |
| mnmb_setblock(Oid databaseId, Oid tableId, Oid block)
 | |
| 
 | |
| in catalog. If *both* functions were found *and* there was no
 | |
| ANALYZE option specified then vacuum calls mnmb_getblock to obtain
 | |
| MNMB for table being vacuumed and starts reading this table from
 | |
| block number returned. After table was processed vacuum calls
 | |
| mnmb_setblock to update data in file to last table block number.
 | |
| Neither mnmb_getblock nor mnmb_setblock try to create file.
 | |
| If there was no file for table being vacuumed then mnmb_getblock
 | |
| returns 0 and mnmb_setblock does nothing.
 | |
| mnmb_setblock() may be used to set in file MNMB to 0 and force
 | |
| vacuum to read entire table if required.
 | |
| 
 | |
| To compile MNMB you have to add -DMNMB to CUSTOM_COPT
 | |
| in src/Makefile.custom.
 | |
| 
 | |
| -- 
 | |
| -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
 | |
| "I have the heart of a child; I keep it in a jar on my desk."
 | |
| 
 | |
| From pgsql-general-owner+M4010@postgresql.org Mon Feb  5 18:50:47 2001
 | |
| Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28])
 | |
| 	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id SAA02209
 | |
| 	for <pgman@candle.pha.pa.us>; Mon, 5 Feb 2001 18:50:46 -0500 (EST)
 | |
| Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28])
 | |
| 	by mail.postgresql.org (8.11.1/8.11.1) with SMTP id f15Nn8x86486;
 | |
| 	Mon, 5 Feb 2001 18:49:08 -0500 (EST)
 | |
| 	(envelope-from pgsql-general-owner+M4010@postgresql.org)
 | |
| Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28])
 | |
| 	by mail.postgresql.org (8.11.1/8.11.1) with SMTP id f15N7Ux81124
 | |
| 	for <pgsql-general@postgresql.org>; Mon, 5 Feb 2001 18:07:30 -0500 (EST)
 | |
| 	(envelope-from pgsql-general-owner@postgresql.org)
 | |
| Received: from news.tht.net (news.hub.org [216.126.91.242])
 | |
| 	by mail.postgresql.org (8.11.1/8.11.1) with ESMTP id f0V0Twq69854
 | |
| 	for <pgsql-general@postgresql.org>; Tue, 30 Jan 2001 19:29:58 -0500 (EST)
 | |
| 	(envelope-from news@news.tht.net)
 | |
| Received: (from news@localhost)
 | |
| 	by news.tht.net (8.11.1/8.11.1) id f0V0RAO01011
 | |
| 	for pgsql-general@postgresql.org; Tue, 30 Jan 2001 19:27:10 -0500 (EST)
 | |
| 	(envelope-from news)
 | |
| From: Mike Hoskins <mikehoskins@yahoo.com>
 | |
| X-Newsgroups: comp.databases.postgresql.general
 | |
| Subject: Re: [GENERAL] MySQL file system
 | |
| Date: Tue, 30 Jan 2001 18:30:36 -0600
 | |
| Organization: Hub.Org Networking Services (http://www.hub.org)
 | |
| Lines: 120
 | |
| Message-ID: <3A775CAB.C416AA16@yahoo.com>
 | |
| References: <016e01c080b7$ea554080$330a0a0a@6014cwpza006>
 | |
| MIME-Version: 1.0
 | |
| Content-Type: text/plain; charset=us-ascii
 | |
| Content-Transfer-Encoding: 7bit
 | |
| X-Complaints-To: scrappy@hub.org
 | |
| X-Mailer: Mozilla 4.76 [en] (Windows NT 5.0; U)
 | |
| X-Accept-Language: en
 | |
| To: pgsql-general@postgresql.org
 | |
| Precedence: bulk
 | |
| Sender: pgsql-general-owner@postgresql.org
 | |
| Status: OR
 | |
| 
 | |
| This idea is such a popular (even old) one that Oracle developed it for 8i --
 | |
| IFS.  Yep, AS/400 has had it forever, and BeOS is another example.  Informix has
 | |
| had its DataBlades for years, as well.  In fact, Reiser-FS is an FS implemented
 | |
| on a DB, albeit probably not a SQL DB.  AIX's LVM and JFS is extent/DB-based, as
 | |
| well. Let's see now, why would all those guys do that?  (Now, some of those that
 | |
| aren't SQL-based probably won't allow SQL queries on files, so just think about
 | |
| those that do, for a minute)....
 | |
| 
 | |
| Rather than asking why, a far better question is why not?  There is SO much
 | |
| functionality to be gained here that it's silly to ask why.  At a higher level,
 | |
| treating BLOBs as files and as DB entries simultaneously has so many uses, that
 | |
| one has trouble answering the question properly without the puzzled stare back
 | |
| at the questioner.  Again, look at the above list, particularly at AS/400 -- the
 | |
| entire OS's FS sits on top of DB/2!
 | |
| 
 | |
| For example, think how easy dynamically generated web sites could access online
 | |
| catalog information, with all those JPEG's, GIFs, PNGs, HTML files, Text files,
 | |
| .PDF's, etc., both in the DB and in the FS.  This would be so much easier to
 | |
| maintain, when you have webmasters, web designers, artists, programmers,
 | |
| sysadmins, dba's, etc., all trying to manage a big, dynamic, graphics-rich web
 | |
| site.  Who cares if the FS is a bit slow, as long as it's not too slow?  That's
 | |
| not the point, anyway.
 | |
| 
 | |
| The point is easy access to data:  asset management, version control, the
 | |
| ability to access the same data as a file and as a BLOB simultaneously, the
 | |
| ability to replicate easier, the ability to use more tools on the same info,
 | |
| etc.  It's not for speed, per se; instead, it's for accessibility.
 | |
| 
 | |
| Think about this issue.  You have some already compiled text-based program that
 | |
| works on binary files, but not on databases -- it was simply never designed into
 | |
| the program.  How are you going to get your graphics BLOBs into that program?
 | |
| Oh yeah, let's write another program to transform our data into files, first,
 | |
| then after processing delete them in some cleanup routine....  Why?  If you have
 | |
| a DB'ed FS, then file data can simultaneously have two views -- one for the DB
 | |
| and one as an FS.  (You can easily reverse the scenario.)  Not only does this
 | |
| save time and disk space; it saves you from having to pay for the most expensive
 | |
| element of all -- programmer time.
 | |
| 
 | |
| BTW, once this FS-on-a-DB concept really sinks in, imagine how tightly
 | |
| integrated Linux/Unix apps could be written.  Imagine if a bunch of GPL'ed
 | |
| software started coding for this and used this as a means to exchange data, all
 | |
| using a common set of libraries.  You could get to the point of uniting files,
 | |
| BLOBs, data of all sorts, IPC, version control, etc., all under one umbrella,
 | |
| especially if XML was the means data was exchanged.  Heck, distributed
 | |
| authentication, file access, data access, etc., could be improved greatly.
 | |
| Well, this paragraph sounds like flame bait, but really consider the
 | |
| ramifications.  Also, read the next paragraph....
 | |
| 
 | |
| Something like this *has* existed for Postgres for a long time -- PGFS, by Brian
 | |
| Bartholomew.  It's even supposedly matured with age.  Unfortunately, I cannot
 | |
| get to http://www.wv.com/ (Working Version's main site).  Working Version is a
 | |
| version control system that keeps old versions of files around in the FS.  It
 | |
| uses PG as the back-end DB and lets you mount it like another FS.  It's
 | |
| supposedly an awesome system, but where is it?  It's not some clunky korbit
 | |
| thingy, either.  (If someone can find it, please let me know by email, if
 | |
| possible.)
 | |
| 
 | |
| The only thing I can find on this is from a Google search, which caches
 | |
| everything but the actual software:
 | |
| 
 | |
| http://www.google.com/search?q=pgfs+postgres&num=100&hl=en&lr=lang_en&newwindow=1&safe=active
 | |
| 
 | |
| Also, there is the Perl-FS that can be transformed into something like PGFS:
 | |
| http://www.assurdo.com/perlfs/  It allows you to write Perl code that can mount
 | |
| various protocols or data types as an FS, in user space.  (One example is the
 | |
| ability to mount FTP sites, BTW.)
 | |
| 
 | |
| Instead of ridiculing something you've never tried, consider that MySQL-FS,
 | |
| Oracle (IFS), Informix (DataBlades), AS/400 (DB/2), BeOS, and Reiser-FS are
 | |
| doing this today.  Do you want to be left behind and let them tell us what it's
 | |
| good for?  Or, do we want this for PG?  (Reiser-FS, BTW, is FASTER than ext2,
 | |
| but has no SQL hooks).
 | |
| 
 | |
| There were many posts on this on slashdot:
 | |
|     http://slashdot.org/article.pl?sid=01/01/16/1855253&mode=thread
 | |
|     (I wrote some comments here, as well, just look for mikehoskins)
 | |
| 
 | |
| I, for one, want to see this succeed for MySQL, PostgreSQL, msql, etc.  It's an
 | |
| awesome feature that doesn't need to be speedy because it can save HUMANS time.
 | |
| 
 | |
| The question really is, "When do we want to catch up to everyone else?"  We are
 | |
| always moving to higher levels of abstraction, anyway, so it's just a matter of
 | |
| time.  PG should participate.
 | |
| 
 | |
| 
 | |
| Adam Lang wrote:
 | |
| 
 | |
| > I wasn't following the thread too closely, but database for a filesystem has
 | |
| > been done.  BeOS uses a database for a filesystem as well as AS/400 and
 | |
| > Mainframes.
 | |
| >
 | |
| > Adam Lang
 | |
| > Systems Engineer
 | |
| > Rutgers Casualty Insurance Company
 | |
| > http://www.rutgersinsurance.com
 | |
| > ----- Original Message -----
 | |
| > From: "Alfred Perlstein" <bright@wintelcom.net>
 | |
| > To: "Robert D. Nelson" <RDNELSON@co.centre.pa.us>
 | |
| > Cc: "Joseph Shraibman" <jks@selectacast.net>; "Karl DeBisschop"
 | |
| > <karl@debisschop.net>; "Ned Lilly" <ned@greatbridge.com>; "PostgreSQL
 | |
| > General" <pgsql-general@postgresql.org>
 | |
| > Sent: Wednesday, January 17, 2001 12:23 PM
 | |
| > Subject: Re: [GENERAL] MySQL file system
 | |
| >
 | |
| > > * Robert D. Nelson <RDNELSON@co.centre.pa.us> [010117 05:17] wrote:
 | |
| > > > >Raw disk access allows:
 | |
| > > >
 | |
| > > > If I'm correct, mysql is providing a filesystem, not a way to access raw
 | |
| > > > disk, like Oracle does. Huge difference there - with a filesystem, you
 | |
| > have
 | |
| > > > overhead of FS *and* SQL at the same time.
 | |
| > >
 | |
| > > Oh, so it's sort of like /proc for mysql?
 | |
| > >
 | |
| > > What a terrible waste of time and resources. :(
 | |
| > >
 | |
| > > --
 | |
| > > -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
 | |
| > > "I have the heart of a child; I keep it in a jar on my desk."
 | |
| 
 | |
| 
 | |
| From pgsql-general-owner+M4049@postgresql.org Tue Feb  6 01:26:19 2001
 | |
| Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28])
 | |
| 	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id BAA21425
 | |
| 	for <pgman@candle.pha.pa.us>; Tue, 6 Feb 2001 01:26:18 -0500 (EST)
 | |
| Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28])
 | |
| 	by mail.postgresql.org (8.11.1/8.11.1) with SMTP id f166Nxx26400;
 | |
| 	Tue, 6 Feb 2001 01:23:59 -0500 (EST)
 | |
| 	(envelope-from pgsql-general-owner+M4049@postgresql.org)
 | |
| Received: from simecity.com ([202.188.254.2])
 | |
| 	by mail.postgresql.org (8.11.1/8.11.1) with ESMTP id f166GUx25754
 | |
| 	for <pgsql-general@postgresql.org>; Tue, 6 Feb 2001 01:16:30 -0500 (EST)
 | |
| 	(envelope-from lyeoh@pop.jaring.my)
 | |
| Received: (from mail@localhost)
 | |
| 	by simecity.com (8.9.3/8.8.7) id OAA23910;
 | |
| 	Tue, 6 Feb 2001 14:28:48 +0800
 | |
| Received: from <lyeoh@pop.jaring.my> (ilab2.mecomb.po.my [192.168.3.22]) by cirrus.simecity.com via smap (V2.1)
 | |
| 	id xma023908; Tue, 6 Feb 01 14:28:34 +0800
 | |
| Message-ID: <3.0.5.32.20010206141555.00a3d100@192.228.128.13>
 | |
| X-Sender: lyeoh@192.228.128.13
 | |
| X-Mailer: QUALCOMM Windows Eudora Light Version 3.0.5 (32)
 | |
| Date: Tue, 06 Feb 2001 14:15:55 +0800
 | |
| To: Mike Hoskins <mikehoskins@yahoo.com>, pgsql-general@postgresql.org
 | |
| From: Lincoln Yeoh <lyeoh@pop.jaring.my>
 | |
| Subject: [GENERAL] Re: MySQL file system
 | |
| In-Reply-To: <3A775CF7.3C5F1909@yahoo.com>
 | |
| References: <016e01c080b7$ea554080$330a0a0a@6014cwpza006>
 | |
| MIME-Version: 1.0
 | |
| Content-Type: text/plain; charset="us-ascii"
 | |
| Precedence: bulk
 | |
| Sender: pgsql-general-owner@postgresql.org
 | |
| Status: OR
 | |
| 
 | |
| What you're saying seems to be to have a data structure where the same data
 | |
| can be accessed in both the filesystem style and the RDBMs style. How does
 | |
| that work? How is the mapping done between both structures? Slapping a
 | |
| filesystem on top of a RDBMs doesn't do that does it?
 | |
| 
 | |
| Most filesystems are basically databases already, just differently
 | |
| structured and featured databases. And so far most of them do their job
 | |
| pretty well. You move a folder/directory somewhere, and everything inside
 | |
| it moves. Tons of data are already arranged in that form. Though porting
 | |
| over data from one filesystem to another is not always straightforward,
 | |
| RDBMSes are far worse.
 | |
| 
 | |
| Maybe what would be nice is not a filesystem based on a database, rather
 | |
| one influenced by databases. One with a decent fulltextindex for data and
 | |
| filenames, where you have the option to ignore or not ignore
 | |
| nonalphanumerics and still get an indexed search.
 | |
| 
 | |
| Then perhaps we could do something like the following:
 | |
| 
 | |
| select file.name from path "/var/logs/" where file.name like "%.log%' and
 | |
| file.lastmodified > '2000/1/1' and file.contents =~ 'te_st[0-9]+\.gif$' use
 | |
| index
 | |
| 
 | |
| Checkpoints would be nice too. Then I can rollback to a known point if I
 | |
| screw up ;).
 | |
| 
 | |
| In fact the SQL style interface doesn't have to be built in at all. Neither
 | |
| does the index have to be realtime. I suppose there could be an option to
 | |
| make it realtime if performance is not an issue. 
 | |
| 
 | |
| What could be done is to use some fast filesystem. Then we add tools to
 | |
| maintain indexes, for SQL style interfaces and other style interfaces.
 | |
| Checkpoints and rollbacks would be harder of course.
 | |
| 
 | |
| Cheerio,
 | |
| Link.
 | |
| 
 | |
| 
 | |
| From pgsql-hackers-owner+M20329@postgresql.org Tue Mar 19 18:00:15 2002
 | |
| Return-path: <pgsql-hackers-owner+M20329@postgresql.org>
 | |
| Received: from postgresql.org (postgresql.org [64.49.215.8])
 | |
| 	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g2K00EA02465
 | |
| 	for <pgman@candle.pha.pa.us>; Tue, 19 Mar 2002 19:00:14 -0500 (EST)
 | |
| Received: from postgresql.org (postgresql.org [64.49.215.8])
 | |
| 	by postgresql.org (Postfix) with SMTP
 | |
| 	id 8C7164763EF; Tue, 19 Mar 2002 18:22:08 -0500 (EST)
 | |
| Received: from CopelandConsulting.Net (dsl-24293-ld.customer.centurytel.net [209.142.135.135])
 | |
| 	by postgresql.org (Postfix) with ESMTP id E4DAD475F1F
 | |
| 	for <pgsql-hackers@postgresql.org>; Tue, 19 Mar 2002 18:02:17 -0500 (EST)
 | |
| Received: from mouse.copelandconsulting.net (mouse.copelandconsulting.net [192.168.1.2])
 | |
| 	by CopelandConsulting.Net (8.10.1/8.10.1) with ESMTP id g2JN0jh13185;
 | |
| 	Tue, 19 Mar 2002 17:00:45 -0600 (CST)
 | |
| X-Trade-Id: <CCC.Tue, 19 Mar 2002 17:00:45 -0600 (CST).Tue, 19 Mar 2002 17:00:45 -0600 (CST).200203192300.g2JN0jh13185.g2JN0jh13185@CopelandConsulting.Net.
 | |
| Subject: Re: [HACKERS] Bitmap indexes?
 | |
| From: Greg Copeland <greg@CopelandConsulting.Net>
 | |
| To: Matthew Kirkwood <matthew@hairy.beasts.org>
 | |
| cc: Oleg Bartunov <oleg@sai.msu.su>,
 | |
|    PostgresSQL Hackers Mailing List <pgsql-hackers@postgresql.org>
 | |
| 	<Pine.LNX.4.33.0203192118140.29494-100000@sphinx.mythic-beasts.com>
 | |
| 	<Pine.LNX.4.33.0203192118140.29494-100000@sphinx.mythic-beasts.com>
 | |
| Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature";
 | |
| 	boundary="=-Ivchb84S75fOMzJ9DxwK"
 | |
| X-Mailer: Evolution/1.0.2 
 | |
| Date: 19 Mar 2002 17:00:53 -0600
 | |
| Message-ID: <1016578854.14670.450.camel@mouse.copelandconsulting.net>
 | |
| MIME-Version: 1.0
 | |
| Precedence: bulk
 | |
| Sender: pgsql-hackers-owner@postgresql.org
 | |
| Status: OR
 | |
| 
 | |
| --=-Ivchb84S75fOMzJ9DxwK
 | |
| Content-Type: text/plain
 | |
| Content-Transfer-Encoding: quoted-printable
 | |
| 
 | |
| On Tue, 2002-03-19 at 15:30, Matthew Kirkwood wrote:
 | |
| > On Tue, 19 Mar 2002, Oleg Bartunov wrote:
 | |
| >=20
 | |
| > Sorry to reply over you, Oleg.
 | |
| >=20
 | |
| > > On 13 Mar 2002, Greg Copeland wrote:
 | |
| > >
 | |
| > > > One of the reasons why I originally stated following the hackers list=
 | |
|  is
 | |
| > > > because I wanted to implement bitmap indexes.  I found in the archive=
 | |
| s,
 | |
| > > > the follow link, http://www.it.iitb.ernet.in/~rvijay/dbms/proj/, which
 | |
| > > > was extracted from this,
 | |
| > > > http://groups.google.com/groups?hl=3Den&threadm=3D01C0EF67.5105D2E0.m=
 | |
| ascarm%40mascari.com&rnum=3D1&prev=3D/groups%3Fq%3Dbitmap%2Bindex%2Bgroup:c=
 | |
| omp.databases.postgresql.hackers%26hl%3Den%26selm%3D01C0EF67.5105D2E0.masca=
 | |
| rm%2540mascari.com%26rnum%3D1, archive thread.
 | |
| >=20
 | |
| > For every case I have used a bitmap index on Oracle, a
 | |
| > partial index[0] made more sense (especialy since it
 | |
| > could usefully be compound).
 | |
| 
 | |
| That's very true, however, often bitmap indexes are used where partial
 | |
| indexes may not work well.  It maybe you were trying to apply the cure
 | |
| for the wrong disease.  ;)
 | |
| 
 | |
| >=20
 | |
| > Our troublesome case (on Oracle) is a table of "events"
 | |
| > where maybe fifty to a couple of hundred are "published"
 | |
| > (ie. web-visible) at any time.  The events are categorised
 | |
| > by sport (about a dozen) and by "event type" (about five).
 | |
| > We never really query events except by PK or by sport/type/
 | |
| > published.
 | |
| 
 | |
| The reason why bitmap indexes are primarily used for DSS and data
 | |
| wherehousing applications is because they are best used on extremely
 | |
| large to very large tables which have low cardinality (e.g, 10,000,000
 | |
| rows having 200 distinct values).  On top of that, bitmap indexes also
 | |
| tend to be much smaller than their "standard" cousins.  On large and
 | |
| very tables tables, this can sometimes save gigs in index space alone
 | |
| (serious space benefit).  Plus, their small index size tends to result
 | |
| in much less I/O (serious speed benefit).  This, of course, can result
 | |
| in several orders of magnitude speed improvements when index scans are
 | |
| required.  As an added bonus, using AND, OR, XOR and NOT predicates are
 | |
| exceptionally fast and if implemented properly, can even take advantage
 | |
| of some 64-bit hardware for further speed improvements.  This, of
 | |
| course, further speeds look ups.  The primary down side is that inserts
 | |
| and updates to bitmap indexes are very costly (comparatively) which is,
 | |
| yet again, why they excel in read-only environments (DSS & data
 | |
| wherehousing).
 | |
| 
 | |
| It should also be noted that RDMS's, such as Oracle, often use multiple
 | |
| types of bitmap indexes.  This further impedes insert/update
 | |
| performance, however, the additional bitmap index types usually allow
 | |
| for range predicates while still making use of the bitmap index.  If I'm
 | |
| not mistaken, several other types of bitmaps are available as well as
 | |
| many ways to encode and compress (rle, quad compression, etc) bitmap
 | |
| indexes which further save on an already compact indexing scheme.
 | |
| 
 | |
| Given the proper problem domain, index bitmaps can be a big win.
 | |
| 
 | |
| >=20
 | |
| > We make a bitmap index on "published", and trust Oracle to
 | |
| > use it correctly, and hope that our other indexes are also
 | |
| > useful.
 | |
| >=20
 | |
| > On Postgres[1] we would make a partial compound index:
 | |
| >=20
 | |
| > create index ... on events(sport_id,event_type_id)
 | |
| > where published=3D'Y';
 | |
| 
 | |
| 
 | |
| Generally speaking, bitmap indexes will not serve you very will on
 | |
| tables having a low row counts, high cardinality or where they are
 | |
| attached to tables which are primarily used in an OLTP capacity.=20
 | |
| Situations where you have a low row count and low cardinality or high
 | |
| row count and high cardinality tend to be better addressed by partial
 | |
| indexes; which seem to make much more sense.  In your example, it sounds
 | |
| like you did "the right thing"(tm).  ;)
 | |
| 
 | |
| 
 | |
| Greg
 | |
| 
 | |
| 
 | |
| --=-Ivchb84S75fOMzJ9DxwK
 | |
| Content-Type: application/pgp-signature; name=signature.asc
 | |
| Content-Description: This is a digitally signed message part
 | |
| 
 | |
| -----BEGIN PGP SIGNATURE-----
 | |
| Version: GnuPG v1.0.6 (GNU/Linux)
 | |
| Comment: For info see http://www.gnupg.org
 | |
| 
 | |
| iD8DBQA8l8Ml4lr1bpbcL6kRAhldAJ9Aoi9dwm1OteZjySfsd1o42trWLACfegQj
 | |
| OEV6eO8MnBSlbJMHiQ08gNE=
 | |
| =PQvW
 | |
| -----END PGP SIGNATURE-----
 | |
| 
 | |
| --=-Ivchb84S75fOMzJ9DxwK--
 | |
| 
 | |
| 
 |