mirror of
				https://github.com/postgres/postgres.git
				synced 2025-10-25 13:17:41 +03:00 
			
		
		
		
	
		
			
				
	
	
		
			2840 lines
		
	
	
		
			125 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			2840 lines
		
	
	
		
			125 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
| From cjs@cynic.net Thu Jun 20 22:18:27 2002
 | |
| Return-path: <cjs@cynic.net>
 | |
| Received: from academic.cynic.net (academic.cynic.net [63.144.177.3])
 | |
| 	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5L2IPo22195
 | |
| 	for <pgman@candle.pha.pa.us>; Thu, 20 Jun 2002 22:18:26 -0400 (EDT)
 | |
| Received: from angelic-academic.cvpn.cynic.net (angelic-academic.cvpn.cynic.net [198.73.220.224])
 | |
| 	by academic.cynic.net (Postfix) with ESMTP
 | |
| 	id 88216F821; Fri, 21 Jun 2002 02:18:17 +0000 (UTC)
 | |
| Date: Fri, 21 Jun 2002 11:18:14 +0900 (JST)
 | |
| From: Curt Sampson <cjs@cynic.net>
 | |
| To: Bruce Momjian <pgman@candle.pha.pa.us>
 | |
| cc: Michael Loftis <mloftis@wgops.com>, mlw <markw@mohawksoft.com>,
 | |
|    PostgreSQL-development <pgsql-hackers@postgresql.org>
 | |
| Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
 | |
| In-Reply-To: <200206210158.g5L1wFk20118@candle.pha.pa.us>
 | |
| Message-ID: <Pine.NEB.4.43.0206211106390.437-100000@angelic.cynic.net>
 | |
| MIME-Version: 1.0
 | |
| Content-Type: TEXT/PLAIN; charset=US-ASCII
 | |
| Status: OR
 | |
| 
 | |
| On Thu, 20 Jun 2002, Bruce Momjian wrote:
 | |
| 
 | |
| > > MS SQL Server has an interesting way of dealing with this. They have a
 | |
| > > "torn" bit in each 512-byte chunk of a page, and this bit is set the
 | |
| > > same for each chunk. When they are about to write out a page, they first
 | |
| > > flip all of the torn bits and then do the write. If the write does not
 | |
| > > complete due to a system crash or whatever, this can be detected later
 | |
| > > because the torn bits won't match across the entire page.
 | |
| >
 | |
| > I was wondering, how does knowing the block is corrupt help MS SQL?
 | |
| 
 | |
| I'm trying to recall, but I can't off hand. I'll have to look it
 | |
| up in my Inside SQL Server book, which is at home right now,
 | |
| unfortunately. I'll bring the book into work and let you know the
 | |
| details later.
 | |
| 
 | |
| > Right now, we write changed pages to WAL, then later write them to disk.
 | |
| 
 | |
| Ah. You write the entire page? MS writes only the changed tuple.
 | |
| And DB2, in fact, goes one better and writes only the part of the
 | |
| tuple up to the change, IIRC. Thus, if you put smaller and/or more
 | |
| frequently changed columns first, you'll have smaller logs.
 | |
| 
 | |
| > I have always been looking for a way to prevent these WAL writes.  The
 | |
| > 512-byte bit seems interesting, but how does it help?
 | |
| 
 | |
| Well, this would at least let you reduce the write to the 512-byte
 | |
| chunk that changed, rather than writing the entire 8K page.
 | |
| 
 | |
| > And how does the bit help them with partial block writes?  Is the bit at
 | |
| > the end of the block?  Is that reliable?
 | |
| 
 | |
| The bit is somewhere within every 512 byte "disk page" within the
 | |
| 8192 byte "filesystem/database page." So an 8KB page is divided up
 | |
| like this:
 | |
| 
 | |
|     | <----------------------- 8 Kb ----------------------> |
 | |
| 
 | |
|     | 512b | 512b | 512b | 512b | 512b | 512b | 512b | 512b |
 | |
| 
 | |
| Thus, the tear bits start out like this:
 | |
| 
 | |
|     |  0   |  0   |  0   |  0   |  0   |  0   |  0   |  0   |
 | |
| 
 | |
| After a successful write of the entire page, you have this:
 | |
| 
 | |
|     |  1   |  1   |  1   |  1   |  1   |  1   |  1   |  1   |
 | |
| 
 | |
| If the write is unsuccessful, you end up with something like this:
 | |
| 
 | |
|     |  1   |  1   |  1   |  1   |  1   |  0   |  0   |  0   |
 | |
| 
 | |
| And now you know which parts of your page got written, and which
 | |
| parts didn't.
 | |
| 
 | |
| cjs
 | |
| -- 
 | |
| Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
 | |
|     Don't you know, in this new Dark Age, we're all light.  --XTC
 | |
| 
 | |
| 
 | |
| From cjs@cynic.net Sat Jun 22 04:41:54 2002
 | |
| Return-path: <cjs@cynic.net>
 | |
| Received: from academic.cynic.net ([63.144.177.3])
 | |
| 	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5M8fpF04711
 | |
| 	for <pgman@candle.pha.pa.us>; Sat, 22 Jun 2002 04:41:53 -0400 (EDT)
 | |
| Received: from angelic-academic.cvpn.cynic.net (angelic-academic.cvpn.cynic.net [198.73.220.224])
 | |
| 	by academic.cynic.net (Postfix) with ESMTP
 | |
| 	id 415C8F820; Sat, 22 Jun 2002 08:41:33 +0000 (UTC)
 | |
| Date: Sat, 22 Jun 2002 17:41:30 +0900 (JST)
 | |
| From: Curt Sampson <cjs@cynic.net>
 | |
| To: Tom Lane <tgl@sss.pgh.pa.us>
 | |
| cc: Bruce Momjian <pgman@candle.pha.pa.us>, Michael Loftis <mloftis@wgops.com>,
 | |
|    mlw <markw@mohawksoft.com>,
 | |
|    PostgreSQL-development <pgsql-hackers@postgresql.org>
 | |
| Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE 
 | |
| In-Reply-To: <19332.1024668861@sss.pgh.pa.us>
 | |
| Message-ID: <Pine.NEB.4.43.0206221731130.1091-100000@angelic.cynic.net>
 | |
| MIME-Version: 1.0
 | |
| Content-Type: TEXT/PLAIN; charset=US-ASCII
 | |
| Status: OR
 | |
| 
 | |
| On Fri, 21 Jun 2002, Tom Lane wrote:
 | |
| 
 | |
| > Curt Sampson <cjs@cynic.net> writes:
 | |
| > > And now you know which parts of your page got written, and which
 | |
| > > parts didn't.
 | |
| >
 | |
| > Yes ... and what do you *do* about it?
 | |
| 
 | |
| Ok. Here's the extract from _Inside Microsoft SQL Server 7.0_, page 207:
 | |
| 
 | |
|     torn page detection   When TRUE, this option causes a bit to be
 | |
| 	flipped for each 512-byte sector in a database page (8 KB)
 | |
| 	whenever the page is written to disk.  This option allows
 | |
| 	SQL Server to detect incomplete I/O operations caused by
 | |
| 	power failures or other system outages. If a bit is in the
 | |
| 	wrong state when the page is later read by SQL Server, this
 | |
| 	means the page was written incorrectly; a torn page has
 | |
| 	been detected. Although SQL Server database pages are 8
 | |
| 	KB, disks perform I/O operations using 512-byte sectors.
 | |
| 	Therefore, 16 sectors are written per database page.  A
 | |
| 	torn page can occur if the system crashes (for example,
 | |
| 	because of power failure) between the time the operating
 | |
| 	system writes the first 512-byte sector to disk and the
 | |
| 	completion of the 8-KB I/O operation.  If the first sector
 | |
| 	of a database page is successfully written before the crash,
 | |
| 	it will appear that the database page on disk was updated,
 | |
| 	although it might not have succeeded. Using battery-backed
 | |
| 	disk caches can ensure that data is [sic] successfully
 | |
| 	written to disk or not written at all. In this case, don't
 | |
| 	set torn page detection to TRUE, as it isn't needed. If a
 | |
| 	torn page is detected, the database will need to be restored
 | |
| 	from backup because it will be physically inconsistent.
 | |
| 
 | |
| As I understand it, this is not a problem for postgres becuase the
 | |
| entire page is written to the log. So postgres is safe, but quite
 | |
| inefficient. (It would be much more efficient to write just the
 | |
| changed tuple, or even just the changed values within the tuple,
 | |
| to the log.)
 | |
| 
 | |
| Adding these torn bits would allow posgres at least to write to
 | |
| the log just the 512-byte sectors that have changed, rather than
 | |
| the entire 8 KB page.
 | |
| 
 | |
| cjs
 | |
| -- 
 | |
| Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
 | |
|     Don't you know, in this new Dark Age, we're all light.  --XTC
 | |
| 
 | |
| 
 | |
| From pgsql-hackers-owner+M24060@postgresql.org Sat Jun 22 18:31:21 2002
 | |
| Return-path: <pgsql-hackers-owner+M24060@postgresql.org>
 | |
| Received: from postgresql.org (postgresql.org [64.49.215.8])
 | |
| 	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5MMVKF20014
 | |
| 	for <pgman@candle.pha.pa.us>; Sat, 22 Jun 2002 18:31:20 -0400 (EDT)
 | |
| Received: from localhost.localdomain (postgresql.org [64.49.215.8])
 | |
| 	by localhost (Postfix) with ESMTP
 | |
| 	id 0ADFE476090; Sat, 22 Jun 2002 18:31:10 -0400 (EDT)
 | |
| Received: from postgresql.org (postgresql.org [64.49.215.8])
 | |
| 	by postgresql.org (Postfix) with SMTP
 | |
| 	id 6B372475A96; Sat, 22 Jun 2002 18:28:42 -0400 (EDT)
 | |
| Received: from localhost.localdomain (postgresql.org [64.49.215.8])
 | |
| 	by localhost (Postfix) with ESMTP id 47AD2475935
 | |
| 	for <pgsql-hackers@postgresql.org>; Sat, 22 Jun 2002 18:28:40 -0400 (EDT)
 | |
| Received: from hades.usol.com (hades.usol.com [208.232.58.41])
 | |
| 	by postgresql.org (Postfix) with ESMTP id 1D5DA476166
 | |
| 	for <pgsql-hackers@postgresql.org>; Sat, 22 Jun 2002 18:23:16 -0400 (EDT)
 | |
| Received: from 01-081.024.popsite.net (01-081.024.popsite.net [216.126.160.81])
 | |
| 	by hades.usol.com (8.11.6/8.11.6) with ESMTP id g5MMMOj11344;
 | |
| 	Sat, 22 Jun 2002 18:22:25 -0400
 | |
| Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
 | |
| From: "J. R. Nield" <jrnield@usol.com>
 | |
| To: Bruce Momjian <pgman@candle.pha.pa.us>
 | |
| cc: Curt Sampson <cjs@cynic.net>, Michael Loftis <mloftis@wgops.com>,
 | |
|    mlw <markw@mohawksoft.com>,
 | |
|    PostgreSQL Hacker <pgsql-hackers@postgresql.org>,
 | |
|    Tom Lane <tgl@sss.pgh.pa.us>
 | |
| In-Reply-To: <200206210158.g5L1wFk20118@candle.pha.pa.us>
 | |
| References: <200206210158.g5L1wFk20118@candle.pha.pa.us>
 | |
| Content-Type: text/plain
 | |
| Content-Transfer-Encoding: 7bit
 | |
| Message-ID: <1024784514.1793.242.camel@localhost.localdomain>
 | |
| MIME-Version: 1.0
 | |
| X-Mailer: Ximian Evolution 1.0.3 (1.0.3-6) 
 | |
| Date: 22 Jun 2002 18:22:58 -0400
 | |
| Precedence: bulk
 | |
| Sender: pgsql-hackers-owner@postgresql.org
 | |
| Status: ORr
 | |
| 
 | |
| On Thu, 2002-06-20 at 21:58, Bruce Momjian wrote:
 | |
| > I was wondering, how does knowing the block is corrupt help MS SQL? 
 | |
| > Right now, we write changed pages to WAL, then later write them to disk.
 | |
| > I have always been looking for a way to prevent these WAL writes.  The
 | |
| > 512-byte bit seems interesting, but how does it help?
 | |
| > 
 | |
| > And how does the bit help them with partial block writes?  Is the bit at
 | |
| > the end of the block?  Is that reliable?
 | |
| > 
 | |
| 
 | |
| My understanding of this is as follows:
 | |
| 
 | |
| 1) On most commercial systems, if you get a corrupted block (from
 | |
| partial write or whatever) you need to restore the file(s) from the most
 | |
| recent backup, and replay the log from the log archive (usually only the
 | |
| damaged files will be written to during replay). 
 | |
| 
 | |
| 2) If you can't deal with the downtime to recover the file, then EMC,
 | |
| Sun, or IBM will sell you an expensive disk array with an NVRAM cache
 | |
| that will do atomic writes. Some plain-vanilla SCSI disks are also
 | |
| capable of atomic writes, though usually they don't use NVRAM to do it. 
 | |
| 
 | |
| The database must then make sure that each page-write gets translated
 | |
| into exactly one SCSI-level write. This is one reason why ORACLE and
 | |
| Sybase recommend that you use raw disk partitions for high availability.
 | |
| Some operating systems support this through the filesystem, but it is OS
 | |
| dependent. I think Solaris 7 & 8 has support for this, but I'm not sure.
 | |
| 
 | |
| PostgreSQL has trouble because it can neither archive logs for replay,
 | |
| nor use raw disk partitions.
 | |
| 
 | |
| 
 | |
| One other point:
 | |
| 
 | |
| Page pre-image logging is fundamentally the same as what Jim Grey's
 | |
| book[1] would call "careful writes". I don't believe they should be in
 | |
| the XLOG, because we never need to keep the pre-images after we're sure
 | |
| the buffer has made it to the disk. Instead, we should have the buffer
 | |
| IO routines implement ping-pong writes of some kind if we want
 | |
| protection from partial writes.
 | |
| 
 | |
| 
 | |
| Does any of this make sense?
 | |
| 
 | |
| 
 | |
| 
 | |
| ;jrnield
 | |
| 
 | |
| 
 | |
| [1] Grey, J. and Reuter, A. (1993). "Transaction Processing: Concepts
 | |
| 	and Techniques". Morgan Kaufmann.
 | |
| 
 | |
| -- 
 | |
| J. R. Nield
 | |
| jrnield@usol.com
 | |
| 
 | |
| 
 | |
| 
 | |
| 
 | |
| ---------------------------(end of broadcast)---------------------------
 | |
| TIP 4: Don't 'kill -9' the postmaster
 | |
| 
 | |
| From pgsql-hackers-owner+M24068@postgresql.org Sun Jun 23 08:40:27 2002
 | |
| Return-path: <pgsql-hackers-owner+M24068@postgresql.org>
 | |
| Received: from postgresql.org (postgresql.org [64.49.215.8])
 | |
| 	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5NCeQF01601
 | |
| 	for <pgman@candle.pha.pa.us>; Sun, 23 Jun 2002 08:40:27 -0400 (EDT)
 | |
| Received: from localhost.localdomain (postgresql.org [64.49.215.8])
 | |
| 	by localhost (Postfix) with ESMTP
 | |
| 	id 8AC4B475CBC; Sun, 23 Jun 2002 08:40:22 -0400 (EDT)
 | |
| Received: from postgresql.org (postgresql.org [64.49.215.8])
 | |
| 	by postgresql.org (Postfix) with SMTP
 | |
| 	id 4683647599D; Sun, 23 Jun 2002 08:37:40 -0400 (EDT)
 | |
| Received: from localhost.localdomain (postgresql.org [64.49.215.8])
 | |
| 	by localhost (Postfix) with ESMTP id 0D57847592A
 | |
| 	for <pgsql-hackers@postgresql.org>; Sun, 23 Jun 2002 08:37:38 -0400 (EDT)
 | |
| Received: from hades.usol.com (hades.usol.com [208.232.58.41])
 | |
| 	by postgresql.org (Postfix) with ESMTP id 75326475876
 | |
| 	for <pgsql-hackers@postgresql.org>; Sun, 23 Jun 2002 08:37:36 -0400 (EDT)
 | |
| Received: from 08-032.024.popsite.net (08-032.024.popsite.net [66.19.4.32])
 | |
| 	by hades.usol.com (8.11.6/8.11.6) with ESMTP id g5NCbNj02111;
 | |
| 	Sun, 23 Jun 2002 08:37:23 -0400
 | |
| Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
 | |
| From: "J. R. Nield" <jrnield@usol.com>
 | |
| To: Bruce Momjian <pgman@candle.pha.pa.us>
 | |
| cc: Curt Sampson <cjs@cynic.net>, Michael Loftis <mloftis@wgops.com>,
 | |
|    mlw <markw@mohawksoft.com>,
 | |
|    PostgreSQL Hacker <pgsql-hackers@postgresql.org>,
 | |
|    Tom Lane <tgl@sss.pgh.pa.us>
 | |
| In-Reply-To: <200206222317.g5MNHBn23427@candle.pha.pa.us>
 | |
| References: <200206222317.g5MNHBn23427@candle.pha.pa.us>
 | |
| Content-Type: text/plain
 | |
| Content-Transfer-Encoding: 7bit
 | |
| X-Mailer: Ximian Evolution 1.0.3 (1.0.3-6) 
 | |
| Date: 23 Jun 2002 08:37:53 -0400
 | |
| Message-ID: <1024835880.1793.264.camel@localhost.localdomain>
 | |
| MIME-Version: 1.0
 | |
| Precedence: bulk
 | |
| Sender: pgsql-hackers-owner@postgresql.org
 | |
| Status: OR
 | |
| 
 | |
| On Sat, 2002-06-22 at 19:17, Bruce Momjian wrote:
 | |
| > J. R. Nield wrote:
 | |
| > > One other point:
 | |
| > > 
 | |
| > > Page pre-image logging is fundamentally the same as what Jim Grey's
 | |
| > > book[1] would call "careful writes". I don't believe they should be in
 | |
| > > the XLOG, because we never need to keep the pre-images after we're sure
 | |
| > > the buffer has made it to the disk. Instead, we should have the buffer
 | |
| > > IO routines implement ping-pong writes of some kind if we want
 | |
| > > protection from partial writes.
 | |
| > 
 | |
| > Ping-pong writes to where?  We have to fsync, and rather than fsync that
 | |
| > area and WAL, we just do WAL.  Not sure about a win there.
 | |
| > 
 | |
| 
 | |
| The key question is: do we have some method to ensure that the OS
 | |
| doesn't do the writes in parallel?
 | |
| 
 | |
| If the OS will ensure that one of the two block writes of a ping-pong
 | |
| completes before the other starts, then we don't need to fsync() at 
 | |
| all. 
 | |
| 
 | |
| The only thing we are protecting against is the possibility of both
 | |
| writes being partial. If neither is done, that's fine because WAL will
 | |
| protect us. If the first write is partial, we will detect that and use
 | |
| the old data from the other, then recover from WAL. If the first is
 | |
| complete but the second is partial, then we detect that and use the
 | |
| newer block from the first write. If the second is complete but the
 | |
| first is partial, we detect that and use the newer block from the second
 | |
| write.
 | |
| 
 | |
| So does anyone know a way to prevent parallel writes in one of the
 | |
| common unix standards? Do they say anything about this?
 | |
| 
 | |
| It would seem to me that if the same process does both ping-pong writes,
 | |
| then there should be a cheap way to enforce a serial order. I could be
 | |
| wrong though.
 | |
| 
 | |
| As to where the first block of the ping-pong should go, maybe we could
 | |
| reserve a file with nBlocks space for them, and write the information
 | |
| about which block was being written to the XLOG for use in recovery.
 | |
| There are many other ways to do it.
 | |
| 
 | |
| ;jrnield
 | |
| 
 | |
| -- 
 | |
| J. R. Nield
 | |
| jrnield@usol.com
 | |
| 
 | |
| 
 | |
| 
 | |
| 
 | |
| ---------------------------(end of broadcast)---------------------------
 | |
| TIP 6: Have you searched our list archives?
 | |
| 
 | |
| http://archives.postgresql.org
 | |
| 
 | |
| From jrnield@usol.com Sun Jun 23 08:37:30 2002
 | |
| Return-path: <jrnield@usol.com>
 | |
| Received: from hades.usol.com (IDENT:root@hades.usol.com [208.232.58.41])
 | |
| 	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5NCbRF28741
 | |
| 	for <pgman@candle.pha.pa.us>; Sun, 23 Jun 2002 08:37:28 -0400 (EDT)
 | |
| Received: from 08-032.024.popsite.net (08-032.024.popsite.net [66.19.4.32])
 | |
| 	by hades.usol.com (8.11.6/8.11.6) with ESMTP id g5NCbNj02111;
 | |
| 	Sun, 23 Jun 2002 08:37:23 -0400
 | |
| Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
 | |
| From: "J. R. Nield" <jrnield@usol.com>
 | |
| To: Bruce Momjian <pgman@candle.pha.pa.us>
 | |
| cc: Curt Sampson <cjs@cynic.net>, Michael Loftis <mloftis@wgops.com>,
 | |
|    mlw
 | |
|   <markw@mohawksoft.com>,
 | |
|    PostgreSQL Hacker <pgsql-hackers@postgresql.org>,
 | |
|    Tom Lane <tgl@sss.pgh.pa.us>
 | |
| In-Reply-To: <200206222317.g5MNHBn23427@candle.pha.pa.us>
 | |
| References: <200206222317.g5MNHBn23427@candle.pha.pa.us>
 | |
| Content-Type: text/plain
 | |
| Content-Transfer-Encoding: 7bit
 | |
| X-Mailer: Ximian Evolution 1.0.3 (1.0.3-6) 
 | |
| Date: 23 Jun 2002 08:37:53 -0400
 | |
| Message-ID: <1024835880.1793.264.camel@localhost.localdomain>
 | |
| MIME-Version: 1.0
 | |
| Status: OR
 | |
| 
 | |
| On Sat, 2002-06-22 at 19:17, Bruce Momjian wrote:
 | |
| > J. R. Nield wrote:
 | |
| > > One other point:
 | |
| > > 
 | |
| > > Page pre-image logging is fundamentally the same as what Jim Grey's
 | |
| > > book[1] would call "careful writes". I don't believe they should be in
 | |
| > > the XLOG, because we never need to keep the pre-images after we're sure
 | |
| > > the buffer has made it to the disk. Instead, we should have the buffer
 | |
| > > IO routines implement ping-pong writes of some kind if we want
 | |
| > > protection from partial writes.
 | |
| > 
 | |
| > Ping-pong writes to where?  We have to fsync, and rather than fsync that
 | |
| > area and WAL, we just do WAL.  Not sure about a win there.
 | |
| > 
 | |
| 
 | |
| The key question is: do we have some method to ensure that the OS
 | |
| doesn't do the writes in parallel?
 | |
| 
 | |
| If the OS will ensure that one of the two block writes of a ping-pong
 | |
| completes before the other starts, then we don't need to fsync() at 
 | |
| all. 
 | |
| 
 | |
| The only thing we are protecting against is the possibility of both
 | |
| writes being partial. If neither is done, that's fine because WAL will
 | |
| protect us. If the first write is partial, we will detect that and use
 | |
| the old data from the other, then recover from WAL. If the first is
 | |
| complete but the second is partial, then we detect that and use the
 | |
| newer block from the first write. If the second is complete but the
 | |
| first is partial, we detect that and use the newer block from the second
 | |
| write.
 | |
| 
 | |
| So does anyone know a way to prevent parallel writes in one of the
 | |
| common unix standards? Do they say anything about this?
 | |
| 
 | |
| It would seem to me that if the same process does both ping-pong writes,
 | |
| then there should be a cheap way to enforce a serial order. I could be
 | |
| wrong though.
 | |
| 
 | |
| As to where the first block of the ping-pong should go, maybe we could
 | |
| reserve a file with nBlocks space for them, and write the information
 | |
| about which block was being written to the XLOG for use in recovery.
 | |
| There are many other ways to do it.
 | |
| 
 | |
| ;jrnield
 | |
| 
 | |
| -- 
 | |
| J. R. Nield
 | |
| jrnield@usol.com
 | |
| 
 | |
| 
 | |
| 
 | |
| 
 | |
| From cjs@cynic.net Sun Jun 23 09:33:29 2002
 | |
| Return-path: <cjs@cynic.net>
 | |
| Received: from academic.cynic.net (academic.cynic.net [63.144.177.3])
 | |
| 	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5NDXSF11543
 | |
| 	for <pgman@candle.pha.pa.us>; Sun, 23 Jun 2002 09:33:28 -0400 (EDT)
 | |
| Received: from angelic-academic.cvpn.cynic.net (angelic-academic.cvpn.cynic.net [198.73.220.224])
 | |
| 	by academic.cynic.net (Postfix) with ESMTP
 | |
| 	id A83ABF820; Sun, 23 Jun 2002 13:33:15 +0000 (UTC)
 | |
| Date: Sun, 23 Jun 2002 22:33:07 +0900 (JST)
 | |
| From: Curt Sampson <cjs@cynic.net>
 | |
| To: "J. R. Nield" <jrnield@usol.com>
 | |
| cc: Bruce Momjian <pgman@candle.pha.pa.us>, Michael Loftis <mloftis@wgops.com>,
 | |
|    mlw <markw@mohawksoft.com>,
 | |
|    PostgreSQL Hacker <pgsql-hackers@postgresql.org>,
 | |
|    Tom Lane <tgl@sss.pgh.pa.us>
 | |
| Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
 | |
| In-Reply-To: <1024835880.1793.264.camel@localhost.localdomain>
 | |
| Message-ID: <Pine.NEB.4.43.0206232223300.2100-100000@angelic.cynic.net>
 | |
| MIME-Version: 1.0
 | |
| Content-Type: TEXT/PLAIN; charset=US-ASCII
 | |
| Status: OR
 | |
| 
 | |
| On 23 Jun 2002, J. R. Nield wrote:
 | |
| 
 | |
| > On Sat, 2002-06-22 at 19:17, Bruce Momjian wrote:
 | |
| > > J. R. Nield wrote:
 | |
| > > > One other point:
 | |
| > > >
 | |
| > > > Page pre-image logging is fundamentally the same as what Jim Grey's
 | |
| > > > book[1] would call "careful writes". I don't believe they should be in
 | |
| > > > the XLOG, because we never need to keep the pre-images after we're sure
 | |
| > > > the buffer has made it to the disk. Instead, we should have the buffer
 | |
| > > > IO routines implement ping-pong writes of some kind if we want
 | |
| > > > protection from partial writes.
 | |
| > >
 | |
| > > Ping-pong writes to where?  We have to fsync, and rather than fsync that
 | |
| > > area and WAL, we just do WAL.  Not sure about a win there.
 | |
| 
 | |
| Presumably the win is that, "we never need to keep the pre-images
 | |
| after we're sure the buffer has made it to the disk." So the
 | |
| pre-image log can be completely ditched when we shut down the
 | |
| server, so a full system sync, or whatever. This keeps the log file
 | |
| size down, which means faster recovery, less to back up (when we
 | |
| start getting transaction logs that can be backed up), etc.
 | |
| 
 | |
| This should also allow us to disable completely the ping-pong writes
 | |
| if we have a disk subsystem that we trust. (E.g., a disk array with
 | |
| battery backed memory.) That would, in theory, produce a nice little
 | |
| performance increase when lots of inserts and/or updates are being
 | |
| committed, as we have much, much less to write to the log file.
 | |
| 
 | |
| Are there stats that track, e.g., the bandwidth of writes to the
 | |
| log file? I'd be interested in knowing just what kind of savings
 | |
| one might see by doing this.
 | |
| 
 | |
| > The key question is: do we have some method to ensure that the OS
 | |
| > doesn't do the writes in parallel?...
 | |
| > It would seem to me that if the same process does both ping-pong writes,
 | |
| > then there should be a cheap way to enforce a serial order. I could be
 | |
| > wrong though.
 | |
| 
 | |
| Well, whether or not there's a cheap way depends on whether you consider
 | |
| fsync to be cheap. :-)
 | |
| 
 | |
| cjs
 | |
| -- 
 | |
| Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
 | |
|     Don't you know, in this new Dark Age, we're all light.  --XTC
 | |
| 
 | |
| 
 | |
| From pgsql-hackers-owner+M24073@postgresql.org Sun Jun 23 11:19:59 2002
 | |
| Return-path: <pgsql-hackers-owner+M24073@postgresql.org>
 | |
| Received: from postgresql.org (postgresql.org [64.49.215.8])
 | |
| 	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5NFJxF19785
 | |
| 	for <pgman@candle.pha.pa.us>; Sun, 23 Jun 2002 11:19:59 -0400 (EDT)
 | |
| Received: from localhost.localdomain (postgresql.org [64.49.215.8])
 | |
| 	by localhost (Postfix) with ESMTP
 | |
| 	id 0BD5B475E79; Sun, 23 Jun 2002 11:19:55 -0400 (EDT)
 | |
| Received: from postgresql.org (postgresql.org [64.49.215.8])
 | |
| 	by postgresql.org (Postfix) with SMTP
 | |
| 	id 5C0CB475D6A; Sun, 23 Jun 2002 11:19:50 -0400 (EDT)
 | |
| Received: from localhost.localdomain (postgresql.org [64.49.215.8])
 | |
| 	by localhost (Postfix) with ESMTP id E2353475C4B
 | |
| 	for <pgsql-hackers@postgresql.org>; Sun, 23 Jun 2002 11:19:47 -0400 (EDT)
 | |
| Received: from sss.pgh.pa.us (unknown [192.204.191.242])
 | |
| 	by postgresql.org (Postfix) with ESMTP id 746F8475AEA
 | |
| 	for <pgsql-hackers@postgresql.org>; Sun, 23 Jun 2002 11:19:46 -0400 (EDT)
 | |
| Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
 | |
| 	by sss.pgh.pa.us (8.11.4/8.11.4) with ESMTP id g5NFJF108464;
 | |
| 	Sun, 23 Jun 2002 11:19:15 -0400 (EDT)
 | |
| To: Curt Sampson <cjs@cynic.net>
 | |
| cc: "J. R. Nield" <jrnield@usol.com>, Bruce Momjian <pgman@candle.pha.pa.us>,
 | |
|    Michael Loftis <mloftis@wgops.com>, mlw <markw@mohawksoft.com>,
 | |
|    PostgreSQL Hacker <pgsql-hackers@postgresql.org>
 | |
| Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE 
 | |
| In-Reply-To: <Pine.NEB.4.43.0206232223300.2100-100000@angelic.cynic.net> 
 | |
| References: <Pine.NEB.4.43.0206232223300.2100-100000@angelic.cynic.net>
 | |
| Comments: In-reply-to Curt Sampson <cjs@cynic.net>
 | |
| 	message dated "Sun, 23 Jun 2002 22:33:07 +0900"
 | |
| Date: Sun, 23 Jun 2002 11:19:15 -0400
 | |
| Message-ID: <8461.1024845555@sss.pgh.pa.us>
 | |
| From: Tom Lane <tgl@sss.pgh.pa.us>
 | |
| Precedence: bulk
 | |
| Sender: pgsql-hackers-owner@postgresql.org
 | |
| Status: OR
 | |
| 
 | |
| Curt Sampson <cjs@cynic.net> writes:
 | |
| > This should also allow us to disable completely the ping-pong writes
 | |
| > if we have a disk subsystem that we trust.
 | |
| 
 | |
| If we have a disk subsystem we trust, we just disable fsync on the
 | |
| WAL and the performance issue largely goes away.
 | |
| 
 | |
| I concur with Bruce: the reason we keep page images in WAL is to
 | |
| minimize the number of places we have to fsync, and thus the amount of
 | |
| head movement required for a commit.  Putting the page images elsewhere
 | |
| cannot be a win AFAICS.
 | |
| 
 | |
| > Well, whether or not there's a cheap way depends on whether you consider
 | |
| > fsync to be cheap. :-)
 | |
| 
 | |
| It's never cheap :-(
 | |
| 
 | |
| 			regards, tom lane
 | |
| 
 | |
| ---------------------------(end of broadcast)---------------------------
 | |
| TIP 5: Have you checked our extensive FAQ?
 | |
| 
 | |
| http://www.postgresql.org/users-lounge/docs/faq.html
 | |
| 
 | |
| From cjs@cynic.net Sun Jun 23 12:10:44 2002
 | |
| Return-path: <cjs@cynic.net>
 | |
| Received: from academic.cynic.net (academic.cynic.net [63.144.177.3])
 | |
| 	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5NGAgF22907
 | |
| 	for <pgman@candle.pha.pa.us>; Sun, 23 Jun 2002 12:10:43 -0400 (EDT)
 | |
| Received: from angelic-academic.cvpn.cynic.net (angelic-academic.cvpn.cynic.net [198.73.220.224])
 | |
| 	by academic.cynic.net (Postfix) with ESMTP
 | |
| 	id 57BFDF820; Sun, 23 Jun 2002 16:10:35 +0000 (UTC)
 | |
| Date: Mon, 24 Jun 2002 01:10:26 +0900 (JST)
 | |
| From: Curt Sampson <cjs@cynic.net>
 | |
| To: Tom Lane <tgl@sss.pgh.pa.us>
 | |
| cc: "J. R. Nield" <jrnield@usol.com>, Bruce Momjian <pgman@candle.pha.pa.us>,
 | |
|    Michael Loftis <mloftis@wgops.com>, mlw <markw@mohawksoft.com>,
 | |
|    PostgreSQL Hacker <pgsql-hackers@postgresql.org>
 | |
| Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE 
 | |
| In-Reply-To: <8461.1024845555@sss.pgh.pa.us>
 | |
| Message-ID: <Pine.NEB.4.43.0206240057070.2100-100000@angelic.cynic.net>
 | |
| MIME-Version: 1.0
 | |
| Content-Type: TEXT/PLAIN; charset=US-ASCII
 | |
| Status: OR
 | |
| 
 | |
| On Sun, 23 Jun 2002, Tom Lane wrote:
 | |
| 
 | |
| > Curt Sampson <cjs@cynic.net> writes:
 | |
| > > This should also allow us to disable completely the ping-pong writes
 | |
| > > if we have a disk subsystem that we trust.
 | |
| >
 | |
| > If we have a disk subsystem we trust, we just disable fsync on the
 | |
| > WAL and the performance issue largely goes away.
 | |
| 
 | |
| No, you can't do this. If you don't fsync(), there's no guarantee
 | |
| that the write ever got out of the computer's buffer cache and to
 | |
| the disk subsystem in the first place.
 | |
| 
 | |
| > I concur with Bruce: the reason we keep page images in WAL is to
 | |
| > minimize the number of places we have to fsync, and thus the amount of
 | |
| > head movement required for a commit.
 | |
| 
 | |
| An fsync() does not necessarially cause head movement, or any real
 | |
| disk writes at all. If you're writing to many external disk arrays,
 | |
| for example, the fsync() ensures that the data are in the disk array's
 | |
| non-volatile or UPS-backed RAM, no more. The array might hold the data
 | |
| for quite some time before it actually writes it to disk.
 | |
| 
 | |
| But you're right that it's faster, if you're going to write out changed
 | |
| pages and have have the ping-pong file and the transaction log on the
 | |
| same disk, just to write out the entire page to the transaction log.
 | |
| 
 | |
| So what we would really need to implement, if we wanted to be more
 | |
| efficient with trusted disk subsystems, would be the option of writing
 | |
| to the log only the changed row or changed part of the row, or writing
 | |
| the entire changed page. I don't know how hard this would be....
 | |
| 
 | |
| > > Well, whether or not there's a cheap way depends on whether you consider
 | |
| > > fsync to be cheap. :-)
 | |
| >
 | |
| > It's never cheap :-(
 | |
| 
 | |
| Actually, with a good external RAID system with non-volatile RAM,
 | |
| it's a good two to four orders of magnitude cheaper than writing to a
 | |
| directly connected disk that doesn't claim the write is complete until
 | |
| it's physically on disk. I'd say that it qualifies as at least "not
 | |
| expensive." Not that you want to do it more often than you have to
 | |
| anyway....
 | |
| 
 | |
| cjs
 | |
| -- 
 | |
| Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
 | |
|     Don't you know, in this new Dark Age, we're all light.  --XTC
 | |
| 
 | |
| 
 | |
| From jrnield@usol.com Sun Jun 23 13:56:59 2002
 | |
| Return-path: <jrnield@usol.com>
 | |
| Received: from hades.usol.com (IDENT:root@hades.usol.com [208.232.58.41])
 | |
| 	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5NHusF00335
 | |
| 	for <pgman@candle.pha.pa.us>; Sun, 23 Jun 2002 13:56:58 -0400 (EDT)
 | |
| Received: from 04-077.024.popsite.net (04-077.024.popsite.net [216.126.163.77])
 | |
| 	by hades.usol.com (8.11.6/8.11.6) with ESMTP id g5NHunj18549;
 | |
| 	Sun, 23 Jun 2002 13:56:49 -0400
 | |
| Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
 | |
| From: "J. R. Nield" <jrnield@usol.com>
 | |
| To: Tom Lane <tgl@sss.pgh.pa.us>
 | |
| cc: Curt Sampson <cjs@cynic.net>, Bruce Momjian <pgman@candle.pha.pa.us>,
 | |
|    Michael Loftis <mloftis@wgops.com>, mlw <markw@mohawksoft.com>,
 | |
|    PostgreSQL Hacker
 | |
|   <pgsql-hackers@postgresql.org>
 | |
| In-Reply-To: <8461.1024845555@sss.pgh.pa.us>
 | |
| References: <Pine.NEB.4.43.0206232223300.2100-100000@angelic.cynic.net> 
 | |
| 	<8461.1024845555@sss.pgh.pa.us>
 | |
| Content-Type: text/plain
 | |
| Content-Transfer-Encoding: 7bit
 | |
| X-Mailer: Ximian Evolution 1.0.3 (1.0.3-6) 
 | |
| Date: 23 Jun 2002 13:57:19 -0400
 | |
| Message-ID: <1024855044.1793.414.camel@localhost.localdomain>
 | |
| MIME-Version: 1.0
 | |
| Status: ORr
 | |
| 
 | |
| On Sun, 2002-06-23 at 11:19, Tom Lane wrote: 
 | |
| > Curt Sampson <cjs@cynic.net> writes:
 | |
| > > This should also allow us to disable completely the ping-pong writes
 | |
| > > if we have a disk subsystem that we trust.
 | |
| > 
 | |
| > If we have a disk subsystem we trust, we just disable fsync on the
 | |
| > WAL and the performance issue largely goes away.
 | |
| 
 | |
| It wouldn't work because the OS buffering interferes, and we need those
 | |
| WAL records on disk up to the greatest LSN of the Buffer we will be writing.
 | |
| 
 | |
| 
 | |
| We already buffer WAL ourselves. We also already buffer regular pages.
 | |
| Whenever we write a Buffer out of the buffer cache, it is because we
 | |
| really want that page on disk and wanted to start an IO. If thats not
 | |
| the case, then we should have more block buffers! 
 | |
| 
 | |
| So since we have all this buffering designed especially to meet our
 | |
| needs, and since the OS buffering is in the way, can someone explain to
 | |
| me why postgresql would ever open a file without the O_DSYNC flag if the
 | |
| platform supports it? 
 | |
| 
 | |
| 
 | |
| 
 | |
| > 
 | |
| > I concur with Bruce: the reason we keep page images in WAL is to
 | |
| > minimize the number of places we have to fsync, and thus the amount of
 | |
| > head movement required for a commit.  Putting the page images elsewhere
 | |
| > cannot be a win AFAICS.
 | |
| 
 | |
| 
 | |
| Why not put all the page images in a single pre-allocated file and treat
 | |
| it as a ring? How could this be any worse than flushing them in the WAL
 | |
| log? 
 | |
| 
 | |
| Maybe fsync would be slower with two files, but I don't see how
 | |
| fdatasync would be, and most platforms support that. 
 | |
| 
 | |
| What would improve performance would be to have a dbflush process that
 | |
| would work in the background flushing buffers in groups and trying to
 | |
| stay ahead of ReadBuffer requests. That would let you do the temporary
 | |
| side of the ping-pong as a huge O_DSYNC writev(2) request (or
 | |
| fdatasync() once) and then write out the other buffers. It would also
 | |
| tend to prevent the other backends from blocking on write requests. 
 | |
| 
 | |
| A dbflush could also support aio_read/aio_write on platforms like
 | |
| Solaris and WindowsNT that support it. 
 | |
| 
 | |
| Am I correct that right now, buffers only get written when they get
 | |
| removed from the free list for reuse? So a released dirty buffer will
 | |
| sit in the buffer free list until it becomes the Least Recently Used
 | |
| buffer, and will then cause a backend to block for IO in a call to
 | |
| BufferAlloc? 
 | |
| 
 | |
| This would explain why we like using the OS buffer cache, and why our
 | |
| performance is troublesome when we have to do synchronous IO writes, and
 | |
| why fsync() takes so long to complete. All of the backends block for
 | |
| each call to BufferAlloc() after a large table update by a single
 | |
| backend, and then the OS buffers are always full of our "written" data. 
 | |
| 
 | |
| Am I reading the bufmgr code correctly? I already found an imaginary
 | |
| race condition there once :-) 
 | |
| 
 | |
| ;jnield 
 | |
| 
 | |
| 
 | |
| > 
 | |
| > > Well, whether or not there's a cheap way depends on whether you consider
 | |
| > > fsync to be cheap. :-)
 | |
| > 
 | |
| > It's never cheap :-(
 | |
| > 
 | |
| -- 
 | |
| J. R. Nield
 | |
| jrnield@usol.com
 | |
| 
 | |
| 
 | |
| From cjs@cynic.net Sun Jun 23 14:15:15 2002
 | |
| Return-path: <cjs@cynic.net>
 | |
| Received: from academic.cynic.net (academic.cynic.net [63.144.177.3])
 | |
| 	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5NIFEF01698
 | |
| 	for <pgman@candle.pha.pa.us>; Sun, 23 Jun 2002 14:15:15 -0400 (EDT)
 | |
| Received: from angelic-academic.cvpn.cynic.net (angelic-academic.cvpn.cynic.net [198.73.220.224])
 | |
| 	by academic.cynic.net (Postfix) with ESMTP
 | |
| 	id 796E6F820; Sun, 23 Jun 2002 18:15:08 +0000 (UTC)
 | |
| Date: Mon, 24 Jun 2002 03:15:01 +0900 (JST)
 | |
| From: Curt Sampson <cjs@cynic.net>
 | |
| To: "J. R. Nield" <jrnield@usol.com>
 | |
| cc: Tom Lane <tgl@sss.pgh.pa.us>, Bruce Momjian <pgman@candle.pha.pa.us>,
 | |
|    Michael Loftis <mloftis@wgops.com>, mlw <markw@mohawksoft.com>,
 | |
|    PostgreSQL Hacker <pgsql-hackers@postgresql.org>
 | |
| Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
 | |
| In-Reply-To: <1024855044.1793.414.camel@localhost.localdomain>
 | |
| Message-ID: <Pine.NEB.4.43.0206240307550.511-100000@angelic.cynic.net>
 | |
| MIME-Version: 1.0
 | |
| Content-Type: TEXT/PLAIN; charset=US-ASCII
 | |
| Status: ORr
 | |
| 
 | |
| On 23 Jun 2002, J. R. Nield wrote:
 | |
| 
 | |
| > So since we have all this buffering designed especially to meet our
 | |
| > needs, and since the OS buffering is in the way, can someone explain to
 | |
| > me why postgresql would ever open a file without the O_DSYNC flag if the
 | |
| > platform supports it?
 | |
| 
 | |
| It's more code, if there are platforms out there that don't support
 | |
| O_DYSNC. (We still have to keep the old fsync code.) On the other hand,
 | |
| O_DSYNC could save us a disk arm movement over fsync() because it
 | |
| appears to me that fsync is also going to force a metadata update, which
 | |
| means that the inode blocks have to be written as well.
 | |
| 
 | |
| > Maybe fsync would be slower with two files, but I don't see how
 | |
| > fdatasync would be, and most platforms support that.
 | |
| 
 | |
| Because, if both files are on the same disk, you still have to move
 | |
| the disk arm from the cylinder at the current log file write point
 | |
| to the cylinder at the current ping-pong file write point. And then back
 | |
| again to the log file write point cylinder.
 | |
| 
 | |
| In the end, having a ping-pong file as well seems to me unnecessary
 | |
| complexity, especially when anyone interested in really good
 | |
| performance is going to buy a disk subsystem that guarantees no
 | |
| torn pages and thus will want to turn off the ping-pong file writes
 | |
| entirely, anyway.
 | |
| 
 | |
| cjs
 | |
| -- 
 | |
| Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
 | |
|     Don't you know, in this new Dark Age, we're all light.  --XTC
 | |
| 
 | |
| 
 | |
| From jrnield@usol.com Sun Jun 23 14:14:51 2002
 | |
| Return-path: <jrnield@usol.com>
 | |
| Received: from hades.usol.com (IDENT:root@hades.usol.com [208.232.58.41])
 | |
| 	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5NIEnF01649
 | |
| 	for <pgman@candle.pha.pa.us>; Sun, 23 Jun 2002 14:14:50 -0400 (EDT)
 | |
| Received: from 04-077.024.popsite.net (04-077.024.popsite.net [216.126.163.77])
 | |
| 	by hades.usol.com (8.11.6/8.11.6) with ESMTP id g5NIEkj19287;
 | |
| 	Sun, 23 Jun 2002 14:14:46 -0400
 | |
| Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
 | |
| From: "J. R. Nield" <jrnield@usol.com>
 | |
| To: Curt Sampson <cjs@cynic.net>
 | |
| cc: Tom Lane <tgl@sss.pgh.pa.us>, Bruce Momjian <pgman@candle.pha.pa.us>,
 | |
|    Michael Loftis <mloftis@wgops.com>, mlw <markw@mohawksoft.com>,
 | |
|    PostgreSQL Hacker
 | |
|   <pgsql-hackers@postgresql.org>
 | |
| In-Reply-To: <Pine.NEB.4.43.0206240057070.2100-100000@angelic.cynic.net>
 | |
| References: <Pine.NEB.4.43.0206240057070.2100-100000@angelic.cynic.net>
 | |
| Content-Type: text/plain
 | |
| Content-Transfer-Encoding: 7bit
 | |
| X-Mailer: Ximian Evolution 1.0.3 (1.0.3-6) 
 | |
| Date: 23 Jun 2002 14:15:17 -0400
 | |
| Message-ID: <1024856120.3054.418.camel@localhost.localdomain>
 | |
| MIME-Version: 1.0
 | |
| Status: OR
 | |
| 
 | |
| On Sun, 2002-06-23 at 12:10, Curt Sampson wrote:
 | |
| > 
 | |
| > So what we would really need to implement, if we wanted to be more
 | |
| > efficient with trusted disk subsystems, would be the option of writing
 | |
| > to the log only the changed row or changed part of the row, or writing
 | |
| > the entire changed page. I don't know how hard this would be....
 | |
| > 
 | |
| We already log that stuff. The page images are in addition to the
 | |
| "Logical Changes", so we could just stop logging the page images.
 | |
| 
 | |
| -- 
 | |
| J. R. Nield
 | |
| jrnield@usol.com
 | |
| 
 | |
| 
 | |
| 
 | |
| 
 | |
| From pgsql-hackers-owner+M24100@postgresql.org Mon Jun 24 13:13:41 2002
 | |
| Return-path: <pgsql-hackers-owner+M24100@postgresql.org>
 | |
| Received: from postgresql.org (postgresql.org [64.49.215.8])
 | |
| 	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5OHDeF08564
 | |
| 	for <pgman@candle.pha.pa.us>; Mon, 24 Jun 2002 13:13:40 -0400 (EDT)
 | |
| Received: from localhost.localdomain (postgresql.org [64.49.215.8])
 | |
| 	by localhost (Postfix) with ESMTP
 | |
| 	id 05602475CBE; Mon, 24 Jun 2002 13:11:10 -0400 (EDT)
 | |
| Mailbox-Line: From pgman@candle.pha.pa.us  Mon Jun 24 13:11:10 2002
 | |
| Received: from postgresql.org (postgresql.org [64.49.215.8])
 | |
| 	by postgresql.org (Postfix) with SMTP
 | |
| 	id 929A247633B; Mon, 24 Jun 2002 09:26:54 -0400 (EDT)
 | |
| Received: from localhost.localdomain (postgresql.org [64.49.215.8])
 | |
| 	by localhost (Postfix) with ESMTP id 962C147631A
 | |
| 	for <pgsql-hackers@postgresql.org>; Mon, 24 Jun 2002 08:31:43 -0400 (EDT)
 | |
| Mailbox-Line: From pgman@candle.pha.pa.us  Mon Jun 24 08:31:43 2002
 | |
| Received: from candle.pha.pa.us (216-55-132-35.dsl.san-diego.abac.net [216.55.132.35])
 | |
| 	by postgresql.org (Postfix) with ESMTP id C112D475C3C
 | |
| 	for <pgsql-hackers@postgresql.org>; Sun, 23 Jun 2002 15:35:20 -0400 (EDT)
 | |
| Received: (from pgman@localhost)
 | |
| 	by candle.pha.pa.us (8.11.6/8.10.1) id g5NJYtL07449;
 | |
| 	Sun, 23 Jun 2002 15:34:55 -0400 (EDT)
 | |
| From: Bruce Momjian <pgman@candle.pha.pa.us>
 | |
| Message-ID: <200206231934.g5NJYtL07449@candle.pha.pa.us>
 | |
| Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
 | |
| In-Reply-To: <1024855044.1793.414.camel@localhost.localdomain>
 | |
| To: "J. R. Nield" <jrnield@usol.com>
 | |
| Date: Sun, 23 Jun 2002 15:34:55 -0400 (EDT)
 | |
| cc: Tom Lane <tgl@sss.pgh.pa.us>, Curt Sampson <cjs@cynic.net>,
 | |
|    Michael Loftis <mloftis@wgops.com>, mlw <markw@mohawksoft.com>,
 | |
|    PostgreSQL Hacker <pgsql-hackers@postgresql.org>
 | |
| X-Mailer: ELM [version 2.4ME+ PL97 (25)]
 | |
| MIME-Version: 1.0
 | |
| Content-Transfer-Encoding: 7bit
 | |
| Content-Type: text/plain; charset=US-ASCII
 | |
| Precedence: bulk
 | |
| Sender: pgsql-hackers-owner@postgresql.org
 | |
| X-Spam-Status: No, hits=-3.4 required=5.0
 | |
| 	tests=IN_REP_TO
 | |
| 	version=2.30
 | |
| Status: OR
 | |
| 
 | |
| J. R. Nield wrote:
 | |
| > So since we have all this buffering designed especially to meet our
 | |
| > needs, and since the OS buffering is in the way, can someone explain to
 | |
| > me why postgresql would ever open a file without the O_DSYNC flag if the
 | |
| > platform supports it? 
 | |
| 
 | |
| We sync only WAL, not the other pages, except for the sync() call we do
 | |
| during checkpoint when we discard old WAL files.
 | |
| 
 | |
| > > I concur with Bruce: the reason we keep page images in WAL is to
 | |
| > > minimize the number of places we have to fsync, and thus the amount of
 | |
| > > head movement required for a commit.  Putting the page images elsewhere
 | |
| > > cannot be a win AFAICS.
 | |
| > 
 | |
| > 
 | |
| > Why not put all the page images in a single pre-allocated file and treat
 | |
| > it as a ring? How could this be any worse than flushing them in the WAL
 | |
| > log? 
 | |
| > 
 | |
| > Maybe fsync would be slower with two files, but I don't see how
 | |
| > fdatasync would be, and most platforms support that. 
 | |
| 
 | |
| We have fdatasync option for WAL in postgresql.conf.
 | |
| 
 | |
| -- 
 | |
|   Bruce Momjian                        |  http://candle.pha.pa.us
 | |
|   pgman@candle.pha.pa.us               |  (610) 853-3000
 | |
|   +  If your life is a hard drive,     |  830 Blythe Avenue
 | |
|   +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
 | |
| 
 | |
| 
 | |
| 
 | |
| ---------------------------(end of broadcast)---------------------------
 | |
| TIP 4: Don't 'kill -9' the postmaster
 | |
| 
 | |
| 
 | |
| 
 | |
| From pgsql-hackers-owner+M24091@postgresql.org Mon Jun 24 12:54:22 2002
 | |
| Return-path: <pgsql-hackers-owner+M24091@postgresql.org>
 | |
| Received: from postgresql.org (postgresql.org [64.49.215.8])
 | |
| 	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5OGsMF07208
 | |
| 	for <pgman@candle.pha.pa.us>; Mon, 24 Jun 2002 12:54:22 -0400 (EDT)
 | |
| Received: from localhost.localdomain (postgresql.org [64.49.215.8])
 | |
| 	by localhost (Postfix) with ESMTP
 | |
| 	id 7DB7947679D; Mon, 24 Jun 2002 09:48:51 -0400 (EDT)
 | |
| Mailbox-Line: From pgman@candle.pha.pa.us  Mon Jun 24 09:48:51 2002
 | |
| Received: from postgresql.org (postgresql.org [64.49.215.8])
 | |
| 	by postgresql.org (Postfix) with SMTP
 | |
| 	id 3FD37476491; Mon, 24 Jun 2002 08:55:34 -0400 (EDT)
 | |
| Received: from localhost.localdomain (postgresql.org [64.49.215.8])
 | |
| 	by localhost (Postfix) with ESMTP id 2769E4762E3
 | |
| 	for <pgsql-hackers@postgresql.org>; Mon, 24 Jun 2002 08:27:39 -0400 (EDT)
 | |
| Mailbox-Line: From pgman@candle.pha.pa.us  Mon Jun 24 08:27:39 2002
 | |
| Received: from candle.pha.pa.us (216-55-132-35.dsl.san-diego.abac.net [216.55.132.35])
 | |
| 	by postgresql.org (Postfix) with ESMTP id ED459475C61
 | |
| 	for <pgsql-hackers@postgresql.org>; Sun, 23 Jun 2002 15:37:08 -0400 (EDT)
 | |
| Received: (from pgman@localhost)
 | |
| 	by candle.pha.pa.us (8.11.6/8.10.1) id g5NJasa07642;
 | |
| 	Sun, 23 Jun 2002 15:36:54 -0400 (EDT)
 | |
| From: Bruce Momjian <pgman@candle.pha.pa.us>
 | |
| Message-ID: <200206231936.g5NJasa07642@candle.pha.pa.us>
 | |
| Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
 | |
| In-Reply-To: <Pine.NEB.4.43.0206240307550.511-100000@angelic.cynic.net>
 | |
| To: Curt Sampson <cjs@cynic.net>
 | |
| Date: Sun, 23 Jun 2002 15:36:54 -0400 (EDT)
 | |
| cc: "J. R. Nield" <jrnield@usol.com>, Tom Lane <tgl@sss.pgh.pa.us>,
 | |
|    Michael Loftis <mloftis@wgops.com>, mlw <markw@mohawksoft.com>,
 | |
|    PostgreSQL Hacker <pgsql-hackers@postgresql.org>
 | |
| X-Mailer: ELM [version 2.4ME+ PL97 (25)]
 | |
| MIME-Version: 1.0
 | |
| Content-Transfer-Encoding: 7bit
 | |
| Content-Type: text/plain; charset=US-ASCII
 | |
| Precedence: bulk
 | |
| Sender: pgsql-hackers-owner@postgresql.org
 | |
| X-Spam-Status: No, hits=-3.4 required=5.0
 | |
| 	tests=IN_REP_TO
 | |
| 	version=2.30
 | |
| Status: OR
 | |
| 
 | |
| Curt Sampson wrote:
 | |
| > On 23 Jun 2002, J. R. Nield wrote:
 | |
| > 
 | |
| > > So since we have all this buffering designed especially to meet our
 | |
| > > needs, and since the OS buffering is in the way, can someone explain to
 | |
| > > me why postgresql would ever open a file without the O_DSYNC flag if the
 | |
| > > platform supports it?
 | |
| > 
 | |
| > It's more code, if there are platforms out there that don't support
 | |
| > O_DYSNC. (We still have to keep the old fsync code.) On the other hand,
 | |
| > O_DSYNC could save us a disk arm movement over fsync() because it
 | |
| > appears to me that fsync is also going to force a metadata update, which
 | |
| > means that the inode blocks have to be written as well.
 | |
| 
 | |
| Again, see postgresql.conf:
 | |
| 
 | |
| #wal_sync_method = fsync        # the default varies across platforms:
 | |
| #                               # fsync, fdatasync, open_sync, or open_datasync
 | |
| 
 | |
| > 
 | |
| > > Maybe fsync would be slower with two files, but I don't see how
 | |
| > > fdatasync would be, and most platforms support that.
 | |
| > 
 | |
| > Because, if both files are on the same disk, you still have to move
 | |
| > the disk arm from the cylinder at the current log file write point
 | |
| > to the cylinder at the current ping-pong file write point. And then back
 | |
| > again to the log file write point cylinder.
 | |
| > 
 | |
| > In the end, having a ping-pong file as well seems to me unnecessary
 | |
| > complexity, especially when anyone interested in really good
 | |
| > performance is going to buy a disk subsystem that guarantees no
 | |
| > torn pages and thus will want to turn off the ping-pong file writes
 | |
| > entirely, anyway.
 | |
| 
 | |
| Yes, I don't see writing to two files vs. one to be any win, especially
 | |
| when we need to fsync both of them.  What I would really like is to
 | |
| avoid the double I/O of writing to WAL and to the data file;  improving
 | |
| that would be a huge win.
 | |
| 
 | |
| -- 
 | |
|   Bruce Momjian                        |  http://candle.pha.pa.us
 | |
|   pgman@candle.pha.pa.us               |  (610) 853-3000
 | |
|   +  If your life is a hard drive,     |  830 Blythe Avenue
 | |
|   +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
 | |
| 
 | |
| 
 | |
| 
 | |
| ---------------------------(end of broadcast)---------------------------
 | |
| TIP 4: Don't 'kill -9' the postmaster
 | |
| 
 | |
| 
 | |
| 
 | |
| From cjs@cynic.net Sun Jun 23 20:09:44 2002
 | |
| Return-path: <cjs@cynic.net>
 | |
| Received: from academic.cynic.net (academic.cynic.net [63.144.177.3])
 | |
| 	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5O09hF00630
 | |
| 	for <pgman@candle.pha.pa.us>; Sun, 23 Jun 2002 20:09:43 -0400 (EDT)
 | |
| Received: from angelic-academic.cvpn.cynic.net (angelic-academic.cvpn.cynic.net [198.73.220.224])
 | |
| 	by academic.cynic.net (Postfix) with ESMTP
 | |
| 	id 6F45AF820; Mon, 24 Jun 2002 00:09:38 +0000 (UTC)
 | |
| Date: Mon, 24 Jun 2002 09:09:30 +0900 (JST)
 | |
| From: Curt Sampson <cjs@cynic.net>
 | |
| To: Bruce Momjian <pgman@candle.pha.pa.us>
 | |
| cc: "J. R. Nield" <jrnield@usol.com>, Tom Lane <tgl@sss.pgh.pa.us>,
 | |
|    Michael Loftis <mloftis@wgops.com>, mlw <markw@mohawksoft.com>,
 | |
|    PostgreSQL Hacker <pgsql-hackers@postgresql.org>
 | |
| Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
 | |
| In-Reply-To: <200206231936.g5NJasa07642@candle.pha.pa.us>
 | |
| Message-ID: <Pine.NEB.4.43.0206240907160.511-100000@angelic.cynic.net>
 | |
| MIME-Version: 1.0
 | |
| Content-Type: TEXT/PLAIN; charset=US-ASCII
 | |
| Status: OR
 | |
| 
 | |
| On Sun, 23 Jun 2002, Bruce Momjian wrote:
 | |
| 
 | |
| > Yes, I don't see writing to two files vs. one to be any win, especially
 | |
| > when we need to fsync both of them.  What I would really like is to
 | |
| > avoid the double I/O of writing to WAL and to the data file;  improving
 | |
| > that would be a huge win.
 | |
| 
 | |
| You mean, the double I/O of writing the block to the WAL and data file?
 | |
| (We'd still have to write the changed columns or whatever to the WAL,
 | |
| right?)
 | |
| 
 | |
| I'd just add an option to turn it off. If you need it, you need it;
 | |
| there's no way around that except to buy hardware that is really going
 | |
| to guarantee your writes (which then means you don't need it).
 | |
| 
 | |
| cjs
 | |
| -- 
 | |
| Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
 | |
|     Don't you know, in this new Dark Age, we're all light.  --XTC
 | |
| 
 | |
| 
 | |
| From jrnield@usol.com Sun Jun 23 21:28:58 2002
 | |
| Return-path: <jrnield@usol.com>
 | |
| Received: from hades.usol.com (IDENT:root@hades.usol.com [208.232.58.41])
 | |
| 	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5O1SuF06381
 | |
| 	for <pgman@candle.pha.pa.us>; Sun, 23 Jun 2002 21:28:57 -0400 (EDT)
 | |
| Received: from 01-072.024.popsite.net (01-072.024.popsite.net [216.126.160.72])
 | |
| 	by hades.usol.com (8.11.6/8.11.6) with ESMTP id g5O1Ssj09303;
 | |
| 	Sun, 23 Jun 2002 21:28:55 -0400
 | |
| Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
 | |
| From: "J. R. Nield" <jrnield@usol.com>
 | |
| To: Bruce Momjian <pgman@candle.pha.pa.us>
 | |
| cc: Curt Sampson <cjs@cynic.net>, Tom Lane <tgl@sss.pgh.pa.us>,
 | |
|    Michael Loftis
 | |
|   <mloftis@wgops.com>, mlw <markw@mohawksoft.com>,
 | |
|    PostgreSQL Hacker
 | |
|   <pgsql-hackers@postgresql.org>
 | |
| In-Reply-To: <200206231936.g5NJasa07642@candle.pha.pa.us>
 | |
| References: <200206231936.g5NJasa07642@candle.pha.pa.us>
 | |
| Content-Type: text/plain
 | |
| Content-Transfer-Encoding: 7bit
 | |
| X-Mailer: Ximian Evolution 1.0.3 (1.0.3-6) 
 | |
| Date: 23 Jun 2002 21:29:23 -0400
 | |
| Message-ID: <1024882167.1793.733.camel@localhost.localdomain>
 | |
| MIME-Version: 1.0
 | |
| Status: ORr
 | |
| 
 | |
| On Sun, 2002-06-23 at 15:36, Bruce Momjian wrote:
 | |
| > Yes, I don't see writing to two files vs. one to be any win, especially
 | |
| > when we need to fsync both of them.  What I would really like is to
 | |
| > avoid the double I/O of writing to WAL and to the data file;  improving
 | |
| > that would be a huge win.
 | |
| > 
 | |
| 
 | |
| If is impossible to do what you want. You can not protect against
 | |
| partial writes without writing pages twice and calling fdatasync between
 | |
| them while going through a generic filesystem. The best disk array will
 | |
| not protect you if the operating system does not align block writes to
 | |
| the structure of the underlying device. Even with raw devices, you need
 | |
| special support or knowledge of the operating system and/or the disk
 | |
| device to ensure that each write request will be atomic to the
 | |
| underlying hardware. 
 | |
| 
 | |
| All other systems rely on the fact that you can recover a damaged file
 | |
| using the log archive. This means downtime in the rare case, but no data
 | |
| loss. Until PostgreSQL can do this, then it will not be acceptable for
 | |
| real critical production use. This is not to knock PostgreSQL, because
 | |
| it is a very good database system, and clearly the best open-source one.
 | |
| It even has feature advantages over the commercial systems. But at the
 | |
| end of the day, unless you have complete understanding of the I/O system
 | |
| from write(2) through to the disk system, the only sure ways to protect
 | |
| against partial writes are by "careful writes" (in the WAL log or
 | |
| elsewhere, writing pages twice), or by requiring (and allowing) users to
 | |
| do log-replay recovery when a file is corrupted by a partial write. As
 | |
| long as there is a UPS, and the operating system doesn't crash, then
 | |
| there still should be no partial writes.
 | |
| 
 | |
| If we log pages to WAL, they are useless when archived (after a
 | |
| checkpoint). So either we have a separate "log" for them (the ping-pong
 | |
| file), or we should at least remove them when archived, which makes log
 | |
| archiving more complex but is perfectly doable.
 | |
| 
 | |
| Finally, I would love to hear why we are using the operating system
 | |
| buffer manager at all. The OS is acting as a secondary buffer manager
 | |
| for us. Why is that? What flaw in our I/O system does this reveal? I
 | |
| know that:
 | |
| 
 | |
| >We sync only WAL, not the other pages, except for the sync() call we do
 | |
| > during checkpoint when we discard old WAL files.
 | |
| 
 | |
| But this is probably not a good thing. We should only be writing blocks
 | |
| when they need to be on disk. We should not be expecting the OS to write
 | |
| them "sometime later" and avoid blocking (as long) for the write. If we
 | |
| need that, then our buffer management is wrong and we need to fix it.
 | |
| The reason we are doing this is because we expect the OS buffer manager
 | |
| to do asynchronous I/O for us, but then we don't control the order. That
 | |
| is the reason why we have to call fdatasync(), to create "sequence
 | |
| points".
 | |
| 
 | |
| The reason we have performance problems with either D_OSYNC or fdatasync
 | |
| on the normal relations is because we have no dbflush process. This
 | |
| causes an unacceptable amount of I/O blocking by other transactions.
 | |
| 
 | |
| The ORACLE people were not kidding when they said that they could not
 | |
| certify Linux for production use until it supported O_DSYNC. Can you
 | |
| explain why that was the case?
 | |
| 
 | |
| Finally, let me apologize if the above comes across as somewhat
 | |
| belligerent. I know very well that I can't compete with you guys for
 | |
| knowledge of the PosgreSQL system. I am still at a loss when I look at
 | |
| the optimizer and executor modules, and it will take some time before I
 | |
| can follow discussion of that area. Even then, I doubt my ability to
 | |
| compare with people like Mr. Lane and Mr. Momjian in experience and
 | |
| general intelligence, or in the field of database programming and
 | |
| software development in particular. However, this discussion and a
 | |
| search of the pgsql-hackers archives reveals this problem to be the KEY
 | |
| area of PostgreSQL's failing, and general misunderstanding, when
 | |
| compared to its commercial competitors.
 | |
| 
 | |
| Sincerely, 
 | |
| 
 | |
| 	J. R. Nield
 | |
| 
 | |
| -- 
 | |
| J. R. Nield
 | |
| jrnield@usol.com
 | |
| 
 | |
| 
 | |
| 
 | |
| 
 | |
| From pgsql-hackers-owner+M24090@postgresql.org Mon Jun 24 12:38:04 2002
 | |
| Return-path: <pgsql-hackers-owner+M24090@postgresql.org>
 | |
| Received: from postgresql.org (postgresql.org [64.49.215.8])
 | |
| 	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5OGc3F05962
 | |
| 	for <pgman@candle.pha.pa.us>; Mon, 24 Jun 2002 12:38:03 -0400 (EDT)
 | |
| Received: from localhost.localdomain (postgresql.org [64.49.215.8])
 | |
| 	by localhost (Postfix) with ESMTP
 | |
| 	id 81B9F4768DF; Mon, 24 Jun 2002 10:18:05 -0400 (EDT)
 | |
| Mailbox-Line: From pgman@candle.pha.pa.us  Mon Jun 24 10:18:05 2002
 | |
| Received: from postgresql.org (postgresql.org [64.49.215.8])
 | |
| 	by postgresql.org (Postfix) with SMTP
 | |
| 	id 81F08476473; Mon, 24 Jun 2002 08:55:28 -0400 (EDT)
 | |
| Received: from localhost.localdomain (postgresql.org [64.49.215.8])
 | |
| 	by localhost (Postfix) with ESMTP id CDDFA475CC3
 | |
| 	for <pgsql-hackers@postgresql.org>; Mon, 24 Jun 2002 08:37:44 -0400 (EDT)
 | |
| Mailbox-Line: From pgman@candle.pha.pa.us  Mon Jun 24 08:37:44 2002
 | |
| Received: from candle.pha.pa.us (216-55-132-35.dsl.san-diego.abac.net [216.55.132.35])
 | |
| 	by postgresql.org (Postfix) with ESMTP id 5C971475858
 | |
| 	for <pgsql-hackers@postgresql.org>; Sun, 23 Jun 2002 22:47:12 -0400 (EDT)
 | |
| Received: (from pgman@localhost)
 | |
| 	by candle.pha.pa.us (8.11.6/8.10.1) id g5O2ki712992;
 | |
| 	Sun, 23 Jun 2002 22:46:44 -0400 (EDT)
 | |
| From: Bruce Momjian <pgman@candle.pha.pa.us>
 | |
| Message-ID: <200206240246.g5O2ki712992@candle.pha.pa.us>
 | |
| Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
 | |
| In-Reply-To: <1024882167.1793.733.camel@localhost.localdomain>
 | |
| To: "J. R. Nield" <jrnield@usol.com>
 | |
| Date: Sun, 23 Jun 2002 22:46:44 -0400 (EDT)
 | |
| cc: Curt Sampson <cjs@cynic.net>, Tom Lane <tgl@sss.pgh.pa.us>,
 | |
|    Michael Loftis <mloftis@wgops.com>, mlw <markw@mohawksoft.com>,
 | |
|    PostgreSQL Hacker <pgsql-hackers@postgresql.org>
 | |
| X-Mailer: ELM [version 2.4ME+ PL97 (25)]
 | |
| MIME-Version: 1.0
 | |
| Content-Transfer-Encoding: 7bit
 | |
| Content-Type: text/plain; charset=US-ASCII
 | |
| Precedence: bulk
 | |
| Sender: pgsql-hackers-owner@postgresql.org
 | |
| X-Spam-Status: No, hits=-3.4 required=5.0
 | |
| 	tests=IN_REP_TO
 | |
| 	version=2.30
 | |
| Status: OR
 | |
| 
 | |
| J. R. Nield wrote:
 | |
| > On Sun, 2002-06-23 at 15:36, Bruce Momjian wrote:
 | |
| > > Yes, I don't see writing to two files vs. one to be any win, especially
 | |
| > > when we need to fsync both of them.  What I would really like is to
 | |
| > > avoid the double I/O of writing to WAL and to the data file;  improving
 | |
| > > that would be a huge win.
 | |
| > > 
 | |
| > 
 | |
| > If is impossible to do what you want. You can not protect against
 | |
| > partial writes without writing pages twice and calling fdatasync between
 | |
| > them while going through a generic filesystem. The best disk array will
 | |
| > not protect you if the operating system does not align block writes to
 | |
| > the structure of the underlying device. Even with raw devices, you need
 | |
| > special support or knowledge of the operating system and/or the disk
 | |
| > device to ensure that each write request will be atomic to the
 | |
| > underlying hardware. 
 | |
| 
 | |
| Yes, I suspected it was impossible, but that doesn't mean I want it any
 | |
| less.  ;-)
 | |
| 
 | |
| > All other systems rely on the fact that you can recover a damaged file
 | |
| > using the log archive. This means downtime in the rare case, but no data
 | |
| > loss. Until PostgreSQL can do this, then it will not be acceptable for
 | |
| > real critical production use. This is not to knock PostgreSQL, because
 | |
| > it is a very good database system, and clearly the best open-source one.
 | |
| > It even has feature advantages over the commercial systems. But at the
 | |
| > end of the day, unless you have complete understanding of the I/O system
 | |
| > from write(2) through to the disk system, the only sure ways to protect
 | |
| > against partial writes are by "careful writes" (in the WAL log or
 | |
| > elsewhere, writing pages twice), or by requiring (and allowing) users to
 | |
| > do log-replay recovery when a file is corrupted by a partial write. As
 | |
| > long as there is a UPS, and the operating system doesn't crash, then
 | |
| > there still should be no partial writes.
 | |
| 
 | |
| You are talking point-in-time recovery, a major missing feature right
 | |
| next to replication, and I agree it makes PostgreSQL unacceptable for
 | |
| some applications.  Point taken.
 | |
| 
 | |
| And the interesting thing you are saying is that with point-in-time
 | |
| recovery, we don't need to write pre-write images of pages because if we
 | |
| detect a partial page write, we then abort the database and tell the
 | |
| user to do a point-in-time recovery, basically meaning we are using the
 | |
| previous full backup as our pre-write page image and roll forward using
 | |
| the logical logs.  This is clearly a nice thing to be able to do because
 | |
| it let's you take a pre-write image of the page once during full backup,
 | |
| keep it offline, and bring it back in the rare case of a full page write
 | |
| failure.  I now can see how the MSSQL tearoff-bits would be used, not
 | |
| for recovery, but to detect a partial write and force a point-in-time
 | |
| recovery from the administrator.
 | |
| 
 | |
| 
 | |
| > If we log pages to WAL, they are useless when archived (after a
 | |
| > checkpoint). So either we have a separate "log" for them (the ping-pong
 | |
| > file), or we should at least remove them when archived, which makes log
 | |
| > archiving more complex but is perfectly doable.
 | |
| 
 | |
| Yes, that is how we will do point-in-time recovery;  remove the
 | |
| pre-write page images and archive the rest.  It is more complex, but
 | |
| having the fsync all in one file is too big a win.
 | |
| 
 | |
| > Finally, I would love to hear why we are using the operating system
 | |
| > buffer manager at all. The OS is acting as a secondary buffer manager
 | |
| > for us. Why is that? What flaw in our I/O system does this reveal? I
 | |
| > know that:
 | |
| > 
 | |
| > >We sync only WAL, not the other pages, except for the sync() call we do
 | |
| > > during checkpoint when we discard old WAL files.
 | |
| > 
 | |
| > But this is probably not a good thing. We should only be writing blocks
 | |
| > when they need to be on disk. We should not be expecting the OS to write
 | |
| > them "sometime later" and avoid blocking (as long) for the write. If we
 | |
| > need that, then our buffer management is wrong and we need to fix it.
 | |
| > The reason we are doing this is because we expect the OS buffer manager
 | |
| > to do asynchronous I/O for us, but then we don't control the order. That
 | |
| > is the reason why we have to call fdatasync(), to create "sequence
 | |
| > points".
 | |
| 
 | |
| Yes.  I think I understand.  It is true we have to fsync WAL because we
 | |
| can't control the individual writes by the OS.
 | |
| 
 | |
| > The reason we have performance problems with either D_OSYNC or fdatasync
 | |
| > on the normal relations is because we have no dbflush process. This
 | |
| > causes an unacceptable amount of I/O blocking by other transactions.
 | |
| 
 | |
| Uh, that would force writes all over the disk. Why do we really care how
 | |
| the OS writes them?  If we are going to fsync, let's just do the one
 | |
| file and be done with it.  What would a separate flusher process really
 | |
| buy us  if it has to use fsync too. The main backend doesn't have to
 | |
| wait for the fsync, but then again, we can't say the transaction is
 | |
| committed until it hits the disk, so how does a flusher help?
 | |
| 
 | |
| > The ORACLE people were not kidding when they said that they could not
 | |
| > certify Linux for production use until it supported O_DSYNC. Can you
 | |
| > explain why that was the case?
 | |
| 
 | |
| I don't see O_DSYNC as very different from write/fsync(or fdatasync).
 | |
| 
 | |
| > Finally, let me apologize if the above comes across as somewhat
 | |
| > belligerent. I know very well that I can't compete with you guys for
 | |
| > knowledge of the PostgreSQL system. I am still at a loss when I look at
 | |
| > the optimizer and executor modules, and it will take some time before I
 | |
| > can follow discussion of that area. Even then, I doubt my ability to
 | |
| > compare with people like Mr. Lane and Mr. Momjian in experience and
 | |
| > general intelligence, or in the field of database programming and
 | |
| > software development in particular. However, this discussion and a
 | |
| > search of the pgsql-hackers archives reveals this problem to be the KEY
 | |
| > area of PostgreSQL's failing, and general misunderstanding, when
 | |
| > compared to its commercial competitors.
 | |
| 
 | |
| We appreciate your ideas.  Few of us are professional db folks so we are
 | |
| always looking for good ideas.
 | |
| 
 | |
| -- 
 | |
|   Bruce Momjian                        |  http://candle.pha.pa.us
 | |
|   pgman@candle.pha.pa.us               |  (610) 853-3000
 | |
|   +  If your life is a hard drive,     |  830 Blythe Avenue
 | |
|   +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
 | |
| 
 | |
| 
 | |
| 
 | |
| ---------------------------(end of broadcast)---------------------------
 | |
| TIP 6: Have you searched our list archives?
 | |
| 
 | |
| http://archives.postgresql.org
 | |
| 
 | |
| 
 | |
| 
 | |
| From cjs@cynic.net Sun Jun 23 23:40:59 2002
 | |
| Return-path: <cjs@cynic.net>
 | |
| Received: from academic.cynic.net (academic.cynic.net [63.144.177.3])
 | |
| 	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5O3evF17903
 | |
| 	for <pgman@candle.pha.pa.us>; Sun, 23 Jun 2002 23:40:58 -0400 (EDT)
 | |
| Received: from angelic-academic.cvpn.cynic.net (angelic-academic.cvpn.cynic.net [198.73.220.224])
 | |
| 	by academic.cynic.net (Postfix) with ESMTP
 | |
| 	id 37F36F820; Mon, 24 Jun 2002 03:40:54 +0000 (UTC)
 | |
| Date: Mon, 24 Jun 2002 12:40:51 +0900 (JST)
 | |
| From: Curt Sampson <cjs@cynic.net>
 | |
| To: "J. R. Nield" <jrnield@usol.com>
 | |
| cc: Bruce Momjian <pgman@candle.pha.pa.us>, Tom Lane <tgl@sss.pgh.pa.us>,
 | |
|    Michael Loftis <mloftis@wgops.com>, mlw <markw@mohawksoft.com>,
 | |
|    PostgreSQL Hacker <pgsql-hackers@postgresql.org>
 | |
| Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
 | |
| In-Reply-To: <1024882167.1793.733.camel@localhost.localdomain>
 | |
| Message-ID: <Pine.NEB.4.43.0206241150500.7326-100000@angelic.cynic.net>
 | |
| MIME-Version: 1.0
 | |
| Content-Type: TEXT/PLAIN; charset=US-ASCII
 | |
| Status: OR
 | |
| 
 | |
| On 23 Jun 2002, J. R. Nield wrote:
 | |
| 
 | |
| > If is impossible to do what you want. You can not protect against
 | |
| > partial writes without writing pages twice and calling fdatasync
 | |
| > between them while going through a generic filesystem.
 | |
| 
 | |
| I agree with this.
 | |
| 
 | |
| > The best disk array will not protect you if the operating system does
 | |
| > not align block writes to the structure of the underlying device.
 | |
| 
 | |
| This I don't quite understand. Assuming you're using a SCSI drive
 | |
| (and this mostly applies to ATAPI/IDE, too), you can do naught but
 | |
| align block writes to the structure of the underlying device. When you
 | |
| initiate a SCSI WRITE command, you start by telling the device at which
 | |
| block to start writing and how many blocks you intend to write. Then you
 | |
| start passing the data.
 | |
| 
 | |
| (See http://www.danbbs.dk/~dino/SCSI/SCSI2-09.html#9.2.21 for parameter
 | |
| details for the SCSI WRITE(10) command. You may find the SCSI 2
 | |
| specification, at http://www.danbbs.dk/~dino/SCSI/ to be a useful
 | |
| reference here.)
 | |
| 
 | |
| > Even with raw devices, you need special support or knowledge of the
 | |
| > operating system and/or the disk device to ensure that each write
 | |
| > request will be atomic to the underlying hardware.
 | |
| 
 | |
| Well, so here I guess you're talking about two things:
 | |
| 
 | |
|     1. When you request, say, an 8K block write, will the OS really
 | |
|     write it to disk in a single 8K or multiple of 8K SCSI write
 | |
|     command?
 | |
| 
 | |
|     2. Does the SCSI device you're writing to consider these writes to
 | |
|     be transactional. That is, if the write is interrupted before being
 | |
|     completed, does the SCSI device guarantee that the partially-sent
 | |
|     data is not written, and the old data is maintained? And of course,
 | |
|     does it guarantee that, when it acknowledges a write, that write is
 | |
|     now in stable storage and will never go away?
 | |
| 
 | |
| Both of these are not hard to guarantee, actually. For a BSD-based OS,
 | |
| for example, just make sure that your filesystem block size is the
 | |
| same as or a multiple of the database block size. BSD will never write
 | |
| anything other than a block or a sequence of blocks to a disk in a
 | |
| single SCSI transaction (unless you've got a really odd SCSI driver).
 | |
| And for your disk, buy a Baydel or Clarion disk array, or something
 | |
| similar.
 | |
| 
 | |
| Given that it's not hard to set up a system that meets these criteria,
 | |
| and this is in fact commonly done for database servers, it would seem a
 | |
| good idea for postgres to have the option to take advantage of the time
 | |
| and money spent and adjust its performance upward appropriately.
 | |
| 
 | |
| > All other systems rely on the fact that you can recover a damaged file
 | |
| > using the log archive.
 | |
| 
 | |
| Not exactly. For MS SQL Server, at any rate, if it detects a page tear
 | |
| you cannot restore based on the log file alone. You need a full or
 | |
| partial backup that includes that entire torn block.
 | |
| 
 | |
| > This means downtime in the rare case, but no data loss. Until
 | |
| > PostgreSQL can do this, then it will not be acceptable for real
 | |
| > critical production use.
 | |
| 
 | |
| It seems to me that it is doing this right now. In fact, it's more
 | |
| reliable than some commerial systems (such as SQL Server) because it can
 | |
| recover from a torn block with just the logfile.
 | |
| 
 | |
| > But at the end of the day, unless you have complete understanding of
 | |
| > the I/O system from write(2) through to the disk system, the only sure
 | |
| > ways to protect against partial writes are by "careful writes" (in
 | |
| > the WAL log or elsewhere, writing pages twice), or by requiring (and
 | |
| > allowing) users to do log-replay recovery when a file is corrupted by
 | |
| > a partial write.
 | |
| 
 | |
| I don't understand how, without a copy of the old data that was in the
 | |
| torn block, you can restore that block from just log file entries. Can
 | |
| you explain this to me? Take, as an example, a block with ten tuples,
 | |
| only one of which has been changed "recently." (I.e., only that change
 | |
| is in the log files.)
 | |
| 
 | |
| > If we log pages to WAL, they are useless when archived (after a
 | |
| > checkpoint). So either we have a separate "log" for them (the
 | |
| > ping-pong file), or we should at least remove them when archived,
 | |
| > which makes log archiving more complex but is perfectly doable.
 | |
| 
 | |
| Right. That seems to me a better option, since we've now got only one
 | |
| write point on the disk rather than two.
 | |
| 
 | |
| > Finally, I would love to hear why we are using the operating system
 | |
| > buffer manager at all. The OS is acting as a secondary buffer manager
 | |
| > for us. Why is that? What flaw in our I/O system does this reveal?
 | |
| 
 | |
| It's acting as a "second-level" buffer manager, yes, but to say it's
 | |
| "secondary" may be a bit misleading. On most of the systems I've set
 | |
| up, the OS buffer cache is doing the vast majority of the work, and the
 | |
| postgres buffering is fairly minimal.
 | |
| 
 | |
| There are some good (and some perhaps not-so-good) reasons to do it this
 | |
| way. I'll list them more or less in the order of best to worst:
 | |
| 
 | |
|     1. The OS knows where the blocks physically reside on disk, and
 | |
|     postgres does not. Therefore it's in the interest of postgresql to
 | |
|     dispatch write responsibility back to the OS as quickly as possible
 | |
|     so that the OS can prioritize requests appropriately. Most operating
 | |
|     systems use an "elevator" algorithm to minimize disk head movement;
 | |
|     but if the OS does not have a block that it could write while the
 | |
|     head is "on the way" to another request, it can't write it in that
 | |
|     head pass.
 | |
| 
 | |
|     2. Postgres does not know about any "bank-switching" tricks for
 | |
|     mapping more physical memory than it has address space. Thus, on
 | |
|     32-bit machines, postgres might be limited to mapping 2 or 3 GB of
 | |
|     memory, even though the machine has, say, 6 GB of physical RAM. The
 | |
|     OS can use all of the available memory for caching; postgres cannot.
 | |
| 
 | |
|     3. A lot of work has been put into the seek algorithms, read-ahead
 | |
|     algorithms, block allocation algorithms, etc. in the OS. Why
 | |
|     duplicate all that work again in postgres?
 | |
| 
 | |
| When you say things like the following:
 | |
| 
 | |
| > We should only be writing blocks when they need to be on disk. We
 | |
| > should not be expecting the OS to write them "sometime later" and
 | |
| > avoid blocking (as long) for the write. If we need that, then our
 | |
| > buffer management is wrong and we need to fix it.
 | |
| 
 | |
| you appear to be making the arugment that we should take the route of
 | |
| other database systems, and use raw devices and our own management of
 | |
| disk block allocation. If so, you might want first to look back through
 | |
| the archives at the discussion I and several others had about this a
 | |
| month or two ago. After looking in detail at what NetBSD, at least, does
 | |
| in terms of its disk I/O algorithms and buffering, I've pretty much come
 | |
| around, at least for the moment, to the attitude that we should stick
 | |
| with using the OS. I wouldn't mind seeing postgres be able to manage all
 | |
| of this stuff, but it's a *lot* of work for not all that much benefit
 | |
| that I can see.
 | |
| 
 | |
| > The ORACLE people were not kidding when they said that they could not
 | |
| > certify Linux for production use until it supported O_DSYNC. Can you
 | |
| > explain why that was the case?
 | |
| 
 | |
| I'm suspecting it's because Linux at the time had no raw devices, so
 | |
| O_DSYNC was the only other possible method of making sure that disk
 | |
| writes actually got to disk.
 | |
| 
 | |
| You certainly don't want to use O_DSYNC if you can use another method,
 | |
| because O_DSYNC still goes through the the operating system's buffer
 | |
| cache, wasting memory and double-caching things. If you're doing your
 | |
| own management, you need either to use a raw device or open files with
 | |
| the flag that indicates that the buffer cache should not be used at all
 | |
| for reads from and writes to that file.
 | |
| 
 | |
| > However, this discussion and a search of the pgsql-hackers archives
 | |
| > reveals this problem to be the KEY area of PostgreSQL's failing, and
 | |
| > general misunderstanding, when compared to its commercial competitors.
 | |
| 
 | |
| No, I think it's just that you're under a few minor misapprehensions
 | |
| here about what postgres and the OS are actually doing. As I said, I
 | |
| went through this whole exact argument a month or two ago, on this very
 | |
| list, and I came around to the idea that what postgres is doing now
 | |
| works quite well, at least on NetBSD. (Most other OSes have disk I/O
 | |
| algorithms that are pretty much as good or better.) There might be a
 | |
| very slight advantage to doing all one's own I/O management, but it's
 | |
| a huge amount of work, and I think that much effort could be much more
 | |
| usefully applied to other areas.
 | |
| 
 | |
| Just as a side note, I've been a NetBSD developer since about '96,
 | |
| and have been delving into the details of OS design since well before
 | |
| that time, so I'm coming to this with what I hope is reasonably good
 | |
| knowledge of how disks work and how operating systems use them. (Not
 | |
| that this should stop you from pointing out holes in my arguments. :-))
 | |
| 
 | |
| cjs
 | |
| -- 
 | |
| Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
 | |
|     Don't you know, in this new Dark Age, we're all light.  --XTC
 | |
| 
 | |
| 
 | |
| From pgsql-hackers-owner+M24112@postgresql.org Mon Jun 24 18:16:36 2002
 | |
| Return-path: <pgsql-hackers-owner+M24112@postgresql.org>
 | |
| Received: from postgresql.org (postgresql.org [64.49.215.8])
 | |
| 	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5OMGaF00910
 | |
| 	for <pgman@candle.pha.pa.us>; Mon, 24 Jun 2002 18:16:36 -0400 (EDT)
 | |
| Received: from localhost.localdomain (postgresql.org [64.49.215.8])
 | |
| 	by localhost (Postfix) with ESMTP
 | |
| 	id A2EF1476475; Mon, 24 Jun 2002 16:43:38 -0400 (EDT)
 | |
| Mailbox-Line: From tgl@sss.pgh.pa.us  Mon Jun 24 16:43:38 2002
 | |
| Received: from postgresql.org (postgresql.org [64.49.215.8])
 | |
| 	by postgresql.org (Postfix) with SMTP
 | |
| 	id BA57D476148; Mon, 24 Jun 2002 14:14:00 -0400 (EDT)
 | |
| Received: from localhost.localdomain (postgresql.org [64.49.215.8])
 | |
| 	by localhost (Postfix) with ESMTP id 93D6A477214
 | |
| 	for <pgsql-hackers@postgresql.org>; Mon, 24 Jun 2002 13:59:17 -0400 (EDT)
 | |
| Mailbox-Line: From tgl@sss.pgh.pa.us  Mon Jun 24 13:59:17 2002
 | |
| Received: from sss.pgh.pa.us (unknown [192.204.191.242])
 | |
| 	by postgresql.org (Postfix) with ESMTP id D70AA476401
 | |
| 	for <pgsql-hackers@postgresql.org>; Mon, 24 Jun 2002 10:06:26 -0400 (EDT)
 | |
| Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
 | |
| 	by sss.pgh.pa.us (8.11.4/8.11.4) with ESMTP id g5OE6J117666;
 | |
| 	Mon, 24 Jun 2002 10:06:19 -0400 (EDT)
 | |
| To: Curt Sampson <cjs@cynic.net>
 | |
| cc: Bruce Momjian <pgman@candle.pha.pa.us>, "J. R. Nield" <jrnield@usol.com>,
 | |
|    Michael Loftis <mloftis@wgops.com>, mlw <markw@mohawksoft.com>,
 | |
|    PostgreSQL Hacker <pgsql-hackers@postgresql.org>
 | |
| Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE 
 | |
| In-Reply-To: <Pine.NEB.4.43.0206240907160.511-100000@angelic.cynic.net> 
 | |
| References: <Pine.NEB.4.43.0206240907160.511-100000@angelic.cynic.net>
 | |
| Comments: In-reply-to Curt Sampson <cjs@cynic.net>
 | |
| 	message dated "Mon, 24 Jun 2002 09:09:30 +0900"
 | |
| Date: Mon, 24 Jun 2002 10:06:19 -0400
 | |
| Message-ID: <17663.1024927579@sss.pgh.pa.us>
 | |
| From: Tom Lane <tgl@sss.pgh.pa.us>
 | |
| Precedence: bulk
 | |
| Sender: pgsql-hackers-owner@postgresql.org
 | |
| X-Spam-Status: No, hits=-5.3 required=5.0
 | |
| 	tests=IN_REP_TO,X_NOT_PRESENT
 | |
| 	version=2.30
 | |
| Status: OR
 | |
| 
 | |
| > On Sun, 23 Jun 2002, Bruce Momjian wrote:
 | |
| >> Yes, I don't see writing to two files vs. one to be any win, especially
 | |
| >> when we need to fsync both of them.  What I would really like is to
 | |
| >> avoid the double I/O of writing to WAL and to the data file;  improving
 | |
| >> that would be a huge win.
 | |
| 
 | |
| I don't believe it's possible to eliminate the double I/O.  Keep in mind
 | |
| though that in the ideal case (plenty of shared buffers) you are only
 | |
| paying two writes per modified block per checkpoint interval --- one to
 | |
| the WAL during the first write of the interval, and then a write to the
 | |
| real datafile issued by the checkpoint process.  Anything that requires
 | |
| transaction commits to write data blocks will likely result in more I/O
 | |
| not less, at least for blocks that are modified by several successive
 | |
| transactions.
 | |
| 
 | |
| The only thing I've been able to think of that seems like it might
 | |
| improve matters is to make the WAL writing logic aware of the layout
 | |
| of buffer pages --- specifically, to know that our pages generally
 | |
| contain an uninteresting "hole" in the middle, and not write the hole.
 | |
| Optimistically this might reduce the WAL data volume by something
 | |
| approaching 50%; though pessimistically (if most pages are near full)
 | |
| it wouldn't help much.
 | |
| 
 | |
| This was not very feasible when the WAL code was designed because the
 | |
| buffer manager needed to cope with both normal pages and pg_log pages,
 | |
| but as of 7.2 I think it'd be safe to assume that all pages have the
 | |
| standard layout.
 | |
| 
 | |
| 			regards, tom lane
 | |
| 
 | |
| 
 | |
| 
 | |
| ---------------------------(end of broadcast)---------------------------
 | |
| TIP 2: you can get off all lists at once with the unregister command
 | |
|     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
 | |
| 
 | |
| 
 | |
| 
 | |
| From pgsql-hackers-owner+M24116@postgresql.org Mon Jun 24 20:32:07 2002
 | |
| Return-path: <pgsql-hackers-owner+M24116@postgresql.org>
 | |
| Received: from postgresql.org (postgresql.org [64.49.215.8])
 | |
| 	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5P0W7F10985
 | |
| 	for <pgman@candle.pha.pa.us>; Mon, 24 Jun 2002 20:32:07 -0400 (EDT)
 | |
| Received: from localhost.localdomain (postgresql.org [64.49.215.8])
 | |
| 	by localhost (Postfix) with ESMTP
 | |
| 	id EBCE547632E; Mon, 24 Jun 2002 18:54:34 -0400 (EDT)
 | |
| Mailbox-Line: From pgman@candle.pha.pa.us  Mon Jun 24 18:54:34 2002
 | |
| Received: from postgresql.org (postgresql.org [64.49.215.8])
 | |
| 	by postgresql.org (Postfix) with SMTP
 | |
| 	id 3EB93476D85; Mon, 24 Jun 2002 17:12:18 -0400 (EDT)
 | |
| Received: from localhost.localdomain (postgresql.org [64.49.215.8])
 | |
| 	by localhost (Postfix) with ESMTP id EBC20476E2E
 | |
| 	for <pgsql-hackers@postgresql.org>; Mon, 24 Jun 2002 14:54:40 -0400 (EDT)
 | |
| Mailbox-Line: From pgman@candle.pha.pa.us  Mon Jun 24 14:54:40 2002
 | |
| Received: from candle.pha.pa.us (216-55-132-35.dsl.san-diego.abac.net [216.55.132.35])
 | |
| 	by postgresql.org (Postfix) with ESMTP id 1C8874760C2
 | |
| 	for <pgsql-hackers@postgresql.org>; Mon, 24 Jun 2002 12:40:53 -0400 (EDT)
 | |
| Received: (from pgman@localhost)
 | |
| 	by candle.pha.pa.us (8.11.6/8.10.1) id g5OGeVY06116;
 | |
| 	Mon, 24 Jun 2002 12:40:31 -0400 (EDT)
 | |
| From: Bruce Momjian <pgman@candle.pha.pa.us>
 | |
| Message-ID: <200206241640.g5OGeVY06116@candle.pha.pa.us>
 | |
| Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
 | |
| In-Reply-To: <17663.1024927579@sss.pgh.pa.us>
 | |
| To: Tom Lane <tgl@sss.pgh.pa.us>
 | |
| Date: Mon, 24 Jun 2002 12:40:31 -0400 (EDT)
 | |
| cc: Curt Sampson <cjs@cynic.net>, "J. R. Nield" <jrnield@usol.com>,
 | |
|    Michael Loftis <mloftis@wgops.com>, mlw <markw@mohawksoft.com>,
 | |
|    PostgreSQL Hacker <pgsql-hackers@postgresql.org>
 | |
| X-Mailer: ELM [version 2.4ME+ PL97 (25)]
 | |
| MIME-Version: 1.0
 | |
| Content-Transfer-Encoding: 7bit
 | |
| Content-Type: text/plain; charset=US-ASCII
 | |
| Precedence: bulk
 | |
| Sender: pgsql-hackers-owner@postgresql.org
 | |
| X-Spam-Status: No, hits=-3.4 required=5.0
 | |
| 	tests=IN_REP_TO
 | |
| 	version=2.30
 | |
| Status: OR
 | |
| 
 | |
| Tom Lane wrote:
 | |
| > > On Sun, 23 Jun 2002, Bruce Momjian wrote:
 | |
| > >> Yes, I don't see writing to two files vs. one to be any win, especially
 | |
| > >> when we need to fsync both of them.  What I would really like is to
 | |
| > >> avoid the double I/O of writing to WAL and to the data file;  improving
 | |
| > >> that would be a huge win.
 | |
| > 
 | |
| > I don't believe it's possible to eliminate the double I/O.  Keep in mind
 | |
| > though that in the ideal case (plenty of shared buffers) you are only
 | |
| > paying two writes per modified block per checkpoint interval --- one to
 | |
| > the WAL during the first write of the interval, and then a write to the
 | |
| > real datafile issued by the checkpoint process.  Anything that requires
 | |
| > transaction commits to write data blocks will likely result in more I/O
 | |
| > not less, at least for blocks that are modified by several successive
 | |
| > transactions.
 | |
| > 
 | |
| > The only thing I've been able to think of that seems like it might
 | |
| > improve matters is to make the WAL writing logic aware of the layout
 | |
| > of buffer pages --- specifically, to know that our pages generally
 | |
| > contain an uninteresting "hole" in the middle, and not write the hole.
 | |
| > Optimistically this might reduce the WAL data volume by something
 | |
| > approaching 50%; though pessimistically (if most pages are near full)
 | |
| > it wouldn't help much.
 | |
| 
 | |
| Good idea.  How about putting the page through or TOAST compression
 | |
| routine before writing it to WAL?  Should be pretty easy and fast and
 | |
| doesn't require any knowledge of the page format.
 | |
| 
 | |
| -- 
 | |
|   Bruce Momjian                        |  http://candle.pha.pa.us
 | |
|   pgman@candle.pha.pa.us               |  (610) 853-3000
 | |
|   +  If your life is a hard drive,     |  830 Blythe Avenue
 | |
|   +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
 | |
| 
 | |
| 
 | |
| 
 | |
| ---------------------------(end of broadcast)---------------------------
 | |
| TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
 | |
| 
 | |
| 
 | |
| 
 | |
| From pgsql-hackers-owner+M24114@postgresql.org Mon Jun 24 17:54:35 2002
 | |
| Return-path: <pgsql-hackers-owner+M24114@postgresql.org>
 | |
| Received: from postgresql.org (postgresql.org [64.49.215.8])
 | |
| 	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5OLsZF28642
 | |
| 	for <pgman@candle.pha.pa.us>; Mon, 24 Jun 2002 17:54:35 -0400 (EDT)
 | |
| Received: from localhost.localdomain (postgresql.org [64.49.215.8])
 | |
| 	by localhost (Postfix) with ESMTP
 | |
| 	id BD68F47683C; Mon, 24 Jun 2002 16:46:24 -0400 (EDT)
 | |
| Mailbox-Line: From tgl@sss.pgh.pa.us  Mon Jun 24 16:46:24 2002
 | |
| Received: from postgresql.org (postgresql.org [64.49.215.8])
 | |
| 	by postgresql.org (Postfix) with SMTP
 | |
| 	id B2719476B31; Mon, 24 Jun 2002 16:01:51 -0400 (EDT)
 | |
| Received: from localhost.localdomain (postgresql.org [64.49.215.8])
 | |
| 	by localhost (Postfix) with ESMTP id 950004770BC
 | |
| 	for <pgsql-hackers@postgresql.org>; Mon, 24 Jun 2002 14:59:46 -0400 (EDT)
 | |
| Mailbox-Line: From tgl@sss.pgh.pa.us  Mon Jun 24 14:59:46 2002
 | |
| Received: from sss.pgh.pa.us (unknown [192.204.191.242])
 | |
| 	by postgresql.org (Postfix) with ESMTP id A0756475BB7
 | |
| 	for <pgsql-hackers@postgresql.org>; Mon, 24 Jun 2002 13:11:41 -0400 (EDT)
 | |
| Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
 | |
| 	by sss.pgh.pa.us (8.11.4/8.11.4) with ESMTP id g5OHB1119826;
 | |
| 	Mon, 24 Jun 2002 13:11:02 -0400 (EDT)
 | |
| To: Bruce Momjian <pgman@candle.pha.pa.us>
 | |
| cc: Curt Sampson <cjs@cynic.net>, "J. R. Nield" <jrnield@usol.com>,
 | |
|    Michael Loftis <mloftis@wgops.com>, mlw <markw@mohawksoft.com>,
 | |
|    PostgreSQL Hacker <pgsql-hackers@postgresql.org>
 | |
| Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE 
 | |
| In-Reply-To: <200206241640.g5OGeVY06116@candle.pha.pa.us> 
 | |
| References: <200206241640.g5OGeVY06116@candle.pha.pa.us>
 | |
| Comments: In-reply-to Bruce Momjian <pgman@candle.pha.pa.us>
 | |
| 	message dated "Mon, 24 Jun 2002 12:40:31 -0400"
 | |
| Date: Mon, 24 Jun 2002 13:11:01 -0400
 | |
| Message-ID: <19823.1024938661@sss.pgh.pa.us>
 | |
| From: Tom Lane <tgl@sss.pgh.pa.us>
 | |
| Precedence: bulk
 | |
| Sender: pgsql-hackers-owner@postgresql.org
 | |
| X-Spam-Status: No, hits=-5.3 required=5.0
 | |
| 	tests=IN_REP_TO,X_NOT_PRESENT
 | |
| 	version=2.30
 | |
| Status: OR
 | |
| 
 | |
| Bruce Momjian <pgman@candle.pha.pa.us> writes:
 | |
| >> The only thing I've been able to think of that seems like it might
 | |
| >> improve matters is to make the WAL writing logic aware of the layout
 | |
| >> of buffer pages --- specifically, to know that our pages generally
 | |
| >> contain an uninteresting "hole" in the middle, and not write the hole.
 | |
| >> Optimistically this might reduce the WAL data volume by something
 | |
| >> approaching 50%; though pessimistically (if most pages are near full)
 | |
| >> it wouldn't help much.
 | |
| 
 | |
| > Good idea.  How about putting the page through or TOAST compression
 | |
| > routine before writing it to WAL?  Should be pretty easy and fast and
 | |
| > doesn't require any knowledge of the page format.
 | |
| 
 | |
| Easy, maybe, but fast definitely NOT.  The compressor is not speedy.
 | |
| Given that we have to be holding various locks while we build WAL
 | |
| records, I do not think it's a good idea to add CPU time there.
 | |
| 
 | |
| Also, compressing already-compressed data is not a win ...
 | |
| 
 | |
| 			regards, tom lane
 | |
| 
 | |
| 
 | |
| 
 | |
| ---------------------------(end of broadcast)---------------------------
 | |
| TIP 3: if posting/reading through Usenet, please send an appropriate
 | |
| subscribe-nomail command to majordomo@postgresql.org so that your
 | |
| message can get through to the mailing list cleanly
 | |
| 
 | |
| 
 | |
| 
 | |
| From jrnield@usol.com Mon Jun 24 16:49:25 2002
 | |
| Return-path: <jrnield@usol.com>
 | |
| Received: from hades.usol.com (IDENT:root@hades.usol.com [208.232.58.41])
 | |
| 	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5OKnNF23393
 | |
| 	for <pgman@candle.pha.pa.us>; Mon, 24 Jun 2002 16:49:24 -0400 (EDT)
 | |
| Received: from 08-113.024.popsite.net (08-113.024.popsite.net [66.19.4.113])
 | |
| 	by hades.usol.com (8.11.6/8.11.6) with ESMTP id g5OKnHV19100;
 | |
| 	Mon, 24 Jun 2002 16:49:18 -0400
 | |
| Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
 | |
| From: "J. R. Nield" <jrnield@usol.com>
 | |
| To: Curt Sampson <cjs@cynic.net>
 | |
| cc: Bruce Momjian <pgman@candle.pha.pa.us>, Tom Lane <tgl@sss.pgh.pa.us>,
 | |
|    PostgreSQL Hacker <pgsql-hackers@postgresql.org>
 | |
| In-Reply-To: <Pine.NEB.4.43.0206241150500.7326-100000@angelic.cynic.net>
 | |
| References: <Pine.NEB.4.43.0206241150500.7326-100000@angelic.cynic.net>
 | |
| Content-Type: text/plain
 | |
| Content-Transfer-Encoding: 7bit
 | |
| X-Mailer: Ximian Evolution 1.0.3 (1.0.3-6) 
 | |
| Date: 24 Jun 2002 16:49:42 -0400
 | |
| Message-ID: <1024951786.1793.865.camel@localhost.localdomain>
 | |
| MIME-Version: 1.0
 | |
| Status: ORr
 | |
| 
 | |
| On Sun, 2002-06-23 at 23:40, Curt Sampson wrote:
 | |
| > On 23 Jun 2002, J. R. Nield wrote:
 | |
| > 
 | |
| > > If is impossible to do what you want. You can not protect against
 | |
| > > partial writes without writing pages twice and calling fdatasync
 | |
| > > between them while going through a generic filesystem.
 | |
| > 
 | |
| > I agree with this.
 | |
| > 
 | |
| > > The best disk array will not protect you if the operating system does
 | |
| > > not align block writes to the structure of the underlying device.
 | |
| > 
 | |
| > This I don't quite understand. Assuming you're using a SCSI drive
 | |
| > (and this mostly applies to ATAPI/IDE, too), you can do naught but
 | |
| > align block writes to the structure of the underlying device. When you
 | |
| > initiate a SCSI WRITE command, you start by telling the device at which
 | |
| > block to start writing and how many blocks you intend to write. Then you
 | |
| > start passing the data.
 | |
| > 
 | |
| 
 | |
| All I'm saying is that the entire postgresql block write must be
 | |
| converted into exactly one SCSI write command in all cases, and I don't
 | |
| know a portable way to ensure this. 
 | |
| 
 | |
| > > Even with raw devices, you need special support or knowledge of the
 | |
| > > operating system and/or the disk device to ensure that each write
 | |
| > > request will be atomic to the underlying hardware.
 | |
| > 
 | |
| > Well, so here I guess you're talking about two things:
 | |
| > 
 | |
| >     1. When you request, say, an 8K block write, will the OS really
 | |
| >     write it to disk in a single 8K or multiple of 8K SCSI write
 | |
| >     command?
 | |
| > 
 | |
| >     2. Does the SCSI device you're writing to consider these writes to
 | |
| >     be transactional. That is, if the write is interrupted before being
 | |
| >     completed, does the SCSI device guarantee that the partially-sent
 | |
| >     data is not written, and the old data is maintained? And of course,
 | |
| >     does it guarantee that, when it acknowledges a write, that write is
 | |
| >     now in stable storage and will never go away?
 | |
| > 
 | |
| > Both of these are not hard to guarantee, actually. For a BSD-based OS,
 | |
| > for example, just make sure that your filesystem block size is the
 | |
| > same as or a multiple of the database block size. BSD will never write
 | |
| > anything other than a block or a sequence of blocks to a disk in a
 | |
| > single SCSI transaction (unless you've got a really odd SCSI driver).
 | |
| > And for your disk, buy a Baydel or Clarion disk array, or something
 | |
| > similar.
 | |
| > 
 | |
| > Given that it's not hard to set up a system that meets these criteria,
 | |
| > and this is in fact commonly done for database servers, it would seem a
 | |
| > good idea for postgres to have the option to take advantage of the time
 | |
| > and money spent and adjust its performance upward appropriately.
 | |
| 
 | |
| I agree with this. My point was only that you need to know what
 | |
| guarantees your operating system/hardware combination provides on a
 | |
| case-by-case basis, and there is no standard way for a program to
 | |
| discover this. Most system administrators are not going to know this
 | |
| either, unless databases are their main responsibility.
 | |
| 
 | |
| > 
 | |
| > > All other systems rely on the fact that you can recover a damaged file
 | |
| > > using the log archive.
 | |
| > 
 | |
| > Not exactly. For MS SQL Server, at any rate, if it detects a page tear
 | |
| > you cannot restore based on the log file alone. You need a full or
 | |
| > partial backup that includes that entire torn block.
 | |
| > 
 | |
| 
 | |
| I should have been more specific: you need a backup of the file from
 | |
| some time ago, plus all the archived logs from then until the current
 | |
| log sequence number.
 | |
| 
 | |
| > > This means downtime in the rare case, but no data loss. Until
 | |
| > > PostgreSQL can do this, then it will not be acceptable for real
 | |
| > > critical production use.
 | |
| > 
 | |
| > It seems to me that it is doing this right now. In fact, it's more
 | |
| > reliable than some commerial systems (such as SQL Server) because it can
 | |
| > recover from a torn block with just the logfile.
 | |
| 
 | |
| Again, what I meant to say is that the commercial systems can recover
 | |
| with an old file backup + logs. How old the backup can be depends only
 | |
| on how much time you are willing to spend playing the logs forward. So
 | |
| if you do a full backup once a week, and multiplex and backup the logs,
 | |
| then even if a backup tape gets destroyed you can still survive. It just
 | |
| takes longer.
 | |
| 
 | |
| Also, postgreSQL can't recover from any other type of block corruption,
 | |
| while the commercial systems can. That's what I meant by the "critical
 | |
| production use" comment, which was sort-of unfair.
 | |
| 
 | |
| So I would say they are equally reliable for torn pages (but not bad
 | |
| blocks), and the commercial systems let you trade potential recovery
 | |
| time for not having to write the blocks twice. You do need to back-up
 | |
| the log archives though.
 | |
| 
 | |
| > 
 | |
| > > But at the end of the day, unless you have complete understanding of
 | |
| > > the I/O system from write(2) through to the disk system, the only sure
 | |
| > > ways to protect against partial writes are by "careful writes" (in
 | |
| > > the WAL log or elsewhere, writing pages twice), or by requiring (and
 | |
| > > allowing) users to do log-replay recovery when a file is corrupted by
 | |
| > > a partial write.
 | |
| > 
 | |
| > I don't understand how, without a copy of the old data that was in the
 | |
| > torn block, you can restore that block from just log file entries. Can
 | |
| > you explain this to me? Take, as an example, a block with ten tuples,
 | |
| > only one of which has been changed "recently." (I.e., only that change
 | |
| > is in the log files.)
 | |
| >
 | |
| > 
 | |
| > > If we log pages to WAL, they are useless when archived (after a
 | |
| > > checkpoint). So either we have a separate "log" for them (the
 | |
| > > ping-pong file), or we should at least remove them when archived,
 | |
| > > which makes log archiving more complex but is perfectly doable.
 | |
| > 
 | |
| > Right. That seems to me a better option, since we've now got only one
 | |
| > write point on the disk rather than two.
 | |
| 
 | |
| OK. I agree with this now.
 | |
| 
 | |
| > 
 | |
| > > Finally, I would love to hear why we are using the operating system
 | |
| > > buffer manager at all. The OS is acting as a secondary buffer manager
 | |
| > > for us. Why is that? What flaw in our I/O system does this reveal?
 | |
| > 
 | |
| > It's acting as a "second-level" buffer manager, yes, but to say it's
 | |
| > "secondary" may be a bit misleading. On most of the systems I've set
 | |
| > up, the OS buffer cache is doing the vast majority of the work, and the
 | |
| > postgres buffering is fairly minimal.
 | |
| > 
 | |
| > There are some good (and some perhaps not-so-good) reasons to do it this
 | |
| > way. I'll list them more or less in the order of best to worst:
 | |
| > 
 | |
| >     1. The OS knows where the blocks physically reside on disk, and
 | |
| >     postgres does not. Therefore it's in the interest of postgresql to
 | |
| >     dispatch write responsibility back to the OS as quickly as possible
 | |
| >     so that the OS can prioritize requests appropriately. Most operating
 | |
| >     systems use an "elevator" algorithm to minimize disk head movement;
 | |
| >     but if the OS does not have a block that it could write while the
 | |
| >     head is "on the way" to another request, it can't write it in that
 | |
| >     head pass.
 | |
| > 
 | |
| >     2. Postgres does not know about any "bank-switching" tricks for
 | |
| >     mapping more physical memory than it has address space. Thus, on
 | |
| >     32-bit machines, postgres might be limited to mapping 2 or 3 GB of
 | |
| >     memory, even though the machine has, say, 6 GB of physical RAM. The
 | |
| >     OS can use all of the available memory for caching; postgres cannot.
 | |
| > 
 | |
| >     3. A lot of work has been put into the seek algorithms, read-ahead
 | |
| >     algorithms, block allocation algorithms, etc. in the OS. Why
 | |
| >     duplicate all that work again in postgres?
 | |
| > 
 | |
| > When you say things like the following:
 | |
| > 
 | |
| > > We should only be writing blocks when they need to be on disk. We
 | |
| > > should not be expecting the OS to write them "sometime later" and
 | |
| > > avoid blocking (as long) for the write. If we need that, then our
 | |
| > > buffer management is wrong and we need to fix it.
 | |
| > 
 | |
| > you appear to be making the arugment that we should take the route of
 | |
| > other database systems, and use raw devices and our own management of
 | |
| > disk block allocation. If so, you might want first to look back through
 | |
| > the archives at the discussion I and several others had about this a
 | |
| > month or two ago. After looking in detail at what NetBSD, at least, does
 | |
| > in terms of its disk I/O algorithms and buffering, I've pretty much come
 | |
| > around, at least for the moment, to the attitude that we should stick
 | |
| > with using the OS. I wouldn't mind seeing postgres be able to manage all
 | |
| > of this stuff, but it's a *lot* of work for not all that much benefit
 | |
| > that I can see.
 | |
| 
 | |
| I'll back off on that. I don't know if we want to use the OS buffer
 | |
| manager, but shouldn't we try to have our buffer manager group writes
 | |
| together by files, and pro-actively get them out to disk? Right now, it
 | |
| looks like all our write requests are delayed as long as possible and
 | |
| the order in which they are written is pretty-much random, as is the
 | |
| backend that writes the block, so there is no locality of reference even
 | |
| when the blocks are adjacent on disk, and the write calls are spread-out
 | |
| over all the backends.
 | |
| 
 | |
| Would it not be the case that things like read-ahead, grouping writes,
 | |
| and caching written data are probably best done by PostgreSQL, because
 | |
| only our buffer manager can understand when they will be useful or when
 | |
| they will thrash the cache?
 | |
| 
 | |
| I may likely be wrong on this, and I haven't done any performance
 | |
| testing. I shouldn't have brought this up alongside the logging issues,
 | |
| but there seemed to be some question about whether the OS was actually
 | |
| doing all these things behind the scene.
 | |
| 
 | |
| 
 | |
| > 
 | |
| > > The ORACLE people were not kidding when they said that they could not
 | |
| > > certify Linux for production use until it supported O_DSYNC. Can you
 | |
| > > explain why that was the case?
 | |
| > 
 | |
| > I'm suspecting it's because Linux at the time had no raw devices, so
 | |
| > O_DSYNC was the only other possible method of making sure that disk
 | |
| > writes actually got to disk.
 | |
| > 
 | |
| > You certainly don't want to use O_DSYNC if you can use another method,
 | |
| > because O_DSYNC still goes through the the operating system's buffer
 | |
| > cache, wasting memory and double-caching things. If you're doing your
 | |
| > own management, you need either to use a raw device or open files with
 | |
| > the flag that indicates that the buffer cache should not be used at all
 | |
| > for reads from and writes to that file.
 | |
| 
 | |
| Would O_DSYNC|O_RSYNC turn off the cache? 
 | |
| 
 | |
| > 
 | |
| > > However, this discussion and a search of the pgsql-hackers archives
 | |
| > > reveals this problem to be the KEY area of PostgreSQL's failing, and
 | |
| > > general misunderstanding, when compared to its commercial competitors.
 | |
| > 
 | |
| > No, I think it's just that you're under a few minor misapprehensions
 | |
| > here about what postgres and the OS are actually doing. As I said, I
 | |
| > went through this whole exact argument a month or two ago, on this very
 | |
| > list, and I came around to the idea that what postgres is doing now
 | |
| > works quite well, at least on NetBSD. (Most other OSes have disk I/O
 | |
| > algorithms that are pretty much as good or better.) There might be a
 | |
| > very slight advantage to doing all one's own I/O management, but it's
 | |
| > a huge amount of work, and I think that much effort could be much more
 | |
| > usefully applied to other areas.
 | |
| 
 | |
| I will look for that discussion in the archives.
 | |
| 
 | |
| The logging issue is a key one I think. At least I would be very nervous
 | |
| as a DBA if I were running a system where any damaged file would cause
 | |
| data loss.
 | |
| 
 | |
| Does anyone know what the major barriers to infinite log replay are in
 | |
| PostgreSQL? I'm trying to look for everything that might need to be
 | |
| changed outside xlog.c, but surely this has come up before. Searching
 | |
| the archives hasn't revealed much.
 | |
| 
 | |
| 
 | |
| 
 | |
| As to the I/O issue:
 | |
| 
 | |
| Since you know a lot about NetBSD internals, I'd be interested in
 | |
| hearing about what postgresql looks like to the NetBSD buffer manager.
 | |
| Am I right that strings of successive writes get randomized? What do our
 | |
| cache-hit percentages look like? I'm going to do some experimenting with
 | |
| this.
 | |
| 
 | |
| > 
 | |
| > Just as a side note, I've been a NetBSD developer since about '96,
 | |
| > and have been delving into the details of OS design since well before
 | |
| > that time, so I'm coming to this with what I hope is reasonably good
 | |
| > knowledge of how disks work and how operating systems use them. (Not
 | |
| > that this should stop you from pointing out holes in my arguments. :-))
 | |
| > 
 | |
| 
 | |
| This stuff is very difficult to get right. Glad to know you follow this
 | |
| list.
 | |
| 
 | |
| 
 | |
| > cjs
 | |
| > -- 
 | |
| > Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
 | |
| >     Don't you know, in this new Dark Age, we're all light.  --XTC
 | |
| > 
 | |
| -- 
 | |
| J. R. Nield
 | |
| jrnield@usol.com
 | |
| 
 | |
| 
 | |
| 
 | |
| 
 | |
| From tgl@sss.pgh.pa.us Mon Jun 24 17:16:06 2002
 | |
| Return-path: <tgl@sss.pgh.pa.us>
 | |
| Received: from sss.pgh.pa.us (root@[192.204.191.242])
 | |
| 	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5OLG5F25284
 | |
| 	for <pgman@candle.pha.pa.us>; Mon, 24 Jun 2002 17:16:05 -0400 (EDT)
 | |
| Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
 | |
| 	by sss.pgh.pa.us (8.11.4/8.11.4) with ESMTP id g5OLG2121379;
 | |
| 	Mon, 24 Jun 2002 17:16:02 -0400 (EDT)
 | |
| To: "J. R. Nield" <jrnield@usol.com>
 | |
| cc: Curt Sampson <cjs@cynic.net>, Bruce Momjian <pgman@candle.pha.pa.us>,
 | |
|    PostgreSQL Hacker <pgsql-hackers@postgresql.org>
 | |
| Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE 
 | |
| In-Reply-To: <1024951786.1793.865.camel@localhost.localdomain> 
 | |
| References: <Pine.NEB.4.43.0206241150500.7326-100000@angelic.cynic.net> <1024951786.1793.865.camel@localhost.localdomain>
 | |
| Comments: In-reply-to "J. R. Nield" <jrnield@usol.com>
 | |
| 	message dated "24 Jun 2002 16:49:42 -0400"
 | |
| Date: Mon, 24 Jun 2002 17:16:01 -0400
 | |
| Message-ID: <21376.1024953361@sss.pgh.pa.us>
 | |
| From: Tom Lane <tgl@sss.pgh.pa.us>
 | |
| Status: OR
 | |
| 
 | |
| "J. R. Nield" <jrnield@usol.com> writes:
 | |
| > Also, postgreSQL can't recover from any other type of block corruption,
 | |
| > while the commercial systems can.
 | |
| 
 | |
| Say again?
 | |
| 
 | |
| > Would it not be the case that things like read-ahead, grouping writes,
 | |
| > and caching written data are probably best done by PostgreSQL, because
 | |
| > only our buffer manager can understand when they will be useful or when
 | |
| > they will thrash the cache?
 | |
| 
 | |
| I think you have been missing the point.  No one denies that there will
 | |
| be some incremental gain if we do all that.  However, the conclusion of
 | |
| everyone who has thought much about it (and I see Curt has joined that
 | |
| group) is that the effort would be far out of proportion to the probable
 | |
| gain.  There are a lot of other things we desperately need to spend time
 | |
| on that would not amount to re-engineering large quantities of OS-level
 | |
| code.  Given that most Unixen have perfectly respectable disk management
 | |
| subsystems, we prefer to tune our code to make use of that stuff, rather
 | |
| than follow the "conventional wisdom" that databases need to bypass it.
 | |
| 
 | |
| Oracle can afford to do that sort of thing because they have umpteen
 | |
| thousand developers available.  Postgres does not.
 | |
| 
 | |
| 			regards, tom lane
 | |
| 
 | |
| From pgsql-hackers-owner+M24128@postgresql.org Mon Jun 24 22:01:58 2002
 | |
| Return-path: <pgsql-hackers-owner+M24128@postgresql.org>
 | |
| Received: from postgresql.org (postgresql.org [64.49.215.8])
 | |
| 	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5P21vF19918
 | |
| 	for <pgman@candle.pha.pa.us>; Mon, 24 Jun 2002 22:01:57 -0400 (EDT)
 | |
| Received: from localhost.localdomain (postgresql.org [64.49.215.8])
 | |
| 	by localhost (Postfix) with ESMTP
 | |
| 	id 540B8475B33; Mon, 24 Jun 2002 21:34:40 -0400 (EDT)
 | |
| Mailbox-Line: From pgman@candle.pha.pa.us  Mon Jun 24 21:34:40 2002
 | |
| Received: from postgresql.org (postgresql.org [64.49.215.8])
 | |
| 	by postgresql.org (Postfix) with SMTP
 | |
| 	id 0A13F476965; Mon, 24 Jun 2002 19:30:14 -0400 (EDT)
 | |
| Received: from localhost.localdomain (postgresql.org [64.49.215.8])
 | |
| 	by localhost (Postfix) with ESMTP id B4F62476E4A
 | |
| 	for <pgsql-hackers@postgresql.org>; Mon, 24 Jun 2002 18:53:59 -0400 (EDT)
 | |
| Mailbox-Line: From pgman@candle.pha.pa.us  Mon Jun 24 18:53:59 2002
 | |
| Received: from candle.pha.pa.us (216-55-132-35.dsl.san-diego.abac.net [216.55.132.35])
 | |
| 	by postgresql.org (Postfix) with ESMTP id 36043475BF6
 | |
| 	for <pgsql-hackers@postgresql.org>; Mon, 24 Jun 2002 17:25:28 -0400 (EDT)
 | |
| Received: (from pgman@localhost)
 | |
| 	by candle.pha.pa.us (8.11.6/8.10.1) id g5OLPFG26140;
 | |
| 	Mon, 24 Jun 2002 17:25:15 -0400 (EDT)
 | |
| From: Bruce Momjian <pgman@candle.pha.pa.us>
 | |
| Message-ID: <200206242125.g5OLPFG26140@candle.pha.pa.us>
 | |
| Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
 | |
| In-Reply-To: <1024951786.1793.865.camel@localhost.localdomain>
 | |
| To: "J. R. Nield" <jrnield@usol.com>
 | |
| Date: Mon, 24 Jun 2002 17:25:14 -0400 (EDT)
 | |
| cc: Curt Sampson <cjs@cynic.net>, Tom Lane <tgl@sss.pgh.pa.us>,
 | |
|    PostgreSQL Hacker <pgsql-hackers@postgresql.org>
 | |
| X-Mailer: ELM [version 2.4ME+ PL97 (25)]
 | |
| MIME-Version: 1.0
 | |
| Content-Transfer-Encoding: 7bit
 | |
| Content-Type: text/plain; charset=US-ASCII
 | |
| Precedence: bulk
 | |
| Sender: pgsql-hackers-owner@postgresql.org
 | |
| X-Spam-Status: No, hits=-3.4 required=5.0
 | |
| 	tests=IN_REP_TO
 | |
| 	version=2.30
 | |
| Status: OR
 | |
| 
 | |
| J. R. Nield wrote:
 | |
| > > This I don't quite understand. Assuming you're using a SCSI drive
 | |
| > > (and this mostly applies to ATAPI/IDE, too), you can do naught but
 | |
| > > align block writes to the structure of the underlying device. When you
 | |
| > > initiate a SCSI WRITE command, you start by telling the device at which
 | |
| > > block to start writing and how many blocks you intend to write. Then you
 | |
| > > start passing the data.
 | |
| > > 
 | |
| > 
 | |
| > All I'm saying is that the entire postgresql block write must be
 | |
| > converted into exactly one SCSI write command in all cases, and I don't
 | |
| > know a portable way to ensure this. 
 | |
| 
 | |
| ...
 | |
| 
 | |
| > I agree with this. My point was only that you need to know what
 | |
| > guarantees your operating system/hardware combination provides on a
 | |
| > case-by-case basis, and there is no standard way for a program to
 | |
| > discover this. Most system administrators are not going to know this
 | |
| > either, unless databases are their main responsibility.
 | |
| 
 | |
| Yes, agreed.  >1% are going to know the answer to this question so we
 | |
| have to assume worst case.
 | |
| 
 | |
| > > It seems to me that it is doing this right now. In fact, it's more
 | |
| > > reliable than some commerial systems (such as SQL Server) because it can
 | |
| > > recover from a torn block with just the logfile.
 | |
| > 
 | |
| > Again, what I meant to say is that the commercial systems can recover
 | |
| > with an old file backup + logs. How old the backup can be depends only
 | |
| > on how much time you are willing to spend playing the logs forward. So
 | |
| > if you do a full backup once a week, and multiplex and backup the logs,
 | |
| > then even if a backup tape gets destroyed you can still survive. It just
 | |
| > takes longer.
 | |
| > 
 | |
| > Also, postgreSQL can't recover from any other type of block corruption,
 | |
| > while the commercial systems can. That's what I meant by the "critical
 | |
| > production use" comment, which was sort-of unfair.
 | |
| > 
 | |
| > So I would say they are equally reliable for torn pages (but not bad
 | |
| > blocks), and the commercial systems let you trade potential recovery
 | |
| > time for not having to write the blocks twice. You do need to back-up
 | |
| > the log archives though.
 | |
| 
 | |
| Yes, good tradeoff analysis.  We recover from partial writes quicker,
 | |
| and don't require saving of log files, _but_ we don't recover from bad
 | |
| disk blocks.  Good summary.
 | |
| 
 | |
| > I'll back off on that. I don't know if we want to use the OS buffer
 | |
| > manager, but shouldn't we try to have our buffer manager group writes
 | |
| > together by files, and pro-actively get them out to disk? Right now, it
 | |
| > looks like all our write requests are delayed as long as possible and
 | |
| > the order in which they are written is pretty-much random, as is the
 | |
| > backend that writes the block, so there is no locality of reference even
 | |
| > when the blocks are adjacent on disk, and the write calls are spread-out
 | |
| > over all the backends.
 | |
| > 
 | |
| > Would it not be the case that things like read-ahead, grouping writes,
 | |
| > and caching written data are probably best done by PostgreSQL, because
 | |
| > only our buffer manager can understand when they will be useful or when
 | |
| > they will thrash the cache?
 | |
| 
 | |
| The OS should handle all of this.  We are doing main table writes but no
 | |
| sync until checkpoint, so the OS can keep those blocks around and write
 | |
| them at its convenience.  It knows the size of the buffer cache and when
 | |
| stuff is forced to disk.  We can't second-guess that.
 | |
| 
 | |
| > I may likely be wrong on this, and I haven't done any performance
 | |
| > testing. I shouldn't have brought this up alongside the logging issues,
 | |
| > but there seemed to be some question about whether the OS was actually
 | |
| > doing all these things behind the scene.
 | |
| 
 | |
| It had better.  Looking at the kernel source is the way to know.
 | |
| 
 | |
| > Does anyone know what the major barriers to infinite log replay are in
 | |
| > PostgreSQL? I'm trying to look for everything that might need to be
 | |
| > changed outside xlog.c, but surely this has come up before. Searching
 | |
| > the archives hasn't revealed much.
 | |
| 
 | |
| This has been brought up.  Could we just save WAL files and get replay? 
 | |
| I believe some things have to be added to WAL to allow this, but it
 | |
| seems possible.  However, the pg_dump is just a data dump and does not
 | |
| have the file offsets and things.  Somehow you would need a tar-type
 | |
| backup of the database, and with a running db, it is hard to get a valid
 | |
| snapshot of that.
 | |
| 
 | |
| -- 
 | |
|   Bruce Momjian                        |  http://candle.pha.pa.us
 | |
|   pgman@candle.pha.pa.us               |  (610) 853-3000
 | |
|   +  If your life is a hard drive,     |  830 Blythe Avenue
 | |
|   +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
 | |
| 
 | |
| 
 | |
| 
 | |
| ---------------------------(end of broadcast)---------------------------
 | |
| TIP 3: if posting/reading through Usenet, please send an appropriate
 | |
| subscribe-nomail command to majordomo@postgresql.org so that your
 | |
| message can get through to the mailing list cleanly
 | |
| 
 | |
| 
 | |
| 
 | |
| From tgl@sss.pgh.pa.us Mon Jun 24 17:31:57 2002
 | |
| Return-path: <tgl@sss.pgh.pa.us>
 | |
| Received: from sss.pgh.pa.us (root@[192.204.191.242])
 | |
| 	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5OLVuF26684
 | |
| 	for <pgman@candle.pha.pa.us>; Mon, 24 Jun 2002 17:31:56 -0400 (EDT)
 | |
| Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
 | |
| 	by sss.pgh.pa.us (8.11.4/8.11.4) with ESMTP id g5OLVu121485;
 | |
| 	Mon, 24 Jun 2002 17:31:56 -0400 (EDT)
 | |
| To: Bruce Momjian <pgman@candle.pha.pa.us>
 | |
| cc: "J. R. Nield" <jrnield@usol.com>, Curt Sampson <cjs@cynic.net>,
 | |
|    PostgreSQL Hacker <pgsql-hackers@postgresql.org>
 | |
| Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE 
 | |
| In-Reply-To: <200206242125.g5OLPFG26140@candle.pha.pa.us> 
 | |
| References: <200206242125.g5OLPFG26140@candle.pha.pa.us>
 | |
| Comments: In-reply-to Bruce Momjian <pgman@candle.pha.pa.us>
 | |
| 	message dated "Mon, 24 Jun 2002 17:25:14 -0400"
 | |
| Date: Mon, 24 Jun 2002 17:31:56 -0400
 | |
| Message-ID: <21482.1024954316@sss.pgh.pa.us>
 | |
| From: Tom Lane <tgl@sss.pgh.pa.us>
 | |
| Status: ORr
 | |
| 
 | |
| Bruce Momjian <pgman@candle.pha.pa.us> writes:
 | |
| >> Does anyone know what the major barriers to infinite log replay are in
 | |
| >> PostgreSQL? I'm trying to look for everything that might need to be
 | |
| >> changed outside xlog.c, but surely this has come up before. Searching
 | |
| >> the archives hasn't revealed much.
 | |
| 
 | |
| > This has been brought up.  Could we just save WAL files and get replay? 
 | |
| > I believe some things have to be added to WAL to allow this, but it
 | |
| > seems possible.
 | |
| 
 | |
| The Red Hat group has been looking at this somewhat; so far there seem
 | |
| to be some minor tweaks that would be needed, but no showstoppers.
 | |
| 
 | |
| > Somehow you would need a tar-type
 | |
| > backup of the database, and with a running db, it is hard to get a valid
 | |
| > snapshot of that.
 | |
| 
 | |
| But you don't *need* a "valid snapshot", only a correct copy of
 | |
| every block older than the first checkpoint in your WAL log series.
 | |
| Any inconsistencies in your tar dump will look like repairable damage;
 | |
| replaying the WAL log will fix 'em.
 | |
| 
 | |
| 			regards, tom lane
 | |
| 
 | |
| From pgsql-hackers-owner+M24131@postgresql.org Mon Jun 24 21:15:06 2002
 | |
| Return-path: <pgsql-hackers-owner+M24131@postgresql.org>
 | |
| Received: from postgresql.org (postgresql.org [64.49.215.8])
 | |
| 	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5P1F5F15390
 | |
| 	for <pgman@candle.pha.pa.us>; Mon, 24 Jun 2002 21:15:05 -0400 (EDT)
 | |
| Received: from localhost.localdomain (postgresql.org [64.49.215.8])
 | |
| 	by localhost (Postfix) with ESMTP
 | |
| 	id B76174768CC; Mon, 24 Jun 2002 20:59:56 -0400 (EDT)
 | |
| Mailbox-Line: From tgl@sss.pgh.pa.us  Mon Jun 24 20:59:56 2002
 | |
| Received: from postgresql.org (postgresql.org [64.49.215.8])
 | |
| 	by postgresql.org (Postfix) with SMTP
 | |
| 	id 8724C47742E; Mon, 24 Jun 2002 20:17:44 -0400 (EDT)
 | |
| Received: from localhost.localdomain (postgresql.org [64.49.215.8])
 | |
| 	by localhost (Postfix) with ESMTP id 4E472476875
 | |
| 	for <pgsql-hackers@postgresql.org>; Mon, 24 Jun 2002 18:37:46 -0400 (EDT)
 | |
| Mailbox-Line: From tgl@sss.pgh.pa.us  Mon Jun 24 18:37:46 2002
 | |
| Received: from sss.pgh.pa.us (unknown [192.204.191.242])
 | |
| 	by postgresql.org (Postfix) with ESMTP id CFCC9476A7A
 | |
| 	for <pgsql-hackers@postgresql.org>; Mon, 24 Jun 2002 17:32:02 -0400 (EDT)
 | |
| Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
 | |
| 	by sss.pgh.pa.us (8.11.4/8.11.4) with ESMTP id g5OLVu121485;
 | |
| 	Mon, 24 Jun 2002 17:31:56 -0400 (EDT)
 | |
| To: Bruce Momjian <pgman@candle.pha.pa.us>
 | |
| cc: "J. R. Nield" <jrnield@usol.com>, Curt Sampson <cjs@cynic.net>,
 | |
|    PostgreSQL Hacker <pgsql-hackers@postgresql.org>
 | |
| Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE 
 | |
| In-Reply-To: <200206242125.g5OLPFG26140@candle.pha.pa.us> 
 | |
| References: <200206242125.g5OLPFG26140@candle.pha.pa.us>
 | |
| Comments: In-reply-to Bruce Momjian <pgman@candle.pha.pa.us>
 | |
| 	message dated "Mon, 24 Jun 2002 17:25:14 -0400"
 | |
| Date: Mon, 24 Jun 2002 17:31:56 -0400
 | |
| Message-ID: <21482.1024954316@sss.pgh.pa.us>
 | |
| From: Tom Lane <tgl@sss.pgh.pa.us>
 | |
| Precedence: bulk
 | |
| Sender: pgsql-hackers-owner@postgresql.org
 | |
| X-Spam-Status: No, hits=-5.3 required=5.0
 | |
| 	tests=IN_REP_TO,X_NOT_PRESENT
 | |
| 	version=2.30
 | |
| Status: OR
 | |
| 
 | |
| Bruce Momjian <pgman@candle.pha.pa.us> writes:
 | |
| >> Does anyone know what the major barriers to infinite log replay are in
 | |
| >> PostgreSQL? I'm trying to look for everything that might need to be
 | |
| >> changed outside xlog.c, but surely this has come up before. Searching
 | |
| >> the archives hasn't revealed much.
 | |
| 
 | |
| > This has been brought up.  Could we just save WAL files and get replay? 
 | |
| > I believe some things have to be added to WAL to allow this, but it
 | |
| > seems possible.
 | |
| 
 | |
| The Red Hat group has been looking at this somewhat; so far there seem
 | |
| to be some minor tweaks that would be needed, but no showstoppers.
 | |
| 
 | |
| > Somehow you would need a tar-type
 | |
| > backup of the database, and with a running db, it is hard to get a valid
 | |
| > snapshot of that.
 | |
| 
 | |
| But you don't *need* a "valid snapshot", only a correct copy of
 | |
| every block older than the first checkpoint in your WAL log series.
 | |
| Any inconsistencies in your tar dump will look like repairable damage;
 | |
| replaying the WAL log will fix 'em.
 | |
| 
 | |
| 			regards, tom lane
 | |
| 
 | |
| 
 | |
| 
 | |
| ---------------------------(end of broadcast)---------------------------
 | |
| TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
 | |
| 
 | |
| 
 | |
| 
 | |
| From pgsql-hackers-owner+M24133@postgresql.org Mon Jun 24 22:19:55 2002
 | |
| Return-path: <pgsql-hackers-owner+M24133@postgresql.org>
 | |
| Received: from postgresql.org (postgresql.org [64.49.215.8])
 | |
| 	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5P2JsF21543
 | |
| 	for <pgman@candle.pha.pa.us>; Mon, 24 Jun 2002 22:19:54 -0400 (EDT)
 | |
| Received: from localhost.localdomain (postgresql.org [64.49.215.8])
 | |
| 	by localhost (Postfix) with ESMTP
 | |
| 	id 42391476E53; Mon, 24 Jun 2002 22:09:49 -0400 (EDT)
 | |
| Mailbox-Line: From pgman@candle.pha.pa.us  Mon Jun 24 22:09:49 2002
 | |
| Received: from postgresql.org (postgresql.org [64.49.215.8])
 | |
| 	by postgresql.org (Postfix) with SMTP
 | |
| 	id 191654774EB; Mon, 24 Jun 2002 20:26:08 -0400 (EDT)
 | |
| Received: from localhost.localdomain (postgresql.org [64.49.215.8])
 | |
| 	by localhost (Postfix) with ESMTP id 8EB90476101
 | |
| 	for <pgsql-hackers@postgresql.org>; Mon, 24 Jun 2002 19:43:19 -0400 (EDT)
 | |
| Mailbox-Line: From pgman@candle.pha.pa.us  Mon Jun 24 19:43:19 2002
 | |
| Received: from candle.pha.pa.us (216-55-132-35.dsl.san-diego.abac.net [216.55.132.35])
 | |
| 	by postgresql.org (Postfix) with ESMTP id 08018476931
 | |
| 	for <pgsql-hackers@postgresql.org>; Mon, 24 Jun 2002 17:33:53 -0400 (EDT)
 | |
| Received: (from pgman@localhost)
 | |
| 	by candle.pha.pa.us (8.11.6/8.10.1) id g5OLXhl26908;
 | |
| 	Mon, 24 Jun 2002 17:33:43 -0400 (EDT)
 | |
| From: Bruce Momjian <pgman@candle.pha.pa.us>
 | |
| Message-ID: <200206242133.g5OLXhl26908@candle.pha.pa.us>
 | |
| Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
 | |
| In-Reply-To: <21482.1024954316@sss.pgh.pa.us>
 | |
| To: Tom Lane <tgl@sss.pgh.pa.us>
 | |
| Date: Mon, 24 Jun 2002 17:33:43 -0400 (EDT)
 | |
| cc: "J. R. Nield" <jrnield@usol.com>, Curt Sampson <cjs@cynic.net>,
 | |
|    PostgreSQL Hacker <pgsql-hackers@postgresql.org>
 | |
| X-Mailer: ELM [version 2.4ME+ PL97 (25)]
 | |
| MIME-Version: 1.0
 | |
| Content-Transfer-Encoding: 7bit
 | |
| Content-Type: text/plain; charset=US-ASCII
 | |
| Precedence: bulk
 | |
| Sender: pgsql-hackers-owner@postgresql.org
 | |
| X-Spam-Status: No, hits=-3.4 required=5.0
 | |
| 	tests=IN_REP_TO
 | |
| 	version=2.30
 | |
| Status: OR
 | |
| 
 | |
| Tom Lane wrote:
 | |
| > Bruce Momjian <pgman@candle.pha.pa.us> writes:
 | |
| > >> Does anyone know what the major barriers to infinite log replay are in
 | |
| > >> PostgreSQL? I'm trying to look for everything that might need to be
 | |
| > >> changed outside xlog.c, but surely this has come up before. Searching
 | |
| > >> the archives hasn't revealed much.
 | |
| > 
 | |
| > > This has been brought up.  Could we just save WAL files and get replay? 
 | |
| > > I believe some things have to be added to WAL to allow this, but it
 | |
| > > seems possible.
 | |
| > 
 | |
| > The Red Hat group has been looking at this somewhat; so far there seem
 | |
| > to be some minor tweaks that would be needed, but no showstoppers.
 | |
| 
 | |
| 
 | |
| Good.
 | |
| 
 | |
| > > Somehow you would need a tar-type
 | |
| > > backup of the database, and with a running db, it is hard to get a valid
 | |
| > > snapshot of that.
 | |
| > 
 | |
| > But you don't *need* a "valid snapshot", only a correct copy of
 | |
| > every block older than the first checkpoint in your WAL log series.
 | |
| > Any inconsistencies in your tar dump will look like repairable damage;
 | |
| > replaying the WAL log will fix 'em.
 | |
| 
 | |
| Yes, my point was that you need physical file backups, not pg_dump, and
 | |
| you have to be tricky about the files changing during the backup.  You
 | |
| _can_ work around changes to the files during backup.
 | |
| 
 | |
| -- 
 | |
|   Bruce Momjian                        |  http://candle.pha.pa.us
 | |
|   pgman@candle.pha.pa.us               |  (610) 853-3000
 | |
|   +  If your life is a hard drive,     |  830 Blythe Avenue
 | |
|   +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
 | |
| /usr/local/bin/mime: cannot create /dev/ttyp3: permission denied
 | |
| 
 | |
| 
 | |
| 
 | |
| ---------------------------(end of broadcast)---------------------------
 | |
| TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
 | |
| 
 | |
| 
 | |
| 
 | |
| From pgsql-hackers-owner+M24139@postgresql.org Tue Jun 25 00:00:22 2002
 | |
| Return-path: <pgsql-hackers-owner+M24139@postgresql.org>
 | |
| Received: from postgresql.org (postgresql.org [64.49.215.8])
 | |
| 	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5P40LF00838
 | |
| 	for <pgman@candle.pha.pa.us>; Tue, 25 Jun 2002 00:00:21 -0400 (EDT)
 | |
| Received: from localhost.localdomain (postgresql.org [64.49.215.8])
 | |
| 	by localhost (Postfix) with ESMTP
 | |
| 	id CBAE8476E94; Mon, 24 Jun 2002 23:44:51 -0400 (EDT)
 | |
| Mailbox-Line: From jrnield@usol.com  Mon Jun 24 23:44:51 2002
 | |
| Received: from postgresql.org (postgresql.org [64.49.215.8])
 | |
| 	by postgresql.org (Postfix) with SMTP
 | |
| 	id C5076476871; Mon, 24 Jun 2002 22:25:46 -0400 (EDT)
 | |
| Received: from localhost.localdomain (postgresql.org [64.49.215.8])
 | |
| 	by localhost (Postfix) with ESMTP id 8DF57476979
 | |
| 	for <pgsql-hackers@postgresql.org>; Mon, 24 Jun 2002 22:08:31 -0400 (EDT)
 | |
| Mailbox-Line: From jrnield@usol.com  Mon Jun 24 22:08:31 2002
 | |
| Received: from hades.usol.com (hades.usol.com [208.232.58.41])
 | |
| 	by postgresql.org (Postfix) with ESMTP id 298D2476101
 | |
| 	for <pgsql-hackers@postgresql.org>; Mon, 24 Jun 2002 20:27:46 -0400 (EDT)
 | |
| Received: from 08-159.024.popsite.net (08-159.024.popsite.net [66.19.4.159])
 | |
| 	by hades.usol.com (8.11.6/8.11.6) with ESMTP id g5P0RbV01261;
 | |
| 	Mon, 24 Jun 2002 20:27:37 -0400
 | |
| Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
 | |
| From: "J. R. Nield" <jrnield@usol.com>
 | |
| To: Tom Lane <tgl@sss.pgh.pa.us>
 | |
| cc: Curt Sampson <cjs@cynic.net>, Bruce Momjian <pgman@candle.pha.pa.us>,
 | |
|    PostgreSQL Hacker <pgsql-hackers@postgresql.org>
 | |
| In-Reply-To: <21376.1024953361@sss.pgh.pa.us>
 | |
| References: <Pine.NEB.4.43.0206241150500.7326-100000@angelic.cynic.net>
 | |
| 	<1024951786.1793.865.camel@localhost.localdomain> 
 | |
| 	<21376.1024953361@sss.pgh.pa.us>
 | |
| Content-Type: text/plain
 | |
| Content-Transfer-Encoding: 7bit
 | |
| X-Mailer: Ximian Evolution 1.0.3 (1.0.3-6) 
 | |
| Date: 24 Jun 2002 20:28:00 -0400
 | |
| Message-ID: <1024964884.3031.876.camel@localhost.localdomain>
 | |
| MIME-Version: 1.0
 | |
| Precedence: bulk
 | |
| Sender: pgsql-hackers-owner@postgresql.org
 | |
| X-Spam-Status: No, hits=-3.4 required=5.0
 | |
| 	tests=IN_REP_TO
 | |
| 	version=2.30
 | |
| Status: OR
 | |
| 
 | |
| On Mon, 2002-06-24 at 17:16, Tom Lane wrote:
 | |
|  
 | |
| > I think you have been missing the point...  
 | |
| Yes, this appears to be the case. Thanks especially to Curt for clearing
 | |
| things up for me.
 | |
| 
 | |
| -- 
 | |
| J. R. Nield
 | |
| jrnield@usol.com
 | |
| 
 | |
| 
 | |
| 
 | |
| 
 | |
| 
 | |
| 
 | |
| ---------------------------(end of broadcast)---------------------------
 | |
| TIP 5: Have you checked our extensive FAQ?
 | |
| 
 | |
| http://www.postgresql.org/users-lounge/docs/faq.html
 | |
| 
 | |
| 
 | |
| 
 | |
| From jrnield@usol.com Mon Jun 24 20:27:45 2002
 | |
| Return-path: <jrnield@usol.com>
 | |
| Received: from hades.usol.com (IDENT:root@hades.usol.com [208.232.58.41])
 | |
| 	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5P0RhF10711
 | |
| 	for <pgman@candle.pha.pa.us>; Mon, 24 Jun 2002 20:27:44 -0400 (EDT)
 | |
| Received: from 08-159.024.popsite.net (08-159.024.popsite.net [66.19.4.159])
 | |
| 	by hades.usol.com (8.11.6/8.11.6) with ESMTP id g5P0RbV01261;
 | |
| 	Mon, 24 Jun 2002 20:27:37 -0400
 | |
| Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
 | |
| From: "J. R. Nield" <jrnield@usol.com>
 | |
| To: Tom Lane <tgl@sss.pgh.pa.us>
 | |
| cc: Curt Sampson <cjs@cynic.net>, Bruce Momjian <pgman@candle.pha.pa.us>,
 | |
|    PostgreSQL Hacker <pgsql-hackers@postgresql.org>
 | |
| In-Reply-To: <21376.1024953361@sss.pgh.pa.us>
 | |
| References: <Pine.NEB.4.43.0206241150500.7326-100000@angelic.cynic.net>
 | |
| 	<1024951786.1793.865.camel@localhost.localdomain> 
 | |
| 	<21376.1024953361@sss.pgh.pa.us>
 | |
| Content-Type: text/plain
 | |
| Content-Transfer-Encoding: 7bit
 | |
| X-Mailer: Ximian Evolution 1.0.3 (1.0.3-6) 
 | |
| Date: 24 Jun 2002 20:28:00 -0400
 | |
| Message-ID: <1024964884.3031.876.camel@localhost.localdomain>
 | |
| MIME-Version: 1.0
 | |
| Status: OR
 | |
| 
 | |
| On Mon, 2002-06-24 at 17:16, Tom Lane wrote:
 | |
|  
 | |
| > I think you have been missing the point...  
 | |
| Yes, this appears to be the case. Thanks especially to Curt for clearing
 | |
| things up for me.
 | |
| 
 | |
| -- 
 | |
| J. R. Nield
 | |
| jrnield@usol.com
 | |
| 
 | |
| 
 | |
| 
 | |
| 
 | |
| From cjs@cynic.net Mon Jun 24 23:32:23 2002
 | |
| Return-path: <cjs@cynic.net>
 | |
| Received: from academic.cynic.net ([63.144.177.3])
 | |
| 	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5P3WMF28287
 | |
| 	for <pgman@candle.pha.pa.us>; Mon, 24 Jun 2002 23:32:23 -0400 (EDT)
 | |
| Received: from angelic-academic.cvpn.cynic.net (angelic-academic.cvpn.cynic.net [198.73.220.224])
 | |
| 	by academic.cynic.net (Postfix) with ESMTP
 | |
| 	id 28AB5F820; Tue, 25 Jun 2002 03:32:08 +0000 (UTC)
 | |
| Date: Tue, 25 Jun 2002 12:32:05 +0900 (JST)
 | |
| From: Curt Sampson <cjs@cynic.net>
 | |
| To: "J. R. Nield" <jrnield@usol.com>
 | |
| cc: Bruce Momjian <pgman@candle.pha.pa.us>, Tom Lane <tgl@sss.pgh.pa.us>,
 | |
|    PostgreSQL Hacker <pgsql-hackers@postgresql.org>
 | |
| Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
 | |
| In-Reply-To: <1024951786.1793.865.camel@localhost.localdomain>
 | |
| Message-ID: <Pine.NEB.4.43.0206251229010.17448-100000@angelic.cynic.net>
 | |
| MIME-Version: 1.0
 | |
| Content-Type: TEXT/PLAIN; charset=US-ASCII
 | |
| Status: OR
 | |
| 
 | |
| On 24 Jun 2002, J. R. Nield wrote:
 | |
| 
 | |
| > All I'm saying is that the entire postgresql block write must be
 | |
| > converted into exactly one SCSI write command in all cases, and I don't
 | |
| > know a portable way to ensure this.
 | |
| 
 | |
| No, there's no portable way. All you can do is give the admin who
 | |
| is able to set things up safely the ability to turn of the now-unneeded
 | |
| (and expensive) safety-related stuff that postgres does.
 | |
| 
 | |
| > I agree with this. My point was only that you need to know what
 | |
| > guarantees your operating system/hardware combination provides on a
 | |
| > case-by-case basis, and there is no standard way for a program to
 | |
| > discover this. Most system administrators are not going to know this
 | |
| > either, unless databases are their main responsibility.
 | |
| 
 | |
| Certainly this is true of pretty much every database system out there.
 | |
| 
 | |
| cjs
 | |
| -- 
 | |
| Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
 | |
|     Don't you know, in this new Dark Age, we're all light.  --XTC
 | |
| 
 | |
| 
 | |
| From cjs@cynic.net Tue Jun 25 01:09:02 2002
 | |
| Return-path: <cjs@cynic.net>
 | |
| Received: from academic.cynic.net (academic.cynic.net [63.144.177.3])
 | |
| 	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5P591F07292
 | |
| 	for <pgman@candle.pha.pa.us>; Tue, 25 Jun 2002 01:09:01 -0400 (EDT)
 | |
| Received: from angelic-academic.cvpn.cynic.net (angelic-academic.cvpn.cynic.net [198.73.220.224])
 | |
| 	by academic.cynic.net (Postfix) with ESMTP
 | |
| 	id 517BEF820; Tue, 25 Jun 2002 05:09:02 +0000 (UTC)
 | |
| Date: Tue, 25 Jun 2002 14:08:59 +0900 (JST)
 | |
| From: Curt Sampson <cjs@cynic.net>
 | |
| To: Tom Lane <tgl@sss.pgh.pa.us>
 | |
| cc: "J. R. Nield" <jrnield@usol.com>, Bruce Momjian <pgman@candle.pha.pa.us>,
 | |
|    PostgreSQL Hacker <pgsql-hackers@postgresql.org>
 | |
| Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE 
 | |
| In-Reply-To: <21376.1024953361@sss.pgh.pa.us>
 | |
| Message-ID: <Pine.NEB.4.43.0206251406390.17448-100000@angelic.cynic.net>
 | |
| MIME-Version: 1.0
 | |
| Content-Type: TEXT/PLAIN; charset=US-ASCII
 | |
| Status: ORr
 | |
| 
 | |
| On Mon, 24 Jun 2002, Tom Lane wrote:
 | |
| 
 | |
| > There are a lot of other things we desperately need to spend time
 | |
| > on that would not amount to re-engineering large quantities of OS-level
 | |
| > code.  Given that most Unixen have perfectly respectable disk management
 | |
| > subsystems, we prefer to tune our code to make use of that stuff, rather
 | |
| > than follow the "conventional wisdom" that databases need to bypass it.
 | |
| > ...
 | |
| > Oracle can afford to do that sort of thing because they have umpteen
 | |
| > thousand developers available.  Postgres does not.
 | |
| 
 | |
| Well, Oracle also started out, a long long time ago, on systems without
 | |
| unified buffer cache and so on, and so they *had* to write this stuff
 | |
| because otherwise data would not be cached. So Oracle can also afford to
 | |
| maintain it now because the code already exists.
 | |
| 
 | |
| cjs
 | |
| -- 
 | |
| Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
 | |
|     Don't you know, in this new Dark Age, we're all light.  --XTC
 | |
| 
 | |
| 
 | |
| From pgsql-hackers-owner+M24154@postgresql.org Tue Jun 25 09:22:38 2002
 | |
| Return-path: <pgsql-hackers-owner+M24154@postgresql.org>
 | |
| Received: from postgresql.org (postgresql.org [64.49.215.8])
 | |
| 	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5PDMbF03932
 | |
| 	for <pgman@candle.pha.pa.us>; Tue, 25 Jun 2002 09:22:37 -0400 (EDT)
 | |
| Received: from localhost.localdomain (postgresql.org [64.49.215.8])
 | |
| 	by localhost (Postfix) with ESMTP
 | |
| 	id C12C3475E4A; Tue, 25 Jun 2002 09:22:32 -0400 (EDT)
 | |
| Mailbox-Line: From pgman@candle.pha.pa.us  Tue Jun 25 09:22:32 2002
 | |
| Received: from postgresql.org (postgresql.org [64.49.215.8])
 | |
| 	by postgresql.org (Postfix) with SMTP
 | |
| 	id 65471475C7A; Tue, 25 Jun 2002 09:22:23 -0400 (EDT)
 | |
| Received: from localhost.localdomain (postgresql.org [64.49.215.8])
 | |
| 	by localhost (Postfix) with ESMTP id 97C8C475A7C
 | |
| 	for <pgsql-hackers@postgresql.org>; Tue, 25 Jun 2002 09:22:20 -0400 (EDT)
 | |
| Mailbox-Line: From pgman@candle.pha.pa.us  Tue Jun 25 09:22:20 2002
 | |
| Received: from candle.pha.pa.us (216-55-132-35.dsl.san-diego.abac.net [216.55.132.35])
 | |
| 	by postgresql.org (Postfix) with ESMTP id 42C0B475A64
 | |
| 	for <pgsql-hackers@postgresql.org>; Tue, 25 Jun 2002 09:22:19 -0400 (EDT)
 | |
| Received: (from pgman@localhost)
 | |
| 	by candle.pha.pa.us (8.11.6/8.10.1) id g5PDM5B03772;
 | |
| 	Tue, 25 Jun 2002 09:22:05 -0400 (EDT)
 | |
| From: Bruce Momjian <pgman@candle.pha.pa.us>
 | |
| Message-ID: <200206251322.g5PDM5B03772@candle.pha.pa.us>
 | |
| Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE
 | |
| In-Reply-To: <Pine.NEB.4.43.0206251406390.17448-100000@angelic.cynic.net>
 | |
| To: Curt Sampson <cjs@cynic.net>
 | |
| Date: Tue, 25 Jun 2002 09:22:05 -0400 (EDT)
 | |
| cc: Tom Lane <tgl@sss.pgh.pa.us>, "J. R. Nield" <jrnield@usol.com>,
 | |
|    PostgreSQL Hacker <pgsql-hackers@postgresql.org>
 | |
| X-Mailer: ELM [version 2.4ME+ PL97 (25)]
 | |
| MIME-Version: 1.0
 | |
| Content-Transfer-Encoding: 7bit
 | |
| Content-Type: text/plain; charset=US-ASCII
 | |
| Precedence: bulk
 | |
| Sender: pgsql-hackers-owner@postgresql.org
 | |
| X-Spam-Status: No, hits=-3.4 required=5.0
 | |
| 	tests=IN_REP_TO
 | |
| 	version=2.30
 | |
| Status: OR
 | |
| 
 | |
| Curt Sampson wrote:
 | |
| > On Mon, 24 Jun 2002, Tom Lane wrote:
 | |
| > 
 | |
| > > There are a lot of other things we desperately need to spend time
 | |
| > > on that would not amount to re-engineering large quantities of OS-level
 | |
| > > code.  Given that most Unixen have perfectly respectable disk management
 | |
| > > subsystems, we prefer to tune our code to make use of that stuff, rather
 | |
| > > than follow the "conventional wisdom" that databases need to bypass it.
 | |
| > > ...
 | |
| > > Oracle can afford to do that sort of thing because they have umpteen
 | |
| > > thousand developers available.  Postgres does not.
 | |
| > 
 | |
| > Well, Oracle also started out, a long long time ago, on systems without
 | |
| > unified buffer cache and so on, and so they *had* to write this stuff
 | |
| > because otherwise data would not be cached. So Oracle can also afford to
 | |
| > maintain it now because the code already exists.
 | |
| 
 | |
| Well, actually, it isn't unified buffer cache that is the issue, but
 | |
| rather the older SysV file system had pretty poor performance so
 | |
| bypassing it was a bigger win that it is today.
 | |
| 
 | |
| -- 
 | |
|   Bruce Momjian                        |  http://candle.pha.pa.us
 | |
|   pgman@candle.pha.pa.us               |  (610) 853-3000
 | |
|   +  If your life is a hard drive,     |  830 Blythe Avenue
 | |
|   +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
 | |
| 
 | |
| 
 | |
| 
 | |
| ---------------------------(end of broadcast)---------------------------
 | |
| TIP 4: Don't 'kill -9' the postmaster
 | |
| 
 | |
| 
 | |
| 
 | |
| From pgsql-hackers-owner+M31893@postgresql.org Fri Nov 15 11:25:58 2002
 | |
| Return-path: <pgsql-hackers-owner+M31893@postgresql.org>
 | |
| Received: from postgresql.org (postgresql.org [64.49.215.8])
 | |
| 	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id gAFHPvR10276
 | |
| 	for <pgman@candle.pha.pa.us>; Fri, 15 Nov 2002 12:25:57 -0500 (EST)
 | |
| Received: from localhost (postgresql.org [64.49.215.8])
 | |
| 	by postgresql.org (Postfix) with ESMTP
 | |
| 	id A2D5A4774A1; Fri, 15 Nov 2002 11:34:54 -0500 (EST)
 | |
| Received: from postgresql.org (postgresql.org [64.49.215.8])
 | |
| 	by postgresql.org (Postfix) with SMTP
 | |
| 	id 5E898477132; Fri, 15 Nov 2002 11:15:45 -0500 (EST)
 | |
| Received: from localhost (postgresql.org [64.49.215.8])
 | |
| 	by postgresql.org (Postfix) with ESMTP id 90CF1475B85
 | |
| 	for <pgsql-hackers@postgresql.org>; Mon, 11 Nov 2002 15:33:47 -0500 (EST)
 | |
| Received: from Curtis-Vaio (unknown [63.164.0.45])
 | |
| 	by postgresql.org (Postfix) with SMTP id C6CB1475A3F
 | |
| 	for <pgsql-hackers@postgresql.org>; Mon, 11 Nov 2002 15:33:46 -0500 (EST)
 | |
| Received: from [127.0.0.1] by Curtis-Vaio
 | |
|   (ArGoSoft Mail Server Freeware, Version 1.8 (1.8.1.7)); Mon, 11 Nov 2002 16:33:42 -0400
 | |
| From: "Curtis Faith" <curtis@galtcapital.com>
 | |
| To: <pgsql-hackers@postgresql.org>
 | |
| Subject: [HACKERS] 500 tpsQL + WAL log implementation
 | |
| Date: Mon, 11 Nov 2002 16:33:41 -0400
 | |
| Message-ID: <DMEEJMCDOJAKPPFACMPMCEBMCFAA.curtis@galtcapital.com>
 | |
| MIME-Version: 1.0
 | |
| Content-Type: text/plain;
 | |
| 	charset="iso-8859-1"
 | |
| Content-Transfer-Encoding: 7bit
 | |
| X-Priority: 3 (Normal)
 | |
| X-MSMail-Priority: Normal
 | |
| X-Mailer: Microsoft Outlook IMO, Build 9.0.2416 (9.0.2911.0)
 | |
| X-MimeOLE: Produced By Microsoft MimeOLE V5.00.2919.6700
 | |
| Importance: Normal
 | |
| X-Virus-Scanned: by AMaViS new-20020517
 | |
| Precedence: bulk
 | |
| Sender: pgsql-hackers-owner@postgresql.org
 | |
| X-Virus-Scanned: by AMaViS new-20020517
 | |
| Status: ORr
 | |
| 
 | |
| I have been experimenting with empirical tests of file system and device
 | |
| level writes to determine the actual constraints in order to speed up the WAL
 | |
| logging code.
 | |
| 
 | |
| Using a raw file partition and a time-based technique for determining the
 | |
| optimal write position, I am able to get 8K writes physically written to disk
 | |
| synchronously in the range of 500 to 650 writes per second using FreeBSD raw
 | |
| device partitions on IDE disks (with write cache disabled).  I will be
 | |
| testing it soon under linux with 10,00RPM SCSI which should be even better.
 | |
| It is my belief that the mechanism used to achieve these speeds could be
 | |
| incorporated into the existing WAL logging code as an abstraction that looks
 | |
| to the WAL code just like the file level access currently used. The current
 | |
| speeds are limited by the speed of a single disk rotation. For a 7,200 RPM
 | |
| disk this is 120/second, for a 10,000 RPM disk this is 166.66/second
 | |
| 
 | |
| The mechanism works by adjusting the seek offset of the write by using
 | |
| gettimeofday to determine approximately where the disk head is in its
 | |
| rotation. The mechanism does not use any AIO calls.
 | |
| 
 | |
| Assuming the following:
 | |
| 
 | |
| 1) Disk rotation time is 8.333ms or 8333us (7200 RPM).
 | |
| 
 | |
| 2) A write at offset 1,500K completes at system time 103s 000ms 000us
 | |
| 
 | |
| 3) A new write is requested at system time 103s 004ms 166us
 | |
| 
 | |
| 4) A 390K per rotation alignment of the data on the disk.
 | |
| 
 | |
| 5) A write must be sent at least 20K ahead of the current head position to
 | |
| ensure that it is written in less than one rotation.
 | |
| 
 | |
| It can be determined from the above that a write for an offset of something
 | |
| slightly more than 195K past the last write, or offset 1,695K will be ahead
 | |
| of the current location of the head and will therefore complete in less than
 | |
| a single rotation's time.
 | |
| 
 | |
| The disk specific metrics (rotation speed, bytes per rotation, base write
 | |
| time, etc.) can be derived empirically through a tester program that would
 | |
| take a few minutes to run and which could be run at log setup time.
 | |
| 
 | |
| The obvious problem with the above mechanism is that the WAL log needs to be
 | |
| able to read from the log file in transaction order during recovery. This
 | |
| could be provided for using an abstraction that prepends the logical order
 | |
| for each block written to the disk and makes sure that the log blocks contain
 | |
| either a valid logical order number or some other marker indicating that the
 | |
| block is not being used.
 | |
| 
 | |
| A bitmap of blocks that have already been used would be kept in memory for
 | |
| quickly determining the next set of possible unused blocks but this bitmap
 | |
| would not need to be written to disk except during normal shutdown since in
 | |
| the even of a failure the bitmaps would be reconstructed by reading all the
 | |
| blocks from the disk.
 | |
| 
 | |
| Checkpointing and something akin to log rotation could be handled using this
 | |
| mechanism as well.
 | |
| 
 | |
| So, MY REAL QUESTION is whether or not this is the sort of speed improvement
 | |
| that warrants the work of writing the required abstraction layer and making
 | |
| this very robust. The WAL code should remain essentially unchanged, with
 | |
| perhaps new calls for the five or six routines used to access the log files,
 | |
| and handle the equivalent of log rotation for raw device access. These new
 | |
| calls would either use the current file based implementation or the new
 | |
| logging mechanism depending on the configuration.
 | |
| 
 | |
| I anticipate that the extra work required for a PostgreSQL administrator to
 | |
| use the proposed logging mechanism would be to:
 | |
| 
 | |
| 1) Create a raw device partition of the appropriate size
 | |
| 2) Run the metrics tester for that device partition
 | |
| 3) Set the appropriate configuration parameters to indicate raw WAL logging
 | |
| 
 | |
| I anticipate that the additional space requirements for this system would be
 | |
| on the order of 10% to 15% beyond the current file-based implementation's
 | |
| requirements.
 | |
| 
 | |
| So, is this worth doing? Would a robust implementation likely be accepted for
 | |
| 7.4 assuming it can demonstrate speed improvements in the range of 500tps?
 | |
| 
 | |
| - Curtis
 | |
| 
 | |
| 
 | |
| 
 | |
| 
 | |
| 
 | |
| 
 | |
| 
 | |
| 
 | |
| 
 | |
| 
 | |
| 
 | |
| 
 | |
| 
 | |
| 
 | |
| 
 | |
| 
 | |
| 
 | |
| ---------------------------(end of broadcast)---------------------------
 | |
| TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
 | |
| 
 |