mirror of
				https://github.com/postgres/postgres.git
				synced 2025-11-03 09:13:20 +03:00 
			
		
		
		
	
		
			
				
	
	
		
			485 lines
		
	
	
		
			22 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			485 lines
		
	
	
		
			22 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
From pgsql-hackers-owner+M5149@postgresql.org Mon Feb 26 03:32:49 2001
 | 
						|
Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28])
 | 
						|
	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id DAA04497
 | 
						|
	for <pgman@candle.pha.pa.us>; Mon, 26 Feb 2001 03:32:48 -0500 (EST)
 | 
						|
Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28])
 | 
						|
	by mail.postgresql.org (8.11.1/8.11.1) with SMTP id f1Q8TSx48319;
 | 
						|
	Mon, 26 Feb 2001 03:29:28 -0500 (EST)
 | 
						|
	(envelope-from pgsql-hackers-owner+M5149@postgresql.org)
 | 
						|
Received: from store.d.zembu.com (nat.zembu.com [209.128.96.253])
 | 
						|
	by mail.postgresql.org (8.11.1/8.11.1) with ESMTP id f1Q8LPx47243
 | 
						|
	for <pgsql-hackers@postgreSQL.org>; Mon, 26 Feb 2001 03:21:25 -0500 (EST)
 | 
						|
	(envelope-from ncm@zembu.com)
 | 
						|
Received: by store.d.zembu.com (Postfix, from userid 509)
 | 
						|
	id 58E39A782; Mon, 26 Feb 2001 00:21:25 -0800 (PST)
 | 
						|
Date: Mon, 26 Feb 2001 00:21:25 -0800
 | 
						|
To: pgsql-hackers@postgresql.org
 | 
						|
Subject: Re: [HACKERS] Re: [PATCHES] A patch for xlog.c
 | 
						|
Message-ID: <20010226002125.A2430@store.zembu.com>
 | 
						|
Reply-To: pgsql-hackers@postgresql.org
 | 
						|
References: <200102260200.VAA17397@candle.pha.pa.us> <22318.983161726@sss.pgh.pa.us>
 | 
						|
Mime-Version: 1.0
 | 
						|
Content-Type: text/plain; charset=us-ascii
 | 
						|
Content-Disposition: inline
 | 
						|
User-Agent: Mutt/1.2.5i
 | 
						|
In-Reply-To: <22318.983161726@sss.pgh.pa.us>; from tgl@sss.pgh.pa.us on Sun, Feb 25, 2001 at 11:28:46PM -0500
 | 
						|
From: ncm@zembu.com (Nathan Myers)
 | 
						|
Precedence: bulk
 | 
						|
Sender: pgsql-hackers-owner@postgresql.org
 | 
						|
Status: ORr
 | 
						|
 | 
						|
On Sun, Feb 25, 2001 at 11:28:46PM -0500, Tom Lane wrote:
 | 
						|
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
 | 
						|
> > It allows no backing store on disk.  
 | 
						|
 | 
						|
I.e. it allows you to map memory without an associated inode; the memory
 | 
						|
may still be swapped.  Of course, there is no problem with mapping an 
 | 
						|
inode too, so that unrelated processes can join in.  Solarix has a flag
 | 
						|
to pin the shared pages in RAM so they can't be swapped out.
 | 
						|
 | 
						|
> > It is the BSD solution to SysV
 | 
						|
> > share memory.  Here are all the BSDi flags:
 | 
						|
> 
 | 
						|
> >      MAP_ANON    Map anonymous memory not associated with any specific
 | 
						|
> >                  file.  The file descriptor used for creating MAP_ANON
 | 
						|
> >                  must be -1.  The offset parameter is ignored.
 | 
						|
> 
 | 
						|
> Hmm.  Now that I read down to the "nonstandard extensions" part of the
 | 
						|
> HPUX man page for mmap(), I find
 | 
						|
> 
 | 
						|
>      If MAP_ANONYMOUS is set in flags:
 | 
						|
> 
 | 
						|
>           o    A new memory region is created and initialized to all zeros.
 | 
						|
>                This memory region can be shared only with descendants of
 | 
						|
>                the current process.
 | 
						|
 | 
						|
This is supported on Linux and BSD, but not on Solarix 7.  It's not 
 | 
						|
necessary; you can just map /dev/zero on SysV systems that don't 
 | 
						|
have MAP_ANON.
 | 
						|
 | 
						|
> While I've said before that I don't think it's really necessary for
 | 
						|
> processes that aren't children of the postmaster to access the shared
 | 
						|
> memory, I'm not sure that I want to go over to a mechanism that makes it
 | 
						|
> *impossible* for that to be done.  Especially not if the only motivation
 | 
						|
> is to avoid having to configure the kernel's shared memory settings.
 | 
						|
 | 
						|
There are enormous advantages to avoiding the need to configure kernel 
 | 
						|
settings.  It makes PG a better citizen.  PG is much easier to drop in 
 | 
						|
and use if you don't need attention from the IT department.
 | 
						|
 | 
						|
But I don't know of any reason to avoid mapping an actual inode,
 | 
						|
so using mmap doesn't necessarily mean giving up sharing among
 | 
						|
unrelated processes.
 | 
						|
 | 
						|
> Besides, what makes you think there's not a limit on the size of shmem
 | 
						|
> allocatable via mmap()?
 | 
						|
 | 
						|
I've never seen any mmap limit documented.  Since mmap() is how 
 | 
						|
everybody implements shared libraries, such a limit would be equivalent 
 | 
						|
to a limit on how much/many shared libraries are used.  mmap() with 
 | 
						|
MAP_ANONYMOUS (or its SysV /dev/zero equivalent) is a common, modern 
 | 
						|
way to get raw storage for malloc(), so such a limit would be a limit
 | 
						|
on malloc() too.
 | 
						|
 | 
						|
The mmap architecture comes to us from the Mach microkernel memory
 | 
						|
manager, backported into BSD and then copied widely.  Since it was
 | 
						|
the fundamental mechanism for all memory operations in Mach, arbitrary
 | 
						|
limits would make no sense.  That it worked so well is the reason it 
 | 
						|
was copied everywhere else, so adding arbitrary limits while copying 
 | 
						|
it would be silly.  I don't think we'll see any systems like that.
 | 
						|
 | 
						|
Nathan Myers
 | 
						|
ncm@zembu.com
 | 
						|
 | 
						|
From pgsql-hackers-owner+M6138@postgresql.org Mon Mar 19 07:57:59 2001
 | 
						|
Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28])
 | 
						|
	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id HAA26926
 | 
						|
	for <pgman@candle.pha.pa.us>; Mon, 19 Mar 2001 07:57:59 -0500 (EST)
 | 
						|
Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28])
 | 
						|
	by mail.postgresql.org (8.11.1/8.11.1) with SMTP id f2JCug641835;
 | 
						|
	Mon, 19 Mar 2001 07:56:42 -0500 (EST)
 | 
						|
	(envelope-from pgsql-hackers-owner+M6138@postgresql.org)
 | 
						|
Received: from fw.wintelcom.net (ns1.wintelcom.net [209.1.153.20])
 | 
						|
	by mail.postgresql.org (8.11.1/8.11.1) with ESMTP id f2JCt7641684
 | 
						|
	for <pgsql-hackers@postgresql.org>; Mon, 19 Mar 2001 07:55:07 -0500 (EST)
 | 
						|
	(envelope-from bright@fw.wintelcom.net)
 | 
						|
Received: (from bright@localhost)
 | 
						|
	by fw.wintelcom.net (8.10.0/8.10.0) id f2JCt2325289;
 | 
						|
	Mon, 19 Mar 2001 04:55:02 -0800 (PST)
 | 
						|
Date: Mon, 19 Mar 2001 04:55:01 -0800
 | 
						|
From: Alfred Perlstein <bright@wintelcom.net>
 | 
						|
To: Rod Taylor <rod.taylor@inquent.com>
 | 
						|
Cc: Hackers List <pgsql-hackers@postgresql.org>
 | 
						|
Subject: Re: [HACKERS] Fw: [vorbis-dev] ogg123: shared memory by mmap()
 | 
						|
Message-ID: <20010319045500.T29888@fw.wintelcom.net>
 | 
						|
References: <018301c0b070$16049a40$2205010a@jester>
 | 
						|
Mime-Version: 1.0
 | 
						|
Content-Type: text/plain; charset=us-ascii
 | 
						|
Content-Disposition: inline
 | 
						|
User-Agent: Mutt/1.2.5i
 | 
						|
In-Reply-To: <018301c0b070$16049a40$2205010a@jester>; from rod.taylor@inquent.com on Mon, Mar 19, 2001 at 07:28:21AM -0500
 | 
						|
X-all-your-base: are belong to us.
 | 
						|
Precedence: bulk
 | 
						|
Sender: pgsql-hackers-owner@postgresql.org
 | 
						|
Status: ORr
 | 
						|
 | 
						|
WOOT WOOT! DANGER WILL ROBINSON!
 | 
						|
 | 
						|
> ----- Original Message -----
 | 
						|
> From: "Christian Weisgerber" <naddy@mips.inka.de>
 | 
						|
> Newsgroups: list.vorbis.dev
 | 
						|
> To: <vorbis-dev@xiph.org>
 | 
						|
> Sent: Saturday, March 17, 2001 12:01 PM
 | 
						|
> Subject: [vorbis-dev] ogg123: shared memory by mmap()
 | 
						|
> 
 | 
						|
> 
 | 
						|
> > The patch below adds:
 | 
						|
> >
 | 
						|
> > - acinclude.m4:  A new macro A_FUNC_SMMAP to check that sharing
 | 
						|
> pages
 | 
						|
> >   through mmap() works.  This is taken from Joerg Schilling's star.
 | 
						|
> > - configure.in:  A_FUNC_SMMAP
 | 
						|
> > - ogg123/buffer.c:  If we have a working mmap(), use it to create
 | 
						|
> >   a region of shared memory instead of using System V IPC.
 | 
						|
> >
 | 
						|
> > Works on BSD.  Should also work on SVR4 and offspring (Solaris),
 | 
						|
> > and Linux.
 | 
						|
 | 
						|
This is a really bad idea performance wise.  Solaris has a special
 | 
						|
code path for SYSV shared memory that doesn't require tons of swap
 | 
						|
tracking structures per-page/per-process.  FreeBSD also has this
 | 
						|
optimization (it's off by default, but should work since FreeBSD
 | 
						|
4.2 via the sysctl kern.ipc.shm_use_phys=1)
 | 
						|
 | 
						|
Both OS's use a trick of making the pages non-pageable, this allows
 | 
						|
signifigant savings in kernel space required for each attached
 | 
						|
process, as well as the use of large pages which reduce the amount
 | 
						|
of TLB faults your processes will incurr.
 | 
						|
 | 
						|
Anyhow, if you could make this a runtime option it wouldn't be so
 | 
						|
evil, but as a compile time option, it's a really bad idea for
 | 
						|
Solaris and FreeBSD.
 | 
						|
 | 
						|
--
 | 
						|
-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
 | 
						|
 | 
						|
---------------------------(end of broadcast)---------------------------
 | 
						|
TIP 2: you can get off all lists at once with the unregister command
 | 
						|
    (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
 | 
						|
 | 
						|
From pgsql-hackers-owner+M6255@postgresql.org Tue Mar 20 18:46:33 2001
 | 
						|
Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28])
 | 
						|
	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id SAA02887
 | 
						|
	for <pgman@candle.pha.pa.us>; Tue, 20 Mar 2001 18:46:33 -0500 (EST)
 | 
						|
Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28])
 | 
						|
	by mail.postgresql.org (8.11.3/8.11.1) with SMTP id f2KNjtH22390;
 | 
						|
	Tue, 20 Mar 2001 18:45:55 -0500 (EST)
 | 
						|
	(envelope-from pgsql-hackers-owner+M6255@postgresql.org)
 | 
						|
Received: from fw.wintelcom.net (ns1.wintelcom.net [209.1.153.20])
 | 
						|
	by mail.postgresql.org (8.11.3/8.11.1) with ESMTP id f2KNiFH22033
 | 
						|
	for <pgsql-hackers@postgresql.org>; Tue, 20 Mar 2001 18:44:15 -0500 (EST)
 | 
						|
	(envelope-from bright@fw.wintelcom.net)
 | 
						|
Received: (from bright@localhost)
 | 
						|
	by fw.wintelcom.net (8.10.0/8.10.0) id f2KNiAW02417;
 | 
						|
	Tue, 20 Mar 2001 15:44:10 -0800 (PST)
 | 
						|
Date: Tue, 20 Mar 2001 15:44:10 -0800
 | 
						|
From: Alfred Perlstein <bright@wintelcom.net>
 | 
						|
To: Bruce Momjian <pgman@candle.pha.pa.us>
 | 
						|
Cc: Rod Taylor <rod.taylor@inquent.com>,
 | 
						|
        Hackers List <pgsql-hackers@postgresql.org>
 | 
						|
Subject: Re: [HACKERS] Fw: [vorbis-dev] ogg123: shared memory by mmap()
 | 
						|
Message-ID: <20010320154410.H29888@fw.wintelcom.net>
 | 
						|
References: <20010319045500.T29888@fw.wintelcom.net> <200103202210.RAA23981@candle.pha.pa.us>
 | 
						|
Mime-Version: 1.0
 | 
						|
Content-Type: text/plain; charset=us-ascii
 | 
						|
Content-Disposition: inline
 | 
						|
User-Agent: Mutt/1.2.5i
 | 
						|
In-Reply-To: <200103202210.RAA23981@candle.pha.pa.us>; from pgman@candle.pha.pa.us on Tue, Mar 20, 2001 at 05:10:33PM -0500
 | 
						|
X-all-your-base: are belong to us.
 | 
						|
Precedence: bulk
 | 
						|
Sender: pgsql-hackers-owner@postgresql.org
 | 
						|
Status: OR
 | 
						|
 | 
						|
* Bruce Momjian <pgman@candle.pha.pa.us> [010320 14:10] wrote:
 | 
						|
> > > > The patch below adds:
 | 
						|
> > > >
 | 
						|
> > > > - acinclude.m4:  A new macro A_FUNC_SMMAP to check that sharing
 | 
						|
> > > pages
 | 
						|
> > > >   through mmap() works.  This is taken from Joerg Schilling's star.
 | 
						|
> > > > - configure.in:  A_FUNC_SMMAP
 | 
						|
> > > > - ogg123/buffer.c:  If we have a working mmap(), use it to create
 | 
						|
> > > >   a region of shared memory instead of using System V IPC.
 | 
						|
> > > >
 | 
						|
> > > > Works on BSD.  Should also work on SVR4 and offspring (Solaris),
 | 
						|
> > > > and Linux.
 | 
						|
> > 
 | 
						|
> > This is a really bad idea performance wise.  Solaris has a special
 | 
						|
> > code path for SYSV shared memory that doesn't require tons of swap
 | 
						|
> > tracking structures per-page/per-process.  FreeBSD also has this
 | 
						|
> > optimization (it's off by default, but should work since FreeBSD
 | 
						|
> > 4.2 via the sysctl kern.ipc.shm_use_phys=1)
 | 
						|
> 
 | 
						|
> > 
 | 
						|
> > Both OS's use a trick of making the pages non-pageable, this allows
 | 
						|
> > signifigant savings in kernel space required for each attached
 | 
						|
> > process, as well as the use of large pages which reduce the amount
 | 
						|
> > of TLB faults your processes will incurr.
 | 
						|
> 
 | 
						|
> That is interesting.  BSDi has SysV shared memory as non-pagable, and I
 | 
						|
> always thought of that as a bug.  Seems you are saying that having it
 | 
						|
> pagable has a significant performance penalty.  Interesting.
 | 
						|
 | 
						|
Yes, having it pageable is actually sort of bad.
 | 
						|
 | 
						|
It doesn't allow you to do several important optimizations.
 | 
						|
 | 
						|
-- 
 | 
						|
-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
 | 
						|
 | 
						|
 | 
						|
---------------------------(end of broadcast)---------------------------
 | 
						|
TIP 4: Don't 'kill -9' the postmaster
 | 
						|
 | 
						|
From pgsql-general-owner+M14300@postgresql.org Mon Aug 27 13:07:32 2001
 | 
						|
Return-path: <pgsql-general-owner+M14300@postgresql.org>
 | 
						|
Received: from server1.pgsql.org (server1.pgsql.org [64.39.15.238])
 | 
						|
	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id f7RH7VF04800
 | 
						|
	for <pgman@candle.pha.pa.us>; Mon, 27 Aug 2001 13:07:31 -0400 (EDT)
 | 
						|
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
 | 
						|
	by server1.pgsql.org (8.11.6/8.11.6) with ESMTP id f7RH7Tq17721;
 | 
						|
	Mon, 27 Aug 2001 12:07:29 -0500 (CDT)
 | 
						|
	(envelope-from pgsql-general-owner+M14300@postgresql.org)
 | 
						|
Received: from svana.org (svana.org [210.9.66.30])
 | 
						|
	by postgresql.org (8.11.3/8.11.4) with ESMTP id f7RFE1f13269
 | 
						|
	for <pgsql-general@postgresql.org>; Mon, 27 Aug 2001 11:14:01 -0400 (EDT)
 | 
						|
	(envelope-from kleptog@svana.org)
 | 
						|
Received: from kleptog by svana.org with local (Exim 3.12 #1 (Debian))
 | 
						|
	id 15bO5x-0000Fd-00; Tue, 28 Aug 2001 01:14:33 +1000
 | 
						|
Date: Tue, 28 Aug 2001 01:14:33 +1000
 | 
						|
From: Martijn van Oosterhout <kleptog@svana.org>
 | 
						|
To: Andrew Snow <andrew@modulus.org>
 | 
						|
cc: pgsql-general@postgresql.org
 | 
						|
Subject: Re: [GENERAL] raw partition
 | 
						|
Message-ID: <20010828011433.E32309@svana.org>
 | 
						|
Reply-To: Martijn van Oosterhout <kleptog@svana.org>
 | 
						|
References: <20010827233815.B32309@svana.org> <000101c12f00$dc5814b0$fa01b5ca@avon>
 | 
						|
MIME-Version: 1.0
 | 
						|
Content-Type: text/plain; charset=us-ascii
 | 
						|
Content-Disposition: inline
 | 
						|
User-Agent: Mutt/1.2.5i
 | 
						|
In-Reply-To: <000101c12f00$dc5814b0$fa01b5ca@avon>; from andrew@modulus.org on Tue, Aug 28, 2001 at 12:02:08AM +1000
 | 
						|
Precedence: bulk
 | 
						|
Sender: pgsql-general-owner@postgresql.org
 | 
						|
Status: OR
 | 
						|
 | 
						|
On Tue, Aug 28, 2001 at 12:02:08AM +1000, Andrew Snow wrote:
 | 
						|
> 
 | 
						|
> What I think would be better would be moving postgresql to a system of
 | 
						|
> using memory-mapped I/O.  instead of the shared buffer cache, files
 | 
						|
> would be directly memory-mapped and the OS would do the caching.  I
 | 
						|
> can't see this happening though because of platform dependancy, but I
 | 
						|
> think its worth another look soon because many unix platforms support
 | 
						|
> mmap().  I think it would improve the performance of disk-intensive
 | 
						|
> tasks noticeably.
 | 
						|
 | 
						|
Well, this has other problems. Consider tables that are larger than your
 | 
						|
system memory. You'd have to continuously map and unmap different sections.
 | 
						|
That can have odd side effects (witness mozilla on linux having 15,000
 | 
						|
mapped areas or so...)
 | 
						|
 | 
						|
You would still however get the advantage that you wouldn't have to copy the
 | 
						|
data from the disk buffers to user space, you simply get the disk buffer
 | 
						|
mapped into your address space.
 | 
						|
 | 
						|
I think that for commonly used tables that are under 100K in size (most of
 | 
						|
the system tables), this is quite a workable idea. If you don't mind keeping
 | 
						|
them mapped the whole time.
 | 
						|
 | 
						|
-- 
 | 
						|
Martijn van Oosterhout <kleptog@svana.org>
 | 
						|
http://svana.org/kleptog/
 | 
						|
> It would be nice if someone came up with a certification system that
 | 
						|
> actually separated those who can barely regurgitate what they crammed over
 | 
						|
> the last few weeks from those who command secret ninja networking powers.
 | 
						|
 | 
						|
---------------------------(end of broadcast)---------------------------
 | 
						|
TIP 3: if posting/reading through Usenet, please send an appropriate
 | 
						|
subscribe-nomail command to majordomo@postgresql.org so that your
 | 
						|
message can get through to the mailing list cleanly
 | 
						|
 | 
						|
From pgsql-general-owner+M14319@postgresql.org Mon Aug 27 16:57:10 2001
 | 
						|
Return-path: <pgsql-general-owner+M14319@postgresql.org>
 | 
						|
Received: from server1.pgsql.org (server1.pgsql.org [64.39.15.238])
 | 
						|
	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id f7RKv9F16849
 | 
						|
	for <pgman@candle.pha.pa.us>; Mon, 27 Aug 2001 16:57:09 -0400 (EDT)
 | 
						|
Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28])
 | 
						|
	by server1.pgsql.org (8.11.6/8.11.6) with ESMTP id f7RKv9q31456;
 | 
						|
	Mon, 27 Aug 2001 15:57:09 -0500 (CDT)
 | 
						|
	(envelope-from pgsql-general-owner+M14319@postgresql.org)
 | 
						|
Received: from sss.pgh.pa.us ([192.204.191.242])
 | 
						|
	by postgresql.org (8.11.3/8.11.4) with ESMTP id f7RJrsf55472
 | 
						|
	for <pgsql-general@postgresql.org>; Mon, 27 Aug 2001 15:53:54 -0400 (EDT)
 | 
						|
	(envelope-from tgl@sss.pgh.pa.us)
 | 
						|
Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1])
 | 
						|
	by sss.pgh.pa.us (8.11.4/8.11.4) with ESMTP id f7RJrGK19431;
 | 
						|
	Mon, 27 Aug 2001 15:53:16 -0400 (EDT)
 | 
						|
To: Martijn van Oosterhout <kleptog@svana.org>
 | 
						|
cc: Andrew Snow <andrew@modulus.org>, pgsql-general@postgresql.org
 | 
						|
Subject: Re: [GENERAL] raw partition 
 | 
						|
In-Reply-To: <20010828011433.E32309@svana.org> 
 | 
						|
References: <20010827233815.B32309@svana.org> <000101c12f00$dc5814b0$fa01b5ca@avon> <20010828011433.E32309@svana.org>
 | 
						|
Comments: In-reply-to Martijn van Oosterhout <kleptog@svana.org>
 | 
						|
	message dated "Tue, 28 Aug 2001 01:14:33 +1000"
 | 
						|
Date: Mon, 27 Aug 2001 15:53:15 -0400
 | 
						|
Message-ID: <19428.998941995@sss.pgh.pa.us>
 | 
						|
From: Tom Lane <tgl@sss.pgh.pa.us>
 | 
						|
Precedence: bulk
 | 
						|
Sender: pgsql-general-owner@postgresql.org
 | 
						|
Status: OR
 | 
						|
 | 
						|
Martijn van Oosterhout <kleptog@svana.org> writes:
 | 
						|
> You would still however get the advantage that you wouldn't have to copy the
 | 
						|
> data from the disk buffers to user space, you simply get the disk buffer
 | 
						|
> mapped into your address space.
 | 
						|
 | 
						|
AFAICS this would be the *only* advantage.  While it's not negligible,
 | 
						|
it's quite unclear that it's worth the bookkeeping and portability
 | 
						|
headaches of managing lots of mmap'd areas, either.
 | 
						|
 | 
						|
Before I take this idea seriously at all, I'd want to see a design that
 | 
						|
addresses a couple of critical issues:
 | 
						|
 | 
						|
1. Postgres' shared buffers are *shared*, potentially across many
 | 
						|
processes.  How will you deal with buffers for files that have been
 | 
						|
mmap'd by only some of the processes?  (Maybe this means that the
 | 
						|
whole concept of shared buffers goes away, and each process does its
 | 
						|
own buffer management based on its own mmaps.  Not sure.  That would be
 | 
						|
a pretty radical restructuring though, and would completely invalidate
 | 
						|
our present approach to page-level locking.)
 | 
						|
 | 
						|
2. How do you deal with extending a file?  My system's mmap man page
 | 
						|
says
 | 
						|
     If the size of the mapped file changes after the call to mmap(), the
 | 
						|
     effect of references to portions of the mapped region that correspond
 | 
						|
     to added or removed portions of the file is unspecified.
 | 
						|
This suggests that the only portable way to cope is to issue a separate
 | 
						|
mmap for every disk page.  Will typical Unix systems perform well with
 | 
						|
umpteen thousand small mmap requests?
 | 
						|
 | 
						|
3. How do you persuade the other backends to drop their mmaps of a table
 | 
						|
you are deleting?
 | 
						|
 | 
						|
There are probably other gotchas, but without an understanding of how
 | 
						|
to address these, I doubt it's worth looking further ...
 | 
						|
 | 
						|
			regards, tom lane
 | 
						|
 | 
						|
---------------------------(end of broadcast)---------------------------
 | 
						|
TIP 5: Have you checked our extensive FAQ?
 | 
						|
 | 
						|
http://www.postgresql.org/users-lounge/docs/faq.html
 | 
						|
 | 
						|
From pgsql-hackers-owner+M13750=candle.pha.pa.us=pgman@postgresql.org Mon Oct  1 05:59:15 2001
 | 
						|
Return-path: <pgsql-hackers-owner+M13750=candle.pha.pa.us=pgman@postgresql.org>
 | 
						|
Received: from server1.pgsql.org (server1.pgsql.org [64.39.15.238] (may be forged))
 | 
						|
	by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id f919xF512590
 | 
						|
	for <pgman@candle.pha.pa.us>; Mon, 1 Oct 2001 05:59:15 -0400 (EDT)
 | 
						|
Received: from postgresql.org (webmail.postgresql.org [216.126.85.28])
 | 
						|
	by server1.pgsql.org (8.11.6/8.11.6) with ESMTP id f919xA207817
 | 
						|
	for <pgman@candle.pha.pa.us>; Mon, 1 Oct 2001 04:59:10 -0500 (CDT)
 | 
						|
	(envelope-from pgsql-hackers-owner+M13750=candle.pha.pa.us=pgman@postgresql.org)
 | 
						|
Received: from mrsgntmail01.mediaring.com.sg (mserver.mediaring.com.sg [203.208.141.175])
 | 
						|
	by postgresql.org (8.11.3/8.11.4) with ESMTP id f919rE320926
 | 
						|
	for <pgsql-hackers@postgreSQL.org>; Mon, 1 Oct 2001 05:53:15 -0400 (EDT)
 | 
						|
	(envelope-from jana-reddy@mediaring.com.sg)
 | 
						|
Received: by MRSGNTMAIL01 with Internet Mail Service (5.5.2650.21)
 | 
						|
	id <PMTCM7SJ>; Mon, 1 Oct 2001 18:03:34 +0800
 | 
						|
Received: from mediaring.com.sg (10.1.0.131 [10.1.0.131]) by mrsgntmail01.mediaring.com.sg with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21)
 | 
						|
	id PMTCM7SH; Mon, 1 Oct 2001 18:03:25 +0800
 | 
						|
From: Janardhana Reddy <jana-reddy@mediaring.com.sg>
 | 
						|
To: Bruce Momjian <pgman@candle.pha.pa.us>, Tom Lane <tgl@sss.pgh.pa.us>
 | 
						|
cc: PostgreSQL-development <pgsql-hackers@postgresql.org>,
 | 
						|
   janareddy
 | 
						|
  <jana-reddy@mediaring.com.sg>
 | 
						|
Message-ID: <3BB83DF0.8946973@mediaring.com.sg>
 | 
						|
Date: Mon, 01 Oct 2001 17:57:04 +0800
 | 
						|
X-Mailer: Mozilla 4.75 [en] (X11; U; Linux 2.4.0 i686)
 | 
						|
X-Accept-Language: en
 | 
						|
MIME-Version: 1.0
 | 
						|
Subject: Re: [HACKERS] PERFORMANCE IMPROVEMENT by mapping  WAL FILES
 | 
						|
References: <200109282137.f8SLbpm01890@candle.pha.pa.us>
 | 
						|
Content-Type: text/plain; charset=us-ascii
 | 
						|
Content-Transfer-Encoding: 7bit
 | 
						|
Precedence: bulk
 | 
						|
Sender: pgsql-hackers-owner@postgresql.org
 | 
						|
Status: ORr
 | 
						|
 | 
						|
     I have just  completed the functional testing  the WAL using mmap  , it is
 | 
						|
 | 
						|
 working  fine,  I  have tested  by commenting out the  "CreateCheckPoint "
 | 
						|
functionality so that
 | 
						|
   when i kill the postgres and restart it will redo all the records from the
 | 
						|
WAL log file  which
 | 
						|
  is updated  using mmap.
 | 
						|
     Just i need  to  clean code and to do some stress testing.
 | 
						|
 By the end of this week i should able to  complete  the stress test  and
 | 
						|
generate the patch file .
 | 
						|
    As Tom Lane mentioned  i see the  problem in portability  to all platforms,
 | 
						|
 | 
						|
      what i propose is to use mmap for only WAL  for some platforms like
 | 
						|
  linux,freebsd etc . For  other platforms we can use the existing method by
 | 
						|
slightly modifying the
 | 
						|
 write()  routine to write only the modified part of the page.
 | 
						|
 | 
						|
Regards
 | 
						|
jana
 | 
						|
 | 
						|
>
 | 
						|
>
 | 
						|
> OK, I have talked to Tom Lane about this on the phone and we have a few
 | 
						|
> ideas.
 | 
						|
>
 | 
						|
> Historically, we have avoided mmap() because of portability problems,
 | 
						|
> and because using mmap() to write to large tables could consume lots of
 | 
						|
> address space with little benefit.  However, I perhaps can see WAL as
 | 
						|
> being a good use of mmap.
 | 
						|
>
 | 
						|
> First, there is the issue of using mmap().  For OS's that have the
 | 
						|
> mmap() MAP_SHARED flag, different backends could mmap the same file and
 | 
						|
> each see the changes.  However, keep in mind we still have to fsync()
 | 
						|
> WAL, so we need to use msync().
 | 
						|
>
 | 
						|
> So, looking at the benefits of using mmap(), we have overhead of
 | 
						|
> different backends having to mmap something that now sits quite easily
 | 
						|
> in shared memory.  Now, I can see mmap reducing the copy from user to
 | 
						|
> kernel, but there are other ways to fix that.  We could modify the
 | 
						|
> write() routines to write() 8k on first WAL page write and later write
 | 
						|
> only the modified part of the page to the kernel buffers.  The old
 | 
						|
> kernel buffer is probably still around so it is unlikely to require a
 | 
						|
> read from the file system to read in the rest of the page.  This reduces
 | 
						|
> the write from 8k to something probably less than 4k which is better
 | 
						|
> than we can do with mmap.
 | 
						|
>
 | 
						|
> I will add a TODO item to this effect.
 | 
						|
>
 | 
						|
> As far as reducing the write to disk from 8k to 4k, if we have to
 | 
						|
> fsync/msync, we have to wait for the disk to spin to the proper location
 | 
						|
> and at that point writing 4k or 8k doesn't seem like much of a win.
 | 
						|
>
 | 
						|
> In summary, I think it would be nice to reduce the 8k transfer from user
 | 
						|
> to kernel on secondary page writes to only the modified part of the
 | 
						|
> page.  I am uncertain if mmap() or anything else will help the physical
 | 
						|
> write to the disk.
 | 
						|
>
 | 
						|
> --
 | 
						|
>   Bruce Momjian                        |  http://candle.pha.pa.us
 | 
						|
>   pgman@candle.pha.pa.us               |  (610) 853-3000
 | 
						|
>   +  If your life is a hard drive,     |  830 Blythe Avenue
 | 
						|
>   +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
 | 
						|
 | 
						|
---------------------------(end of broadcast)---------------------------
 | 
						|
TIP 6: Have you searched our list archives?
 | 
						|
 | 
						|
http://archives.postgresql.org
 | 
						|
 |