mirror of
				https://github.com/postgres/postgres.git
				synced 2025-11-03 09:13:20 +03:00 
			
		
		
		
	
		
			
				
	
	
		
			286 lines
		
	
	
		
			12 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			286 lines
		
	
	
		
			12 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
From owner-pgsql-hackers@hub.org Fri Nov 13 13:24:37 1998
 | 
						|
Received: from hub.org (majordom@hub.org [209.47.148.200])
 | 
						|
	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id NAA13457
 | 
						|
	for <maillist@candle.pha.pa.us>; Fri, 13 Nov 1998 13:24:35 -0500 (EST)
 | 
						|
Received: from localhost (majordom@localhost)
 | 
						|
	by hub.org (8.9.1/8.9.1) with SMTP id NAA02464;
 | 
						|
	Fri, 13 Nov 1998 13:22:52 -0500 (EST)
 | 
						|
	(envelope-from owner-pgsql-hackers@hub.org)
 | 
						|
Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Fri, 13 Nov 1998 13:21:14 +0000 (EST)
 | 
						|
Received: (from majordom@localhost)
 | 
						|
	by hub.org (8.9.1/8.9.1) id NAA02331
 | 
						|
	for pgsql-hackers-outgoing; Fri, 13 Nov 1998 13:21:12 -0500 (EST)
 | 
						|
	(envelope-from owner-pgsql-hackers@postgreSQL.org)
 | 
						|
Received: from orion.SAPserv.Hamburg.dsh.de (Tpolaris2.sapham.debis.de [53.2.131.8])
 | 
						|
	by hub.org (8.9.1/8.9.1) with SMTP id NAA02316
 | 
						|
	for <pgsql-hackers@postgreSQL.org>; Fri, 13 Nov 1998 13:21:06 -0500 (EST)
 | 
						|
	(envelope-from wieck@sapserv.debis.de)
 | 
						|
Received: by orion.SAPserv.Hamburg.dsh.de 
 | 
						|
	for pgsql-hackers@postgreSQL.org 
 | 
						|
	id m0zeOEf-000EBPC; Fri, 13 Nov 98 19:46 MET
 | 
						|
Message-Id: <m0zeOEf-000EBPC@orion.SAPserv.Hamburg.dsh.de>
 | 
						|
From: jwieck@debis.com (Jan Wieck)
 | 
						|
Subject: [HACKERS] shmem limits and redolog
 | 
						|
To: pgsql-hackers@postgreSQL.org (PostgreSQL HACKERS)
 | 
						|
Date: Fri, 13 Nov 1998 19:46:20 +0100 (MET)
 | 
						|
Reply-To: jwieck@debis.com (Jan Wieck)
 | 
						|
X-Mailer: ELM [version 2.4 PL25]
 | 
						|
Content-Type: text
 | 
						|
Sender: owner-pgsql-hackers@postgreSQL.org
 | 
						|
Precedence: bulk
 | 
						|
Status: ROr
 | 
						|
 | 
						|
Hi,
 | 
						|
 | 
						|
    I'm  currently  hacking  around on a solution for logging all
 | 
						|
    database operations at query level that can recover a crashed
 | 
						|
    database  from  the last successful backup by redoing all the
 | 
						|
    commands.
 | 
						|
 | 
						|
    Well, I wanted it to be as flexible as can. So I  decided  to
 | 
						|
    make  it  per  database  configurable.  One  could  say which
 | 
						|
    databases are logged and if a database is, if  it  is  logged
 | 
						|
    sync  or async (in sync mode, every COMMIT forces an fsync of
 | 
						|
    the actual logfile and controlfiles).
 | 
						|
 | 
						|
    To make async mode as fast as can, I'm using a shared  memory
 | 
						|
    of  32K per database (not per backend) that is used as a wrap
 | 
						|
    around  buffer  from  the  backends  to  place  their   query
 | 
						|
    information.  So  the  log writer can fall a little behind if
 | 
						|
    there are many backends doing  different  things  that  don't
 | 
						|
    lock each other.
 | 
						|
 | 
						|
    Now  I'm  a  little  in  doubt about the shared memory limits
 | 
						|
    reported.  Was it a good decision to use shared memory? Am  I
 | 
						|
    better off using socket's?
 | 
						|
 | 
						|
    The  bad  thing  in  what  I  have  up  to now (it's far from
 | 
						|
    complete) is, that even if a database isn't currently logged,
 | 
						|
    a redolog writer is started and creates the 32K shmem segment
 | 
						|
    (plus a semaphore set with 5 semaphores). This is  because  I
 | 
						|
    plan to create commands like
 | 
						|
 | 
						|
        ALTER DATABASE LOG MODE=ASYNC LOGDIR='/somewhere/dbname';
 | 
						|
 | 
						|
    and the like that can be used at runtime (while more than one
 | 
						|
    backend is connected to the database) to turn logging on/off,
 | 
						|
    switch  to/from  backup  mode (all other activity is stopped)
 | 
						|
    etc.
 | 
						|
 | 
						|
    So every 32 databases will require another megabyte of shared
 | 
						|
    memory.  The  logging  master  controls  which databases have
 | 
						|
    activity  and  kills  redolog  writers  after  some  time  of
 | 
						|
    inactivity,  and  the shmem is freed then. But it can hurt if
 | 
						|
    someone really has many many databases that are all  used  at
 | 
						|
    the same time.
 | 
						|
 | 
						|
    What do the others say?
 | 
						|
 | 
						|
 | 
						|
Jan
 | 
						|
 | 
						|
--
 | 
						|
 | 
						|
#======================================================================#
 | 
						|
# It's easier to get forgiveness for being wrong than for being right. #
 | 
						|
# Let's break this rule - forgive me.                                  #
 | 
						|
#======================================== jwieck@debis.com (Jan Wieck) #
 | 
						|
 | 
						|
 | 
						|
 | 
						|
 | 
						|
From owner-pgsql-hackers@hub.org Wed Dec 16 15:46:41 1998
 | 
						|
Received: from renoir.op.net (root@renoir.op.net [209.152.193.4])
 | 
						|
	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id PAA00521
 | 
						|
	for <maillist@candle.pha.pa.us>; Wed, 16 Dec 1998 15:46:40 -0500 (EST)
 | 
						|
Received: from hub.org (majordom@hub.org [209.47.145.100]) by renoir.op.net (o1/$ Revision: 1.18 $) with ESMTP id PAA08772 for <maillist@candle.pha.pa.us>; Wed, 16 Dec 1998 15:10:01 -0500 (EST)
 | 
						|
Received: from localhost (majordom@localhost)
 | 
						|
	by hub.org (8.9.1/8.9.1) with SMTP id PAA01254;
 | 
						|
	Wed, 16 Dec 1998 15:06:56 -0500 (EST)
 | 
						|
	(envelope-from owner-pgsql-hackers@hub.org)
 | 
						|
Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Wed, 16 Dec 1998 14:58:11 +0000 (EST)
 | 
						|
Received: (from majordom@localhost)
 | 
						|
	by hub.org (8.9.1/8.9.1) id OAA00660
 | 
						|
	for pgsql-hackers-outgoing; Wed, 16 Dec 1998 14:58:10 -0500 (EST)
 | 
						|
	(envelope-from owner-pgsql-hackers@postgreSQL.org)
 | 
						|
Received: from orion.SAPserv.Hamburg.dsh.de (Tpolaris2.sapham.debis.de [53.2.131.8])
 | 
						|
	by hub.org (8.9.1/8.9.1) with SMTP id OAA00643
 | 
						|
	for <pgsql-hackers@postgreSQL.org>; Wed, 16 Dec 1998 14:58:05 -0500 (EST)
 | 
						|
	(envelope-from wieck@sapserv.debis.de)
 | 
						|
Received: by orion.SAPserv.Hamburg.dsh.de 
 | 
						|
	for pgsql-hackers@postgreSQL.org 
 | 
						|
	id m0zqNDo-000EBTC; Wed, 16 Dec 98 21:07 MET
 | 
						|
Message-Id: <m0zqNDo-000EBTC@orion.SAPserv.Hamburg.dsh.de>
 | 
						|
From: jwieck@debis.com (Jan Wieck)
 | 
						|
Subject: Re: [HACKERS] redolog - for discussion
 | 
						|
To: vadim@krs.ru (Vadim Mikheev)
 | 
						|
Date: Wed, 16 Dec 1998 21:07:00 +0100 (MET)
 | 
						|
Cc: jwieck@debis.com, pgsql-hackers@postgreSQL.org
 | 
						|
Reply-To: jwieck@debis.com (Jan Wieck)
 | 
						|
In-Reply-To: <3677B71D.C67462B3@krs.ru> from "Vadim Mikheev" at Dec 16, 98 08:35:25 pm
 | 
						|
X-Mailer: ELM [version 2.4 PL25]
 | 
						|
Content-Type: text
 | 
						|
Sender: owner-pgsql-hackers@postgreSQL.org
 | 
						|
Precedence: bulk
 | 
						|
Status: RO
 | 
						|
 | 
						|
Vadim wrote:
 | 
						|
 | 
						|
>
 | 
						|
> Jan Wieck wrote:
 | 
						|
> >
 | 
						|
> >     RECOVER DATABASE {ALL | UNTIL 'datetime' | RESET};
 | 
						|
> >
 | 
						|
> ...
 | 
						|
> >
 | 
						|
> >         For  the  others, the backend starts the recovery program
 | 
						|
> >         which  reads  the  redolog  files,  establishes  database
 | 
						|
> >         connections  as  required  and reruns all the commands in
 | 
						|
>                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^
 | 
						|
> >         them. If a required logfile isn't  found,  it  tells  the
 | 
						|
>           ^^^^^
 | 
						|
>
 | 
						|
> I foresee problems with using _commands_ logging for
 | 
						|
> recovery/replication -:((
 | 
						|
>
 | 
						|
> Let's consider two concurrent updates in READ COMMITTED mode:
 | 
						|
>
 | 
						|
> update test set x = 2 where y = 1;
 | 
						|
>
 | 
						|
>    and
 | 
						|
>
 | 
						|
> update test set x = 3 where y = 1;
 | 
						|
>
 | 
						|
> The result of both committed transaction will be x = 2
 | 
						|
> if the 1st transaction updated row _after_ 2nd transaction
 | 
						|
> and x = 3 if the 2nd transaction gets row after 1st one.
 | 
						|
> Order of updates is not defined by order in which commands
 | 
						|
> begun and so order in which commands should be rerun
 | 
						|
> will be unknown...
 | 
						|
 | 
						|
    Yepp,  the order in which commands begun is absolutely not of
 | 
						|
    interest. Locking could already delay the  execution  of  one
 | 
						|
    command  until  another  one  started  later has finished and
 | 
						|
    released the lock.  It's a classic race condition.
 | 
						|
 | 
						|
    Thus, my plan was to log the queries just before the call  to
 | 
						|
    CommitTransactionCommand()  in  tcop. This has the advantage,
 | 
						|
    that queries which bail out with errors don't  get  into  the
 | 
						|
    log  at  all  and  must not get rerun. And I can set a static
 | 
						|
    flag to false before starting the command, which  is  set  to
 | 
						|
    true  in  the buffer manager when a buffer is written (marked
 | 
						|
    dirty), so filtering out queries that do no updates at all is
 | 
						|
    easy.
 | 
						|
 | 
						|
    Unfortunately  query  level  logging get's hit by the current
 | 
						|
    implementation of sequence numbers. If  a  query  that  get's
 | 
						|
    aborted  somewhere  in the middle (maybe by a trigger) called
 | 
						|
    nextval() for rows processed  earlier,  the  sequence  number
 | 
						|
    isn't  advanced  at  recovery  time,  because  the  query  is
 | 
						|
    suppressed at all.   And  sequences  aren't  locked,  so  for
 | 
						|
    concurrently  running  queries  getting numbers from the same
 | 
						|
    sequence,  the  results   aren't   reproduceable.   If   some
 | 
						|
    application  selects  a  value  resulting from a sequence and
 | 
						|
    uses that later in another query, how could the redolog  know
 | 
						|
    that  this has changed? It's a Const in the query logged, and
 | 
						|
    all that corrupts the whole thing.
 | 
						|
 | 
						|
    All that is painful and I don't see another solution yet than
 | 
						|
    to  hook  into  nextval(),  log  out the numbers generated in
 | 
						|
    normal operation and getting back the same  numbers  in  redo
 | 
						|
    mode.
 | 
						|
 | 
						|
    The whole thing gets more and more complicated :-(
 | 
						|
 | 
						|
 | 
						|
Jan
 | 
						|
 | 
						|
--
 | 
						|
 | 
						|
#======================================================================#
 | 
						|
# It's easier to get forgiveness for being wrong than for being right. #
 | 
						|
# Let's break this rule - forgive me.                                  #
 | 
						|
#======================================== jwieck@debis.com (Jan Wieck) #
 | 
						|
 | 
						|
 | 
						|
 | 
						|
 | 
						|
From owner-pgsql-hackers@hub.org Wed Jun 16 09:29:31 1999
 | 
						|
Received: from hub.org (hub.org [209.167.229.1])
 | 
						|
	by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id JAA22504
 | 
						|
	for <maillist@candle.pha.pa.us>; Wed, 16 Jun 1999 09:29:29 -0400 (EDT)
 | 
						|
Received: from hub.org (hub.org [209.167.229.1])
 | 
						|
	by hub.org (8.9.3/8.9.3) with ESMTP id JAA02132;
 | 
						|
	Wed, 16 Jun 1999 09:18:20 -0400 (EDT)
 | 
						|
	(envelope-from owner-pgsql-hackers@hub.org)
 | 
						|
Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Wed, 16 Jun 1999 09:14:07 +0000 (EDT)
 | 
						|
Received: (from majordom@localhost)
 | 
						|
	by hub.org (8.9.3/8.9.3) id JAA01318
 | 
						|
	for pgsql-hackers-outgoing; Wed, 16 Jun 1999 09:14:06 -0400 (EDT)
 | 
						|
	(envelope-from owner-pgsql-hackers@postgreSQL.org)
 | 
						|
X-Authentication-Warning: hub.org: majordom set sender to owner-pgsql-hackers@postgreSQL.org using -f
 | 
						|
Received: from sunpine.krs.ru (SunPine.krs.ru [195.161.16.37])
 | 
						|
	by hub.org (8.9.3/8.9.3) with ESMTP id JAA01278
 | 
						|
	for <hackers@postgreSQL.org>; Wed, 16 Jun 1999 09:13:48 -0400 (EDT)
 | 
						|
	(envelope-from vadim@krs.ru)
 | 
						|
Received: from krs.ru (dune.krs.ru [195.161.16.38])
 | 
						|
	by sunpine.krs.ru (8.8.8/8.8.8) with ESMTP id VAA06276
 | 
						|
	for <hackers@postgreSQL.org>; Wed, 16 Jun 1999 21:12:49 +0800 (KRSS)
 | 
						|
Message-ID: <3767A2CF.E6E4A5F9@krs.ru>
 | 
						|
Date: Wed, 16 Jun 1999 21:12:47 +0800
 | 
						|
From: Vadim Mikheev <vadim@krs.ru>
 | 
						|
Organization: OJSC Rostelecom (Krasnoyarsk)
 | 
						|
X-Mailer: Mozilla 4.5 [en] (X11; I; FreeBSD 3.0-RELEASE i386)
 | 
						|
X-Accept-Language: ru, en
 | 
						|
MIME-Version: 1.0
 | 
						|
To: PostgreSQL Developers List <hackers@postgreSQL.org>
 | 
						|
Subject: [HACKERS] Savepoints...
 | 
						|
Content-Type: text/plain; charset=us-ascii
 | 
						|
Content-Transfer-Encoding: 7bit
 | 
						|
Sender: owner-pgsql-hackers@postgreSQL.org
 | 
						|
Precedence: bulk
 | 
						|
Status: ROr
 | 
						|
 | 
						|
To have them I need to add tuple id (6 bytes) to heap tuple
 | 
						|
header. Are there objections? Though it's not good to increase 
 | 
						|
tuple header size, subj is, imho, very nice feature...
 | 
						|
 | 
						|
Implementation is , hm, "easy":
 | 
						|
 | 
						|
- heap_insert/heap_delete/heap_replace/heap_mark4update will
 | 
						|
  remember updated tid (and current command id) in relation cache
 | 
						|
  and store previously updated tid (remembered in relation cache)
 | 
						|
  in additional heap header tid;
 | 
						|
- lmgr will remember command id when lock was acquired;
 | 
						|
- for a savepoint we will just store command id when
 | 
						|
  the savepoint was setted;
 | 
						|
- when going to sleep due to concurrent the-same-row update,
 | 
						|
  backend will store MyProc and tuple id in shmem hash table.
 | 
						|
 | 
						|
When rolling back to a savepoint, backend will:
 | 
						|
 | 
						|
- release locks acquired after savepoint;
 | 
						|
- for a relation updated after savepoint, get last updated tid 
 | 
						|
  from relation cache, walk through relation, set 
 | 
						|
  HEAP_XMIN_INVALID/HEAP_XMAX_INVALID in all tuples updated 
 | 
						|
  after savepoint and wake up concurrent writers blocked
 | 
						|
  on these tuples (using shmem hash table mentioned above).
 | 
						|
 | 
						|
The last feature (waking up of concurrent writers) is most hard
 | 
						|
part to implement. AFAIK, Oracle 7.3 was not able to do it.
 | 
						|
Can someone comment is this feature implemented in Oracle 8.X,
 | 
						|
other DBMSes?
 | 
						|
 | 
						|
Now about implicit savepoints. Backend will place them before
 | 
						|
user statements execution. In the case of failure, transaction
 | 
						|
state will be rolled back to the one before execution of query.
 | 
						|
As side-effect, this means that we'll get rid of complaints
 | 
						|
about entire transaction abort in the case of mistyping
 | 
						|
causing abort due to parser errors...
 | 
						|
 | 
						|
Comments?
 | 
						|
 | 
						|
Vadim
 | 
						|
 | 
						|
 |