mirror of
https://github.com/postgres/postgres.git
synced 2025-11-10 17:42:29 +03:00
Support an optional asynchronous commit mode, in which we don't flush WAL
before reporting a transaction committed. Data consistency is still guaranteed (unlike setting fsync = off), but a crash may lose the effects of the last few transactions. Patch by Simon, some editorialization by Tom.
This commit is contained in:
@@ -1,4 +1,4 @@
|
||||
$PostgreSQL: pgsql/src/backend/access/transam/README,v 1.5 2006/03/31 23:32:05 tgl Exp $
|
||||
$PostgreSQL: pgsql/src/backend/access/transam/README,v 1.6 2007/08/01 22:45:07 tgl Exp $
|
||||
|
||||
The Transaction System
|
||||
----------------------
|
||||
@@ -409,4 +409,113 @@ two separate WAL records. The replay code has to remember "unfinished" split
|
||||
operations, and match them up to subsequent insertions in the parent level.
|
||||
If no matching insert has been found by the time the WAL replay ends, the
|
||||
replay code has to do the insertion on its own to restore the index to
|
||||
consistency.
|
||||
consistency. Such insertions occur after WAL is operational, so they can
|
||||
and should write WAL records for the additional generated actions.
|
||||
|
||||
|
||||
Asynchronous Commit
|
||||
-------------------
|
||||
|
||||
As of PostgreSQL 8.3 it is possible to perform asynchronous commits - i.e.,
|
||||
we don't wait while the WAL record for the commit is fsync'ed.
|
||||
We perform an asynchronous commit when synchronous_commit = off. Instead
|
||||
of performing an XLogFlush() up to the LSN of the commit, we merely note
|
||||
the LSN in shared memory. The backend then continues with other work.
|
||||
We record the LSN only for an asynchronous commit, not an abort; there's
|
||||
never any need to flush an abort record, since the presumption after a
|
||||
crash would be that the transaction aborted anyway.
|
||||
|
||||
We always force synchronous commit when the transaction is deleting
|
||||
relations, to ensure the commit record is down to disk before the relations
|
||||
are removed from the filesystem. Also, certain utility commands that have
|
||||
non-roll-backable side effects (such as filesystem changes) force sync
|
||||
commit to minimize the window in which the filesystem change has been made
|
||||
but the transaction isn't guaranteed committed.
|
||||
|
||||
Every wal_writer_delay milliseconds, the walwriter process performs an
|
||||
XLogBackgroundFlush(). This checks the location of the last completely
|
||||
filled WAL page. If that has moved forwards, then we write all the changed
|
||||
buffers up to that point, so that under full load we write only whole
|
||||
buffers. If there has been a break in activity and the current WAL page is
|
||||
the same as before, then we find out the LSN of the most recent
|
||||
asynchronous commit, and flush up to that point, if required (i.e.,
|
||||
if it's in the current WAL page). This arrangement in itself would
|
||||
guarantee that an async commit record reaches disk during at worst the
|
||||
second walwriter cycle after the transaction completes. However, we also
|
||||
allow XLogFlush to flush full buffers "flexibly" (ie, not wrapping around
|
||||
at the end of the circular WAL buffer area), so as to minimize the number
|
||||
of writes issued under high load when multiple WAL pages are filled per
|
||||
walwriter cycle. This makes the worst-case delay three walwriter cycles.
|
||||
|
||||
There are some other subtle points to consider with asynchronous commits.
|
||||
First, for each page of CLOG we must remember the LSN of the latest commit
|
||||
affecting the page, so that we can enforce the same flush-WAL-before-write
|
||||
rule that we do for ordinary relation pages. Otherwise the record of the
|
||||
commit might reach disk before the WAL record does. Again, abort records
|
||||
need not factor into this consideration.
|
||||
|
||||
In fact, we store more than one LSN for each clog page. This relates to
|
||||
the way we set transaction status hint bits during visibility tests.
|
||||
We must not set a transaction-committed hint bit on a relation page and
|
||||
have that record make it to disk prior to the WAL record of the commit.
|
||||
Since visibility tests are normally made while holding buffer share locks,
|
||||
we do not have the option of changing the page's LSN to guarantee WAL
|
||||
synchronization. Instead, we defer the setting of the hint bit if we have
|
||||
not yet flushed WAL as far as the LSN associated with the transaction.
|
||||
This requires tracking the LSN of each unflushed async commit. It is
|
||||
convenient to associate this data with clog buffers: because we will flush
|
||||
WAL before writing a clog page, we know that we do not need to remember a
|
||||
transaction's LSN longer than the clog page holding its commit status
|
||||
remains in memory. However, the naive approach of storing an LSN for each
|
||||
clog position is unattractive: the LSNs are 32x bigger than the two-bit
|
||||
commit status fields, and so we'd need 256K of additional shared memory for
|
||||
each 8K clog buffer page. We choose instead to store a smaller number of
|
||||
LSNs per page, where each LSN is the highest LSN associated with any
|
||||
transaction commit in a contiguous range of transaction IDs on that page.
|
||||
This saves storage at the price of some possibly-unnecessary delay in
|
||||
setting transaction hint bits.
|
||||
|
||||
How many transactions should share the same cached LSN (N)? If the
|
||||
system's workload consists only of small async-commit transactions, then
|
||||
it's reasonable to have N similar to the number of transactions per
|
||||
walwriter cycle, since that is the granularity with which transactions will
|
||||
become truly committed (and thus hintable) anyway. The worst case is where
|
||||
a sync-commit xact shares a cached LSN with an async-commit xact that
|
||||
commits a bit later; even though we paid to sync the first xact to disk,
|
||||
we won't be able to hint its outputs until the second xact is sync'd, up to
|
||||
three walwriter cycles later. This argues for keeping N (the group size)
|
||||
as small as possible. For the moment we are setting the group size to 32,
|
||||
which makes the LSN cache space the same size as the actual clog buffer
|
||||
space (independently of BLCKSZ).
|
||||
|
||||
It is useful that we can run both synchronous and asynchronous commit
|
||||
transactions concurrently, but the safety of this is perhaps not
|
||||
immediately obvious. Assume we have two transactions, T1 and T2. The Log
|
||||
Sequence Number (LSN) is the point in the WAL sequence where a transaction
|
||||
commit is recorded, so LSN1 and LSN2 are the commit records of those
|
||||
transactions. If T2 can see changes made by T1 then when T2 commits it
|
||||
must be true that LSN2 follows LSN1. Thus when T2 commits it is certain
|
||||
that all of the changes made by T1 are also now recorded in the WAL. This
|
||||
is true whether T1 was asynchronous or synchronous. As a result, it is
|
||||
safe for asynchronous commits and synchronous commits to work concurrently
|
||||
without endangering data written by synchronous commits. Sub-transactions
|
||||
are not important here since the final write to disk only occurs at the
|
||||
commit of the top level transaction.
|
||||
|
||||
Changes to data blocks cannot reach disk unless WAL is flushed up to the
|
||||
point of the LSN of the data blocks. Any attempt to write unsafe data to
|
||||
disk will trigger a write which ensures the safety of all data written by
|
||||
that and prior transactions. Data blocks and clog pages are both protected
|
||||
by LSNs.
|
||||
|
||||
Changes to a temp table are not WAL-logged, hence could reach disk in
|
||||
advance of T1's commit, but we don't care since temp table contents don't
|
||||
survive crashes anyway.
|
||||
|
||||
Database writes made via any of the paths we have introduced to avoid WAL
|
||||
overhead for bulk updates are also safe. In these cases it's entirely
|
||||
possible for the data to reach disk before T1's commit, because T1 will
|
||||
fsync it down to disk without any sort of interlock, as soon as it finishes
|
||||
the bulk update. However, all these paths are designed to write data that
|
||||
no other transaction can see until after T1 commits. The situation is thus
|
||||
not different from ordinary WAL-logged updates.
|
||||
|
||||
@@ -14,17 +14,19 @@
|
||||
* CLOG page is initialized to zeroes. Other writes of CLOG come from
|
||||
* recording of transaction commit or abort in xact.c, which generates its
|
||||
* own XLOG records for these events and will re-perform the status update
|
||||
* on redo; so we need make no additional XLOG entry here. Also, the XLOG
|
||||
* is guaranteed flushed through the XLOG commit record before we are called
|
||||
* to log a commit, so the WAL rule "write xlog before data" is satisfied
|
||||
* automatically for commits, and we don't really care for aborts. Therefore,
|
||||
* we don't need to mark CLOG pages with LSN information; we have enough
|
||||
* synchronization already.
|
||||
* on redo; so we need make no additional XLOG entry here. For synchronous
|
||||
* transaction commits, the XLOG is guaranteed flushed through the XLOG commit
|
||||
* record before we are called to log a commit, so the WAL rule "write xlog
|
||||
* before data" is satisfied automatically. However, for async commits we
|
||||
* must track the latest LSN affecting each CLOG page, so that we can flush
|
||||
* XLOG that far and satisfy the WAL rule. We don't have to worry about this
|
||||
* for aborts (whether sync or async), since the post-crash assumption would
|
||||
* be that such transactions failed anyway.
|
||||
*
|
||||
* Portions Copyright (c) 1996-2007, PostgreSQL Global Development Group
|
||||
* Portions Copyright (c) 1994, Regents of the University of California
|
||||
*
|
||||
* $PostgreSQL: pgsql/src/backend/access/transam/clog.c,v 1.42 2007/01/05 22:19:23 momjian Exp $
|
||||
* $PostgreSQL: pgsql/src/backend/access/transam/clog.c,v 1.43 2007/08/01 22:45:07 tgl Exp $
|
||||
*
|
||||
*-------------------------------------------------------------------------
|
||||
*/
|
||||
@@ -57,6 +59,13 @@
|
||||
#define TransactionIdToByte(xid) (TransactionIdToPgIndex(xid) / CLOG_XACTS_PER_BYTE)
|
||||
#define TransactionIdToBIndex(xid) ((xid) % (TransactionId) CLOG_XACTS_PER_BYTE)
|
||||
|
||||
/* We store the latest async LSN for each group of transactions */
|
||||
#define CLOG_XACTS_PER_LSN_GROUP 32 /* keep this a power of 2 */
|
||||
#define CLOG_LSNS_PER_PAGE (CLOG_XACTS_PER_PAGE / CLOG_XACTS_PER_LSN_GROUP)
|
||||
|
||||
#define GetLSNIndex(slotno, xid) ((slotno) * CLOG_LSNS_PER_PAGE + \
|
||||
((xid) % (TransactionId) CLOG_XACTS_PER_PAGE) / CLOG_XACTS_PER_LSN_GROUP)
|
||||
|
||||
|
||||
/*
|
||||
* Link to shared-memory data structures for CLOG control
|
||||
@@ -75,11 +84,16 @@ static void WriteTruncateXlogRec(int pageno);
|
||||
/*
|
||||
* Record the final state of a transaction in the commit log.
|
||||
*
|
||||
* lsn must be the WAL location of the commit record when recording an async
|
||||
* commit. For a synchronous commit it can be InvalidXLogRecPtr, since the
|
||||
* caller guarantees the commit record is already flushed in that case. It
|
||||
* should be InvalidXLogRecPtr for abort cases, too.
|
||||
*
|
||||
* NB: this is a low-level routine and is NOT the preferred entry point
|
||||
* for most uses; TransactionLogUpdate() in transam.c is the intended caller.
|
||||
*/
|
||||
void
|
||||
TransactionIdSetStatus(TransactionId xid, XidStatus status)
|
||||
TransactionIdSetStatus(TransactionId xid, XidStatus status, XLogRecPtr lsn)
|
||||
{
|
||||
int pageno = TransactionIdToPage(xid);
|
||||
int byteno = TransactionIdToByte(xid);
|
||||
@@ -94,7 +108,16 @@ TransactionIdSetStatus(TransactionId xid, XidStatus status)
|
||||
|
||||
LWLockAcquire(CLogControlLock, LW_EXCLUSIVE);
|
||||
|
||||
slotno = SimpleLruReadPage(ClogCtl, pageno, xid);
|
||||
/*
|
||||
* If we're doing an async commit (ie, lsn is valid), then we must wait
|
||||
* for any active write on the page slot to complete. Otherwise our
|
||||
* update could reach disk in that write, which will not do since we
|
||||
* mustn't let it reach disk until we've done the appropriate WAL flush.
|
||||
* But when lsn is invalid, it's OK to scribble on a page while it is
|
||||
* write-busy, since we don't care if the update reaches disk sooner than
|
||||
* we think. Hence, pass write_ok = XLogRecPtrIsInvalid(lsn).
|
||||
*/
|
||||
slotno = SimpleLruReadPage(ClogCtl, pageno, XLogRecPtrIsInvalid(lsn), xid);
|
||||
byteptr = ClogCtl->shared->page_buffer[slotno] + byteno;
|
||||
|
||||
/* Current state should be 0, subcommitted or target state */
|
||||
@@ -110,22 +133,48 @@ TransactionIdSetStatus(TransactionId xid, XidStatus status)
|
||||
|
||||
ClogCtl->shared->page_dirty[slotno] = true;
|
||||
|
||||
/*
|
||||
* Update the group LSN if the transaction completion LSN is higher.
|
||||
*
|
||||
* Note: lsn will be invalid when supplied during InRecovery processing,
|
||||
* so we don't need to do anything special to avoid LSN updates during
|
||||
* recovery. After recovery completes the next clog change will set the
|
||||
* LSN correctly.
|
||||
*/
|
||||
if (!XLogRecPtrIsInvalid(lsn))
|
||||
{
|
||||
int lsnindex = GetLSNIndex(slotno, xid);
|
||||
|
||||
if (XLByteLT(ClogCtl->shared->group_lsn[lsnindex], lsn))
|
||||
ClogCtl->shared->group_lsn[lsnindex] = lsn;
|
||||
}
|
||||
|
||||
LWLockRelease(CLogControlLock);
|
||||
}
|
||||
|
||||
/*
|
||||
* Interrogate the state of a transaction in the commit log.
|
||||
*
|
||||
* Aside from the actual commit status, this function returns (into *lsn)
|
||||
* an LSN that is late enough to be able to guarantee that if we flush up to
|
||||
* that LSN then we will have flushed the transaction's commit record to disk.
|
||||
* The result is not necessarily the exact LSN of the transaction's commit
|
||||
* record! For example, for long-past transactions (those whose clog pages
|
||||
* already migrated to disk), we'll return InvalidXLogRecPtr. Also, because
|
||||
* we group transactions on the same clog page to conserve storage, we might
|
||||
* return the LSN of a later transaction that falls into the same group.
|
||||
*
|
||||
* NB: this is a low-level routine and is NOT the preferred entry point
|
||||
* for most uses; TransactionLogFetch() in transam.c is the intended caller.
|
||||
*/
|
||||
XidStatus
|
||||
TransactionIdGetStatus(TransactionId xid)
|
||||
TransactionIdGetStatus(TransactionId xid, XLogRecPtr *lsn)
|
||||
{
|
||||
int pageno = TransactionIdToPage(xid);
|
||||
int byteno = TransactionIdToByte(xid);
|
||||
int bshift = TransactionIdToBIndex(xid) * CLOG_BITS_PER_XACT;
|
||||
int slotno;
|
||||
int lsnindex;
|
||||
char *byteptr;
|
||||
XidStatus status;
|
||||
|
||||
@@ -136,6 +185,9 @@ TransactionIdGetStatus(TransactionId xid)
|
||||
|
||||
status = (*byteptr >> bshift) & CLOG_XACT_BITMASK;
|
||||
|
||||
lsnindex = GetLSNIndex(slotno, xid);
|
||||
*lsn = ClogCtl->shared->group_lsn[lsnindex];
|
||||
|
||||
LWLockRelease(CLogControlLock);
|
||||
|
||||
return status;
|
||||
@@ -148,14 +200,14 @@ TransactionIdGetStatus(TransactionId xid)
|
||||
Size
|
||||
CLOGShmemSize(void)
|
||||
{
|
||||
return SimpleLruShmemSize(NUM_CLOG_BUFFERS);
|
||||
return SimpleLruShmemSize(NUM_CLOG_BUFFERS, CLOG_LSNS_PER_PAGE);
|
||||
}
|
||||
|
||||
void
|
||||
CLOGShmemInit(void)
|
||||
{
|
||||
ClogCtl->PagePrecedes = CLOGPagePrecedes;
|
||||
SimpleLruInit(ClogCtl, "CLOG Ctl", NUM_CLOG_BUFFERS,
|
||||
SimpleLruInit(ClogCtl, "CLOG Ctl", NUM_CLOG_BUFFERS, CLOG_LSNS_PER_PAGE,
|
||||
CLogControlLock, "pg_clog");
|
||||
}
|
||||
|
||||
@@ -240,7 +292,7 @@ StartupCLOG(void)
|
||||
int slotno;
|
||||
char *byteptr;
|
||||
|
||||
slotno = SimpleLruReadPage(ClogCtl, pageno, xid);
|
||||
slotno = SimpleLruReadPage(ClogCtl, pageno, false, xid);
|
||||
byteptr = ClogCtl->shared->page_buffer[slotno] + byteno;
|
||||
|
||||
/* Zero so-far-unused positions in the current byte */
|
||||
|
||||
@@ -42,7 +42,7 @@
|
||||
* Portions Copyright (c) 1996-2007, PostgreSQL Global Development Group
|
||||
* Portions Copyright (c) 1994, Regents of the University of California
|
||||
*
|
||||
* $PostgreSQL: pgsql/src/backend/access/transam/multixact.c,v 1.23 2007/01/05 22:19:23 momjian Exp $
|
||||
* $PostgreSQL: pgsql/src/backend/access/transam/multixact.c,v 1.24 2007/08/01 22:45:07 tgl Exp $
|
||||
*
|
||||
*-------------------------------------------------------------------------
|
||||
*/
|
||||
@@ -749,7 +749,7 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
|
||||
* enough that a MultiXactId is really involved. Perhaps someday we'll
|
||||
* take the trouble to generalize the slru.c error reporting code.
|
||||
*/
|
||||
slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, multi);
|
||||
slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, multi);
|
||||
offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
|
||||
offptr += entryno;
|
||||
|
||||
@@ -773,7 +773,7 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
|
||||
|
||||
if (pageno != prev_pageno)
|
||||
{
|
||||
slotno = SimpleLruReadPage(MultiXactMemberCtl, pageno, multi);
|
||||
slotno = SimpleLruReadPage(MultiXactMemberCtl, pageno, true, multi);
|
||||
prev_pageno = pageno;
|
||||
}
|
||||
|
||||
@@ -993,7 +993,7 @@ retry:
|
||||
pageno = MultiXactIdToOffsetPage(multi);
|
||||
entryno = MultiXactIdToOffsetEntry(multi);
|
||||
|
||||
slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, multi);
|
||||
slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, multi);
|
||||
offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
|
||||
offptr += entryno;
|
||||
offset = *offptr;
|
||||
@@ -1025,7 +1025,7 @@ retry:
|
||||
entryno = MultiXactIdToOffsetEntry(tmpMXact);
|
||||
|
||||
if (pageno != prev_pageno)
|
||||
slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, tmpMXact);
|
||||
slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, tmpMXact);
|
||||
|
||||
offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
|
||||
offptr += entryno;
|
||||
@@ -1061,7 +1061,7 @@ retry:
|
||||
|
||||
if (pageno != prev_pageno)
|
||||
{
|
||||
slotno = SimpleLruReadPage(MultiXactMemberCtl, pageno, multi);
|
||||
slotno = SimpleLruReadPage(MultiXactMemberCtl, pageno, true, multi);
|
||||
prev_pageno = pageno;
|
||||
}
|
||||
|
||||
@@ -1289,8 +1289,8 @@ MultiXactShmemSize(void)
|
||||
mul_size(sizeof(MultiXactId) * 2, MaxBackends))
|
||||
|
||||
size = SHARED_MULTIXACT_STATE_SIZE;
|
||||
size = add_size(size, SimpleLruShmemSize(NUM_MXACTOFFSET_BUFFERS));
|
||||
size = add_size(size, SimpleLruShmemSize(NUM_MXACTMEMBER_BUFFERS));
|
||||
size = add_size(size, SimpleLruShmemSize(NUM_MXACTOFFSET_BUFFERS, 0));
|
||||
size = add_size(size, SimpleLruShmemSize(NUM_MXACTMEMBER_BUFFERS, 0));
|
||||
|
||||
return size;
|
||||
}
|
||||
@@ -1306,10 +1306,10 @@ MultiXactShmemInit(void)
|
||||
MultiXactMemberCtl->PagePrecedes = MultiXactMemberPagePrecedes;
|
||||
|
||||
SimpleLruInit(MultiXactOffsetCtl,
|
||||
"MultiXactOffset Ctl", NUM_MXACTOFFSET_BUFFERS,
|
||||
"MultiXactOffset Ctl", NUM_MXACTOFFSET_BUFFERS, 0,
|
||||
MultiXactOffsetControlLock, "pg_multixact/offsets");
|
||||
SimpleLruInit(MultiXactMemberCtl,
|
||||
"MultiXactMember Ctl", NUM_MXACTMEMBER_BUFFERS,
|
||||
"MultiXactMember Ctl", NUM_MXACTMEMBER_BUFFERS, 0,
|
||||
MultiXactMemberControlLock, "pg_multixact/members");
|
||||
|
||||
/* Initialize our shared state struct */
|
||||
@@ -1442,7 +1442,7 @@ StartupMultiXact(void)
|
||||
int slotno;
|
||||
MultiXactOffset *offptr;
|
||||
|
||||
slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, multi);
|
||||
slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, multi);
|
||||
offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
|
||||
offptr += entryno;
|
||||
|
||||
@@ -1472,7 +1472,7 @@ StartupMultiXact(void)
|
||||
int slotno;
|
||||
TransactionId *xidptr;
|
||||
|
||||
slotno = SimpleLruReadPage(MultiXactMemberCtl, pageno, offset);
|
||||
slotno = SimpleLruReadPage(MultiXactMemberCtl, pageno, true, offset);
|
||||
xidptr = (TransactionId *) MultiXactMemberCtl->shared->page_buffer[slotno];
|
||||
xidptr += entryno;
|
||||
|
||||
|
||||
@@ -41,7 +41,7 @@
|
||||
* Portions Copyright (c) 1996-2007, PostgreSQL Global Development Group
|
||||
* Portions Copyright (c) 1994, Regents of the University of California
|
||||
*
|
||||
* $PostgreSQL: pgsql/src/backend/access/transam/slru.c,v 1.40 2007/01/05 22:19:23 momjian Exp $
|
||||
* $PostgreSQL: pgsql/src/backend/access/transam/slru.c,v 1.41 2007/08/01 22:45:07 tgl Exp $
|
||||
*
|
||||
*-------------------------------------------------------------------------
|
||||
*/
|
||||
@@ -140,6 +140,8 @@ static SlruErrorCause slru_errcause;
|
||||
static int slru_errno;
|
||||
|
||||
|
||||
static void SimpleLruZeroLSNs(SlruCtl ctl, int slotno);
|
||||
static void SimpleLruWaitIO(SlruCtl ctl, int slotno);
|
||||
static bool SlruPhysicalReadPage(SlruCtl ctl, int pageno, int slotno);
|
||||
static bool SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno,
|
||||
SlruFlush fdata);
|
||||
@@ -152,7 +154,7 @@ static int SlruSelectLRUPage(SlruCtl ctl, int pageno);
|
||||
*/
|
||||
|
||||
Size
|
||||
SimpleLruShmemSize(int nslots)
|
||||
SimpleLruShmemSize(int nslots, int nlsns)
|
||||
{
|
||||
Size sz;
|
||||
|
||||
@@ -165,18 +167,21 @@ SimpleLruShmemSize(int nslots)
|
||||
sz += MAXALIGN(nslots * sizeof(int)); /* page_lru_count[] */
|
||||
sz += MAXALIGN(nslots * sizeof(LWLockId)); /* buffer_locks[] */
|
||||
|
||||
if (nlsns > 0)
|
||||
sz += MAXALIGN(nslots * nlsns * sizeof(XLogRecPtr)); /* group_lsn[] */
|
||||
|
||||
return BUFFERALIGN(sz) + BLCKSZ * nslots;
|
||||
}
|
||||
|
||||
void
|
||||
SimpleLruInit(SlruCtl ctl, const char *name, int nslots,
|
||||
SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
|
||||
LWLockId ctllock, const char *subdir)
|
||||
{
|
||||
SlruShared shared;
|
||||
bool found;
|
||||
|
||||
shared = (SlruShared) ShmemInitStruct(name,
|
||||
SimpleLruShmemSize(nslots),
|
||||
SimpleLruShmemSize(nslots, nlsns),
|
||||
&found);
|
||||
|
||||
if (!IsUnderPostmaster)
|
||||
@@ -193,6 +198,7 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots,
|
||||
shared->ControlLock = ctllock;
|
||||
|
||||
shared->num_slots = nslots;
|
||||
shared->lsn_groups_per_page = nlsns;
|
||||
|
||||
shared->cur_lru_count = 0;
|
||||
|
||||
@@ -212,8 +218,14 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots,
|
||||
offset += MAXALIGN(nslots * sizeof(int));
|
||||
shared->buffer_locks = (LWLockId *) (ptr + offset);
|
||||
offset += MAXALIGN(nslots * sizeof(LWLockId));
|
||||
ptr += BUFFERALIGN(offset);
|
||||
|
||||
if (nlsns > 0)
|
||||
{
|
||||
shared->group_lsn = (XLogRecPtr *) (ptr + offset);
|
||||
offset += MAXALIGN(nslots * nlsns * sizeof(XLogRecPtr));
|
||||
}
|
||||
|
||||
ptr += BUFFERALIGN(offset);
|
||||
for (slotno = 0; slotno < nslots; slotno++)
|
||||
{
|
||||
shared->page_buffer[slotno] = ptr;
|
||||
@@ -266,15 +278,37 @@ SimpleLruZeroPage(SlruCtl ctl, int pageno)
|
||||
/* Set the buffer to zeroes */
|
||||
MemSet(shared->page_buffer[slotno], 0, BLCKSZ);
|
||||
|
||||
/* Set the LSNs for this new page to zero */
|
||||
SimpleLruZeroLSNs(ctl, slotno);
|
||||
|
||||
/* Assume this page is now the latest active page */
|
||||
shared->latest_page_number = pageno;
|
||||
|
||||
return slotno;
|
||||
}
|
||||
|
||||
/*
|
||||
* Zero all the LSNs we store for this slru page.
|
||||
*
|
||||
* This should be called each time we create a new page, and each time we read
|
||||
* in a page from disk into an existing buffer. (Such an old page cannot
|
||||
* have any interesting LSNs, since we'd have flushed them before writing
|
||||
* the page in the first place.)
|
||||
*/
|
||||
static void
|
||||
SimpleLruZeroLSNs(SlruCtl ctl, int slotno)
|
||||
{
|
||||
SlruShared shared = ctl->shared;
|
||||
|
||||
if (shared->lsn_groups_per_page > 0)
|
||||
MemSet(&shared->group_lsn[slotno * shared->lsn_groups_per_page], 0,
|
||||
shared->lsn_groups_per_page * sizeof(XLogRecPtr));
|
||||
}
|
||||
|
||||
/*
|
||||
* Wait for any active I/O on a page slot to finish. (This does not
|
||||
* guarantee that new I/O hasn't been started before we return, though.)
|
||||
* guarantee that new I/O hasn't been started before we return, though.
|
||||
* In fact the slot might not even contain the same page anymore.)
|
||||
*
|
||||
* Control lock must be held at entry, and will be held at exit.
|
||||
*/
|
||||
@@ -305,8 +339,7 @@ SimpleLruWaitIO(SlruCtl ctl, int slotno)
|
||||
/* indeed, the I/O must have failed */
|
||||
if (shared->page_status[slotno] == SLRU_PAGE_READ_IN_PROGRESS)
|
||||
shared->page_status[slotno] = SLRU_PAGE_EMPTY;
|
||||
else
|
||||
/* write_in_progress */
|
||||
else /* write_in_progress */
|
||||
{
|
||||
shared->page_status[slotno] = SLRU_PAGE_VALID;
|
||||
shared->page_dirty[slotno] = true;
|
||||
@@ -320,6 +353,11 @@ SimpleLruWaitIO(SlruCtl ctl, int slotno)
|
||||
* Find a page in a shared buffer, reading it in if necessary.
|
||||
* The page number must correspond to an already-initialized page.
|
||||
*
|
||||
* If write_ok is true then it is OK to return a page that is in
|
||||
* WRITE_IN_PROGRESS state; it is the caller's responsibility to be sure
|
||||
* that modification of the page is safe. If write_ok is false then we
|
||||
* will not return the page until it is not undergoing active I/O.
|
||||
*
|
||||
* The passed-in xid is used only for error reporting, and may be
|
||||
* InvalidTransactionId if no specific xid is associated with the action.
|
||||
*
|
||||
@@ -329,7 +367,8 @@ SimpleLruWaitIO(SlruCtl ctl, int slotno)
|
||||
* Control lock must be held at entry, and will be held at exit.
|
||||
*/
|
||||
int
|
||||
SimpleLruReadPage(SlruCtl ctl, int pageno, TransactionId xid)
|
||||
SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
|
||||
TransactionId xid)
|
||||
{
|
||||
SlruShared shared = ctl->shared;
|
||||
|
||||
@@ -346,8 +385,13 @@ SimpleLruReadPage(SlruCtl ctl, int pageno, TransactionId xid)
|
||||
if (shared->page_number[slotno] == pageno &&
|
||||
shared->page_status[slotno] != SLRU_PAGE_EMPTY)
|
||||
{
|
||||
/* If page is still being read in, we must wait for I/O */
|
||||
if (shared->page_status[slotno] == SLRU_PAGE_READ_IN_PROGRESS)
|
||||
/*
|
||||
* If page is still being read in, we must wait for I/O. Likewise
|
||||
* if the page is being written and the caller said that's not OK.
|
||||
*/
|
||||
if (shared->page_status[slotno] == SLRU_PAGE_READ_IN_PROGRESS ||
|
||||
(shared->page_status[slotno] == SLRU_PAGE_WRITE_IN_PROGRESS &&
|
||||
!write_ok))
|
||||
{
|
||||
SimpleLruWaitIO(ctl, slotno);
|
||||
/* Now we must recheck state from the top */
|
||||
@@ -383,6 +427,9 @@ SimpleLruReadPage(SlruCtl ctl, int pageno, TransactionId xid)
|
||||
/* Do the read */
|
||||
ok = SlruPhysicalReadPage(ctl, pageno, slotno);
|
||||
|
||||
/* Set the LSNs for this newly read-in page to zero */
|
||||
SimpleLruZeroLSNs(ctl, slotno);
|
||||
|
||||
/* Re-acquire control lock and update page state */
|
||||
LWLockAcquire(shared->ControlLock, LW_EXCLUSIVE);
|
||||
|
||||
@@ -443,7 +490,7 @@ SimpleLruReadPage_ReadOnly(SlruCtl ctl, int pageno, TransactionId xid)
|
||||
LWLockRelease(shared->ControlLock);
|
||||
LWLockAcquire(shared->ControlLock, LW_EXCLUSIVE);
|
||||
|
||||
return SimpleLruReadPage(ctl, pageno, xid);
|
||||
return SimpleLruReadPage(ctl, pageno, true, xid);
|
||||
}
|
||||
|
||||
/*
|
||||
@@ -621,6 +668,47 @@ SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno, SlruFlush fdata)
|
||||
char path[MAXPGPATH];
|
||||
int fd = -1;
|
||||
|
||||
/*
|
||||
* Honor the write-WAL-before-data rule, if appropriate, so that we do
|
||||
* not write out data before associated WAL records. This is the same
|
||||
* action performed during FlushBuffer() in the main buffer manager.
|
||||
*/
|
||||
if (shared->group_lsn != NULL)
|
||||
{
|
||||
/*
|
||||
* We must determine the largest async-commit LSN for the page.
|
||||
* This is a bit tedious, but since this entire function is a slow
|
||||
* path anyway, it seems better to do this here than to maintain
|
||||
* a per-page LSN variable (which'd need an extra comparison in the
|
||||
* transaction-commit path).
|
||||
*/
|
||||
XLogRecPtr max_lsn;
|
||||
int lsnindex, lsnoff;
|
||||
|
||||
lsnindex = slotno * shared->lsn_groups_per_page;
|
||||
max_lsn = shared->group_lsn[lsnindex++];
|
||||
for (lsnoff = 1; lsnoff < shared->lsn_groups_per_page; lsnoff++)
|
||||
{
|
||||
XLogRecPtr this_lsn = shared->group_lsn[lsnindex++];
|
||||
|
||||
if (XLByteLT(max_lsn, this_lsn))
|
||||
max_lsn = this_lsn;
|
||||
}
|
||||
|
||||
if (!XLogRecPtrIsInvalid(max_lsn))
|
||||
{
|
||||
/*
|
||||
* As noted above, elog(ERROR) is not acceptable here, so if
|
||||
* XLogFlush were to fail, we must PANIC. This isn't much of
|
||||
* a restriction because XLogFlush is just about all critical
|
||||
* section anyway, but let's make sure.
|
||||
*/
|
||||
START_CRIT_SECTION();
|
||||
XLogFlush(max_lsn);
|
||||
END_CRIT_SECTION();
|
||||
}
|
||||
}
|
||||
|
||||
/*
|
||||
* During a Flush, we may already have the desired file open.
|
||||
*/
|
||||
|
||||
@@ -22,7 +22,7 @@
|
||||
* Portions Copyright (c) 1996-2007, PostgreSQL Global Development Group
|
||||
* Portions Copyright (c) 1994, Regents of the University of California
|
||||
*
|
||||
* $PostgreSQL: pgsql/src/backend/access/transam/subtrans.c,v 1.18 2007/01/05 22:19:23 momjian Exp $
|
||||
* $PostgreSQL: pgsql/src/backend/access/transam/subtrans.c,v 1.19 2007/08/01 22:45:07 tgl Exp $
|
||||
*
|
||||
*-------------------------------------------------------------------------
|
||||
*/
|
||||
@@ -78,7 +78,7 @@ SubTransSetParent(TransactionId xid, TransactionId parent)
|
||||
|
||||
LWLockAcquire(SubtransControlLock, LW_EXCLUSIVE);
|
||||
|
||||
slotno = SimpleLruReadPage(SubTransCtl, pageno, xid);
|
||||
slotno = SimpleLruReadPage(SubTransCtl, pageno, true, xid);
|
||||
ptr = (TransactionId *) SubTransCtl->shared->page_buffer[slotno];
|
||||
ptr += entryno;
|
||||
|
||||
@@ -165,14 +165,14 @@ SubTransGetTopmostTransaction(TransactionId xid)
|
||||
Size
|
||||
SUBTRANSShmemSize(void)
|
||||
{
|
||||
return SimpleLruShmemSize(NUM_SUBTRANS_BUFFERS);
|
||||
return SimpleLruShmemSize(NUM_SUBTRANS_BUFFERS, 0);
|
||||
}
|
||||
|
||||
void
|
||||
SUBTRANSShmemInit(void)
|
||||
{
|
||||
SubTransCtl->PagePrecedes = SubTransPagePrecedes;
|
||||
SimpleLruInit(SubTransCtl, "SUBTRANS Ctl", NUM_SUBTRANS_BUFFERS,
|
||||
SimpleLruInit(SubTransCtl, "SUBTRANS Ctl", NUM_SUBTRANS_BUFFERS, 0,
|
||||
SubtransControlLock, "pg_subtrans");
|
||||
/* Override default assumption that writes should be fsync'd */
|
||||
SubTransCtl->do_fsync = false;
|
||||
|
||||
@@ -8,7 +8,7 @@
|
||||
*
|
||||
*
|
||||
* IDENTIFICATION
|
||||
* $PostgreSQL: pgsql/src/backend/access/transam/transam.c,v 1.69 2007/01/05 22:19:23 momjian Exp $
|
||||
* $PostgreSQL: pgsql/src/backend/access/transam/transam.c,v 1.70 2007/08/01 22:45:07 tgl Exp $
|
||||
*
|
||||
* NOTES
|
||||
* This file contains the high level access-method interface to the
|
||||
@@ -27,14 +27,17 @@
|
||||
|
||||
static XidStatus TransactionLogFetch(TransactionId transactionId);
|
||||
static void TransactionLogUpdate(TransactionId transactionId,
|
||||
XidStatus status);
|
||||
XidStatus status, XLogRecPtr lsn);
|
||||
|
||||
/* ----------------
|
||||
* Single-item cache for results of TransactionLogFetch.
|
||||
* ----------------
|
||||
/*
|
||||
* Single-item cache for results of TransactionLogFetch.
|
||||
*/
|
||||
static TransactionId cachedFetchXid = InvalidTransactionId;
|
||||
static XidStatus cachedFetchXidStatus;
|
||||
static XLogRecPtr cachedCommitLSN;
|
||||
|
||||
/* Handy constant for an invalid xlog recptr */
|
||||
static const XLogRecPtr InvalidXLogRecPtr = {0, 0};
|
||||
|
||||
|
||||
/* ----------------------------------------------------------------
|
||||
@@ -52,6 +55,7 @@ static XidStatus
|
||||
TransactionLogFetch(TransactionId transactionId)
|
||||
{
|
||||
XidStatus xidstatus;
|
||||
XLogRecPtr xidlsn;
|
||||
|
||||
/*
|
||||
* Before going to the commit log manager, check our single item cache to
|
||||
@@ -73,9 +77,9 @@ TransactionLogFetch(TransactionId transactionId)
|
||||
}
|
||||
|
||||
/*
|
||||
* Get the status.
|
||||
* Get the transaction status.
|
||||
*/
|
||||
xidstatus = TransactionIdGetStatus(transactionId);
|
||||
xidstatus = TransactionIdGetStatus(transactionId, &xidlsn);
|
||||
|
||||
/*
|
||||
* DO NOT cache status for unfinished or sub-committed transactions! We
|
||||
@@ -84,8 +88,9 @@ TransactionLogFetch(TransactionId transactionId)
|
||||
if (xidstatus != TRANSACTION_STATUS_IN_PROGRESS &&
|
||||
xidstatus != TRANSACTION_STATUS_SUB_COMMITTED)
|
||||
{
|
||||
TransactionIdStore(transactionId, &cachedFetchXid);
|
||||
cachedFetchXid = transactionId;
|
||||
cachedFetchXidStatus = xidstatus;
|
||||
cachedCommitLSN = xidlsn;
|
||||
}
|
||||
|
||||
return xidstatus;
|
||||
@@ -93,16 +98,19 @@ TransactionLogFetch(TransactionId transactionId)
|
||||
|
||||
/* --------------------------------
|
||||
* TransactionLogUpdate
|
||||
*
|
||||
* Store the new status of a transaction. The commit record LSN must be
|
||||
* passed when recording an async commit; else it should be InvalidXLogRecPtr.
|
||||
* --------------------------------
|
||||
*/
|
||||
static void
|
||||
TransactionLogUpdate(TransactionId transactionId, /* trans id to update */
|
||||
XidStatus status) /* new trans status */
|
||||
static inline void
|
||||
TransactionLogUpdate(TransactionId transactionId,
|
||||
XidStatus status, XLogRecPtr lsn)
|
||||
{
|
||||
/*
|
||||
* update the commit log
|
||||
*/
|
||||
TransactionIdSetStatus(transactionId, status);
|
||||
TransactionIdSetStatus(transactionId, status, lsn);
|
||||
}
|
||||
|
||||
/*
|
||||
@@ -111,15 +119,16 @@ TransactionLogUpdate(TransactionId transactionId, /* trans id to update */
|
||||
* Update multiple transaction identifiers to a given status.
|
||||
* Don't depend on this being atomic; it's not.
|
||||
*/
|
||||
static void
|
||||
TransactionLogMultiUpdate(int nxids, TransactionId *xids, XidStatus status)
|
||||
static inline void
|
||||
TransactionLogMultiUpdate(int nxids, TransactionId *xids,
|
||||
XidStatus status, XLogRecPtr lsn)
|
||||
{
|
||||
int i;
|
||||
|
||||
Assert(nxids != 0);
|
||||
|
||||
for (i = 0; i < nxids; i++)
|
||||
TransactionIdSetStatus(xids[i], status);
|
||||
TransactionIdSetStatus(xids[i], status, lsn);
|
||||
}
|
||||
|
||||
/* ----------------------------------------------------------------
|
||||
@@ -269,31 +278,49 @@ TransactionIdDidAbort(TransactionId transactionId)
|
||||
void
|
||||
TransactionIdCommit(TransactionId transactionId)
|
||||
{
|
||||
TransactionLogUpdate(transactionId, TRANSACTION_STATUS_COMMITTED);
|
||||
TransactionLogUpdate(transactionId, TRANSACTION_STATUS_COMMITTED,
|
||||
InvalidXLogRecPtr);
|
||||
}
|
||||
|
||||
/*
|
||||
* TransactionIdAsyncCommit
|
||||
* Same as above, but for async commits. The commit record LSN is needed.
|
||||
*/
|
||||
void
|
||||
TransactionIdAsyncCommit(TransactionId transactionId, XLogRecPtr lsn)
|
||||
{
|
||||
TransactionLogUpdate(transactionId, TRANSACTION_STATUS_COMMITTED, lsn);
|
||||
}
|
||||
|
||||
|
||||
/*
|
||||
* TransactionIdAbort
|
||||
* Aborts the transaction associated with the identifier.
|
||||
*
|
||||
* Note:
|
||||
* Assumes transaction identifier is valid.
|
||||
* No async version of this is needed.
|
||||
*/
|
||||
void
|
||||
TransactionIdAbort(TransactionId transactionId)
|
||||
{
|
||||
TransactionLogUpdate(transactionId, TRANSACTION_STATUS_ABORTED);
|
||||
TransactionLogUpdate(transactionId, TRANSACTION_STATUS_ABORTED,
|
||||
InvalidXLogRecPtr);
|
||||
}
|
||||
|
||||
/*
|
||||
* TransactionIdSubCommit
|
||||
* Marks the subtransaction associated with the identifier as
|
||||
* sub-committed.
|
||||
*
|
||||
* Note:
|
||||
* No async version of this is needed.
|
||||
*/
|
||||
void
|
||||
TransactionIdSubCommit(TransactionId transactionId)
|
||||
{
|
||||
TransactionLogUpdate(transactionId, TRANSACTION_STATUS_SUB_COMMITTED);
|
||||
TransactionLogUpdate(transactionId, TRANSACTION_STATUS_SUB_COMMITTED,
|
||||
InvalidXLogRecPtr);
|
||||
}
|
||||
|
||||
/*
|
||||
@@ -309,9 +336,23 @@ void
|
||||
TransactionIdCommitTree(int nxids, TransactionId *xids)
|
||||
{
|
||||
if (nxids > 0)
|
||||
TransactionLogMultiUpdate(nxids, xids, TRANSACTION_STATUS_COMMITTED);
|
||||
TransactionLogMultiUpdate(nxids, xids, TRANSACTION_STATUS_COMMITTED,
|
||||
InvalidXLogRecPtr);
|
||||
}
|
||||
|
||||
/*
|
||||
* TransactionIdAsyncCommitTree
|
||||
* Same as above, but for async commits. The commit record LSN is needed.
|
||||
*/
|
||||
void
|
||||
TransactionIdAsyncCommitTree(int nxids, TransactionId *xids, XLogRecPtr lsn)
|
||||
{
|
||||
if (nxids > 0)
|
||||
TransactionLogMultiUpdate(nxids, xids, TRANSACTION_STATUS_COMMITTED,
|
||||
lsn);
|
||||
}
|
||||
|
||||
|
||||
/*
|
||||
* TransactionIdAbortTree
|
||||
* Marks all the given transaction ids as aborted.
|
||||
@@ -323,7 +364,8 @@ void
|
||||
TransactionIdAbortTree(int nxids, TransactionId *xids)
|
||||
{
|
||||
if (nxids > 0)
|
||||
TransactionLogMultiUpdate(nxids, xids, TRANSACTION_STATUS_ABORTED);
|
||||
TransactionLogMultiUpdate(nxids, xids, TRANSACTION_STATUS_ABORTED,
|
||||
InvalidXLogRecPtr);
|
||||
}
|
||||
|
||||
/*
|
||||
@@ -389,3 +431,43 @@ TransactionIdFollowsOrEquals(TransactionId id1, TransactionId id2)
|
||||
diff = (int32) (id1 - id2);
|
||||
return (diff >= 0);
|
||||
}
|
||||
|
||||
/*
|
||||
* TransactionIdGetCommitLSN
|
||||
*
|
||||
* This function returns an LSN that is late enough to be able
|
||||
* to guarantee that if we flush up to the LSN returned then we
|
||||
* will have flushed the transaction's commit record to disk.
|
||||
*
|
||||
* The result is not necessarily the exact LSN of the transaction's
|
||||
* commit record! For example, for long-past transactions (those whose
|
||||
* clog pages already migrated to disk), we'll return InvalidXLogRecPtr.
|
||||
* Also, because we group transactions on the same clog page to conserve
|
||||
* storage, we might return the LSN of a later transaction that falls into
|
||||
* the same group.
|
||||
*/
|
||||
XLogRecPtr
|
||||
TransactionIdGetCommitLSN(TransactionId xid)
|
||||
{
|
||||
XLogRecPtr result;
|
||||
|
||||
/*
|
||||
* Currently, all uses of this function are for xids that were just
|
||||
* reported to be committed by TransactionLogFetch, so we expect that
|
||||
* checking TransactionLogFetch's cache will usually succeed and avoid an
|
||||
* extra trip to shared memory.
|
||||
*/
|
||||
if (TransactionIdEquals(xid, cachedFetchXid))
|
||||
return cachedCommitLSN;
|
||||
|
||||
/* Special XIDs are always known committed */
|
||||
if (!TransactionIdIsNormal(xid))
|
||||
return InvalidXLogRecPtr;
|
||||
|
||||
/*
|
||||
* Get the transaction status.
|
||||
*/
|
||||
(void) TransactionIdGetStatus(xid, &result);
|
||||
|
||||
return result;
|
||||
}
|
||||
|
||||
@@ -7,7 +7,7 @@
|
||||
* Portions Copyright (c) 1994, Regents of the University of California
|
||||
*
|
||||
* IDENTIFICATION
|
||||
* $PostgreSQL: pgsql/src/backend/access/transam/twophase.c,v 1.31 2007/05/27 03:50:39 tgl Exp $
|
||||
* $PostgreSQL: pgsql/src/backend/access/transam/twophase.c,v 1.32 2007/08/01 22:45:07 tgl Exp $
|
||||
*
|
||||
* NOTES
|
||||
* Each global transaction is associated with a global transaction
|
||||
@@ -1706,7 +1706,11 @@ RecordTransactionCommitPrepared(TransactionId xid,
|
||||
XLOG_XACT_COMMIT_PREPARED | XLOG_NO_TRAN,
|
||||
rdata);
|
||||
|
||||
/* we don't currently try to sleep before flush here ... */
|
||||
/*
|
||||
* We don't currently try to sleep before flush here ... nor is there
|
||||
* any support for async commit of a prepared xact (the very idea is
|
||||
* probably a contradiction)
|
||||
*/
|
||||
|
||||
/* Flush XLOG to disk */
|
||||
XLogFlush(recptr);
|
||||
|
||||
@@ -10,7 +10,7 @@
|
||||
*
|
||||
*
|
||||
* IDENTIFICATION
|
||||
* $PostgreSQL: pgsql/src/backend/access/transam/xact.c,v 1.245 2007/06/07 21:45:58 tgl Exp $
|
||||
* $PostgreSQL: pgsql/src/backend/access/transam/xact.c,v 1.246 2007/08/01 22:45:07 tgl Exp $
|
||||
*
|
||||
*-------------------------------------------------------------------------
|
||||
*/
|
||||
@@ -55,6 +55,8 @@ int XactIsoLevel;
|
||||
bool DefaultXactReadOnly = false;
|
||||
bool XactReadOnly;
|
||||
|
||||
bool XactSyncCommit = true;
|
||||
|
||||
int CommitDelay = 0; /* precommit delay in microseconds */
|
||||
int CommitSiblings = 5; /* # concurrent xacts needed to sleep */
|
||||
|
||||
@@ -174,6 +176,11 @@ static TimestampTz xactStopTimestamp;
|
||||
*/
|
||||
static char *prepareGID;
|
||||
|
||||
/*
|
||||
* Some commands want to force synchronous commit.
|
||||
*/
|
||||
static bool forceSyncCommit = false;
|
||||
|
||||
/*
|
||||
* Private context for transaction-abort work --- we reserve space for this
|
||||
* at startup to ensure that AbortTransaction and AbortSubTransaction can work
|
||||
@@ -554,6 +561,18 @@ CommandCounterIncrement(void)
|
||||
AtStart_Cache();
|
||||
}
|
||||
|
||||
/*
|
||||
* ForceSyncCommit
|
||||
*
|
||||
* Interface routine to allow commands to force a synchronous commit of the
|
||||
* current top-level transaction
|
||||
*/
|
||||
void
|
||||
ForceSyncCommit(void)
|
||||
{
|
||||
forceSyncCommit = true;
|
||||
}
|
||||
|
||||
|
||||
/* ----------------------------------------------------------------
|
||||
* StartTransaction stuff
|
||||
@@ -724,6 +743,7 @@ RecordTransactionCommit(void)
|
||||
{
|
||||
TransactionId xid = GetCurrentTransactionId();
|
||||
bool madeTCentries;
|
||||
bool isAsyncCommit = false;
|
||||
XLogRecPtr recptr;
|
||||
|
||||
/* Tell bufmgr and smgr to prepare for commit */
|
||||
@@ -810,21 +830,44 @@ RecordTransactionCommit(void)
|
||||
if (MyXactMadeXLogEntry)
|
||||
{
|
||||
/*
|
||||
* Sleep before flush! So we can flush more than one commit
|
||||
* records per single fsync. (The idea is some other backend may
|
||||
* do the XLogFlush while we're sleeping. This needs work still,
|
||||
* because on most Unixen, the minimum select() delay is 10msec or
|
||||
* more, which is way too long.)
|
||||
*
|
||||
* We do not sleep if enableFsync is not turned on, nor if there
|
||||
* are fewer than CommitSiblings other backends with active
|
||||
* transactions.
|
||||
* If the user has set synchronous_commit = off, and we're
|
||||
* not doing cleanup of any rels nor committing any command
|
||||
* that wanted to force sync commit, then we can defer fsync.
|
||||
*/
|
||||
if (CommitDelay > 0 && enableFsync &&
|
||||
CountActiveBackends() >= CommitSiblings)
|
||||
pg_usleep(CommitDelay);
|
||||
if (XactSyncCommit || forceSyncCommit || nrels > 0)
|
||||
{
|
||||
/*
|
||||
* Synchronous commit case.
|
||||
*
|
||||
* Sleep before flush! So we can flush more than one commit
|
||||
* records per single fsync. (The idea is some other backend
|
||||
* may do the XLogFlush while we're sleeping. This needs work
|
||||
* still, because on most Unixen, the minimum select() delay
|
||||
* is 10msec or more, which is way too long.)
|
||||
*
|
||||
* We do not sleep if enableFsync is not turned on, nor if
|
||||
* there are fewer than CommitSiblings other backends with
|
||||
* active transactions.
|
||||
*/
|
||||
if (CommitDelay > 0 && enableFsync &&
|
||||
CountActiveBackends() >= CommitSiblings)
|
||||
pg_usleep(CommitDelay);
|
||||
|
||||
XLogFlush(recptr);
|
||||
XLogFlush(recptr);
|
||||
}
|
||||
else
|
||||
{
|
||||
/*
|
||||
* Asynchronous commit case.
|
||||
*/
|
||||
isAsyncCommit = true;
|
||||
|
||||
/*
|
||||
* Report the latest async commit LSN, so that
|
||||
* the WAL writer knows to flush this commit.
|
||||
*/
|
||||
XLogSetAsyncCommitLSN(recptr);
|
||||
}
|
||||
}
|
||||
|
||||
/*
|
||||
@@ -835,12 +878,24 @@ RecordTransactionCommit(void)
|
||||
* emitted an XLOG record for our commit, and so in the event of a
|
||||
* crash the clog update might be lost. This is okay because no one
|
||||
* else will ever care whether we committed.
|
||||
*
|
||||
* The recptr here refers to the last xlog entry by this transaction
|
||||
* so is the correct value to use for setting the clog.
|
||||
*/
|
||||
if (madeTCentries || MyXactMadeTempRelUpdate)
|
||||
{
|
||||
TransactionIdCommit(xid);
|
||||
/* to avoid race conditions, the parent must commit first */
|
||||
TransactionIdCommitTree(nchildren, children);
|
||||
if (isAsyncCommit)
|
||||
{
|
||||
TransactionIdAsyncCommit(xid, recptr);
|
||||
/* to avoid race conditions, the parent must commit first */
|
||||
TransactionIdAsyncCommitTree(nchildren, children, recptr);
|
||||
}
|
||||
else
|
||||
{
|
||||
TransactionIdCommit(xid);
|
||||
/* to avoid race conditions, the parent must commit first */
|
||||
TransactionIdCommitTree(nchildren, children);
|
||||
}
|
||||
}
|
||||
|
||||
/* Checkpoint can proceed now */
|
||||
@@ -1406,6 +1461,7 @@ StartTransaction(void)
|
||||
FreeXactSnapshot();
|
||||
XactIsoLevel = DefaultXactIsoLevel;
|
||||
XactReadOnly = DefaultXactReadOnly;
|
||||
forceSyncCommit = false;
|
||||
|
||||
/*
|
||||
* reinitialize within-transaction counters
|
||||
|
||||
@@ -7,7 +7,7 @@
|
||||
* Portions Copyright (c) 1996-2007, PostgreSQL Global Development Group
|
||||
* Portions Copyright (c) 1994, Regents of the University of California
|
||||
*
|
||||
* $PostgreSQL: pgsql/src/backend/access/transam/xlog.c,v 1.275 2007/07/24 04:54:08 tgl Exp $
|
||||
* $PostgreSQL: pgsql/src/backend/access/transam/xlog.c,v 1.276 2007/08/01 22:45:08 tgl Exp $
|
||||
*
|
||||
*-------------------------------------------------------------------------
|
||||
*/
|
||||
@@ -305,6 +305,7 @@ typedef struct XLogCtlData
|
||||
XLogwrtResult LogwrtResult;
|
||||
uint32 ckptXidEpoch; /* nextXID & epoch of latest checkpoint */
|
||||
TransactionId ckptXid;
|
||||
XLogRecPtr asyncCommitLSN; /* LSN of newest async commit */
|
||||
|
||||
/* Protected by WALWriteLock: */
|
||||
XLogCtlWrite Write;
|
||||
@@ -1643,6 +1644,22 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
|
||||
Write->LogwrtResult = LogwrtResult;
|
||||
}
|
||||
|
||||
/*
|
||||
* Record the LSN for an asynchronous transaction commit.
|
||||
* (This should not be called for aborts, nor for synchronous commits.)
|
||||
*/
|
||||
void
|
||||
XLogSetAsyncCommitLSN(XLogRecPtr asyncCommitLSN)
|
||||
{
|
||||
/* use volatile pointer to prevent code rearrangement */
|
||||
volatile XLogCtlData *xlogctl = XLogCtl;
|
||||
|
||||
SpinLockAcquire(&xlogctl->info_lck);
|
||||
if (XLByteLT(xlogctl->asyncCommitLSN, asyncCommitLSN))
|
||||
xlogctl->asyncCommitLSN = asyncCommitLSN;
|
||||
SpinLockRelease(&xlogctl->info_lck);
|
||||
}
|
||||
|
||||
/*
|
||||
* Ensure that all XLOG data through the given position is flushed to disk.
|
||||
*
|
||||
@@ -1797,19 +1814,17 @@ XLogBackgroundFlush(void)
|
||||
/* back off to last completed page boundary */
|
||||
WriteRqstPtr.xrecoff -= WriteRqstPtr.xrecoff % XLOG_BLCKSZ;
|
||||
|
||||
#ifdef NOT_YET /* async commit patch is still to come */
|
||||
/* if we have already flushed that far, consider async commit records */
|
||||
if (XLByteLE(WriteRqstPtr, LogwrtResult.Flush))
|
||||
{
|
||||
/* use volatile pointer to prevent code rearrangement */
|
||||
volatile XLogCtlData *xlogctl = XLogCtl;
|
||||
|
||||
SpinLockAcquire(&xlogctl->async_commit_lck);
|
||||
SpinLockAcquire(&xlogctl->info_lck);
|
||||
WriteRqstPtr = xlogctl->asyncCommitLSN;
|
||||
SpinLockRelease(&xlogctl->async_commit_lck);
|
||||
SpinLockRelease(&xlogctl->info_lck);
|
||||
flexible = false; /* ensure it all gets written */
|
||||
}
|
||||
#endif
|
||||
|
||||
/* Done if already known flushed */
|
||||
if (XLByteLE(WriteRqstPtr, LogwrtResult.Flush))
|
||||
@@ -1841,6 +1856,23 @@ XLogBackgroundFlush(void)
|
||||
END_CRIT_SECTION();
|
||||
}
|
||||
|
||||
/*
|
||||
* Flush any previous asynchronously-committed transactions' commit records.
|
||||
*/
|
||||
void
|
||||
XLogAsyncCommitFlush(void)
|
||||
{
|
||||
XLogRecPtr WriteRqstPtr;
|
||||
/* use volatile pointer to prevent code rearrangement */
|
||||
volatile XLogCtlData *xlogctl = XLogCtl;
|
||||
|
||||
SpinLockAcquire(&xlogctl->info_lck);
|
||||
WriteRqstPtr = xlogctl->asyncCommitLSN;
|
||||
SpinLockRelease(&xlogctl->info_lck);
|
||||
|
||||
XLogFlush(WriteRqstPtr);
|
||||
}
|
||||
|
||||
/*
|
||||
* Test whether XLOG data has been flushed up to (at least) the given position.
|
||||
*
|
||||
@@ -5466,7 +5498,7 @@ ShutdownXLOG(int code, Datum arg)
|
||||
(errmsg("database system is shut down")));
|
||||
}
|
||||
|
||||
/*
|
||||
/*
|
||||
* Log start of a checkpoint.
|
||||
*/
|
||||
static void
|
||||
@@ -5481,7 +5513,7 @@ LogCheckpointStart(int flags)
|
||||
(flags & CHECKPOINT_CAUSE_TIME) ? " time" : "");
|
||||
}
|
||||
|
||||
/*
|
||||
/*
|
||||
* Log end of a checkpoint.
|
||||
*/
|
||||
static void
|
||||
@@ -5523,7 +5555,7 @@ LogCheckpointEnd(void)
|
||||
* flags is a bitwise OR of the following:
|
||||
* CHECKPOINT_IS_SHUTDOWN: checkpoint is for database shutdown.
|
||||
* CHECKPOINT_IMMEDIATE: finish the checkpoint ASAP,
|
||||
* ignoring checkpoint_completion_target parameter.
|
||||
* ignoring checkpoint_completion_target parameter.
|
||||
* CHECKPOINT_FORCE: force a checkpoint even if no XLOG activity has occured
|
||||
* since the last one (implied by CHECKPOINT_IS_SHUTDOWN).
|
||||
*
|
||||
|
||||
Reference in New Issue
Block a user