1
0
mirror of https://github.com/postgres/postgres.git synced 2025-10-27 00:12:01 +03:00

Split work of bgwriter between 2 processes: bgwriter and checkpointer.

bgwriter is now a much less important process, responsible for page
cleaning duties only. checkpointer is now responsible for checkpoints
and so has a key role in shutdown. Later patches will correct doc
references to the now old idea that bgwriter performs checkpoints.
Has beneficial effect on performance at high write rates, but mainly
refactoring to more easily allow changes for power reduction by
simplifying previously tortuous code around required to allow page
cleaning and checkpointing to time slice in the same process.

Patch by me, Review by Dickson Guedes
This commit is contained in:
Simon Riggs
2011-11-01 17:14:47 +00:00
parent 589adb86ee
commit 806a2aee37
11 changed files with 124 additions and 987 deletions

View File

@@ -38,7 +38,7 @@
/*
* Special values for the segno arg to RememberFsyncRequest.
*
* Note that CompactBgwriterRequestQueue assumes that it's OK to remove an
* Note that CompactcheckpointerRequestQueue assumes that it's OK to remove an
* fsync request from the queue if an identical, subsequent request is found.
* See comments there before making changes here.
*/
@@ -77,7 +77,7 @@
* Inactive segments are those that once contained data but are currently
* not needed because of an mdtruncate() operation. The reason for leaving
* them present at size zero, rather than unlinking them, is that other
* backends and/or the bgwriter might be holding open file references to
* backends and/or the checkpointer might be holding open file references to
* such segments. If the relation expands again after mdtruncate(), such
* that a deactivated segment becomes active again, it is important that
* such file references still be valid --- else data might get written
@@ -111,7 +111,7 @@ static MemoryContext MdCxt; /* context for all md.c allocations */
/*
* In some contexts (currently, standalone backends and the bgwriter process)
* In some contexts (currently, standalone backends and the checkpointer process)
* we keep track of pending fsync operations: we need to remember all relation
* segments that have been written since the last checkpoint, so that we can
* fsync them down to disk before completing the next checkpoint. This hash
@@ -123,7 +123,7 @@ static MemoryContext MdCxt; /* context for all md.c allocations */
* a hash table, because we don't expect there to be any duplicate requests.
*
* (Regular backends do not track pending operations locally, but forward
* them to the bgwriter.)
* them to the checkpointer.)
*/
typedef struct
{
@@ -194,7 +194,7 @@ mdinit(void)
* Create pending-operations hashtable if we need it. Currently, we need
* it if we are standalone (not under a postmaster) OR if we are a
* bootstrap-mode subprocess of a postmaster (that is, a startup or
* bgwriter process).
* checkpointer process).
*/
if (!IsUnderPostmaster || IsBootstrapProcessingMode())
{
@@ -214,10 +214,10 @@ mdinit(void)
}
/*
* In archive recovery, we rely on bgwriter to do fsyncs, but we will have
* In archive recovery, we rely on checkpointer to do fsyncs, but we will have
* already created the pendingOpsTable during initialization of the startup
* process. Calling this function drops the local pendingOpsTable so that
* subsequent requests will be forwarded to bgwriter.
* subsequent requests will be forwarded to checkpointer.
*/
void
SetForwardFsyncRequests(void)
@@ -765,9 +765,9 @@ mdnblocks(SMgrRelation reln, ForkNumber forknum)
* NOTE: this assumption could only be wrong if another backend has
* truncated the relation. We rely on higher code levels to handle that
* scenario by closing and re-opening the md fd, which is handled via
* relcache flush. (Since the bgwriter doesn't participate in relcache
* relcache flush. (Since the checkpointer doesn't participate in relcache
* flush, it could have segment chain entries for inactive segments;
* that's OK because the bgwriter never needs to compute relation size.)
* that's OK because the checkpointer never needs to compute relation size.)
*/
while (v->mdfd_chain != NULL)
{
@@ -957,7 +957,7 @@ mdsync(void)
elog(ERROR, "cannot sync without a pendingOpsTable");
/*
* If we are in the bgwriter, the sync had better include all fsync
* If we are in the checkpointer, the sync had better include all fsync
* requests that were queued by backends up to this point. The tightest
* race condition that could occur is that a buffer that must be written
* and fsync'd for the checkpoint could have been dumped by a backend just
@@ -1033,7 +1033,7 @@ mdsync(void)
int failures;
/*
* If in bgwriter, we want to absorb pending requests every so
* If in checkpointer, we want to absorb pending requests every so
* often to prevent overflow of the fsync request queue. It is
* unspecified whether newly-added entries will be visited by
* hash_seq_search, but we don't care since we don't need to
@@ -1070,9 +1070,9 @@ mdsync(void)
* say "but an unreferenced SMgrRelation is still a leak!" Not
* really, because the only case in which a checkpoint is done
* by a process that isn't about to shut down is in the
* bgwriter, and it will periodically do smgrcloseall(). This
* checkpointer, and it will periodically do smgrcloseall(). This
* fact justifies our not closing the reln in the success path
* either, which is a good thing since in non-bgwriter cases
* either, which is a good thing since in non-checkpointer cases
* we couldn't safely do that.) Furthermore, in many cases
* the relation will have been dirtied through this same smgr
* relation, and so we can save a file open/close cycle.
@@ -1301,7 +1301,7 @@ register_unlink(RelFileNodeBackend rnode)
else
{
/*
* Notify the bgwriter about it. If we fail to queue the request
* Notify the checkpointer about it. If we fail to queue the request
* message, we have to sleep and try again, because we can't simply
* delete the file now. Ugly, but hopefully won't happen often.
*
@@ -1315,10 +1315,10 @@ register_unlink(RelFileNodeBackend rnode)
}
/*
* RememberFsyncRequest() -- callback from bgwriter side of fsync request
* RememberFsyncRequest() -- callback from checkpointer side of fsync request
*
* We stuff most fsync requests into the local hash table for execution
* during the bgwriter's next checkpoint. UNLINK requests go into a
* during the checkpointer's next checkpoint. UNLINK requests go into a
* separate linked list, however, because they get processed separately.
*
* The range of possible segment numbers is way less than the range of
@@ -1460,20 +1460,20 @@ ForgetRelationFsyncRequests(RelFileNodeBackend rnode, ForkNumber forknum)
else if (IsUnderPostmaster)
{
/*
* Notify the bgwriter about it. If we fail to queue the revoke
* Notify the checkpointer about it. If we fail to queue the revoke
* message, we have to sleep and try again ... ugly, but hopefully
* won't happen often.
*
* XXX should we CHECK_FOR_INTERRUPTS in this loop? Escaping with an
* error would leave the no-longer-used file still present on disk,
* which would be bad, so I'm inclined to assume that the bgwriter
* which would be bad, so I'm inclined to assume that the checkpointer
* will always empty the queue soon.
*/
while (!ForwardFsyncRequest(rnode, forknum, FORGET_RELATION_FSYNC))
pg_usleep(10000L); /* 10 msec seems a good number */
/*
* Note we don't wait for the bgwriter to actually absorb the revoke
* Note we don't wait for the checkpointer to actually absorb the revoke
* message; see mdsync() for the implications.
*/
}