Commit 31966b15 invented a way for functions dealing with relation
extension to accept a Relation in online code and an SMgrRelation in
recovery code. It seems highly likely that future bufmgr.c interfaces
will face the same problem, and need to do something similar.
Generalize the names so that each interface doesn't have to re-invent
the wheel.
Back-patch to 16. Since extension AM authors might start using the
constructor macros once 16 ships, we agreed to do the rename in 16
rather than waiting for 17.
Reviewed-by: Peter Geoghegan <pg@bowt.ie>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CA%2BhUKG%2B6tLD2BhpRWycEoti6LVLyQq457UL4ticP5xd8LqHySA%40mail.gmail.com
Due to the bug LimitAdditionalPins() could return 0, violating
LimitAdditionalPins()'s API ("One additional pin is always allowed"). This
could be hit when setting shared_buffers very low and using a fair amount of
concurrency.
This bug was introduced in 31966b151e6a.
Author: "Anton A. Melnikov" <aamelnikov@inbox.ru>
Reported-by: "Anton A. Melnikov" <aamelnikov@inbox.ru>
Reported-by: Victoria Shepard
Discussion: https://postgr.es/m/ae46f2fb-5586-3de0-b54b-1bb0f6410ebd@inbox.ru
Backpatch: 16-
The following changes are done:
- Addition of WaitEventBufferPin and WaitEventExtension, that hold a
list of wait events related to each category.
- Addition of two functions that encapsulate the list of wait events for
each category.
- Rename BUFFER_PIN to BUFFERPIN (only this wait event class used an
underscore, requiring a specific rule in the automation script).
These changes make a bit easier the automatic generation of all the code
and documentation related to wait events, as all the wait event
categories are now controlled by consistent structures and functions.
Author: Bertrand Drouvot
Discussion: https://postgr.es/m/c6f35117-4b20-4c78-1df5-d3056010dcf5@gmail.com
Discussion: https://postgr.es/m/77a86b3a-c4a8-5f5d-69b9-d70bbf2e9b98@gmail.com
The assertion checked that the size of the relation is not "too large" - but
the code is explicitly dealing with the possibility of another backend
extending the relation concurrently. In that case the new relation size could
be bigger than what the current backend needs, wrongly triggering an assertion
failure.
Unfortunately it is hard to write a reliable and affordable regression tests
for this, as a lot of concurrency is needed to encounter the bug.
Introduced in 31966b151e6a.
Reported-by: Melanie Plageman <melanieplageman@gmail.com>
Run pgindent, pgperltidy, and reformat-dat-files.
This set of diffs is a bit larger than typical. We've updated to
pg_bsd_indent 2.1.2, which properly indents variable declarations that
have multi-line initialization expressions (the continuation lines are
now indented one tab stop). We've also updated to perltidy version
20230309 and changed some of its settings, which reduces its desire to
add whitespace to lines to make assignments etc. line up. Going
forward, that should make for fewer random-seeming changes to existing
code.
Discussion: https://postgr.es/m/20230428092545.qfb3y5wcu4cm75ur@alvherre.pgsql
28e626bde00 added the concept of IOOps but neglected to include writeback
operations. ac8d53dae5 added time spent doing these I/O operations. Without
counting writeback, checkpointer write time in the log often differed
substantially from that in pg_stat_io. To fix this, add IOOp IOOP_WRITEBACK
and track writeback in pg_stat_io.
Bumps catversion.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reported-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/20230419172326.dhgyo4wrrhulovt6%40awork3.anarazel.de
For clarity of review, renaming the function parameter "context" in
ScheduleBufferTagForWriteback() and IssuePendingWritebacks() to
"wb_context" is a separate commit. The next commit adds an "io_context"
parameter and "wb_context" makes it more clear which is which.
Author: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CAAKRu_acc6iL4M3hvOTeztf_ZPpsB3Pqio5aVHgZ5q=Pi3BZKg@mail.gmail.com
This fixes many spelling mistakes in comments, but a few references to
invalid parameter names, function names and option names too in comments
and also some in string constants
Also, fix an #undef that was undefining the incorrect definition
Author: Alexander Lakhin
Reviewed-by: Justin Pryzby
Discussion: https://postgr.es/m/d5f68d19-c0fc-91a9-118d-7c6a5a3f5fad@gmail.com
For some reason I had not implemented RBM_ZERO_AND_CLEANUP_LOCK support in
ExtendBufferedRelTo(), likely thinking it not being reachable. But it is
reachable, e.g. when replaying a WAL record for a page in a relation that
subsequently is truncated (likely only reachable when doing crash recovery or
PITR, not during ongoing streaming replication).
As now all of the RBM_* modes are supported, remove assertions checking mode.
As we had no test coverage for this scenario, add a new TAP test. There's
plenty more that ought to be tested in this area...
Reported-by: Tom Lane <tgl@sss.pgh.pa.us>
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Discussion: https://postgr.es/m/392271.1681238924%40sss.pgh.pa.us
Discussion: https://postgr.es/m/0b5eb82b-cb99-e0a4-b932-3dc60e2e3926@gmail.com
Make sure that function declarations use names that exactly match the
corresponding names from function definitions in a few places. These
inconsistencies were all introduced relatively recently, after the code
base had parameter name mismatches fixed in bulk (see commits starting
with commits 4274dc22 and 035ce1fe).
pg_bsd_indent still has a couple of similar inconsistencies, which I
(pgeoghegan) have left untouched for now.
Like all earlier commits that cleaned up function parameter names, this
commit was written with help from clang-tidy.
Provide a way to ask the kernel to use O_DIRECT (or local equivalent)
where available for data and WAL files, to avoid or minimize kernel
caching. This hurts performance currently and is not intended for end
users yet. Later proposed work would introduce our own I/O clustering,
read-ahead, etc to replace the facilities the kernel disables with this
option.
The only user-visible change, if the developer-only GUC is not used, is
that this commit also removes the obscure logic that would activate
O_DIRECT for the WAL when wal_sync_method=open_[data]sync and
wal_level=minimal (which also requires max_wal_senders=0). Those are
non-default and unlikely settings, and this behavior wasn't (correctly)
documented. The same effect can be achieved with io_direct=wal.
Author: Thomas Munro <thomas.munro@gmail.com>
Author: Andres Freund <andres@anarazel.de>
Author: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Discussion: https://postgr.es/m/CA%2BhUKGK1X532hYqJ_MzFWt0n1zt8trz980D79WbjwnT-yYLZpg%40mail.gmail.com
In order to have the option to use O_DIRECT/FILE_FLAG_NO_BUFFERING in a
later commit, we need the addresses of user space buffers to be well
aligned. The exact requirements vary by OS and file system (typically
sectors and/or memory pages). The address alignment size is set to
4096, which is enough for currently known systems: it matches modern
sectors and common memory page size. There is no standard governing
O_DIRECT's requirements so we might eventually have to reconsider this
with more information from the field or future systems.
Aligning I/O buffers on memory pages is also known to improve regular
buffered I/O performance.
Three classes of I/O buffers for regular data pages are adjusted:
(1) Heap buffers are now allocated with the new palloc_aligned() or
MemoryContextAllocAligned() functions introduced by commit 439f6175.
(2) Stack buffers now use a new struct PGIOAlignedBlock to respect
PG_IO_ALIGN_SIZE, if possible with this compiler. (3) The buffer
pool is also aligned in shared memory.
WAL buffers were already aligned on XLOG_BLCKSZ. It's possible for
XLOG_BLCKSZ to be configured smaller than PG_IO_ALIGNED_SIZE and thus
for O_DIRECT WAL writes to fail to be well aligned, but that's a
pre-existing condition and will be addressed by a later commit.
BufFiles are not yet addressed (there's no current plan to use O_DIRECT
for those, but they could potentially get some incidental speedup even
in plain buffered I/O operations through better alignment).
If we can't align stack objects suitably using the compiler extensions
we know about, we disable the use of O_DIRECT by setting PG_O_DIRECT to
0. This avoids the need to consider systems that have O_DIRECT but
can't align stack objects the way we want; such systems could in theory
be supported with more work but we don't currently know of any such
machines, so it's easier to pretend there is no O_DIRECT support
instead. That's an existing and tested class of system.
Add assertions that all buffers passed into smgrread(), smgrwrite() and
smgrextend() are correctly aligned, unless PG_O_DIRECT is 0 (= stack
alignment tricks may be unavailable) or the block size has been set too
small to allow arrays of buffers to be all aligned.
Author: Thomas Munro <thomas.munro@gmail.com>
Author: Andres Freund <andres@anarazel.de>
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com>
Discussion: https://postgr.es/m/CA+hUKGK1X532hYqJ_MzFWt0n1zt8trz980D79WbjwnT-yYLZpg@mail.gmail.com
a9c70b46dbe and 8aaa04b32S added counting of IO operations to a new view,
pg_stat_io. Now, add IO timing for reads, writes, extends, and fsyncs to
pg_stat_io as well.
This combines the tracking for pgBufferUsage with the tracking for pg_stat_io
into a new function pgstat_count_io_op_time(). This should make it a bit
easier to avoid the somewhat costly instr_time conversion done for
pgBufferUsage.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Discussion: https://postgr.es/m/flat/CAAKRu_ay5iKmnbXZ3DsauViF3eMxu4m1oNnJXqV_HyqYeg55Ww%40mail.gmail.com
Both pgstat_database and pgBufferUsage count IO timing for reads of temporary
relation blocks into local buffers. However, both failed to count write IO
timing for flushes of dirty local buffers. Fix.
Additionally, FlushRelationBuffers() seems to have omitted counting write
IO (both count and timing) stats for both pgstat_database and
pgBufferUsage. Fix.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/20230321023451.7rzy4kjj2iktrg2r%40awork3.anarazel.de
The same condition accidentally was copied to both branches. Manual testing
confirms that otherwise the error recovery path works fine.
Found while reviewing the logical-decoding-on-standby patch.
The primary bottlenecks for relation extension are:
1) The extension lock is held while acquiring a victim buffer for the new
page. Acquiring a victim buffer can require writing out the old page
contents including possibly needing to flush WAL.
2) When extending via ReadBuffer() et al, we write a zero page during the
extension, and then later write out the actual page contents. This can
nearly double the write rate.
3) The existing bulk relation extension infrastructure in hio.c just amortized
the cost of acquiring the relation extension lock, but none of the other
costs.
Unfortunately 1) cannot currently be addressed in a central manner as the
callers to ReadBuffer() need to acquire the extension lock. To address that,
this this commit moves the responsibility for acquiring the extension lock
into bufmgr.c functions. That allows to acquire the relation extension lock
for just the required time. This will also allow us to improve relation
extension further, without changing callers.
The reason we write all-zeroes pages during relation extension is that we hope
to get ENOSPC errors earlier that way (largely works, except for CoW
filesystems). It is easier to handle out-of-space errors gracefully if the
page doesn't yet contain actual tuples. This commit addresses 2), by using the
recently introduced smgrzeroextend(), which extends the relation, without
dirtying the kernel page cache for all the extended pages.
To address 3), this commit introduces a function to extend a relation by
multiple blocks at a time.
There are three new exposed functions: ExtendBufferedRel() for extending the
relation by a single block, ExtendBufferedRelBy() to extend a relation by
multiple blocks at once, and ExtendBufferedRelTo() for extending a relation up
to a certain size.
To avoid duplicating code between ReadBuffer(P_NEW) and the new functions,
ReadBuffer(P_NEW) now implements relation extension with
ExtendBufferedRel(), using a flag to tell ExtendBufferedRel() that the
relation lock is already held.
Note that this commit does not yet lead to a meaningful performance or
scalability improvement - for that uses of ReadBuffer(P_NEW) will need to be
converted to ExtendBuffered*(), which will be done in subsequent commits.
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
A future patch will add support for extending relations by multiple blocks at
once. To be concurrency safe, the buffers for those blocks need to be marked
as BM_IO_IN_PROGRESS. Until now we only had infrastructure for recovering from
an IO error for a single buffer. This commit extends that infrastructure to
multiple buffers by using the resource owner infrastructure.
This commit increases the size of the ResourceOwnerData struct, which appears
to have a just about measurable overhead in very extreme workloads. Medium
term we are planning to substantially shrink the size of
ResourceOwnerData. Short term the increase is small enough to not worry about
it for now.
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
Discussion: https://postgr.es/m/20221029200025.w7bvlgvamjfo6z44@awork3.anarazel.de
Previously we held buffer locks for two buffer mapping partitions at the same
time to change the identity of buffers. Particularly for extending relations
needing to hold the extension lock while acquiring a victim buffer is
painful.But it also creates a bottleneck for workloads that just involve
reads.
Now we instead first acquire a victim buffer and write it out, if
necessary. Then we remove that buffer from the old partition with just the old
partition's partition lock held and insert it into the new partition with just
that partition's lock held.
By separating out the victim buffer acquisition, future commits will be able
to change relation extensions to scale better.
On my workstation, a micro-benchmark exercising buffered reads strenuously and
under a lot of concurrency, sees a >2x improvement.
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
This adds a few more assertions against a buffer being local in places we
don't expect, and extracts the check for a buffer being pinned exactly once
from LockBufferForCleanup() into its own function. Later commits will use this
function.
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: http://postgr.es/m/419312fd-9255-078c-c3e3-f0525f911d7f@iki.fi
The trace point was using the relfileno / fork / block for the to-be-read-in
buffer. Some upcoming work would make that more expensive to provide. We still
have buffer-flush-start/done, which can serve most tracing needs that
buffer-write-dirty could serve.
Discussion: https://postgr.es/m/f5164e7a-eef6-8972-75a3-8ac622ed0c6e@iki.fi
Among other things, this should make it easier to calculate a useful cache hit
ratio by excluding buffer reads via buffer access strategies. As buffer access
strategies reuse buffers (and thus evict the prior buffer contents), it is
normal to see reads on repeated scans of the same data.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CAAKRu_beMa9Hzih40%3DXPYqhDVz6tsgUGTrhZXRo%3Dunp%2Bszb%3DUA%40mail.gmail.com
This function has been a no-op for over a decade. Even if bufmgr
regains a need to be called during commit, it seems unlikely that
the most appropriate call points would be precisely here, so it's not
doing us much good as a placeholder either. Now, removing it probably
doesn't save any noticeable number of cycles --- but the main call is
inside the commit critical section, and the less work done there the
better.
Matthias van de Meent
Discussion: https://postgr.es/m/CAEze2Wi1=tLKbxZnXzcD+8fYKyKqBtivVakLQC_mYBsP4Y8qVA@mail.gmail.com
RelationCopyStorageUsingBuffer() did not free the strategies used to access
the source / target relation. They memory was released at the end of the
transaction, but when using a template database with a lot of relations, the
temporary leak can become big prohibitively big.
RelationCopyStorageUsingBuffer() acquired the buffer for the target relation
with RBM_NORMAL, therefore requiring a read of a block guaranteed to be
zero. Use RBM_ZERO_AND_LOCK instead.
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://postgr.es/m/20230321070113.o2vqqxogjykwgfrr@awork3.anarazel.de
Backpatch: 15-, where STRATEGY WAL_LOG was introduced
The setting of the process title could be seen on profiles of very
fast-to-execute queries. In many locations where we call
set_ps_display() we pass along a string constant, the length of which is
known during compilation. Here we effectively rename set_ps_display() to
set_ps_display_with_len() and then add a static inline function named
set_ps_display() which calls strlen() on the given string. This allows
the compiler to optimize away the strlen() call when dealing with
call sites passing a string constant. We can then also use memcpy()
instead of strlcpy() to copy the string into the destination buffer.
That's significantly faster than strlcpy's byte-at-a-time way of
copying.
Here we also take measures to improve some code which was adjusting the
process title to add a " waiting" suffix to it. Call sites which require
this can now just call set_ps_display_suffix() to add or adjust the suffix
and call set_ps_display_remove_suffix() to remove it again.
Reviewed-by: Andres Freund
Discussion: https://postgr.es/m/CAApHDvocBvvk-0gWNA2Gohe+sv9fMcv+fK_G+siBKJrgDG4O7g@mail.gmail.com
Commit 28e626bde00 introduced the infrastructure for tracking more detailed IO
statistics. This commit adds the actual collection of the new IO statistics
for relations and temporary relations. See aforementioned commit for goals and
high-level design.
The changes in this commit are fairly straight-forward. The bulk of the change
is to passing sufficient information to the callsites of pgstat_count_io_op().
A somewhat unsightly detail is that it currently is hard to find a better
place to count fsyncs than in md.c, whereas the other pgstat_count_io_op()
calls are in bufmgr.c/localbuf.c. As the number of fsyncs is tied to md.c
implementation details, it's not obvious there is a better answer.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/20200124195226.lth52iydq2n2uilq@alap3.anarazel.de
Some of these appear to be leftovers from when hash_search() took a
char * argument (changed in 5999e78fc45dcb91784b64b6e9ae43f4e4f68ca2).
Since after this there is some more horizontal space available, do
some light reformatting where suitable.
Reviewed-by: Corey Huinker <corey.huinker@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/fd9adf5d-b1aa-e82f-e4c7-263c30145807%40enterprisedb.com
These are all not necessary from a correctness POV. However, in the near
future instr_time will be simplified to an int64, at which point gcc would
otherwise start to warn about the changed places.
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/20230116023639.rn36vf6ajqmfciua@awork3.anarazel.de
This is similar to 7d25958, and this commit takes care of all the
remaining inconsistencies between the initial value used in the C
variable associated to a GUC and its default value stored in the GUC
tables (as of pg_settings.boot_val).
Some of the initial values of the GUCs updated rely on a compile-time
default. These are refactored so as the GUC table and its C declaration
use the same values. This makes everything consistent with other
places, backend_flush_after, bgwriter_flush_after, port,
checkpoint_flush_after doing so already, for example.
Extracted from a larger patch by Peter Smith. The spots updated in the
modules are from me.
Author: Peter Smith, Michael Paquier
Reviewed-by: Nathan Bossart, Tom Lane, Justin Pryzby
Discussion: https://postgr.es/m/CAHut+PtHE0XSfjjRQ6D4v7+dqzCw=d+1a64ujra4EX8aoc_Z+w@mail.gmail.com
The last caller of UnpinBuffer() that did not want to adjust
CurrentResourceOwner was removed in 2d115e4, and nothing has been
introduced in bufmgr.c to do the same thing since. This simplifies 10
code paths.
Author: Aleksander Alekseev
Reviewed-by: Nathan Bossart, Zhang Mingli, Bharath Rupireddy
Discussion: https://postgr.es/m/CAJ7c6TOmmFpb6ohurLhTC7hKNJWGzdwf8s4EAtAZxD48g-e6Jw@mail.gmail.com
Make sure that function declarations use names that exactly match the
corresponding names from function definitions in storage, catalog,
access method, executor, and logical replication code, as well as in
miscellaneous utility/library code.
Like other recent commits that cleaned up function parameter names, this
commit was written with help from clang-tidy. Later commits will do the
same for other parts of the codebase.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/CAH2-WznJt9CMM9KJTMjJh_zbL5hD9oX44qdJ4aqZtjFi-zA3Tg@mail.gmail.com
This is preparatory work for a project to increase the number of bits
in a RelFileNumber from 32 to 56.
Along the way, introduce static inline accessor functions for a couple
of BufferTag fields.
Dilip Kumar, reviewed by me. The overall patch series has also had
review at various times from Andres Freund, Ashutosh Sharma, Hannu
Krosing, Vignesh C, Álvaro Herrera, and Tom Lane.
Discussion: http://postgr.es/m/CAFiTN-trubju5YbWAq-BSpZ90-Z6xCVBQE8BVqXqANOZAF1Znw@mail.gmail.com
In a similar effort to f01592f91, here we're targetting fixing the
warnings that -Wshadow=compatible-local produces that we can fix by moving
a variable to an inner scope to stop that variable from being shadowed by
another variable declared somewhere later in the function.
All of the warnings being fixed here are changing the scope of variables
which are being used as an iterator for a "for" loop. In each instance,
the fix happens to be changing the for loop to use the C99 type
initialization. Much of this code likely pre-dates our use of C99.
Reducing the scope of the outer scoped variable seems like the safest way
to fix these. Renaming seems more likely to risk patches using the wrong
variable. Reducing the scope is more likely to result in a compilation
failure after applying some future patch rather than introducing bugs with
it.
By my count, this takes the warning count from 129 down to 114.
Author: Justin Pryzby
Discussion: https://postgr.es/m/CAApHDvrwLGBP%2BYw9vriayyf%3DXR4uPWP5jr6cQhP9au_kaDUhbA%40mail.gmail.com
If an error occurs before we close the fake relcache entry, the the
fake relcache entry will be destroyed by the SmgrRelation will
survive until end of transaction. Its smgr_owner pointer ends up
pointing to already-freed memory.
The original reason for using a fake relcache entry here was to try
to avoid reusing an SMgrRelation across a relevant invalidation. To
avoid that problem, just call smgropen() again each time we need a
reference to it. Hopefully someday we will come up with a more
elegant approach, but accessing uninitialized memory is bad so let's
do this for now.
Dilip Kumar, reviewed by Andres Freund and Tom Lane. Report by
Justin Pryzby.
Discussion: http://postgr.es/m/20220802175043.GA13682@telsasoft.com
Discussion: http://postgr.es/m/CAFiTN-vSFeE6_W9z698XNtFROOA_nSqUXWqLcG0emob_kJ+dEQ@mail.gmail.com
RelationCopyStorageUsingBuffer thought it could skip copying
empty pages, but of course that does not work at all, because
subsequent blocks will be out of place.
Also fix it to acquire share lock on the source buffer. It *might*
be safe to not do that, but it's not very certain, and I don't think
this code deserves any benefit of the doubt.
Dilip Kumar, per complaint from me
Discussion: https://postgr.es/m/3679800.1659654066@sss.pgh.pa.us
It incorrectly used GetBufferDescriptor instead of
GetLocalBufferDescriptor, causing it to not find the correct buffer in
most cases, and performing an out-of-bounds memory read in the corner
case that temp_buffers > shared_buffers.
It also bumped the usage-count on the buffer, even if it was
previously pinned. That won't lead to crashes or incorrect results,
but it's different from what the shared-buffer case does, and
different from the usual code in LocalBufferAlloc. Fix that too, and
make the code ordering match LocalBufferAlloc() more closely, so that
it's easier to verify that it's doing the same thing.
Currently, ReadRecentBuffer() is only used with non-temp relations, in
WAL redo, so the broken code is currently dead code. However, it could
be used by extensions.
Backpatch-through: 14
Discussion: https://www.postgresql.org/message-id/2d74b46f-27c9-fb31-7f99-327a87184cc0%40iki.fi
Reviewed-by: Thomas Munro, Zhang Mingli, Richard Guo
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
Before commit 412ad7a55639516f284cd0ef9757d6ae5c7abd43, delayChkpt
was a Boolean. Now it's an integer. Extensions using it need to be
appropriately updated, so let's rename the field to make sure that
a hard compilation failure occurs.
Replacing delayChkpt with delayChkptFlags made a few comments extend
past 80 characters, so I reflowed them and changed some wording very
slightly.
The back-branches will need a different change to restore compatibility
with existing minor releases; this is just for master.
Per suggestion from Tom Lane.
Discussion: http://postgr.es/m/a7880f4d-1d74-582a-ada7-dad168d046d1@enterprisedb.com
Introduce a new GUC recovery_prefetch. When enabled, look ahead in the
WAL and try to initiate asynchronous reading of referenced data blocks
that are not yet cached in our buffer pool. For now, this is done with
posix_fadvise(), which has several caveats. Since not all OSes have
that system call, "try" is provided so that it can be enabled where
available. Better mechanisms for asynchronous I/O are possible in later
work.
Set to "try" for now for test coverage. Default setting to be finalized
before release.
The GUC wal_decode_buffer_size limits the distance we can look ahead in
bytes of decoded data.
The existing GUC maintenance_io_concurrency is used to limit the number
of concurrent I/Os allowed, based on pessimistic heuristics used to
infer that I/Os have begun and completed. We'll also not look more than
maintenance_io_concurrency * 4 block references ahead.
Reviewed-by: Julien Rouhaud <rjuju123@gmail.com>
Reviewed-by: Tomas Vondra <tomas.vondra@2ndquadrant.com>
Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com> (earlier version)
Reviewed-by: Andres Freund <andres@anarazel.de> (earlier version)
Reviewed-by: Justin Pryzby <pryzby@telsasoft.com> (earlier version)
Tested-by: Tomas Vondra <tomas.vondra@2ndquadrant.com> (earlier version)
Tested-by: Jakub Wartak <Jakub.Wartak@tomtom.com> (earlier version)
Tested-by: Dmitry Dolgov <9erthalion6@gmail.com> (earlier version)
Tested-by: Sait Talha Nisanci <Sait.Nisanci@microsoft.com> (earlier version)
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
Previously the statistics collector received statistics updates via UDP and
shared statistics data by writing them out to temporary files regularly. These
files can reach tens of megabytes and are written out up to twice a
second. This has repeatedly prevented us from adding additional useful
statistics.
Now statistics are stored in shared memory. Statistics for variable-numbered
objects are stored in a dshash hashtable (backed by dynamic shared
memory). Fixed-numbered stats are stored in plain shared memory.
The header for pgstat.c contains an overview of the architecture.
The stats collector is not needed anymore, remove it.
By utilizing the transactional statistics drop infrastructure introduced in a
prior commit statistics entries cannot "leak" anymore. Previously leaked
statistics were dropped by pgstat_vacuum_stat(), called from [auto-]vacuum. On
systems with many small relations pgstat_vacuum_stat() could be quite
expensive.
Now that replicas drop statistics entries for dropped objects, it is not
necessary anymore to reset stats when starting from a cleanly shut down
replica.
Subsequent commits will perform some further code cleanup, adapt docs and add
tests.
Bumps PGSTAT_FILE_FORMAT_ID.
Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Author: Andres Freund <andres@anarazel.de>
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-By: Andres Freund <andres@anarazel.de>
Reviewed-By: Thomas Munro <thomas.munro@gmail.com>
Reviewed-By: Justin Pryzby <pryzby@telsasoft.com>
Reviewed-By: "David G. Johnston" <david.g.johnston@gmail.com>
Reviewed-By: Tomas Vondra <tomas.vondra@2ndquadrant.com> (in a much earlier version)
Reviewed-By: Arthur Zakirov <a.zakirov@postgrespro.ru> (in a much earlier version)
Reviewed-By: Antonin Houska <ah@cybertec.at> (in a much earlier version)
Discussion: https://postgr.es/m/20220303021600.hs34ghqcw6zcokdh@alap3.anarazel.de
Discussion: https://postgr.es/m/20220308205351.2xcn6k4x5yivcxyd@alap3.anarazel.de
Discussion: https://postgr.es/m/20210319235115.y3wz7hpnnrshdyv6@alap3.anarazel.de