postgres

mirror of https://github.com/postgres/postgres.git synced 2025-11-06 07:49:08 +03:00

Author	SHA1	Message	Date
Amit Kapila	cc32ec24fd	Revert the commits related to allowing page lock to conflict among parallel group members. This commit reverts the work done by commits `3ba59ccc89` and `72e78d831a`. Those commits were incorrect in asserting that we never acquire any other heavy-weight lock after acquring page lock other than relation extension lock. We can acquire a lock on catalogs while doing catalog look up after acquring page lock. This won't impact any existing feature but we need to think some other way to achieve this before parallelizing other write operations or even improving the parallelism in vacuum (like allowing multiple workers for an index). Reported-by: Jaime Casanova Author: Amit Kapila Backpatch-through: 13 Discussion: https://postgr.es/m/CAJKUy5jffnRKNvRHKQ0LynRb0RJC-o4P8Ku3x9vGAVLwDBWumQ@mail.gmail.com	2023-07-06 08:52:10 +05:30
Michael Paquier	fa88928470	Generate automatically code and documentation related to wait events The documentation and the code is generated automatically from a new file called wait_event_names.txt, formatted in sections dedicated to each wait event class (Timeout, Lock, IO, etc.) with three tab-separated fields: - C symbol in enums - Format in the system views - Description in the docs Using this approach has several advantages, as we have proved to be rather bad in maintaining this area of the tree across the years: - The order of each item in the documentation and the code, which should be alphabetical, has become incorrect multiple times, and the script generating the code and documentation has a few rules to enforce that, making the maintenance a no-brainer. - Some wait events were added to the code, but not documented, so this cannot be missed now. - The order of the tables for each wait event class is enforced in the documentation (the input .txt file does so as well for clarity, though this is not mandatory). - Less code, shaving 1.2k lines from the tree, with 1/3 of the savings coming from the code, the rest from the documentation. The wait event types "Lock" and "LWLock" still have their own code path for their code, hence only the documentation is created for them. These classes are listed with a special marker called WAIT_EVENT_DOCONLY in the input file. Adding a new wait event now requires only an update of wait_event_names.txt, with "Lock" and "LWLock" treated as exceptions. This commit has been tested with configure/Makefile, the CI and VPATH build. clean, distclean and maintainer-clean were working fine. Author: Bertrand Drouvot, Michael Paquier Discussion: https://postgr.es/m/77a86b3a-c4a8-5f5d-69b9-d70bbf2e9b98@gmail.com	2023-07-05 10:53:11 +09:00
Heikki Linnakangas	4b4798e138	Ensure that creation of an empty relfile is fsync'd at checkpoint. If you create a table and don't insert any data into it, the relation file is never fsync'd. You don't lose data, because an empty table doesn't have any data to begin with, but if you crash and lose the file, subsequent operations on the table will fail with "could not open file" error. To fix, register an fsync request in mdcreate(), like we do for mdwrite(). Per discussion, we probably should also fsync the containing directory after creating a new file. But that's a separate and much wider issue. Backpatch to all supported versions. Reviewed-by: Andres Freund, Thomas Munro Discussion: https://www.postgresql.org/message-id/d47d8122-415e-425c-d0a2-e0160829702d%40iki.fi	2023-07-04 17:57:03 +03:00
Michael Paquier	2aeaf80e57	Refactor some code related to wait events "BufferPin" and "Extension" The following changes are done: - Addition of WaitEventBufferPin and WaitEventExtension, that hold a list of wait events related to each category. - Addition of two functions that encapsulate the list of wait events for each category. - Rename BUFFER_PIN to BUFFERPIN (only this wait event class used an underscore, requiring a specific rule in the automation script). These changes make a bit easier the automatic generation of all the code and documentation related to wait events, as all the wait event categories are now controlled by consistent structures and functions. Author: Bertrand Drouvot Discussion: https://postgr.es/m/c6f35117-4b20-4c78-1df5-d3056010dcf5@gmail.com Discussion: https://postgr.es/m/77a86b3a-c4a8-5f5d-69b9-d70bbf2e9b98@gmail.com	2023-07-03 11:01:02 +09:00
Thomas Munro	4f49b3f849	Trust signalfd on illumos, again. Commit `3ab4fc5d` avoided choosing signalfd by default on illumos, because it triggered kernel panics. That was fixed, so we can remove a kludge from our code. Users/packagers can still override the default choice at compile time if desired, and we'll leave the back-branches unchanged so they keep choosing self-pipe by default, but we'll default to signalfd (like we do for Linux) in 17. Fixed kernels should be everywhere by the time 17 ships. The illumos issues were: * https://www.illumos.org/issues/13700 * https://www.illumos.org/issues/14892 Discussion: https://postgr.es/m/CA+hUKG+NK-K_G_i1H3OpDTwYPEsiwQi_jw58PGcW2H+-N2eVCA@mail.gmail.com	2023-07-02 15:28:48 +12:00
Peter Eisentraut	39a584dc90	Error message wording improvements	2023-06-29 09:14:55 +02:00
Peter Eisentraut	3ad5f07c0f	Error message refactoring Take some untranslatable things out of the message and replace by format placeholders, to reduce translatable strings and reduce translation mistakes.	2023-06-23 16:36:17 +02:00
Tom Lane	b334612b8a	Pre-beta2 mechanical code beautification. Run pgindent and pgperltidy. It seems we're still some ways away from all committers doing this automatically. Now that we have a buildfarm animal that will whine about poorly-indented code, we'll try to keep the tree more tidy. Discussion: https://postgr.es/m/3156045.1687208823@sss.pgh.pa.us	2023-06-20 09:50:43 -04:00
Andres Freund	0d369ac650	fd.c: Retry after EINTR in more places Starting with `4d330a61bb` we can use posix_fallocate() to extend files. Unfortunately in some situation, e.g. on tmpfs filesystems, EINTR may be returned. See also `4518c798b2`. To fix, add a retry path to FileFallocate(). In contrast to `4518c798b2` the amount we extend by is limited and the extending may happen at a high frequency, so disabling signals does not appear to be the correct path here. Also add retry paths to other file operations currently lacking them (around fdatasync(), fsync(), ftruncate(), posix_fadvise(), sync_file_range(), truncate()) - they are all documented or have been observed to return EINTR. Even though most of these functions used in the back branches, it does not seem worth the risk to backpatch - outside of the new-to-16 case of posix_fallocate() I am not aware of problem reports due to the lack of retries. Reported-by: Christoph Berg <myon@debian.org> Discussion: https://postgr.es/m/ZEZDj1H61ryrmY9o@msg.df7cb.de Backpatch: -	2023-06-19 14:11:32 -07:00
Masahiko Sawada	4327f6c748	Fix typo in comment. Introduced in `4d330a61bb`. Author: Masahiko Sawada Reviewed-by: Michael Paquier Discussion: https://postgr.es/m/CAD21AoDg8rTWJkrNJg9UTP89vS8smfib2c55DVqKrCn8zR-GYA@mail.gmail.com	2023-06-14 13:28:41 +09:00
Andres Freund	e3cb1a586c	Report stats when replaying XLOG_RUNNING_XACTS Previously stats in the startup process would only get reported during shutdown of the startup process. It has been that way for a long time, but became a lot more noticeable with the new pg_stat_io view, which separates out IO done by different backend types... While replaying after every XLOG_RUNNING_XACTS isn't the prettiest approach, it has the advantage of being quite easy. Given that we're well past feature freeze... It's not a problem that we don't report stats more frequently with wal_level=minimal, in that case stats can't be read before the stats process has shut down. Besides the above, this commit also changes pgstat_report_stat() to acquire the timestamp with GetCurrentTimestamp() instead of GetCurrentTransactionStopTimestamp(). Thanks to Melih Mutlu, Kyotaro Horiguchi for prototypes of other approaches to solving this issue. Reported-by: Fujii Masao <masao.fujii@oss.nttdata.com> Discussion: https://postgr.es/m/5315aedc-fbca-1556-c5de-dc2e00b23a14@oss.nttdata.com	2023-06-12 15:06:39 -07:00
Heikki Linnakangas	548d726030	Remove a few unused global variables and declarations. - Commit `3eb77eba5a`, which moved the pending ops queue from md.c to sync.c, introduced a duplicate, unused 'pendingOpsCxt' variable. (I'm surprised none of the compilers or static analysis tools have complained about that.) - Commit `c2fe139c20` moved the 'synchronize_seqscans' variable and introduced an extern declaration in tableam.h, making the one in guc_tables.c unnecessary. - Commit `6f0cf87872` removed the 'pgstat_temp_directory' GUC, but forgot to remove the corresponding global variable. - Commit `1b4e729eaa` removed the 'pg_krb_realm' GUC, and its global variable, but forgot the declaration in auth.h. Spotted all these by reading the code.	2023-06-12 16:25:37 +03:00
Andres Freund	eabb22525e	Remove over-eager assertion in ExtendBufferedRelTo() The assertion checked that the size of the relation is not "too large" - but the code is explicitly dealing with the possibility of another backend extending the relation concurrently. In that case the new relation size could be bigger than what the current backend needs, wrongly triggering an assertion failure. Unfortunately it is hard to write a reliable and affordable regression tests for this, as a lot of concurrency is needed to encounter the bug. Introduced in `31966b151e`. Reported-by: Melanie Plageman <melanieplageman@gmail.com>	2023-05-21 09:53:49 -07:00
Tom Lane	0245f8db36	Pre-beta mechanical code beautification. Run pgindent, pgperltidy, and reformat-dat-files. This set of diffs is a bit larger than typical. We've updated to pg_bsd_indent 2.1.2, which properly indents variable declarations that have multi-line initialization expressions (the continuation lines are now indented one tab stop). We've also updated to perltidy version 20230309 and changed some of its settings, which reduces its desire to add whitespace to lines to make assignments etc. line up. Going forward, that should make for fewer random-seeming changes to existing code. Discussion: https://postgr.es/m/20230428092545.qfb3y5wcu4cm75ur@alvherre.pgsql	2023-05-19 17:24:48 -04:00
Peter Eisentraut	803b4a26ca	Remove stray mid-sentence tabs in comments	2023-05-19 16:13:16 +02:00
Peter Eisentraut	4c9deebd37	Move mdwriteback() to better place The previous order in the file didn't make sense and matched neither the header file nor the smgr API. Discussion: https://www.postgresql.org/message-id/flat/22fed8ba-01c3-2008-a256-4ea912d68fab%40enterprisedb.com	2023-05-19 13:42:06 +02:00
Peter Eisentraut	0b8ace8d77	Reindent some comments Most (older) comments in md.c and smgr.c are indented with a leading tab on all lines, which isn't the current style and makes updating the comments a bit annoying. This reindents all these lines with a single space, as is the normal style. This issue exists in various shapes throughout the code but it's pretty consistent here, and since there is a patch pending to refresh some of the comments in these files, it seems sensible to clean this up here separately. Discussion: https://www.postgresql.org/message-id/flat/22fed8ba-01c3-2008-a256-4ea912d68fab%40enterprisedb.com	2023-05-19 10:52:04 +02:00
Andres Freund	093e5c57d5	Add writeback to pg_stat_io `28e626bde0` added the concept of IOOps but neglected to include writeback operations. `ac8d53dae5` added time spent doing these I/O operations. Without counting writeback, checkpointer write time in the log often differed substantially from that in pg_stat_io. To fix this, add IOOp IOOP_WRITEBACK and track writeback in pg_stat_io. Bumps catversion. Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Reported-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/20230419172326.dhgyo4wrrhulovt6%40awork3.anarazel.de	2023-05-17 11:18:35 -07:00
Andres Freund	52676dc2e0	Update parameter name context to wb_context For clarity of review, renaming the function parameter "context" in ScheduleBufferTagForWriteback() and IssuePendingWritebacks() to "wb_context" is a separate commit. The next commit adds an "io_context" parameter and "wb_context" makes it more clear which is which. Author: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/CAAKRu_acc6iL4M3hvOTeztf_ZPpsB3Pqio5aVHgZ5q=Pi3BZKg@mail.gmail.com	2023-05-17 11:18:30 -07:00
Thomas Munro	319bae9a8d	Rename io_direct to debug_io_direct. Give the new GUC introduced by `d4e71df6` a name that is clearly not intended for mainstream use quite yet. Future proposals would drop the prefix only after adding infrastructure to make it efficient. Having the switch in the tree sooner is good because it might lead to new discoveries about the hazards awaiting us on a wide range of systems, but that name was too enticing and could lead to cross-version confusion in future, per complaints from Noah and Justin. Suggested-by: Noah Misch <noah@leadboat.com> Reviewed-by: Noah Misch <noah@leadboat.com> Reviewed-by: Justin Pryzby <pryzby@telsasoft.com> (the idea, not the patch) Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> (ditto) Discussion: https://postgr.es/m/20230430041106.GA2268796%40rfd.leadboat.com	2023-05-15 10:31:14 +12:00
Michael Paquier	58f5edf849	Fix typo with wait event for SLRU buffer of commit timestamps This wait event was documented as "CommitTsBuffer" since its introduction, but the code named it "CommitTSBuffer". This commit fixes the code to follow the term documented, which is also more consistent with the naming of the other wait events used for commit timestamps. Introduced by `5da1493`. Author: Alexander Lakhin Discussion: https://postgr.es/m/e8c38840-596a-83d6-bd8d-cebc51111572@gmail.com Backpatch-through: 13	2023-05-05 21:25:44 +09:00
Michael Paquier	8961cb9a03	Fix typos in comments The changes done in this commit impact comments with no direct user-visible changes, with fixes for incorrect function, variable or structure names. Author: Alexander Lakhin Discussion: https://postgr.es/m/e8c38840-596a-83d6-bd8d-cebc51111572@gmail.com	2023-05-02 12:23:08 +09:00
Andres Freund	1118cd37eb	Remove vacuum_defer_cleanup_age vacuum_defer_cleanup_age was introduced before hot_standby_feedback and replication slots existed. It is hard to use reasonably - commonly it will either be set too low (not preventing recovery conflicts, while still causing some bloat), or too high (causing a lot of bloat). The alternatives do not have that issue. That on its own might not be sufficient reason to remove vacuum_defer_cleanup_age, but it also complicates computation of xid horizons. See e.g. the bug fixed in `be504a3e97`. It also is untested. This commit removes TransactionIdRetreatSafely(), as there are no users anymore. There might be potential future users, hence noting that here. Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Reviewed-by: Justin Pryzby <pryzby@telsasoft.com> Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org> Discussion: https://postgr.es/m/20230317230930.nhsgk3qfk7f4axls@awork3.anarazel.de	2023-04-24 12:21:02 -07:00
David Rowley	3f58a4e296	Fix various typos and incorrect/outdated name references Author: Alexander Lakhin Discussion: https://postgr.es/m/699beab4-a6ca-92c9-f152-f559caf6dc25@gmail.com	2023-04-19 13:50:33 +12:00
David Rowley	b4dbf3e924	Fix various typos This fixes many spelling mistakes in comments, but a few references to invalid parameter names, function names and option names too in comments and also some in string constants Also, fix an #undef that was undefining the incorrect definition Author: Alexander Lakhin Reviewed-by: Justin Pryzby Discussion: https://postgr.es/m/d5f68d19-c0fc-91a9-118d-7c6a5a3f5fad@gmail.com	2023-04-18 13:23:23 +12:00
Andres Freund	43a33ef54e	Support RBM_ZERO_AND_CLEANUP_LOCK in ExtendBufferedRelTo(), add tests For some reason I had not implemented RBM_ZERO_AND_CLEANUP_LOCK support in ExtendBufferedRelTo(), likely thinking it not being reachable. But it is reachable, e.g. when replaying a WAL record for a page in a relation that subsequently is truncated (likely only reachable when doing crash recovery or PITR, not during ongoing streaming replication). As now all of the RBM_* modes are supported, remove assertions checking mode. As we had no test coverage for this scenario, add a new TAP test. There's plenty more that ought to be tested in this area... Reported-by: Tom Lane <tgl@sss.pgh.pa.us> Reported-by: Alexander Lakhin <exclusion@gmail.com> Reviewed-by: Alexander Lakhin <exclusion@gmail.com> Discussion: https://postgr.es/m/392271.1681238924%40sss.pgh.pa.us Discussion: https://postgr.es/m/0b5eb82b-cb99-e0a4-b932-3dc60e2e3926@gmail.com	2023-04-14 11:30:33 -07:00
Peter Geoghegan	d6f0f95a6b	Harmonize some more function parameter names. Make sure that function declarations use names that exactly match the corresponding names from function definitions in a few places. These inconsistencies were all introduced relatively recently, after the code base had parameter name mismatches fixed in bulk (see commits starting with commits `4274dc22` and `035ce1fe`). pg_bsd_indent still has a couple of similar inconsistencies, which I (pgeoghegan) have left untouched for now. Like all earlier commits that cleaned up function parameter names, this commit was written with help from clang-tidy.	2023-04-13 10:15:20 -07:00
Andres Freund	26669757b6	Handle logical slot conflicts on standby During WAL replay on the standby, when a conflict with a logical slot is identified, invalidate such slots. There are two sources of conflicts: 1) Using the information added in `6af1793954`, logical slots are invalidated if required rows are removed 2) wal_level on the primary server is reduced to below logical Uses the infrastructure introduced in the prior commit. FIXME: add commit reference. Change InvalidatePossiblyObsoleteSlot() to use a recovery conflict to interrupt use of a slot, if called in the startup process. The new recovery conflict is added to pg_stat_database_conflicts, as confl_active_logicalslot. See `6af1793954` for an overall design of logical decoding on a standby. Bumps catversion for the addition of the pg_stat_database_conflicts column. Bumps PGSTAT_FILE_FORMAT_ID for the same reason. Author: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com> Author: Andres Freund <andres@anarazel.de> Author: Amit Khandekar <amitdkhan.pg@gmail.com> (in an older version) Reviewed-by: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Robert Haas <robertmhaas@gmail.com> Reviewed-by: Fabrízio de Royes Mello <fabriziomello@gmail.com> Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org> Discussion: https://postgr.es/m/20230407075009.igg7be27ha2htkbt@awork3.anarazel.de	2023-04-08 00:05:44 -07:00
Thomas Munro	d4e71df6d7	Add io_direct setting (developer-only). Provide a way to ask the kernel to use O_DIRECT (or local equivalent) where available for data and WAL files, to avoid or minimize kernel caching. This hurts performance currently and is not intended for end users yet. Later proposed work would introduce our own I/O clustering, read-ahead, etc to replace the facilities the kernel disables with this option. The only user-visible change, if the developer-only GUC is not used, is that this commit also removes the obscure logic that would activate O_DIRECT for the WAL when wal_sync_method=open_[data]sync and wal_level=minimal (which also requires max_wal_senders=0). Those are non-default and unlikely settings, and this behavior wasn't (correctly) documented. The same effect can be achieved with io_direct=wal. Author: Thomas Munro <thomas.munro@gmail.com> Author: Andres Freund <andres@anarazel.de> Author: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> Reviewed-by: Justin Pryzby <pryzby@telsasoft.com> Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> Discussion: https://postgr.es/m/CA%2BhUKGK1X532hYqJ_MzFWt0n1zt8trz980D79WbjwnT-yYLZpg%40mail.gmail.com	2023-04-08 16:35:07 +12:00
Thomas Munro	faeedbcefd	Introduce PG_IO_ALIGN_SIZE and align all I/O buffers. In order to have the option to use O_DIRECT/FILE_FLAG_NO_BUFFERING in a later commit, we need the addresses of user space buffers to be well aligned. The exact requirements vary by OS and file system (typically sectors and/or memory pages). The address alignment size is set to 4096, which is enough for currently known systems: it matches modern sectors and common memory page size. There is no standard governing O_DIRECT's requirements so we might eventually have to reconsider this with more information from the field or future systems. Aligning I/O buffers on memory pages is also known to improve regular buffered I/O performance. Three classes of I/O buffers for regular data pages are adjusted: (1) Heap buffers are now allocated with the new palloc_aligned() or MemoryContextAllocAligned() functions introduced by commit `439f6175`. (2) Stack buffers now use a new struct PGIOAlignedBlock to respect PG_IO_ALIGN_SIZE, if possible with this compiler. (3) The buffer pool is also aligned in shared memory. WAL buffers were already aligned on XLOG_BLCKSZ. It's possible for XLOG_BLCKSZ to be configured smaller than PG_IO_ALIGNED_SIZE and thus for O_DIRECT WAL writes to fail to be well aligned, but that's a pre-existing condition and will be addressed by a later commit. BufFiles are not yet addressed (there's no current plan to use O_DIRECT for those, but they could potentially get some incidental speedup even in plain buffered I/O operations through better alignment). If we can't align stack objects suitably using the compiler extensions we know about, we disable the use of O_DIRECT by setting PG_O_DIRECT to 0. This avoids the need to consider systems that have O_DIRECT but can't align stack objects the way we want; such systems could in theory be supported with more work but we don't currently know of any such machines, so it's easier to pretend there is no O_DIRECT support instead. That's an existing and tested class of system. Add assertions that all buffers passed into smgrread(), smgrwrite() and smgrextend() are correctly aligned, unless PG_O_DIRECT is 0 (= stack alignment tricks may be unavailable) or the block size has been set too small to allow arrays of buffers to be all aligned. Author: Thomas Munro <thomas.munro@gmail.com> Author: Andres Freund <andres@anarazel.de> Reviewed-by: Justin Pryzby <pryzby@telsasoft.com> Discussion: https://postgr.es/m/CA+hUKGK1X532hYqJ_MzFWt0n1zt8trz980D79WbjwnT-yYLZpg@mail.gmail.com	2023-04-08 16:34:50 +12:00
Andres Freund	ac8d53dae5	Track IO times in pg_stat_io `a9c70b46db` and 8aaa04b32S added counting of IO operations to a new view, pg_stat_io. Now, add IO timing for reads, writes, extends, and fsyncs to pg_stat_io as well. This combines the tracking for pgBufferUsage with the tracking for pg_stat_io into a new function pgstat_count_io_op_time(). This should make it a bit easier to avoid the somewhat costly instr_time conversion done for pgBufferUsage. Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://postgr.es/m/flat/CAAKRu_ay5iKmnbXZ3DsauViF3eMxu4m1oNnJXqV_HyqYeg55Ww%40mail.gmail.com	2023-04-07 17:04:56 -07:00
Andres Freund	704261ecc6	Improve IO accounting for temp relation writes Both pgstat_database and pgBufferUsage count IO timing for reads of temporary relation blocks into local buffers. However, both failed to count write IO timing for flushes of dirty local buffers. Fix. Additionally, FlushRelationBuffers() seems to have omitted counting write IO (both count and timing) stats for both pgstat_database and pgBufferUsage. Fix. Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/20230321023451.7rzy4kjj2iktrg2r%40awork3.anarazel.de	2023-04-07 13:24:26 -07:00
Andres Freund	21d7c05a5c	Fix copy-paste bug in `12f3867f55` triggering an assert after a write error The same condition accidentally was copied to both branches. Manual testing confirms that otherwise the error recovery path works fine. Found while reviewing the logical-decoding-on-standby patch.	2023-04-07 01:02:46 -07:00
David Rowley	1cbbee0338	Add VACUUM/ANALYZE BUFFER_USAGE_LIMIT option Add new options to the VACUUM and ANALYZE commands called BUFFER_USAGE_LIMIT to allow users more control over how large to make the buffer access strategy that is used to limit the usage of buffers in shared buffers. Larger rings can allow VACUUM to run more quickly but have the drawback of VACUUM possibly evicting more buffers from shared buffers that might be useful for other queries running on the database. Here we also add a new GUC named vacuum_buffer_usage_limit which controls how large to make the access strategy when it's not specified in the VACUUM/ANALYZE command. This defaults to 256KB, which is the same size as the access strategy was prior to this change. This setting also controls how large to make the buffer access strategy for autovacuum. Per idea by Andres Freund. Author: Melanie Plageman Reviewed-by: David Rowley Reviewed-by: Andres Freund Reviewed-by: Justin Pryzby Reviewed-by: Bharath Rupireddy Discussion: https://postgr.es/m/20230111182720.ejifsclfwymw2reb@awork3.anarazel.de	2023-04-07 11:40:31 +12:00
Andres Freund	fcdda1e4b5	Use ExtendBufferedRelTo() in {vm,fsm}_extend() This uses ExtendBufferedRelTo(), introduced in `31966b151e`, to extend the visibilitymap and freespacemap to the size needed. It also happens to fix a warning introduced in `3d6a98457d`, reported by Tom Lane. Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de Discussion: https://postgr.es/m/2194723.1680736788@sss.pgh.pa.us	2023-04-05 17:50:09 -07:00
Andres Freund	31966b151e	bufmgr: Introduce infrastructure for faster relation extension The primary bottlenecks for relation extension are: 1) The extension lock is held while acquiring a victim buffer for the new page. Acquiring a victim buffer can require writing out the old page contents including possibly needing to flush WAL. 2) When extending via ReadBuffer() et al, we write a zero page during the extension, and then later write out the actual page contents. This can nearly double the write rate. 3) The existing bulk relation extension infrastructure in hio.c just amortized the cost of acquiring the relation extension lock, but none of the other costs. Unfortunately 1) cannot currently be addressed in a central manner as the callers to ReadBuffer() need to acquire the extension lock. To address that, this this commit moves the responsibility for acquiring the extension lock into bufmgr.c functions. That allows to acquire the relation extension lock for just the required time. This will also allow us to improve relation extension further, without changing callers. The reason we write all-zeroes pages during relation extension is that we hope to get ENOSPC errors earlier that way (largely works, except for CoW filesystems). It is easier to handle out-of-space errors gracefully if the page doesn't yet contain actual tuples. This commit addresses 2), by using the recently introduced smgrzeroextend(), which extends the relation, without dirtying the kernel page cache for all the extended pages. To address 3), this commit introduces a function to extend a relation by multiple blocks at a time. There are three new exposed functions: ExtendBufferedRel() for extending the relation by a single block, ExtendBufferedRelBy() to extend a relation by multiple blocks at once, and ExtendBufferedRelTo() for extending a relation up to a certain size. To avoid duplicating code between ReadBuffer(P_NEW) and the new functions, ReadBuffer(P_NEW) now implements relation extension with ExtendBufferedRel(), using a flag to tell ExtendBufferedRel() that the relation lock is already held. Note that this commit does not yet lead to a meaningful performance or scalability improvement - for that uses of ReadBuffer(P_NEW) will need to be converted to ExtendBuffered*(), which will be done in subsequent commits. Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de	2023-04-05 16:21:09 -07:00
Andres Freund	12f3867f55	bufmgr: Support multiple in-progress IOs by using resowner A future patch will add support for extending relations by multiple blocks at once. To be concurrency safe, the buffers for those blocks need to be marked as BM_IO_IN_PROGRESS. Until now we only had infrastructure for recovering from an IO error for a single buffer. This commit extends that infrastructure to multiple buffers by using the resource owner infrastructure. This commit increases the size of the ResourceOwnerData struct, which appears to have a just about measurable overhead in very extreme workloads. Medium term we are planning to substantially shrink the size of ResourceOwnerData. Short term the increase is small enough to not worry about it for now. Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de Discussion: https://postgr.es/m/20221029200025.w7bvlgvamjfo6z44@awork3.anarazel.de	2023-04-05 14:17:55 -07:00
Andres Freund	dad50f677c	bufmgr: Acquire and clean victim buffer separately Previously we held buffer locks for two buffer mapping partitions at the same time to change the identity of buffers. Particularly for extending relations needing to hold the extension lock while acquiring a victim buffer is painful.But it also creates a bottleneck for workloads that just involve reads. Now we instead first acquire a victim buffer and write it out, if necessary. Then we remove that buffer from the old partition with just the old partition's partition lock held and insert it into the new partition with just that partition's lock held. By separating out the victim buffer acquisition, future commits will be able to change relation extensions to scale better. On my workstation, a micro-benchmark exercising buffered reads strenuously and under a lot of concurrency, sees a >2x improvement. Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de	2023-04-05 13:47:46 -07:00
Andres Freund	794f259447	bufmgr: Add Pin/UnpinLocalBuffer() So far these were open-coded in quite a few places, without a good reason. Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de	2023-04-05 10:42:17 -07:00
Andres Freund	819b69a81d	bufmgr: Add some more error checking [infrastructure] around pinning This adds a few more assertions against a buffer being local in places we don't expect, and extracts the check for a buffer being pinned exactly once from LockBufferForCleanup() into its own function. Later commits will use this function. Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: http://postgr.es/m/419312fd-9255-078c-c3e3-f0525f911d7f@iki.fi	2023-04-05 10:42:17 -07:00
Andres Freund	4d330a61bb	Add smgrzeroextend(), FileZero(), FileFallocate() smgrzeroextend() uses FileFallocate() to efficiently extend files by multiple blocks. When extending by a small number of blocks, use FileZero() instead, as using posix_fallocate() for small numbers of blocks is inefficient for some file systems / operating systems. FileZero() is also used as the fallback for FileFallocate() on platforms / filesystems that don't support fallocate. A big advantage of using posix_fallocate() is that it typically won't cause dirty buffers in the kernel pagecache. So far the most common pattern in our code is that we smgrextend() a page full of zeroes and put the corresponding page into shared buffers, from where we later write out the actual contents of the page. If the kernel, e.g. due to memory pressure or elapsed time, already wrote back the all-zeroes page, this can lead to doubling the amount of writes reaching storage. There are no users of smgrzeroextend() as of this commit. That will follow in future commits. Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Reviewed-by: David Rowley <dgrowleyml@gmail.com> Reviewed-by: John Naylor <john.naylor@enterprisedb.com> Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de	2023-04-05 10:06:39 -07:00
Andres Freund	3d6a98457d	Don't initialize page in {vm,fsm}_extend(), not needed The read path needs to be able to initialize pages anyway, as relation extensions are not durable. By avoiding initializing pages, we can, in a future patch, extend the relation by multiple blocks at once. Using smgrextend() for {vm,fsm}_extend() is not a good idea in general, as at least one page of the VM/FSM will be read immediately after, always causing a cache miss, requiring us to read content we just wrote. Discussion: https://postgr.es/m/20230301223515.pucbj7nb54n4i4nv@awork3.anarazel.de	2023-04-05 08:19:39 -07:00
Andres Freund	8a2b1b1477	bufmgr: Remove buffer-write-dirty tracepoints The trace point was using the relfileno / fork / block for the to-be-read-in buffer. Some upcoming work would make that more expensive to provide. We still have buffer-flush-start/done, which can serve most tracing needs that buffer-write-dirty could serve. Discussion: https://postgr.es/m/f5164e7a-eef6-8972-75a3-8ac622ed0c6e@iki.fi	2023-04-03 18:02:41 -07:00
David Rowley	8d928e3a9f	Rename BufferAccessStrategyData.ring_size to nbuffers The new name better reflects what the field is - the size of the buffers[] array. ring_size sounded more like it is in units of bytes. An upcoming commit allows a BufferAccessStrategy of custom sizes, so it seems relevant to improve this beforehand. Author: Melanie Plageman Reviewed-by: David Rowley Discussion: https://postgr.es/m/CAAKRu_YefVbhg4rAxU2frYFdTap61UftH-kUNQZBvAs%2BYK81Zg%40mail.gmail.com	2023-04-03 23:31:16 +12:00
Andres Freund	8aaa04b32d	Track shared buffer hits in pg_stat_io Among other things, this should make it easier to calculate a useful cache hit ratio by excluding buffer reads via buffer access strategies. As buffer access strategies reuse buffers (and thus evict the prior buffer contents), it is normal to see reads on repeated scans of the same data. Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CAAKRu_beMa9Hzih40%3DXPYqhDVz6tsgUGTrhZXRo%3Dunp%2Bszb%3DUA%40mail.gmail.com	2023-03-30 19:24:21 -07:00
Andres Freund	ca7b3c4c00	pg_stat_wal: Accumulate time as instr_time instead of microseconds In instr_time.h it is stated that: * When summing multiple measurements, it's recommended to leave the * running sum in instr_time form (ie, use INSTR_TIME_ADD or * INSTR_TIME_ACCUM_DIFF) and convert to a result format only at the end. The reason for that is that converting to microseconds is not cheap, and can loose precision. Therefore this commit changes 'PendingWalStats' to use 'instr_time' instead of 'PgStat_Counter' while accumulating 'wal_write_time' and 'wal_sync_time'. Author: Nazir Bilal Yavuz <byavuz81@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/1feedb83-7aa9-cb4b-5086-598349d3f555@gmail.com	2023-03-30 14:23:14 -07:00
Andres Freund	f9054b7a7c	Fix format code in fd.c debugging infrastructure These were not sufficiently adjusted in `2d4f1ba6cf`.	2023-03-30 10:26:10 -07:00
Andres Freund	558cf80387	bufmgr: Fix undefined behaviour with, unrealistically, large temp_buffers Quoting Melanie: > Since if buffer is INT_MAX, then the -(buffer + 1) version invokes > undefined behavior while the -buffer - 1 version doesn't. All other places were already using the correct version. I (Andres), copied the code into more places in a patch. Melanie caught it in review, but to prevent more people from copying the bad code, fix it. Even if it is a theoretical issue. We really ought to wrap these accesses in a helper function... As this is a theoretical issue, don't backpatch. Reported-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/CAAKRu_aW2SX_LWtwHgfnqYpBrunMLfE9PD6-ioPpkh92XH0qpg@mail.gmail.com	2023-03-30 10:26:10 -07:00
Peter Eisentraut	261cf8962b	Fix incorrect format placeholders	2023-03-30 08:33:43 +02:00
Tom Lane	58c9600a9f	Remove empty function BufmgrCommit(). This function has been a no-op for over a decade. Even if bufmgr regains a need to be called during commit, it seems unlikely that the most appropriate call points would be precisely here, so it's not doing us much good as a placeholder either. Now, removing it probably doesn't save any noticeable number of cycles --- but the main call is inside the commit critical section, and the less work done there the better. Matthias van de Meent Discussion: https://postgr.es/m/CAEze2Wi1=tLKbxZnXzcD+8fYKyKqBtivVakLQC_mYBsP4Y8qVA@mail.gmail.com	2023-03-29 09:13:57 -04:00

1 2 3 4 5 ...

2534 Commits