postgres

mirror of https://github.com/postgres/postgres.git synced 2025-08-30 06:01:21 +03:00

Author	SHA1	Message	Date
Alvaro Herrera	242dfcbafa	Avoid killing btree items that are already dead _bt_killitems marks btree items dead when a scan leaves the page where they live, but it does so with only share lock (to improve concurrency). This was historicall okay, since killing a dead item has no consequences. However, with the advent of data checksums and wal_log_hints, this action incurs a WAL full-page-image record of the page. Multiple concurrent processes would write the same page several times, leading to WAL bloat. The probability of this happening can be reduced by only killing items if they're not already dead, so change the code to do that. The problem could eliminated completely by having _bt_killitems upgrade to exclusive lock upon seeing a killable item, but that would reduce concurrency so it's considered a cure worse than the disease. Backpatch all the way back to 9.5, since wal_log_hints was introduced in 9.4. Author: Masahiko Sawada <masahiko.sawada@2ndquadrant.com> Discussion: https://postgr.es/m/CA+fd4k6PeRj2CkzapWNrERkja5G0-6D-YQiKfbukJV+qZGFZ_Q@mail.gmail.com	2020-05-15 16:50:34 -04:00
Tom Lane	5da14938f7	Rename SLRU structures and associated LWLocks. Originally, the names assigned to SLRUs had no purpose other than being shmem lookup keys, so not a lot of thought went into them. As of v13, though, we're exposing them in the pg_stat_slru view and the pg_stat_reset_slru function, so it seems advisable to take a bit more care. Rename them to names based on the associated on-disk storage directories (which fortunately we did think about, to some extent; since those are also visible to DBAs, consistency seems like a good thing). Also rename the associated LWLocks, since those names are likewise user-exposed now as wait event names. For the most part I only touched symbols used in the respective modules' SimpleLruInit() calls, not the names of other related objects. This renaming could have been taken further, and maybe someday we will do so. But for now it seems undesirable to change the names of any globally visible functions or structs, so some inconsistency is unavoidable. (But I did terminate "oldserxid" with prejudice, as I found that name both unreadable and not descriptive of the SLRU's contents.) Table 27.12 needs re-alphabetization now, but I'll leave that till after the other LWLock renamings I have in mind. Discussion: https://postgr.es/m/28683.1589405363@sss.pgh.pa.us	2020-05-15 14:28:25 -04:00
Tom Lane	5cbfce562f	Initial pgindent and pgperltidy run for v13. Includes some manual cleanup of places that pgindent messed up, most of which weren't per project style anyway. Notably, it seems some people didn't absorb the style rules of commit `c9d297751`, because there were a bunch of new occurrences of function calls with a newline just after the left paren, all with faulty expectations about how the rest of the call would get indented.	2020-05-14 13:06:50 -04:00
Tom Lane	29c3e2dd5a	Collect built-in LWLock tranche names statically, not dynamically. There is little point in using the LWLockRegisterTranche mechanism for built-in tranche names. It wastes cycles, it creates opportunities for bugs (since failing to register a tranche name is a very hard-to-detect problem), and the lack of any centralized list of names encourages sloppy nonconformity in name choices. Moreover, since we have a centralized list of the tranches anyway in enum BuiltinTrancheIds, we're certainly not buying any flexibility in return for these disadvantages. Hence, nuke all the backend-internal LWLockRegisterTranche calls, and instead provide a const array of the builtin tranche names. (I have in mind to change a bunch of these names shortly, but this patch is just about getting them into one place.) Discussion: https://postgr.es/m/9056.1589419765@sss.pgh.pa.us	2020-05-14 11:10:31 -04:00
Alvaro Herrera	17cc133f01	Dial back -Wimplicit-fallthrough to level 3 The additional pain from level 4 is excessive for the gain. Also revert all the source annotation changes to their original wordings, to avoid back-patching pain. Discussion: https://postgr.es/m/31166.1589378554@sss.pgh.pa.us	2020-05-13 15:31:14 -04:00
Tom Lane	81ca868630	Improve management of SLRU statistics collection. Instead of re-identifying which statistics bucket to use for a given SLRU on every counter increment, do it once during shmem initialization. This saves a fair number of cycles, and there's no real cost because we could not have a bucket assignment that varies over time or across backends anyway. Also, get rid of the ill-considered decision to let pgstat.c pry directly into SLRU's shared state; it's cleaner just to have slru.c pass the stats bucket number. In consequence of these changes, there's no longer any need to store an SLRU's LWLock tranche info in shared memory, so get rid of that, making this a net reduction in shmem consumption. (That partly reverts fe702a7b3.) This is basically code review for `28cac71bd`, so I also cleaned up some comments, removed a dangling extern declaration, fixed some things that should be static and/or const, etc. Discussion: https://postgr.es/m/3618.1589313035@sss.pgh.pa.us	2020-05-13 13:08:23 -04:00
Alvaro Herrera	850196b610	Adjust walsender usage of xlogreader, simplify APIs * Have both physical and logical walsender share a 'xlogreader' state struct for tracking state. This replaces the existing globals sendSeg and sendCxt. * Change WALRead not to receive XLogReaderState->seg and ->segcxt as separate arguments anymore; just use the ones from 'state'. This is made possible by the above change. * have the XLogReader segment_open contract require the callbacks to install the file descriptor in the state struct themselves instead of returning it. xlogreader was already ignoring any possible failed return from the callbacks, relying solely on them never returning. (This point is not altogether excellent, as it means the callbacks have to know more of XLogReaderState; but to really improve on that we would have to pass back error info from the callbacks to xlogreader. And the complexity would not be saved but instead just transferred to the callers of WALRead, which would have to learn how to throw errors from the open_segment callback in addition of, as currently, from pg_pread.) * segment_open no longer receives the 'segcxt' as a separate argument, since it's part of the XLogReaderState argument. Per comments from Kyotaro Horiguchi. Author: Álvaro Herrera <alvherre@alvh.no-ip.org> Discussion: https://postgr.es/m/20200511203336.GA9913@alvherre.pgsql	2020-05-13 12:17:08 -04:00
Alvaro Herrera	3e9744465d	Add -Wimplicit-fallthrough to CFLAGS and CXXFLAGS Use it at level 4, a bit more restrictive than the default level, and tweak our commanding comments to FALLTHROUGH. (However, leave zic.c alone, since it's external code; to avoid the warnings that would appear there, change CFLAGS for that file in the Makefile.) Author: Julien Rouhaud <rjuju123@gmail.com> Author: Álvaro Herrera <alvherre@alvh.no-ip.org> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/20200412081825.qyo5vwwco3fv4gdo@nol Discussion: https://postgr.es/m/flat/E1fDenm-0000C8-IJ@gemulon.postgresql.org	2020-05-12 16:07:30 -04:00
Michael Paquier	078c9cd258	Fix comment in xlogutils.c The existing callers of XLogReadDetermineTimeline() performing recovery need to check a replay LSN position when determining on which timeline to read a WAL page. A portion of the comment describing this function said exactly that, while referring to a routine for fetching a write LSN, something not available in recovery. Author: Kyotaro Horiguchi Discussion: https://postgr.es/m/20200511.101619.2043820539323292957.horikyota.ntt@gmail.com	2020-05-12 14:43:57 +09:00
Peter Geoghegan	624686abcf	Adjust "root of to-be-deleted subtree" function. Restructure the function that locates the root of the to-be-deleted subtree during nbtree page deletion. Handle the conditions that make page deletion unsafe in a slightly more uniform way, and acknowledge the fact that the behavior with incomplete splits on internal pages is different (as pointed out in the nbtree README as of commit `35bc0ec7`). Also invent new terminology that avoids ambiguity around which pages are about to be deleted. Consistently use the term "to-be-deleted subtree", not the ambiguous term "branch". We were calling the subtree parent page the "top parent page", but that was quite misleading. The top parent page usually refers to a page unlinked from its siblings and marked deleted (during the second stage of page deletion). There was one kind of top parent page that we merely removed a downlink from, and another kind of top parent page that we actually marked deleted. Eliminate the ambiguity by inventing a new term ("subtree parent page") that refers to the former kind of page only.	2020-05-11 11:01:07 -07:00
Alvaro Herrera	b060dbe000	Rework XLogReader callback system Code review for `0dc8ead463`, prompted by a bug closed by `91c40548d5`. XLogReader's system for opening and closing segments had gotten too complicated, with callbacks being passed at both the XLogReaderAllocate level (read_page) as well as at the WALRead level (segment_open). This was confusing and hard to follow, so restructure things so that these callbacks are passed together at XLogReaderAllocate time, and add another callback to the set (segment_close) to make it a coherent whole. Also, ensure XLogReaderState is an argument to all the callbacks, so that they can grab at the ->private data if necessary. Document the whole arrangement more clearly. Author: Álvaro Herrera <alvherre@alvh.no-ip.org> Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Discussion: https://postgr.es/m/20200422175754.GA19858@alvherre.pgsql	2020-05-08 15:40:11 -04:00
Fujii Masao	f2ff203596	Report missing wait event for timeline history file. TimelineHistoryRead and TimelineHistoryWrite wait events are reported during waiting for a read and write of a timeline history file, respectively. However, previously, TimelineHistoryRead wait event was not reported while readTimeLineHistory() was reading a timeline history file. Also TimelineHistoryWrite was not reported while writeTimeLineHistory() was writing one line with the details of the timeline split, at the end. This commit fixes these issues. Back-patch to v10 where wait events for a timeline history file was added. Author: Masahiro Ikeda Reviewed-by: Michael Paquier, Fujii Masao Discussion: https://postgr.es/m/d11b0c910b63684424e06772eb844ab5@oss.nttdata.com	2020-05-08 10:36:40 +09:00
Peter Geoghegan	cd8c73a38a	Refactor nbtree deletion INCOMPLETE_SPLIT check. Factor out code common to _bt_lock_branch_parent() and _bt_pagedel() into a new utility function. This new function is used to check that the left sibling of a deletion target page does not have the INCOMPLETE_SPLIT page flag set. If it is set then deletion is unsafe; there won't be a usable pivot tuple (with a downlink) in the parent page that points to the deletion target page. The page deletion algorithm is not prepared to deal with that. Also restructure an existing, related utility function that checks if the right sibling of the target page has the ISHALFDEAD page flag set. This organization highlights the symmetry between the two cases. The goal is to make the design of page deletion clearer. Both functions involve a sibling page with a flag that indicates that there was an interrupted operation (a page split or a page deletion) that resulted in a page pointed to by sibling pages, but not pointed to in the parent. And, both functions indicate if page deletion is unsafe due to the absence of a particular downlink in the parent page.	2020-05-07 16:08:54 -07:00
Amit Kapila	69bfaf2e1d	Change the display of WAL usage statistics in Explain. In commit `33e05f89c5`, we have added the option to display WAL usage statistics in Explain and auto_explain. The display format used two spaces between each field which is inconsistent with Buffer usage statistics which is using one space between each field. Change the format to make WAL usage statistics consistent with Buffer usage statistics. This commit also changed the usage of "full page writes" to "full page images" for WAL usage statistics to make it consistent with other parts of code and docs. Author: Julien Rouhaud, Amit Kapila Reviewed-by: Justin Pryzby, Kyotaro Horiguchi and Amit Kapila Discussion: https://postgr.es/m/CAB-hujrP8ZfUkvL5OYETipQwA=e3n7oqHFU=4ZLxWS_Cza3kQQ@mail.gmail.com	2020-05-05 08:00:53 +05:30
Peter Geoghegan	9dc7251417	Refactor btvacuumpage(). Remove one of the arguments to btvacuumpage(), and give up on the idea that it's a recursive function. We now use the term "backtracking" to refer to the case where an earlier block must be visited to make sure no tuples that need to be removed were missed. Advertising btvacuumpage() as a recursive function was unhelpful. In reality the function always simulates recursion with a loop (it doesn't actually call itself). This wasn't just necessary as a precaution (per the comments mentioning tail recursion), though. There is no reliable natural limit on the number of times we can backtrack. There are important behavioral difference when "recursing"/backtracking, mostly related to page deletion. We don't perform page deletion when backtracking due to the extra complexity. And when we recurse, we're not performing a physical order scan anymore, so we expect fairly different conditions to hold for the page. Structuring the code like this makes it clearer how _bt_pagedel() cooperates with btvacuumpage() and btvacuumscan() (as established in commit `b0229f26` and commit `73a076b0`). Author: Peter Geoghegan Reviewed-By: Masahiko Sawada Discussion: https://postgr.es/m/CAH2-WzmRGMDWiLMcb+zagG9652PboNN4Gfcq1Gc_wJL6A716MA@mail.gmail.com	2020-05-02 14:04:33 -07:00
Peter Geoghegan	69cf853fe7	Clear up issue with FSM and oldest bpto.xact. On further reflection, code comments added by commit `b0229f26` slightly misrepresented how we determine the oldest bpto.xact for the index. btvacuumpage() does not treat the bpto.xact of a page that it put in the FSM as a candidate to be the oldest deleted page (the delete-marked page that has the oldest bpto.xact XID among all pages encountered). The definition of a deleted page for the purposes of the bpto.xact calculation is different from the definition used by the bulk delete statistics. The bulk delete statistics don't distinguish between pages that were deleted by the current VACUUM, pages deleted by a previous VACUUM operation but not yet recyclable/reusable, and pages that are reusable (though reusable pages are counted separately). Backpatch: 11-, just like commit `b0229f26`.	2020-05-01 12:19:44 -07:00
Peter Geoghegan	4e21f8b633	Reorder function prototypes for consistency.	2020-05-01 10:03:38 -07:00
Peter Geoghegan	73a076b03f	Fix undercounting in VACUUM VERBOSE output. The logic for determining how many nbtree pages in an index are deleted pages sometimes undercounted pages. Pages that were deleted by the current VACUUM operation (as opposed to some previous VACUUM operation whose deleted pages have yet to be reused) were sometimes overlooked. The final count is exposed to users through VACUUM VERBOSE's "%u index pages have been deleted" output. btvacuumpage() avoided double-counting when _bt_pagedel() deleted more than one page by assuming that only one page was deleted, and that the additional deleted pages would get picked up during a future call to btvacuumpage() by the same VACUUM operation. _bt_pagedel() can legitimately delete pages that the btvacuumscan() scan will not visit again, though, so that assumption was slightly faulty. Fix the accounting by teaching _bt_pagedel() about its caller's requirements. It now only reports on pages that it knows btvacuumscan() won't visit again (including the current btvacuumpage() page), so everything works out in the end. This bug has been around forever. Only backpatch to v11, though, to keep _bt_pagedel() is sync on the branches that have today's bugfix commit `b0229f26da`. Note that this commit changes the signature of _bt_pagedel(), just like commit `b0229f26da`. Author: Peter Geoghegan Reviewed-By: Masahiko Sawada Discussion: https://postgr.es/m/CAH2-WzkrXBcMQWAYUJMFTTvzx_r4q=pYSjDe07JnUXhe+OZnJA@mail.gmail.com Backpatch: 11-	2020-05-01 09:51:09 -07:00
Peter Geoghegan	b0229f26da	Fix bug in nbtree VACUUM "skip full scan" feature. Commit `857f9c36cd` (which taught nbtree VACUUM to skip a scan of the index from btcleanup in situations where it doesn't seem worth it) made VACUUM maintain the oldest btpo.xact among all deleted pages for the index as a whole. It failed to handle all the details surrounding pages that are deleted by the current VACUUM operation correctly (though pages deleted by some previous VACUUM operation were processed correctly). The most immediate problem was that the special area of the page was examined without a buffer pin at one point. More fundamentally, the handling failed to account for the full range of _bt_pagedel() behaviors. For example, _bt_pagedel() sometimes deletes internal pages in passing, as part of deleting an entire subtree with btvacuumpage() caller's page as the leaf level page. The original leaf page passed to _bt_pagedel() might not be the page that it deletes first in cases where deletion can take place. It's unclear how disruptive this bug may have been, or what symptoms users might want to look out for. The issue was spotted during unrelated code review. To fix, push down the logic for maintaining the oldest btpo.xact to _bt_pagedel(). btvacuumpage() is now responsible for pages that were fully deleted by a previous VACUUM operation, while _bt_pagedel() is now responsible for pages that were deleted by the current VACUUM operation (this includes half-dead pages from a previous interrupted VACUUM operation that become fully deleted in _bt_pagedel()). Note that _bt_pagedel() should never encounter an existing deleted page. This commit theoretically breaks the ABI of a stable release by changing the signature of _bt_pagedel(). However, if any third party extension is actually affected by this, then it must already be completely broken (since there are numerous assumptions made in _bt_pagedel() that cannot be met outside of VACUUM). It seems highly unlikely that such an extension actually exists, in any case. Author: Peter Geoghegan Reviewed-By: Masahiko Sawada Discussion: https://postgr.es/m/CAH2-WzkrXBcMQWAYUJMFTTvzx_r4q=pYSjDe07JnUXhe+OZnJA@mail.gmail.com Backpatch: 11-, where the "skip full scan" feature was introduced.	2020-05-01 08:39:52 -07:00
Peter Geoghegan	dd1f645cc8	Fix AddressSanitizer use-after-scope complaint. XLogRegisterBufData() does not copy data pointed to by caller's pointer argument. Oversight in commit `0d861bbb70`. Author: Peter Eisentraut Reported-By: Peter Eisentraut Discussion: https://postgr.es/m/21800dbe-a13e-22f7-d423-b81db9d249f5@2ndquadrant.com	2020-04-30 12:31:56 -07:00
Peter Geoghegan	ab2343d4cb	Remove redundant _bt_killitems() buffer check. _bt_getbuf() cannot return an invalid buffer. Oversight in commit `2ed5b87f96`.	2020-04-29 18:17:49 -07:00
Michael Paquier	641b76d9d1	Fix some typos Author: Justin Pryzby Discussion: https://postgr.es/m/20200408165653.GF2228@telsasoft.com	2020-04-27 14:59:36 +09:00
Peter Eisentraut	f057980149	Fix typo from `303640199d`	2020-04-26 13:48:33 +02:00
Peter Geoghegan	7154aa16a6	Fix another minor page deletion buffer lock issue. Avoid accessing the leaf page's top parent tuple without a buffer lock held during the second phase of nbtree page deletion. The old approach was safe, though only because VACUUM never drops its buffer pin (and because only VACUUM itself can modify a half-dead page). Even still, it seems like a good idea to be strict here. Tighten things up by copying the top parent page's block number to a local variable before releasing the buffer lock on the leaf page -- not after. This is a follow-up to commit `fa7ff642`, which fixed a similar issue in the first phase of nbtree page deletion. Update some related comments in passing. Discussion: https://postgr.es/m/CAH2-WzkLgyN3zBvRZ1pkNJThC=xi_0gpWRUb_45eexLH1+k2_Q@mail.gmail.com	2020-04-25 16:45:20 -07:00
Peter Geoghegan	fa7ff642c2	Fix minor nbtree page deletion buffer lock issue. Avoid accessing the deletion target page's special area during nbtree page deletion at a point where there is no buffer lock held. This issue was detected by a patch that extends Valgrind's memcheck tool to mark nbtree pages that are unsafe to access (due to not having a buffer lock or buffer pin) as NOACCESS. We do hold a buffer pin at this point, and only access the special area, so the old approach was safe. Even still, it seems like a good idea to tighten up the rules in this area. There is no reason to not simply insist on always holding a buffer lock (not just a pin) when accessing nbtree pages. Update some related comments in passing. Discussion: https://postgr.es/m/CAH2-WzkLgyN3zBvRZ1pkNJThC=xi_0gpWRUb_45eexLH1+k2_Q@mail.gmail.com	2020-04-25 14:17:02 -07:00
Michael Paquier	4e87c4836a	Fix handling of WAL segments ready to be archived during crash recovery `78ea8b5` has fixed an issue related to the recycling of WAL segments on standbys depending on archive_mode. However, it has introduced a regression with the handling of WAL segments ready to be archived during crash recovery, causing those files to be recycled without getting archived. This commit fixes the regression by tracking in shared memory if a live cluster is either in crash recovery or archive recovery as the handling of WAL segments ready to be archived is different in both cases (those WAL segments should not be removed during crash recovery), and by using this new shared memory state to decide if a segment can be recycled or not. Previously, it was not possible to know if a cluster was in crash recovery or archive recovery as the shared state was able to track only if recovery was happening or not, leading to the problem. A set of TAP tests is added to close the gap here, making sure that WAL segments ready to be archived are correctly handled when a cluster is in archive or crash recovery with archive_mode set to "on" or "always", for both standby and primary. Reported-by: Benoît Lobréau Author: Jehan-Guillaume de Rorthais Reviewed-by: Kyotaro Horiguchi, Fujii Masao, Michael Paquier Discussion: https://postgr.es/m/20200331172229.40ee00dc@firost Backpatch-through: 9.5	2020-04-24 08:48:28 +09:00
Peter Eisentraut	aaf069aa34	Remove HEAPDEBUGALL This has been broken since PostgreSQL 12 and was probably never really used. PostgreSQL 12 added an analogous HEAPAMSLOTDEBUGALL, which still works right now, but it's also not very useful, so remove that as well. Discussion: https://www.postgresql.org/message-id/flat/645c0646-4218-d4c3-409a-a7003a0c108d%402ndquadrant.com	2020-04-22 08:35:33 +02:00
Tom Lane	d12bdba77b	Fix possible crash during FATAL exit from reindexing. index.c supposed that it could just use a PG_TRY block to clean up the state associated with an active REINDEX operation. However, that code doesn't run if we do a FATAL exit --- for example, due to a SIGTERM shutdown signal --- while the REINDEX is happening. And that state does get consulted during catalog accesses, which makes it problematic if we do any catalog accesses during shutdown --- for example, to clean up any temp tables created in the session. If this combination of circumstances occurred, we could find ourselves trying to access already-freed memory. In debug builds that'd fairly reliably cause an assertion failure. In production we might often get away with it, but with some bad luck it could cause a core dump. Another possible bad outcome is an erroneous conclusion that an index-to-be-accessed is being reindexed; but it looks like that would be unlikely to have any consequences worse than failing to drop temp tables right away. (They'd still get dropped by the next session that uses that temp schema.) To fix, get rid of the use of PG_TRY here, and instead hook into the transaction abort mechanisms to clean up reindex state. Per bug #16378 from Alexander Lakhin. This has been wrong for a very long time, so back-patch to all supported branches. Discussion: https://postgr.es/m/16378-7a70ca41b3ec2009@postgresql.org	2020-04-21 15:58:42 -04:00
Peter Geoghegan	1542e16f2c	Consider outliers in split interval calculation. Commit `0d861bbb`, which introduced deduplication to nbtree, added some logic to take large posting list tuples into account when choosing a split point. We subtract firstright posting list overhead from the projected new high key size when calculating leftfree/rightfree values for an affected candidate split point. Posting list tuples aren't special to nbtsplitloc.c, but taking them into account like this makes a huge difference in practice. Posting list tuples are frequently tuple size outliers. However, commit `0d861bbb` missed a closely related issue: split interval itself is calculated based on the assumption that tuples on the page being split are roughly equisized. That assumption was acceptable back when commit `fab25024` taught the logic for choosing a split point about suffix truncation, but it's pretty questionable now that very large tuple sizes are common. This oversight led to unbalanced page splits in low cardinality multi-column indexes when deduplication was used: page splits that don't give sufficient weight to how unbalanced the split is when the interval happens to include some large posting list tuples (and when most other tuples on the page are not so large). Nail this down by calculating an initial split interval in a way that's attuned to the actual cost that we want to keep under control (not a fuzzy proxy for the cost): apply a leftfree + rightfree evenness test to each candidate split point that actually gets included in the split interval (for the default strategy). This replaces logic that used a percentage of all legal split points for the page as the basis of the initial split interval. Discussion: https://postgr.es/m/CAH2-WznJt5aT2uUB2Bs+JBLdwe0XTX67+xeLFcaNvCKxO=QBVQ@mail.gmail.com	2020-04-21 09:59:24 -07:00
Peter Geoghegan	f0ca378d4c	Slightly simplify nbtree split point choice loop. Spotted during post-commit review of the nbtree deduplication commit (commit `0d861bbb`).	2020-04-15 15:47:26 -07:00
Peter Geoghegan	4a05a64095	Remove obsolete "hole in center of page" comment. A comment from the Berkeley days incorrectly claimed that the page management code cares about the contents of the hole in the center of the page (at least in the case of the left half of an nbtree page split). Commit `8fa30f906b` added an addendum that stated that the original comment was "probably obsolete". It's definitely obsolete, though, so remove the original comment plus the addendum.	2020-04-14 14:38:28 -07:00
Peter Geoghegan	80634e3b18	Rearrange _bt_insertonpg() "update metapage" code. Nest the "update metapage as part of insert into root-like page" branch inside the broader "insert into internal page" branch. This improves readability.	2020-04-14 09:33:18 -07:00
Peter Geoghegan	f762b2feba	Add defensive "split_only_page" nbtree assertion. Clearly it's not okay for nbtree to split a page that is the only page on its level, and then find that it has to split the parent one level up in turn. There is simply no code to handle the split_only_page case in the _bt_insertonpg() "newitem won't fit" branch (only the "newitem fits" branch handles split_only_page). Add a defensive assertion that will fail if a split_only_page call to _bt_insertonpg() somehow ends up splitting the target/parent page. I (pgeoghegan) believe that we don't need split_only_page handling for the "newitem won't fit" branch because anybody calling _bt_insertonpg() like this would have to hold a lock on the same one and only child page.	2020-04-13 21:11:03 -07:00
Amit Kapila	a6fea120a7	Comments and doc fixes for commit `40d964ec99`. Reported-by: Justin Pryzby Author: Justin Pryzby, with few changes by me Reviewed-by: Amit Kapila and Sawada Masahiko Discussion: https://postgr.es/m/20200322021801.GB2563@telsasoft.com	2020-04-14 08:10:27 +05:30
Peter Geoghegan	826ee1a019	Make _bt_insertonpg() more like _bt_split(). It seems like a good idea for nbtree's retail insert code to be absolutely consistent with nbtree's page split code for anything that naturally requires equivalent handling. Anything that concerns inserting newitem (which is handled as part of the page split atomic action when a page split is required) should work in exactly the same way. With that in mind, make _bt_insertonpg() handle 'cbuf' in a way that matches _bt_split().	2020-04-13 19:26:41 -07:00
Peter Geoghegan	bc3087b626	Harmonize nbtree page split point code. An nbtree split point can be thought of as a point between two adjoining tuples from an imaginary version of the page being split that includes the incoming/new item (in addition to the items that really are on the page). These adjoining tuples are called the lastleft and firstright tuples. The variables that represent split points contained a field called firstright, which is an offset number of the first data item from the original page that goes on the new right page. The corresponding tuple from origpage was usually the same thing as the actual firstright tuple, but not always: the firstright tuple is sometimes the new/incoming item instead. This situation seems unnecessarily confusing. Make things clearer by renaming the origpage offset returned by _bt_findsplitloc() to "firstrightoff". We now have a firstright tuple and a firstrightoff offset number which are comparable to the newitem/lastleft tuples and the newitemoff/lastleftoff offset numbers respectively. Also make sure that we are consistent about how we describe nbtree page split point state. Push the responsibility for dealing with pg_upgrade'd !heapkeyspace indexes down to lower level code, relieving _bt_split() from dealing with it directly. This means that we always have a palloc'd left page high key on the leaf level, no matter what. This enables simplifying some of the code (and code comments) within _bt_split(). Finally, restructure the page split code to make it clearer why suffix truncation (which only takes place during leaf page splits) is completely different to the first data item truncation that takes place during internal page splits. Tuples are marked as having fewer attributes stored in both cases, and the firstright tuple is truncated in both cases, so it's easy to imagine somebody missing the distinction.	2020-04-13 16:39:55 -07:00
Amit Kapila	ef08ca113f	Cosmetic fixups for WAL usage work. Reported-by: Justin Pryzby and Euler Taveira Author: Justin Pryzby and Julien Rouhaud Reviewed-by: Amit Kapila Discussion: https://postgr.es/m/CAB-hujrP8ZfUkvL5OYETipQwA=e3n7oqHFU=4ZLxWS_Cza3kQQ@mail.gmail.com	2020-04-13 15:31:16 +05:30
Amit Kapila	5c71362174	Allow parallel create index to accumulate buffer usage stats. Currently, we don't account for buffer usage incurred by parallel workers for parallel create index. This commit allows each worker to record the buffer usage stats and leader backend to accumulate that stats at the end of the operation. This will allow pg_stat_statements to display correct buffer usage stats for (parallel) create index command. Reported-by: Julien Rouhaud Author: Sawada Masahiko Reviewed-by: Dilip Kumar, Julien Rouhaud and Amit Kapila Backpatch-through: 11, where this was introduced Discussion: https://postgr.es/m/20200328151721.GB12854@nol	2020-04-09 09:49:30 +05:30
Thomas Munro	d140f2f3e2	Rationalize GetWalRcv{Write,Flush}RecPtr(). GetWalRcvWriteRecPtr() previously reported the latest flushed location. Adopt the conventional terminology used elsewhere in the tree by renaming it to GetWalRcvFlushRecPtr(), and likewise for some related variables that used the term "received". Add a new definition of GetWalRcvWriteRecPtr(), which returns the latest written value. This will allow later patches to use the value for non-data-integrity purposes, without having to wait for the flush pointer to advance. Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com	2020-04-08 23:45:09 +12:00
Alexander Korotkov	1aac32df89	Revert `0f5ca02f53` `0f5ca02f53` introduces 3 new keywords. It appears to be too much for relatively small feature. Given now we past feature freeze, it's already late for discussion of the new syntax. So, revert. Discussion: https://postgr.es/m/28209.1586294824%40sss.pgh.pa.us	2020-04-08 11:37:27 +03:00
David Rowley	02a2e8b442	Modify additional power 2 calculations to use new helper functions 2nd pass of modifying various places which obtain the next power of 2 of a number and make them use the new functions added in `f0705bb62`. In passing, also modify num_combinations(). This can be implemented using simple bitshifting rather than looping. Reviewed-by: John Naylor Discussion: https://postgr.es/m/20200114173553.GE32763%40fetter.org	2020-04-08 18:29:51 +12:00
David Rowley	d025cf88ba	Modify various power 2 calculations to use new helper functions First pass of modifying various places that obtain the next power of 2 of a number and make them use the new functions added in pg_bitutils.h instead. This also removes the _hash_log2() function. There are no longer any callers in core. Other users can swap their _hash_log2(n) call to make use of pg_ceil_log2_32(n). Author: David Fetter, with some minor adjustments by me Reviewed-by: John Naylor, Jesse Zhang Discussion: https://postgr.es/m/20200114173553.GE32763%40fetter.org	2020-04-08 16:55:03 +12:00
Andres Freund	75848bc744	snapshot scalability: Move delayChkpt from PGXACT to PGPROC. The goal of separating hotly accessed per-backend data from PGPROC into PGXACT is to make accesses fast (GetSnapshotData() in particular). But delayChkpt is not actually accessed frequently; only when starting a checkpoint. As it is frequently modified (multiple times in the course of a single transaction), storing it in the same cacheline as hotly accessed data unnecessarily dirties a contended cacheline. Therefore move delayChkpt to PGPROC. This is part of a larger series of patches intending to improve GetSnapshotData() scalability. It is committed and pushed separately, as it is independently beneficial (small but measurable win, limited by the other frequent modifications of PGXACT). Author: Andres Freund Reviewed-By: Robert Haas, Thomas Munro, David Rowley Discussion: https://postgr.es/m/20200301083601.ews6hz5dduc3w2se@alap3.anarazel.de	2020-04-07 17:36:23 -07:00
Tomas Vondra	2b88fdde30	Track SLRU page hits in SimpleLruReadPage_ReadOnly SLRU page hits were tracked only in SimpleLruReadPage, but that's not enough because we may hit the page in SimpleLruReadPage_ReadOnly in which case we don't call SimpleLruReadPage at all. Reported-by: Kuntal Ghosh Discussion: https://postgr.es/m/20200119143707.gyinppnigokesjok@development	2020-04-08 02:15:47 +02:00
Andres Freund	91c40548d5	Fix XLogReader FD leak that makes backends unusable after 2PC usage. Before the fix every 2PC commit/abort leaked a file descriptor. As the files are opened using BasicOpenFile(), that quickly leads to the backend running out of file descriptors. Once enough 2PC abort/commit have caused enough FDs to leak, any IO in the backend will fail with "Too many open files", as BasicOpenFilePerm() will have triggered all open files known to fd.c to be closed. The leak causing the problem at hand is a consequence of `0dc8ead46`, but is only exascerbated by it. Previously most XLogPageReadCB callbacks used static variables to cache one open file, but after the commit the cache is private to each XLogReader instance. There never was infrastructure to close FDs at the time of XLogReaderFree, but the way XLogReader was used limited the leak to one FD. This commit just closes the during XLogReaderFree() if the FD is stored in XLogReaderState.seg.ws_segno. This may not be the way to solve this medium/long term, but at least unbreaks 2PC. Discussion: https://postgr.es/m/20200406025651.fpzdb5yyb7qyhqko@alap3.anarazel.de	2020-04-07 17:03:04 -07:00
Peter Geoghegan	60cbd7751c	Remove nbtree BTreeTupleSetAltHeapTID() function. Since heap TID is supposed to be just another key attribute to the implementation, it doesn't make much sense to have separate BTreeTupleSetNAtts() and BTreeTupleSetAltHeapTID() functions. Merge the two functions together. This slightly simplifies _bt_truncate().	2020-04-07 15:56:52 -07:00
Alvaro Herrera	c655077639	Allow users to limit storage reserved by replication slots Replication slots are useful to retain data that may be needed by a replication system. But experience has shown that allowing them to retain excessive data can lead to the primary failing because of running out of space. This new feature allows the user to configure a maximum amount of space to be reserved using the new option max_slot_wal_keep_size. Slots that overrun that space are invalidated at checkpoint time, enabling the storage to be released. Author: Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-by: Jehan-Guillaume de Rorthais <jgdr@dalibo.com> Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org> Discussion: https://postgr.es/m/20170228.122736.123383594.horiguchi.kyotaro@lab.ntt.co.jp	2020-04-07 18:35:00 -04:00
Alexander Korotkov	0f5ca02f53	Implement waiting for given lsn at transaction start This commit adds following optional clause to BEGIN and START TRANSACTION commands. WAIT FOR LSN lsn [ TIMEOUT timeout ] New clause pospones transaction start till given lsn is applied on standby. This clause allows user be sure, that changes previously made on primary would be visible on standby. New shared memory struct is used to track awaited lsn per backend. Recovery process wakes up backend once required lsn is applied. Author: Ivan Kartyshov, Anna Akenteva Reviewed-by: Craig Ringer, Thomas Munro, Robert Haas, Kyotaro Horiguchi Reviewed-by: Masahiko Sawada, Ants Aasma, Dmitry Ivanov, Simon Riggs Reviewed-by: Amit Kapila, Alexander Korotkov Discussion: https://postgr.es/m/0240c26c-9f84-30ea-fca9-93ab2df5f305%40postgrespro.ru	2020-04-07 23:51:10 +03:00
Fujii Masao	4bd0ad9e44	Prevent archive recovery from scanning non-existent WAL files. Previously when there were multiple timelines listed in the history file of the recovery target timeline, archive recovery searched all of them, starting from the newest timeline to the oldest one, to find the segment to read. That is, archive recovery had to continuously fail scanning the segment until it reached the timeline that the segment belonged to. These scans for non-existent segment could be harmful on the recovery performance especially when archival area was located on the remote storage and each scan could take a long time. To address the issue, this commit changes archive recovery so that it skips scanning the timeline that the segment to read doesn't belong to. Author: Kyotaro Horiguchi, tweaked a bit by Fujii Masao Reviewed-by: David Steele, Pavel Suderevsky, Grigory Smolkin Discussion: https://postgr.es/m/16159-f5a34a3a04dc67e0@postgresql.org Discussion: https://postgr.es/m/20200129.120222.1476610231001551715.horikyota.ntt@gmail.com	2020-04-08 00:49:29 +09:00
Thomas Munro	aeec457de8	Add SQL type xid8 to expose FullTransactionId to users. Similar to xid, but 64 bits wide. This new type is suitable for use in various system views and administration functions. Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com> Reviewed-by: Takao Fujii <btfujiitkp@oss.nttdata.com> Reviewed-by: Yoshikazu Imai <imai.yoshikazu@fujitsu.com> Reviewed-by: Mark Dilger <mark.dilger@enterprisedb.com> Discussion: https://postgr.es/m/20190725000636.666m5mad25wfbrri%40alap3.anarazel.de	2020-04-07 12:03:59 +12:00

1 2 3 4 5 ...

3813 Commits