postgres

mirror of https://github.com/postgres/postgres.git synced 2025-09-02 04:21:28 +03:00

Author	SHA1	Message	Date
Peter Geoghegan	1f55ebae27	Make _bt_keep_natts_fast() use datum_image_eq(). An upcoming patch that adds deduplication to the nbtree AM will rely on _bt_keep_natts_fast() understanding that differences in TOAST input state can never affect its answer. In particular, two opclass-equal datums (with opclasses deemed safe for deduplication) should never be treated as unequal by _bt_keep_natts_fast() due to TOAST input differences. This also seems like a good idea on general principle. nbtsplitloc.c will now occasionally make better decisions about where to split a leaf page. The behavior of _bt_keep_natts_fast() is now somewhat closer to the behavior of _bt_keep_natts(). Discussion: https://postgr.es/m/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ@mail.gmail.com	2019-11-12 13:08:41 -08:00
Amit Kapila	14aec03502	Make the order of the header file includes consistent in backend modules. Similar to commits `7e735035f2` and `dddf4cdc33`, this commit makes the order of header file inclusion consistent for backend modules. In the passing, removed a couple of duplicate inclusions. Author: Vignesh C Reviewed-by: Kuntal Ghosh and Amit Kapila Discussion: https://postgr.es/m/CALDaNm2Sznv8RR6Ex-iJO6xAdsxgWhCoETkaYX=+9DW3q0QCfA@mail.gmail.com	2019-11-12 08:30:16 +05:30
Andres Freund	aae50236e4	Pass ItemPointer not HeapTuple to IndexBuildCallback. Not all AMs use HeapTuples internally, making it inconvenient to pass a HeapTuple. As the index callbacks really only need the TID, not the full tuple, modify callback to only take ItemPointer. Author: Ashwin Agrawal Reviewed-By: Andres Freund Discussion: https://postgr.es/m/CALfoeis6=8ehuR=VNtHvj3z16cYfCwPdTcpaxU+sfSUJ5QgR3g@mail.gmail.com	2019-11-08 11:49:29 -08:00
Peter Geoghegan	e86c8ef243	Use "low key" terminology in nbtsort.c. nbtree index builds once stashed the "minimum key" for a page, which was used as the basis of the pivot tuple that gets placed in the next level up (i.e. the tuple that stores the downlink to the page in question). It doesn't quite work that way anymore, so the "minimum key" terminology now seems misleading (these days the minimum key is actually a straight copy of the high key from the left sibling, which is a distinct thing in subtle but important ways). Rename this concept to "low key". This name is a lot clearer given that there is now a sharp distinction between pivot and non-pivot tuples. Also remove comments that describe obsolete details about how the minimum key concept used to work. Rather than generating the minus infinity item for the leftmost page on a level by copying the new item and truncating that copy, simply allocate a small buffer. The old approach confusingly created the impression that the new item had some kind of significance. This was another artifact of how things used to work before commits `8224de4f` and `dd299df8`.	2019-11-07 17:12:09 -08:00
Thomas Munro	7815e7efdb	Add reusable routine for making arrays unique. Introduce qunique() and qunique_arg(), which can be used after qsort() and qsort_arg() respectively to remove duplicate values. Use it where appropriate. Author: Thomas Munro Reviewed-by: Tom Lane (in an earlier version) Discussion: https://postgr.es/m/CAEepm%3D2vmFTNpAmwbGGD2WaryM6T3hSDVKQPfUwjdD_5XY6vAA%40mail.gmail.com	2019-11-07 17:00:48 +13:00
Andres Freund	01368e5d9d	Split all OBJS style lines in makefiles into one-line-per-entry style. When maintaining or merging patches, one of the most common sources for conflicts are the list of objects in makefiles. Especially when the split across lines has been changed on both sides, which is somewhat common due to attempting to stay below 80 columns, those conflicts are unnecessarily laborious to resolve. By splitting, and alphabetically sorting, OBJS style lines into one object per line, conflicts should be less frequent, and easier to resolve when they still occur. Author: Andres Freund Discussion: https://postgr.es/m/20191029200901.vww4idgcxv74cwes@alap3.anarazel.de	2019-11-05 14:41:07 -08:00
Michael Paquier	6ca86bb7e9	Fix typos in the code Author: Vignesh C Reviewed-by: Dilip Kumar, Michael Paquier Discussion: https://postgr.es/m/CALDaNm0ni+GAOe4+fbXiOxNrVudajMYmhJFtXGX-zBPoN8ixhw@mail.gmail.com	2019-10-30 10:03:00 +09:00
Peter Geoghegan	1b9becd43c	Remove redundant _bt_truncate() comment paragraph.	2019-09-12 09:51:27 -07:00
Peter Geoghegan	55d015bde0	Add _bt_binsrch() scantid assertion to nbtree. Assert that _bt_binsrch() binary searches with scantid set in insertion scankey cannot be performed on leaf pages. Leaf-level binary searches where scantid is set must use _bt_binsrch_insert() instead. _bt_binsrch_insert() is likely to have additional responsibilities in the future, such as searching within GIN-style posting lists using scantid. It seems like a good idea to tighten things up now.	2019-09-09 11:41:19 -07:00
Peter Geoghegan	b8b3a276d4	Remove obsolete nbtree page deletion comment. Commit `efada2b8e9`, which made the nbtree page deletion algorithm more robust, removed the concept of a half-dead internal page. Remove a comment about half dead parent pages that was overlooked.	2019-08-27 14:01:43 -07:00
Peter Geoghegan	867d25ccb4	Explain subtlety in nbtree locking protocol. The Postgres approach to coupling locks during an ascent of the tree is slightly different to the approach taken by Lehman and Yao. Add a new paragraph to the "Differences to the Lehman & Yao algorithm" section of the nbtree README that explains the similarities and differences.	2019-08-23 20:24:49 -07:00
Peter Geoghegan	091bd6befc	Update comments on nbtree stack struct. Adjust the struct comment that describes how page splits use their descent stack to cascade up the tree from the leaf level. In passing, fix up some unrelated nbtree comments that had typos or were obsolete.	2019-08-21 13:50:27 -07:00
Peter Geoghegan	9c02cf5661	Remove block number field from nbtree stack. The initial value of the nbtree stack downlink block number field recorded during an initial descent of the tree wasn't actually used. Both _bt_getstackbuf() callers overwrote the value with their own value. Remove the block number field from the stack struct, and add a child block number argument to _bt_getstackbuf() in its place. This makes the overall design of _bt_getstackbuf() clearer. Author: Peter Geoghegan Reviewed-By: Anastasia Lubennikova Discussion: https://postgr.es/m/CAH2-Wzmx+UbXt2YNOUCZ-a04VdXU=S=OHuAuD7Z8uQq-PXTYUg@mail.gmail.com	2019-08-14 11:32:35 -07:00
Peter Geoghegan	68ef887842	Remove obsolete nbtree README commentary. Commit `d2086b08b0` removed almost all cases where nbtree must release a read buffer lock and acquire a write buffer lock instead, so remaining cases in which that's still necessary are not notable enough to appear in the nbtree README. More importantly, holding on to a buffer pin in cases where nbtree must trade a read lock for a write lock is very unlikely to save any I/O. This seems to have been a long overlooked throwback to a time when nbtree cared about write-ordering dependencies, and performed synchronous buffer writes. It hasn't worked that way in many years.	2019-08-13 17:16:44 -07:00
Peter Geoghegan	af0ba49809	Use PageIndexTupleOverwrite() within nbtree. Use the PageIndexTupleOverwrite() bufpage.c routine within nbtree instead of deleting a tuple and re-inserting its replacement. This makes the intent of affected code slightly clearer. It also makes CREATE INDEX slightly faster, since there is no longer a need to shift every leaf page's line pointer array back and forth during index builds. Author: Peter Geoghegan, Anastasia Lubennikova Reviewed-By: Anastasia Lubennikova Discussion: https://postgr.es/m/CAH2-Wz=Zk=B9+Vwm376WuO7YTjFc2SSskifQm4Nme3RRRPtOSQ@mail.gmail.com	2019-08-13 11:54:26 -07:00
Michael Paquier	66bde49d96	Fix inconsistencies and typos in the tree, take 10 This addresses some issues with unnecessary code comments, fixes various typos in docs and comments, and removes some orphaned structures and definitions. Author: Alexander Lakhin Discussion: https://postgr.es/m/9aabc775-5494-b372-8bcb-4dfc0bd37c68@gmail.com	2019-08-13 13:53:41 +09:00
Michael Paquier	8548ddc61b	Fix inconsistencies and typos in the tree, take 9 This addresses more issues with code comments, variable names and unreferenced variables. Author: Alexander Lakhin Discussion: https://postgr.es/m/7ab243e0-116d-3e44-d120-76b3df7abefd@gmail.com	2019-08-05 12:14:58 +09:00
Peter Eisentraut	fd6ec93bf8	Add error codes to some corruption log messages In some cases we have elog(ERROR) while corruption is certain and we can give a clear error code ERRCODE_DATA_CORRUPTED or ERRCODE_INDEX_CORRUPTED. Author: Andrey Borodin <x4mmm@yandex-team.ru> Discussion: https://www.postgresql.org/message-id/flat/25F6C686-6442-4A6B-BAF8-A6F7B84B16DE@yandex-team.ru	2019-08-01 11:15:26 +02:00
Peter Geoghegan	d004147eb3	Fix nbtree metapage cache upgrade bug. Commit `857f9c36cd`, which taught nbtree VACUUM to avoid unnecessary index scans, bumped the nbtree version number from 2 to 3, while adding the ability for nbtree indexes to be upgraded on-the-fly. Various assertions that assumed that an nbtree index was always on version 2 had to be changed to accept any supported version (version 2 or 3 on Postgres 11). However, a few assertions were missed in the initial commit, all of which were in code paths that cache a local copy of the metapage metadata, where the index had been expected to be on the current version (no longer version 2) as a generic sanity check. Rather than simply update the assertions, follow-up commit `0a64b45152` intentionally made the metapage caching code update the per-backend cached metadata version without changing the on-disk version at the same time. This could even happen when the planner needed to determine the height of a B-Tree for costing purposes. The assertions only fail on Postgres v12 when upgrading from v10, because they were adjusted to use the authoritative shared memory metapage by v12's commit `dd299df8`. To fix, remove the cache-only upgrade mechanism entirely, and update the assertions themselves to accept any supported version (go back to using the cached version in v12). The fix is almost a full revert of commit `0a64b45152` on the v11 branch. VACUUM only considers the authoritative metapage, and never bothers with a locally cached version, whereas everywhere else isn't interested in the metapage fields that were added by commit `857f9c36cd`. It seems unlikely that this bug has affected any user on v11. Reported-By: Christoph Berg Bug: #15896 Discussion: https://postgr.es/m/15896-5b25e260fdb0b081%40postgresql.org Backpatch: 11-, where VACUUM was taught to avoid unnecessary index scans.	2019-07-18 13:22:56 -07:00
Michael Paquier	0896ae561b	Fix inconsistencies and typos in the tree This is numbered take 7, and addresses a set of issues around: - Fixes for typos and incorrect reference names. - Removal of unneeded comments. - Removal of unreferenced functions and structures. - Fixes regarding variable name consistency. Author: Alexander Lakhin Discussion: https://postgr.es/m/10bfd4ac-3e7c-40ab-2b2e-355ed15495e8@gmail.com	2019-07-16 13:23:53 +09:00
Peter Geoghegan	bfdbac2ab3	Correct nbtsplitloc.c comment. The logic just added by commit `e3899ffd` falls back on a 50:50 page split in the event of a new item that's just to the right of our provisional "many duplicates" split point. Fix a comment that incorrectly claimed that the new item had to be just to the left of our provisional split point. Backpatch: 12-, just like commit `e3899ffd`.	2019-07-15 14:35:06 -07:00
Peter Geoghegan	e3899ffd8b	Fix pathological nbtree split point choice issue. Specific ever-decreasing insertion patterns could cause successive unbalanced nbtree page splits. Problem cases involve a large group of duplicates to the left, and ever-decreasing insertions to the right. To fix, detect the situation by considering the newitem offset before performing a split using nbtsplitloc.c's "many duplicates" strategy. If the new item was inserted just to the right of our provisional "many duplicates" split point, infer ever-decreasing insertions and fall back on a 50:50 (space delta optimal) split. This seems to barely affect cases that already had acceptable space utilization. An alternative fix also seems possible. Instead of changing nbtsplitloc.c split choice logic, we could instead teach _bt_truncate() to generate a new value for new high keys by interpolating from the lastleft and firstright key values. That would certainly be a more elegant fix, but it isn't suitable for backpatching. Discussion: https://postgr.es/m/CAH2-WznCNvhZpxa__GqAa1fgQ9uYdVc=_apArkW2nc-K3O7_NA@mail.gmail.com Backpatch: 12-, where the nbtree page split enhancements were introduced.	2019-07-15 13:19:13 -07:00
Peter Geoghegan	66c5bd3a6f	Remove obsolete nbtree "get root" comment. Remove a very old Berkeley era comment that doesn't seem to have anything to do with the current locking considerations within _bt_getroot(). Discussion: https://postgr.es/m/CAH2-WzmA2H+rL-xxF5o6QhMD+9x6cJTnz2Mr3Li_pbPBmqoTBQ@mail.gmail.com	2019-07-01 22:28:08 -07:00
Michael Paquier	c74d49d41c	Fix many typos and inconsistencies Author: Alexander Lakhin Discussion: https://postgr.es/m/af27d1b3-a128-9d62-46e0-88f424397f44@gmail.com	2019-07-01 10:00:23 +09:00
Thomas Munro	89ff7c08ee	Remove unnecessary comment. Author: Vik Fearing Discussion: https://postgr.es/m/150d3e9f-c7ec-3fb3-4fdb-def47c4144af%402ndquadrant.com	2019-06-23 22:19:59 +12:00
Michael Paquier	f43608bda2	Fix typos and inconsistencies in code comments Author: Alexander Lakhin Discussion: https://postgr.es/m/dec6aae8-2d63-639f-4d50-20e229fb83e3@gmail.com	2019-06-14 09:34:34 +09:00
Amit Kapila	9679345f3c	Fix typos. Reported-by: Alexander Lakhin Author: Alexander Lakhin Reviewed-by: Amit Kapila and Tom Lane Discussion: https://postgr.es/m/7208de98-add8-8537-91c0-f8b089e2928c@gmail.com	2019-05-26 18:28:18 +05:30
Tom Lane	8255c7a5ee	Phase 2 pgindent run for v12. Switch to 2.1 version of pg_bsd_indent. This formats multiline function declarations "correctly", that is with additional lines of parameter declarations indented to match where the first line's left parenthesis is. Discussion: https://postgr.es/m/CAEepm=0P3FeTXRcU5B2W3jv3PgRVZ-kGUXLGfd42FFhUROO3ug@mail.gmail.com	2019-05-22 13:04:48 -04:00
Tom Lane	be76af171c	Initial pgindent run for v12. This is still using the 2.0 version of pg_bsd_indent. I thought it would be good to commit this separately, so as to document the differences between 2.0 and 2.1 behavior. Discussion: https://postgr.es/m/16296.1558103386@sss.pgh.pa.us	2019-05-22 12:55:34 -04:00
Peter Geoghegan	3f58cc6dd8	Remove extra nbtree half-dead internal page check. It's not safe for nbtree VACUUM to attempt to delete a target page whose right sibling is already half-dead, since that would fail the cross-check when VACUUM attempts to re-find a downlink to the right sibling in the parent page. Logic to prevent this from happening was added by commit `8da3183780`, which addressed a bug in the overhaul of page deletion that went into PostgreSQL 9.4 (commit `efada2b8e9`). VACUUM was made to check the right sibling page, and back off when it happened to be half-dead already. However, it is only truly necessary to do the right sibling check on the leaf level, since that transitively determines if the deletion target's parent's right sibling page is itself undergoing deletion. Remove the internal page level check, and add a comment explaining why the leaf level check alone suffices. The extra check is also unnecessary due to the fact that internal pages that are marked half-dead are generally considered corrupt. Commit `efada2b8e9` established the principle that there should never be half-dead internal pages (internal pages pending deletion are possible, but that status is never directly represented in the internal page). VACUUM will complain about corruption when it encounters half-dead internal pages, so VACUUM is bound to raise an error one way or another when an nbtree index has a half-dead internal page (contrib/amcheck will also report that the page is corrupt). It's possible that a pg_upgrade'd 9.3 database will still have half-dead internal pages, so it may seem like there is an argument for leaving the check in place to reliably get a cleaner error message that advises the user to REINDEX. However, leaf pages are also deleted in the first phase of deletion prior to PostgreSQL 9.4, so I believe we won't even attempt to re-find the parent page anyway (we won't have the fully deleted leaf page as the right sibling of our target page, so we won't even try to find a downlink for it). Discussion: https://postgr.es/m/CAH2-Wzm_ntmqJjWLRyKzimFmFvk+BnVAvUpaA4s1h9Ja58woaQ@mail.gmail.com	2019-05-16 15:11:58 -07:00
Peter Geoghegan	489e431ba5	Remove obsolete nbtree insertion comment. Remove a Berkeley-era comment above _bt_insertonpg() that admonishes the reader to grok Lehman and Yao's paper before making any changes. This made a certain amount of sense back when _bt_insertonpg() was responsible for most of the things that are now spread across _bt_insertonpg(), _bt_findinsertloc(), _bt_insert_parent(), and _bt_split(), but it doesn't work like that anymore. I believe that this comment alludes to the need to "couple" or "crab" buffer locks as we ascend the tree as page splits cascade upwards. The nbtree README already explains this in detail, which seems sufficient. Besides, the changes to page splits made by commit `40dae7ec53` altered the exact details of how buffer locks are retained during splits; Lehman and Yao's original algorithm seems to release the lock on the left child page/buffer slightly earlier than _bt_insertonpg()/_bt_insert_parent() can.	2019-05-15 16:53:11 -07:00
Peter Geoghegan	7505da2f45	Reverse order of newitem nbtree candidate splits. Commit `fab25024`, which taught nbtree to choose candidate split points more carefully, had _bt_findsplitloc() record all possible split points in an initial pass over a page that is about to be split. The order that candidate split points were processed and stored in was assumed to match the offset number order of split points on an imaginary version of the page that contains the same items as the original, but also fits newitem (the item that provoked the split precisely because it didn't fit). However, the order of split points in the final array was not quite what was expected: the split point that makes newitem the firstright item came after the split point that makes newitem the lastleft item -- not before. As a result, _bt_findsplitloc() could get confused about the leftmost and rightmost tuples among all possible split points recorded for the page. This seems to have no appreciable impact on the quality of the final split point chosen by _bt_findsplitloc(), but it's still wrong. To fix, switch the order in which newitem candidate splits are recorded in. This also makes it possible to describe candidate split points in terms of which pair of adjoining tuples enclose the split point within _bt_findsplitloc(), making it clearer why it's generally safe for _bt_split() to expect lastleft and firstright tuples.	2019-05-15 12:22:07 -07:00
Peter Geoghegan	ae7291acbc	Standardize ItemIdData terminology. The term "item pointer" should not be used to refer to ItemIdData variables, since that is needlessly ambiguous. Only ItemPointerData/ItemPointer variables should be called item pointers. To fix, establish the convention that ItemIdData variables should always be referred to either as "item identifiers" or "line pointers". The term "item identifier" already predominates in docs and translatable messages, and so should be the preferred alternative there. Discussion: https://postgr.es/m/CAH2-Wz=c=MZQjUzde3o9+2PLAPuHTpVZPPdYxN=E4ndQ2--8ew@mail.gmail.com	2019-05-13 15:53:39 -07:00
Peter Geoghegan	9b42e71376	Don't leave behind junk nbtree pages during split. Commit `8fa30f906b` reduced the elevel of a number of "can't happen" _bt_split() errors from PANIC to ERROR. At the same time, the new right page buffer for the split could continue to be acquired well before the critical section. This was possible because it was relatively straightforward to make sure that _bt_split() could not throw an error, with a few specific exceptions. The exceptional cases were safe because they involved specific, well understood errors, making it possible to consistently zero the right page before actually raising an error using elog(). There was no danger of leaving around a junk page, provided _bt_split() stuck to this coding rule. Commit `8224de4f`, which introduced INCLUDE indexes, added code to make _bt_split() truncate away non-key attributes. This happened at a point that broke the rule around zeroing the right page in _bt_split(). If truncation failed (perhaps due to palloc() failure), that would result in an errant right page buffer with junk contents. This could confuse VACUUM when it attempted to delete the page, and should be avoided on general principle. To fix, reorganize _bt_split() so that truncation occurs before the new right page buffer is even acquired. A junk page/buffer will not be left behind if _bt_nonkey_truncate()/_bt_truncate() raise an error. Discussion: https://postgr.es/m/CAH2-WzkcWT_-NH7EeL=Az4efg0KCV+wArygW8zKB=+HoP=VWMw@mail.gmail.com Backpatch: 11-, where INCLUDE indexes were introduced.	2019-05-13 10:27:59 -07:00
Peter Geoghegan	d95e36dc38	Remove obsolete nbtree split REDO routine comment. Commit `dd299df818`, which added suffix truncation to nbtree, simplified the WAL record format used by page splits. It became necessary to explicitly WAL-log the new high key for the left half of a split in all cases, which relieved the REDO routine from having to reconstruct a new high key for the left page by copying the first item from the right page. Remove a comment that referred to the previous practice.	2019-05-08 12:47:20 -07:00
Peter Geoghegan	d65b5ccad6	Correct obsolete nbtsort.c minimum key comment. It is no longer possible under any circumstances for nbtree code to reconstruct a strict lower bound key (parent page's pivot tuple key) for a right sibling page by retrieving the first item in the right sibling page.	2019-05-07 21:42:12 -07:00
Peter Geoghegan	7b37f4b02e	Correct more obsolete nbtree page split comments. Commit `3f342839` corrected obsolete comments about buffer locks at the main _bt_insert_parent() call site, but missed similar obsolete comments above _bt_insert_parent() itself. Both sets of comments were rendered obsolete by commit `40dae7ec53`, which made the nbtree page split algorithm more robust. Fix the comments that were missed the first time around now. In passing, refine a related _bt_insert_parent() comment about re-finding the parent page to insert new downlink.	2019-05-03 13:34:45 -07:00
Peter Geoghegan	6dd86c269d	Fix nbtsort.c's page space accounting. Commit `dd299df818`, which made heap TID a tiebreaker nbtree index column, introduced new rules on page space management to make suffix truncation safe. In general, suffix truncation needs to have a small amount of extra space available on the new left page when splitting a leaf page. This is needed in case it turns out that truncation cannot even "truncate away the heap TID column", resulting in a larger-than-firstright leaf high key with an explicit heap TID representation. Despite all this, CREATE INDEX/nbtsort.c did not account for the possible need for extra heap TID space on leaf pages when deciding whether or not a new item could fit on current page. This could lead to "failed to add item to the index page" errors when CREATE INDEX/nbtsort.c tried to finish off a leaf page that lacked space for a larger-than-firstright leaf high key (it only had space for firstright tuple, which was just short of what was needed following "truncation"). Several conditions needed to be met all at once for CREATE INDEX to fail. The problem was in the hard limit on what will fit on a page, which tends to be masked by the soft fillfactor-wise limit. The easiest way to recreate the problem seems to be a CREATE INDEX on a low cardinality text column, with tuples that are of non-uniform width, using a fillfactor of 100. To fix, bring nbtsort.c in line with nbtsplitloc.c, which already pessimistically assumes that all leaf page splits will have high keys that have a heap TID appended. Reported-By: Andreas Joseph Krogh Discussion: https://postgr.es/m/VisenaEmail.c5.3ee7fe277d514162.16a6d785bea@tc7-visena	2019-05-02 12:33:35 -07:00
Alvaro Herrera	9a83afecb7	Widen tuple counter variables from long to int64 Mistake in ab0dfc961b6a; progress reporting would have wrapped around for indexes created with more than 2^31 tuples. Reported-by: Peter Geoghegan Discussion: https://postgr.es/m/CAH2-Wz=WbNxc5ob5NJ9yqo2RMJ0q4HXDS30GVCobeCvC9A1L9A@mail.gmail.com	2019-04-30 10:27:38 -04:00
Peter Geoghegan	9ee7414ed0	Remove obsolete _bt_insert_parent() comment. Remove a comment that refers to a coding practice that was fully removed by commit `a8b8f4db`, which introduced MarkBufferDirty(). It looks like the comment was even obsolete before then, since it concerns write-ordering dependencies with synchronous buffer writes.	2019-04-29 14:14:38 -07:00
Peter Geoghegan	9b10926263	Prevent O(N^2) unique index insertion edge case. Commit `dd299df8` made nbtree treat heap TID as a tiebreaker column, establishing the principle that there is only one correct location (page and page offset number) for every index tuple, no matter what. Insertions of tuples into non-unique indexes proceed as if heap TID (scan key's scantid) is just another user-attribute value, but insertions into unique indexes are more delicate. The TID value in scantid must initially be omitted to ensure that the unique index insertion visits every leaf page that duplicates could be on. The scantid is set once again after unique checking finishes successfully, which can force _bt_findinsertloc() to step right one or more times, to locate the leaf page that the new tuple must be inserted on. Stepping right within _bt_findinsertloc() was assumed to occur no more frequently than stepping right within _bt_check_unique(), but there was one important case where that assumption was incorrect: inserting a "duplicate" with NULL values. Since _bt_check_unique() didn't do any real work in this case, it wasn't appropriate for _bt_findinsertloc() to behave as if it was finishing off a conventional unique insertion, where any existing physical duplicate must be dead or recently dead. _bt_findinsertloc() might have to grovel through a substantial portion of all of the leaf pages in the index to insert a single tuple, even when there were no dead tuples. To fix, treat insertions of tuples with NULLs into a unique index as if they were insertions into a non-unique index: never unset scantid before calling _bt_search() to descend the tree, and bypass _bt_check_unique() entirely. _bt_check_unique() is no longer responsible for incoming tuples with NULL values. Discussion: https://postgr.es/m/CAH2-Wzm08nr+JPx4jMOa9CGqxWYDQ-_D4wtPBiKghXAUiUy-nQ@mail.gmail.com	2019-04-23 10:33:57 -07:00
Alexander Korotkov	1e87198182	Fix division by zero in _bt_vacuum_needs_cleanup() Checks inside _bt_vacuum_needs_cleanup() allow division by zero to happen when metad->btm_last_cleanup_num_heap_tuples == 0. This commit adjusts the expression so that no division by zero might happen. Reported-by: Piotr Stefaniak Discussion: https://postgr.es/m/DB8PR03MB5931C41F7787A95313F08322F22A0%40DB8PR03MB5931.eurprd03.prod.outlook.com Reviewed-by: Masahiko Sawada Backpatch-through: 11	2019-04-15 20:20:43 +03:00
Peter Geoghegan	74eb2176bf	Invalidate binary search bounds consistently. _bt_check_unique() failed to invalidate binary search bounds in the event of a live conflict following commit `e5adcb78`. This resulted in problems after waiting for the conflicting xact to commit or abort. The subsequent call to _bt_check_unique() would restore the initial binary search bounds, rather than starting a new search. Fix by explicitly invalidating bounds when it becomes clear that there is a live conflict that insertion will have to wait to resolve. Ashutosh Sharma, with a few additional tweaks by me. Author: Ashutosh Sharma Reported-By: Ashutosh Sharma Diagnosed-By: Ashutosh Sharma Discussion: https://postgr.es/m/CAE9k0PnQp-qr-UYKMSCzdC2FBzdE4wKP41hZrZvvP26dKLonLg@mail.gmail.com	2019-04-04 09:38:08 -07:00
Alvaro Herrera	ab0dfc961b	Report progress of CREATE INDEX operations This uses the progress reporting infrastructure added by `c16dc1aca5`, adding support for CREATE INDEX and CREATE INDEX CONCURRENTLY. There are two pieces to this: one is index-AM-agnostic, and the other is AM-specific. The latter is fairly elaborate for btrees, including reportage for parallel index builds and the separate phases that btree index creation uses; other index AMs, which are much simpler in their building procedures, have simplistic reporting only, but that seems sufficient, at least for non-concurrent builds. The index-AM-agnostic part is fairly complete, providing insight into the CONCURRENTLY wait phases as well as block-based progress during the index validation table scan. (The index validation index scan requires patching each AM, which has not been included here.) Reviewers: Rahila Syed, Pavan Deolasee, Tatsuro Yamada Discussion: https://postgr.es/m/20181220220022.mg63bhk26zdpvmcj@alvherre.pgsql	2019-04-02 15:18:08 -03:00
Andres Freund	4bb50236eb	tableam: Formatting and other minor cleanups. The superflous heapam_xlog.h includes were reported by Peter Geoghegan.	2019-03-31 18:16:53 -07:00
Peter Geoghegan	76a39f2295	Fix nbtree high key "continuescan" row compare bug. Commit `29b64d1d` mishandled skipping over truncated high key attributes during row comparisons. The row comparison key matching loop would loop forever when a truncated attribute was encountered for a row compare subkey. Fix by following the example of other code in the loop: advance the current subkey, or break out of the loop when the last subkey is reached. Add test coverage for the relevant _bt_check_rowcompare() code path. The new test case is somewhat tied to nbtree implementation details, which isn't ideal, but seems unavoidable.	2019-03-31 17:24:04 -07:00
Peter Geoghegan	9c7fb7e6d8	Tweak some nbtree-related code comments.	2019-03-29 12:29:05 -07:00
Andres Freund	2a96909a4a	tableam: Support for an index build's initial table scan(s). To support building indexes over tables of different AMs, the scans to do so need to be routed through the table AM. While moving a fair amount of code, nearly all the changes are just moving code to below a callback. Currently the range based interface wouldn't make much sense for non block based table AMs. But that seems aceptable for now. Author: Andres Freund Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de	2019-03-27 19:59:06 -07:00
Andres Freund	558a9165e0	Compute XID horizon for page level index vacuum on primary. Previously the xid horizon was only computed during WAL replay. That had two major problems: 1) It relied on knowing what the table pointed to looks like. That was easy enough before the introducing of tableam (we knew it had to be heap, although some trickery around logging the heap relfilenodes was required). But to properly handle table AMs we need per-database catalog access to look up the AM handler, which recovery doesn't allow. 2) Not knowing the xid horizon also makes it hard to support logical decoding on standbys. When on a catalog table, we need to be able to conflict with slots that have an xid horizon that's too old. But computing the horizon by visiting the heap only works once consistency is reached, but we always need to be able to detect conflicts. There's also a secondary problem, in that the current method performs redundant work on every standby. But that's counterbalanced by potentially computing the value when not necessary (either because there's no standby, or because there's no connected backends). Solve 1) and 2) by moving computation of the xid horizon to the primary and by involving tableam in the computation of the horizon. To address the potentially increased overhead, increase the efficiency of the xid horizon computation for heap by sorting the tids, and eliminating redundant buffer accesses. When prefetching is available, additionally perform prefetching of buffers. As this is more of a maintenance task, rather than something routinely done in every read only query, we add an arbitrary 10 to the effective concurrency - thereby using IO concurrency, when not globally enabled. That's possibly not the perfect formula, but seems good enough for now. Bumps WAL format, as latestRemovedXid is now part of the records, and the heap's relfilenode isn't anymore. Author: Andres Freund, Amit Khandekar, Robert Haas Reviewed-By: Robert Haas Discussion: https://postgr.es/m/20181212204154.nsxf3gzqv3gesl32@alap3.anarazel.de https://postgr.es/m/20181214014235.dal5ogljs3bmlq44@alap3.anarazel.de https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de	2019-03-26 16:52:54 -07:00
Andres Freund	71bdc99d0d	tableam: Add helper for indexes to check if a corresponding table tuples exist. This is, likely exclusively, useful to verify that conflicts detected in a unique index are with live tuples, rather than dead ones. Author: Andres Freund Discussion: https://postgr.es/m/20180703070645.wchpu5muyto5n647@alap3.anarazel.de	2019-03-25 16:52:55 -07:00

1 2 3 4 5 ...

702 Commits