This is required before the creation of a new branch. pgindent is
clean, as well as is reformat-dat-files.
perltidy version is v20230309, as documented in pgindent's README.
Fix two issues in parent_key validation in posting trees:
* It's not enough to check stack->parentblk is valid to determine if the
parentkey is valid. It's possible parentblk is set to a valid block
number, but parentkey is invalid. So check parentkey directly.
* We don't need to invalidate parentkey for all child pages of the
rightmost page. It's enough to invalidate it for the rightmost child
only, which means we can check more cases (less false negatives).
Issues reported by Arseniy Mukhin, along with a proposed patch. Review
by Andrey M. Borodin, cleanup and improvements by me.
Author: Arseniy Mukhin <arseniy.mukhin.dev@gmail.com>
Reviewed-by: Andrey M. Borodin <x4mmm@yandex-team.ru>
Discussion: https://postgr.es/m/CAE7r3MJ611B9TE=YqBBncewp7-k64VWs+sjk7XF6fJUX77uFBA@mail.gmail.com
The checks introduced by commit 14ffaece0f did not get the parent key
checks quite right, missing some data corruption cases. In particular:
* The "rightlink" check was not working as intended, because rightlink
is a BlockNumber, and InvalidBlockNumber is 0xFFFFFFFF, so
!GinPageGetOpaque(page)->rightlink
almost always evaluates to false (except for rightlink=0). So in most
cases parenttup was left NULL, preventing any checks against parent.
* Use GinGetDownlink() to retrieve child blkno to avoid triggering
Assert, same as the core GIN code.
Issues reported by Arseniy Mukhin, along with a proposed patch. Review
by Andrey M. Borodin, cleanup and improvements by me.
Author: Arseniy Mukhin <arseniy.mukhin.dev@gmail.com>
Reviewed-by: Andrey M. Borodin <x4mmm@yandex-team.ru>
Discussion: https://postgr.es/m/CAE7r3MJ611B9TE=YqBBncewp7-k64VWs+sjk7XF6fJUX77uFBA@mail.gmail.com
This tightens a couple checks in checking GIN indexes, which might have
resulted in incorrect results (false positives/negatives).
* The code skipped ordering checks if the entries were for different
attributes (for multi-column GIN indexes), possibly missing some cases
of data corruption. But the attribute number is part of the ordering,
so we can check that.
* The root page was skipped when checking entry order, but that is
unnecessary. The root page is subject to the same ordering rules, we
can process it just like any other page.
* The high key on the right-most page was not checked, but that is
needed only for inner pages (we don't store the high key for those).
For leaf pages we can check the high key just fine.
* Correct the detection of split pages. If the page gets split, the
cached parent key is greater than the current child key (not less, as
the code incorrectly expected).
Issues reported by Arseniy Mukhin, along with a proposed patch. Review
by Andrey M. Borodin, cleanup and improvements by me.
Author: Arseniy Mukhin <arseniy.mukhin.dev@gmail.com>
Reviewed-by: Andrey M. Borodin <x4mmm@yandex-team.ru>
Discussion: https://postgr.es/m/CAE7r3MJ611B9TE=YqBBncewp7-k64VWs+sjk7XF6fJUX77uFBA@mail.gmail.com
Several read stream users asserted that the read stream was exhausted
after looping on that very condition. It was pointed out in an a
review of an as-of-yet uncommitted read stream user [1] that this was
confusing and could lead the reader to think there was a possibility of
some kind of race condition. Remove these asserts.
[1] https://postgr.es/m/F9ACE8D0-B807-4A17-B6BD-87EF0717983D%40yesql.se
Submitting IO in larger batches can be more efficient than doing so
one-by-one, particularly for many small reads. It does, however, require
the ReadStreamBlockNumberCB callback to abide by the restrictions of AIO
batching (c.f. pgaio_enter_batchmode()). Basically, the callback may not:
a) block without first calling pgaio_submit_staged(), unless a
to-be-waited-on lock cannot be part of a deadlock, e.g. because it is
never held while waiting for IO.
b) directly or indirectly start another batch pgaio_enter_batchmode()
As this requires care and is nontrivial in some cases, batching is only
used with explicit opt-in.
This patch adds an explicit flag (READ_STREAM_USE_BATCHING) to read_stream and
uses it where appropriate.
There are two cases where batching would likely be beneficial, but where we
aren't using it yet:
1) bitmap heap scans, because the callback reads the VM
This should soon be solved, because we are planning to remove the use of
the VM, due to that not being sound.
2) The first phase of heap vacuum
This could be made to support batchmode, but would require some care.
Reviewed-by: Noah Misch <noah@leadboat.com>
Reviewed-by: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Adds a new function, validating two kinds of invariants on a GIN index:
- parent-child consistency: Paths in a GIN graph have to contain
consistent keys. Tuples on parent pages consistently include tuples
from child pages; parent tuples do not require any adjustments.
- balanced-tree / graph: Each internal page has at least one downlink,
and can reference either only leaf pages or only internal pages.
The GIN verification is based on work by Grigory Kryachko, reworked by
Heikki Linnakangas and with various improvements by Andrey Borodin.
Investigation and fixes for multiple bugs by Kirill Reshke.
Author: Grigory Kryachko <GSKryachko@gmail.com>
Author: Heikki Linnakangas <hlinnaka@iki.fi>
Author: Andrey Borodin <amborodin@acm.org>
Reviewed-By: José Villanova <jose.arthur@gmail.com>
Reviewed-By: Aleksander Alekseev <aleksander@timescale.com>
Reviewed-By: Nikolay Samokhvalov <samokhvalov@gmail.com>
Reviewed-By: Andres Freund <andres@anarazel.de>
Reviewed-By: Tomas Vondra <tomas.vondra@enterprisedb.com>
Reviewed-By: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-By: Mark Dilger <mark.dilger@enterprisedb.com>
Reviewed-By: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/45AC9B0A-2B45-40EE-B08F-BDCF5739D1E1%40yandex-team.ru
Before performing checks on an index, we need to take some safety
measures that apply to all index AMs. This includes:
* verifying that the index can be checked - Only selected AMs are
supported by amcheck (right now only B-Tree). The index has to be
valid and not a temporary index from another session.
* changing (and then restoring) user's security context
* obtaining proper locks on the index (and table, if needed)
* discarding GUC changes from the index functions
Until now this was implemented in the B-Tree amcheck module, but it's
something every AM will have to do. So relocate the code into a new
module verify_common for reuse.
The shared steps are implemented by amcheck_lock_relation_and_check(),
receiving the AM-specific verification as a callback. Custom parameters
may be supplied using a pointer.
Author: Andrey Borodin <amborodin@acm.org>
Reviewed-By: José Villanova <jose.arthur@gmail.com>
Reviewed-By: Aleksander Alekseev <aleksander@timescale.com>
Reviewed-By: Nikolay Samokhvalov <samokhvalov@gmail.com>
Reviewed-By: Andres Freund <andres@anarazel.de>
Reviewed-By: Tomas Vondra <tomas@vondra.me>
Reviewed-By: Mark Dilger <mark.dilger@enterprisedb.com>
Reviewed-By: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://postgr.es/m/45AC9B0A-2B45-40EE-B08F-BDCF5739D1E1%40yandex-team.ru
It seems potentially useful to label our shared libraries with version
information, now that a facility exists for retrieving that. This
patch labels them with the PG_VERSION string. There was some
discussion about using semantic versioning conventions, but that
doesn't seem terribly helpful for modules with no SQL-level presence;
and for those that do have SQL objects, we typically expect them
to support multiple revisions of the SQL definitions, so it'd still
not be very helpful.
I did not label any of src/test/modules/. It seems unnecessary since
we don't install those, and besides there ought to be someplace that
still provides test coverage for the original PG_MODULE_MAGIC macro.
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/dd4d1b59-d0fe-49d5-b28f-1e463b68fa32@gmail.com
Make nbtree's "1/3 of a page limit" BTMaxItemSize function-like macro
(which accepts a "page" argument) into an object-like macro that can be
used from code that doesn't have convenient access to an nbtree page.
Preparation for an upcoming patch that adds skip scan to nbtree.
Parallel index scans that use skip scan will serialize datums (not just
SAOP array subscripts) when scheduling primitive scans. BTMaxItemSize
will be used by btestimateparallelscan to determine how much DSM to
request.
Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-Wz=H_RG5weNGeUG_TkK87tRBnH9mGCQj6WpM4V4FNWKv2g@mail.gmail.com
Here we convert CompactAttribute.attalign from a char, which is directly
derived from pg_attribute.attalign into a uint8, which stores the number
of bytes to align the column's value by in the tuple.
This allows tuple deformation and tuple size calculations to move away
from using the inefficient att_align_nominal() macro, which manually
checks each TYPALIGN_* char to translate that into the alignment bytes
for the given type. Effectively, this commit changes those to TYPEALIGN
calls, which are branchless and only perform some simple arithmetic with
some bit-twiddling.
The removed branches were often mispredicted by CPUs, especially so in
real-world tables which often contain a mishmash of different types
with different alignment requirements.
Author: David Rowley
Reviewed-by: Andres Freund, Victor Yegorov
Discussion: https://postgr.es/m/CAApHDvrBztXP3yx=NKNmo3xwFAFhEdyPnvrDg3=M0RhDs+4vYw@mail.gmail.com
The new compact_attrs array stores a few select fields from
FormData_pg_attribute in a more compact way, using only 16 bytes per
column instead of the 104 bytes that FormData_pg_attribute uses. Using
CompactAttribute allows performance-critical operations such as tuple
deformation to be performed without looking at the FormData_pg_attribute
element in TupleDesc which means fewer cacheline accesses.
For some workloads, tuple deformation can be the most CPU intensive part
of processing the query. Some testing with 16 columns on a table
where the first column is variable length showed around a 10% increase in
transactions per second for an OLAP type query performing aggregation on
the 16th column. However, in certain cases, the increases were much
higher, up to ~25% on one AMD Zen4 machine.
This also makes pg_attribute.attcacheoff redundant. A follow-on commit
will remove it, thus shrinking the FormData_pg_attribute struct by 4
bytes.
Author: David Rowley
Reviewed-by: Andres Freund, Victor Yegorov
Discussion: https://postgr.es/m/CAApHDvrBztXP3yx=NKNmo3xwFAFhEdyPnvrDg3=M0RhDs+4vYw@mail.gmail.com
Remove the 'whenTaken' and 'lsn' fields from SnapshotData. After the
removal of the "snapshot too old" feature, they were never set to a
non-zero value.
This largely reverts commit 3e2f3c2e42, which added the
OldestActiveSnapshot tracking, and the init_toast_snapshot()
function. That was only required for setting the 'whenTaken' and 'lsn'
fields. SnapshotToast is now a constant again, like SnapshotSelf and
SnapshotAny. I kept a thin get_toast_snapshot() wrapper around
SnapshotToast though, to check that you have a registered or active
snapshot. That's still a useful sanity check.
Reviewed-by: Nathan Bossart, Andres Freund, Tom Lane
Discussion: https://www.postgresql.org/message-id/cd4b4f8c-e63a-41c0-95f6-6e6cd9b83f6d@iki.fi
TAP tests can write
$node->init(no_data_checksums => 1);
to initialize a cluster explicitly without checksums. Currently, this
is the default, but this change allows running all tests with
checksums enabled, like
PG_TEST_INITDB_EXTRA_OPTS=--data-checksums meson test ...
And this also prepares the tests for when we switch the default to
checksums enabled.
The pg_checksums tests need to disable checksums so it can test its
own functionality of enabling checksums. The amcheck/pg_amcheck tests
need to disable checksums because they manually introduce corruption
that they want to detect, but with checksums enabled, the checksum
verification will fail before they even get to their work.
Author: Greg Sabino Mullane <greg@turnstep.com>
Reviewed-by: Nathan Bossart <nathandbossart@gmail.com>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://www.postgresql.org/message-id/flat/CAKAnmmKwiMHik5AHmBEdf5vqzbOBbcwEPHo4-PioWeAbzwcTOQ@mail.gmail.com
Currently, when amcheck validates a unique constraint, it visits the heap for
each index tuple. This commit implements skipping keys, which have only one
non-dedeuplicated index tuple (quite common case for unique indexes). That
gives substantial economy on index checking time.
Reported-by: Noah Misch
Discussion: https://postgr.es/m/20240325020323.fd.nmisch%40google.com
Author: Alexander Korotkov, Pavel Borisov
This commit removes unused variables and routines from some perl code
that have accumulated across the years. This touches the following
areas:
- Wait event generation script.
- AdjustUpgrade.pm.
- TAP perl code
Author: Alexander Lakhin
Reviewed-by: Dagfinn Ilmari Mannsåker
Discussion: https://postgr.es/m/70b340bc-244a-589d-ef8b-d8aebb707a84@gmail.com
* Don't forget to pfree() the right page when it's to be ignored.
* Report error on unexpected non-leaf right page even if this page is not
to be ignored. This restores the logic which was unintendedly changed
in 97e5b0026f.
Reported-by: Pavel Borisov
This is a very unlikely condition during checking a B-tree unique constraint,
meaning that the index structure is violated badly, and we shouldn't continue
checking to avoid endless loops, etc. So it's worth immediately throwing an
error.
Reported-by: Peter Geoghegan
Discussion: https://postgr.es/m/CAH2-Wzk%2B2116uOXdOViA27SHcr31WKPgmjsxXLBs_aTxAeThzg%40mail.gmail.com
Author: Pavel Borisov
5ae2087202 implemented a cross-page unique constraint check by loading
the right sibling to the BtreeCheckState.target variable. This is wrong,
because bt_target_page_check() shouldn't change the target page. Also,
BtreeCheckState.target shouldn't be changed alone without
BtreeCheckState.targetblock.
However, the above didn't cause any visible bugs for the following reasons.
1. We do a cross-page unique constraint check only for leaf index pages.
2. The only way target page get accessed after a cross-page unique constraint
check is loading of the lowkey.
3. The only place lowkey is used is bt_child_highkey_check(), and that applies
only to non-leafs.
The reasons above don't diminish the fact that changing BtreeCheckState.target
for a cross-page unique constraint check is wrong. This commit changes this
check to temporarily store the right sibling to the local variable.
Reported-by: Peter Geoghegan
Discussion: https://postgr.es/m/CAH2-Wzk%2B2116uOXdOViA27SHcr31WKPgmjsxXLBs_aTxAeThzg%40mail.gmail.com
Author: Pavel Borisov
This commit introduces a new data structure BtreeLastVisibleEntry comprising
information about the last visible heap entry with the current value of key.
Usage of this data structure allows us to avoid passing all this information
as individual function arguments.
Reported-by: Alexander Korotkov
Discussion: https://www.postgresql.org/message-id/CAPpHfdsVbB9ToriaB1UHuOKwjKxiZmTFQcEF%3DjuzzC_nby31uA%40mail.gmail.com
Author: Pavel Borisov, Alexander Korotkov
It might happen that the varlena value wasn't compressed by index_form_tuple()
due to current storage parameters. If compression is currently enabled, we
need to compress such values to match index tuple coming from the heap.
Backpatch to all supported versions.
Discussion: https://postgr.es/m/flat/7bdbe559-d61a-4ae4-a6e1-48abdf3024cc%40postgrespro.ru
Author: Andrey Borodin
Reviewed-by: Alexander Lakhin, Michael Zhilin, Jian He, Alexander Korotkov
Backpatch-through: 12
In the heap, tuples may contain short varlena datum with both 1B header and 4B
headers. But the corresponding index tuple should always have such varlena's
with 1B headers. So, for fingerprinting, we need to convert.
Backpatch to all supported versions.
Discussion: https://postgr.es/m/flat/7bdbe559-d61a-4ae4-a6e1-48abdf3024cc%40postgrespro.ru
Author: Michael Zhilin
Reviewed-by: Alexander Lakhin, Andrey Borodin, Jian He, Alexander Korotkov
Backpatch-through: 12
Use GUC_ACTION_SAVE rather than GUC_ACTION_SET, necessary for working
with parallel query.
Now that the call requires more arguments, wrap the call in a new
function to avoid code duplication and offer a place for a comment.
Discussion: https://postgr.es/m/E1rhJpO-0027Wf-9L@gemulon.postgresql.org
There are a lot of Perl scripts in the tree, mostly code generation
and TAP tests. Occasionally, these scripts produce warnings. These
are probably always mistakes on the developer side (true positives).
Typical examples are warnings from genbki.pl or related when you make
a mess in the catalog files during development, or warnings from tests
when they massage a config file that looks different on different
hosts, or mistakes during merges (e.g., duplicate subroutine
definitions), or just mistakes that weren't noticed because there is a
lot of output in a verbose build.
This changes all warnings into fatal errors, by replacing
use warnings;
by
use warnings FATAL => 'all';
in all Perl files.
Discussion: https://www.postgresql.org/message-id/flat/06f899fd-1826-05ab-42d6-adeb1fd5e200%40eisentraut.org
Teach _bt_binsrch (and related helper routines like _bt_search and
_bt_compare) about the initial positioning requirements of backward
scans. Routines like _bt_binsrch already know all about "nextkey"
searches, so it seems natural to teach them about "goback"/backward
searches, too. These concepts are closely related, and are much easier
to understand when discussed together.
Now that certain implementation details are hidden from _bt_first, it's
straightforward to add a new optimization: backward scans using the <
strategy now avoid extra leaf page accesses in certain "boundary cases".
Consider the following example, which uses the tenk1 table (and its
tenk1_hundred index) from the standard regression tests:
SELECT * FROM tenk1 WHERE hundred < 12 ORDER BY hundred DESC LIMIT 1;
Before this commit, nbtree would scan two leaf pages, even though it was
only really necessary to scan one leaf page. We'll now descend straight
to the leaf page containing a (12, -inf) high key instead. The scan
will locate matching non-pivot tuples with "hundred" values starting
from the value 11. The scan won't waste a page access on the right
sibling leaf page, which cannot possibly contain any matching tuples.
You can think of the optimization added by this commit as disabling an
optimization (the _bt_compare "!pivotsearch" behavior that was added to
Postgres 12 in commit dd299df8) for a small subset of cases where it was
always counterproductive.
Equivalently, you can think of the new optimization as extending the
"pivotsearch" behavior that page deletion by VACUUM has long required
(since the aforementioned Postgres 12 commit went in) to other, similar
cases. Obviously, this isn't strictly necessary for these new cases
(unlike VACUUM, _bt_first is prepared to move the scan to the left once
on the leaf level), but the underlying principle is the same.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://postgr.es/m/CAH2-Wz=XPzM8HzaLPq278Vms420mVSHfgs9wi5tjFKHcapZCEw@mail.gmail.com
This prevents false-positive reports about "the first child of leftmost
target page is not leftmost of its level", "block %u is not leftmost"
and "left link/right link pair". They appeared if amcheck ran before
VACUUM cleaned things, after a cluster exited recovery between the
first-stage and second-stage WAL records of a deletion. Back-patch to
v11 (all supported versions).
Reviewed by Peter Geoghegan.
Discussion: https://postgr.es/m/20231005025232.c7.nmisch@google.com
Since C99, there can be a trailing comma after the last value in an
enum definition. A lot of new code has been introducing this style on
the fly. Some new patches are now taking an inconsistent approach to
this. Some add the last comma on the fly if they add a new last
value, some are trying to preserve the existing style in each place,
some are even dropping the last comma if there was one. We could
nudge this all in a consistent direction if we just add the trailing
commas everywhere once.
I omitted a few places where there was a fixed "last" value that will
always stay last. I also skipped the header files of libpq and ecpg,
in case people want to use those with older compilers. There were
also a small number of cases where the enum type wasn't used anywhere
(but the enum values were), which ended up confusing pgindent a bit,
so I left those alone.
Discussion: https://www.postgresql.org/message-id/flat/386f8c45-c8ac-4681-8add-e3b0852c1620%40eisentraut.org