Teach nbtree and heapam to cooperate in order to eagerly remove
duplicate tuples representing dead MVCC versions. This is "bottom-up
deletion". Each bottom-up deletion pass is triggered lazily in response
to a flood of versions on an nbtree leaf page. This usually involves a
"logically unchanged index" hint (these are produced by the executor
mechanism added by commit 9dc718bd).
The immediate goal of bottom-up index deletion is to avoid "unnecessary"
page splits caused entirely by version duplicates. It naturally has an
even more useful effect, though: it acts as a backstop against
accumulating an excessive number of index tuple versions for any given
_logical row_. Bottom-up index deletion complements what we might now
call "top-down index deletion": index vacuuming performed by VACUUM.
Bottom-up index deletion responds to the immediate local needs of
queries, while leaving it up to autovacuum to perform infrequent clean
sweeps of the index. The overall effect is to avoid certain
pathological performance issues related to "version churn" from UPDATEs.
The previous tableam interface used by index AMs to perform tuple
deletion (the table_compute_xid_horizon_for_tuples() function) has been
replaced with a new interface that supports certain new requirements.
Many (perhaps all) of the capabilities added to nbtree by this commit
could also be extended to other index AMs. That is left as work for a
later commit.
Extend deletion of LP_DEAD-marked index tuples in nbtree by adding logic
to consider extra index tuples (that are not LP_DEAD-marked) for
deletion in passing. This increases the number of index tuples deleted
significantly in many cases. The LP_DEAD deletion process (which is now
called "simple deletion" to clearly distinguish it from bottom-up
deletion) won't usually need to visit any extra table blocks to check
these extra tuples. We have to visit the same table blocks anyway to
generate a latestRemovedXid value (at least in the common case where the
index deletion operation's WAL record needs such a value).
Testing has shown that the "extra tuples" simple deletion enhancement
increases the number of index tuples deleted with almost any workload
that has LP_DEAD bits set in leaf pages. That is, it almost never fails
to delete at least a few extra index tuples. It helps most of all in
cases that happen to naturally have a lot of delete-safe tuples. It's
not uncommon for an individual deletion operation to end up deleting an
order of magnitude more index tuples compared to the old naive approach
(e.g., custom instrumentation of the patch shows that this happens
fairly often when the regression tests are run).
Add a further enhancement that augments simple deletion and bottom-up
deletion in indexes that make use of deduplication: Teach nbtree's
_bt_delitems_delete() function to support granular TID deletion in
posting list tuples. It is now possible to delete individual TIDs from
posting list tuples provided the TIDs have a tableam block number of a
table block that gets visited as part of the deletion process (visiting
the table block can be triggered directly or indirectly). Setting the
LP_DEAD bit of a posting list tuple is still an all-or-nothing thing,
but that matters much less now that deletion only needs to start out
with the right _general_ idea about which index tuples are deletable.
Bump XLOG_PAGE_MAGIC because xl_btree_delete changed.
No bump in BTREE_VERSION, since there are no changes to the on-disk
representation of nbtree indexes. Indexes built on PostgreSQL 12 or
PostgreSQL 13 will automatically benefit from bottom-up index deletion
(i.e. no reindexing required) following a pg_upgrade. The enhancement
to simple deletion is available with all B-Tree indexes following a
pg_upgrade, no matter what PostgreSQL version the user upgrades from.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-By: Victor Yegorov <vyegorov@gmail.com>
Discussion: https://postgr.es/m/CAH2-Wzm+maE3apHB8NOtmM=p-DO65j2V5GzAWCOEEuy3JZgb2g@mail.gmail.com
Add an executor aminsert() hint mechanism that informs index AMs that
the incoming index tuple (the tuple that accompanies the hint) is not
being inserted by execution of an SQL statement that logically modifies
any of the index's key columns.
The hint is received by indexes when an UPDATE takes place that does not
apply an optimization like heapam's HOT (though only for indexes where
all key columns are logically unchanged). Any index tuple that receives
the hint on insert is expected to be a duplicate of at least one
existing older version that is needed for the same logical row. Related
versions will typically be stored on the same index page, at least
within index AMs that apply the hint.
Recognizing the difference between MVCC version churn duplicates and
true logical row duplicates at the index AM level can help with cleanup
of garbage index tuples. Cleanup can intelligently target tuples that
are likely to be garbage, without wasting too many cycles on less
promising tuples/pages (index pages with little or no version churn).
This is infrastructure for an upcoming commit that will teach nbtree to
perform bottom-up index deletion. No index AM actually applies the hint
just yet.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Victor Yegorov <vyegorov@gmail.com>
Discussion: https://postgr.es/m/CAH2-Wz=CEKFa74EScx_hFVshCOn6AA5T-ajFASTdzipdkLTNQQ@mail.gmail.com
Instead of making many block-sized write() calls to fill a new WAL file
with zeroes, make a smaller number of pwritev() calls (or various
emulations). The actual number depends on the OS's IOV_MAX, which
PG_IOV_MAX currently caps at 32. That means we'll write 256kB per call
on typical systems. We may want to tune the number later with more
experience.
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CA%2BhUKGJA%2Bu-220VONeoREBXJ9P3S94Y7J%2BkqCnTYmahvZJwM%3Dg%40mail.gmail.com
Bring heap and hash rmgr desc output in line with nbtree and GiST desc
output by using the name latestRemovedXid for all fields that output the
contents of the latestRemovedXid field from the WAL record's C struct
(stop using local variants).
This seems like a clear improvement because latestRemovedXid is a symbol
name that already appears across many different source files, and so is
probably much more recognizable.
Discussion: https://postgr.es/m/CAH2-Wzkt_Rs4VqPSCk87nyjPAAEmWL8STU9zgET_83EF5YfrLw@mail.gmail.com
If the substring start index and length overflow when added together,
substring() misbehaved, either throwing a bogus "negative substring
length" error on a case that should succeed, or failing to complain that
a negative length is negative (and instead returning the whole string,
in most cases). Unsurprisingly, the text, bytea, and bit variants of
the function all had this issue. Rearrange the logic to ensure that
negative lengths are always rejected, and add an overflow check to
handle the other case.
Also install similar guards into detoast_attr_slice() (nee
heap_tuple_untoast_attr_slice()), since it's far from clear that
no other code paths leading to that function could pass it values
that would overflow.
Patch by myself and Pavel Stehule, per bug #16804 from Rafi Shamim.
Back-patch to v11. While these bugs are old, the common/int.h
infrastructure for overflow-detecting arithmetic didn't exist before
commit 4d6ad3125, and it doesn't seem like these misbehaviors are bad
enough to justify developing a standalone fix for the older branches.
Discussion: https://postgr.es/m/16804-f4eeeb6c11ba71d4@postgresql.org
On further reflection it seems better to call PageGetMaxOffsetNumber()
after acquiring a buffer lock on the page. This shouldn't really
matter, but doing it this way is cleaner.
Follow-up to commit 42288174.
Backpatch: 12-, just like commit 42288174
The logic for determining the latest removed XID for the purposes of
generating recovery conflicts in REDO routines was subtly broken. It
failed to follow links from HOT chains, and so failed to consider all
relevant heap tuple headers in some cases.
To fix, expand the loop that deals with LP_REDIRECT line pointers to
also deal with HOT chains. The new version of the loop is loosely based
on a similar loop from heap_prune_chain().
The impact of this bug is probably quite limited, since the horizon code
necessarily deals with heap tuples that are pointed to by LP_DEAD-set
index tuples. The process of setting LP_DEAD index tuples (e.g. within
the kill_prior_tuple mechanism) is highly correlated with opportunistic
pruning of pointed-to heap tuples. Plus the question of generating a
recovery conflict usually comes up some time after index tuple LP_DEAD
bits were initially set, unlike heap pruning, where a latestRemovedXid
is generated at the point of the pruning operation (heap pruning has no
deferred "would-be page split" style processing that produces conflicts
lazily).
Only backpatch to Postgres 12, the first version where this logic runs
during original execution (following commit 558a9165e08). The index
latestRemovedXid mechanism has had the same bug since it first appeared
over 10 years ago (in commit a760893d), but backpatching to all
supported versions now seems like a bad idea on balance. Running the
new improved code during recovery seems risky, especially given the lack
of complaints from the field.
Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-Wz=Eib393+HHcERK_9MtgNS7Ew1HY=RDC_g6GL46zM5C6Q@mail.gmail.com
Backpatch: 12-
The patch needs test cases, reorganization, and cfbot testing.
Technically reverts commits 5c31afc49d..e35b2bad1a (exclusive/inclusive)
and 08db7c63f3..ccbe34139b.
Reported-by: Tom Lane, Michael Paquier
Discussion: https://postgr.es/m/E1ktAAG-0002V2-VB@gemulon.postgresql.org
This adds a key management system that stores (currently) two data
encryption keys of length 128, 192, or 256 bits. The data keys are
AES256 encrypted using a key encryption key, and validated via GCM
cipher mode. A command to obtain the key encryption key must be
specified at initdb time, and will be run at every database server
start. New parameters allow a file descriptor open to the terminal to
be passed. pg_upgrade support has also been added.
Discussion: https://postgr.es/m/CA+fd4k7q5o6Nc_AaX6BcYM9yqTbC6_pnH-6nSD=54Zp6NBQTCQ@mail.gmail.com
Discussion: https://postgr.es/m/20201202213814.GG20285@momjian.us
Author: Masahiko Sawada, me, Stephen Frost
Revert ac22929a26, as well as the followup fix 113d3591b8. Because it broke
the assumption that the startup process waiting for the recovery conflict
on buffer pin should be waken up only by buffer unpin or the timeout enabled
in ResolveRecoveryConflictWithBufferPin(). It caused, for example,
SIGHUP signal handler or walreceiver process to wake that startup process
up unnecessarily frequently.
Additionally, add the comments about why that dedicated latch that
the reverted patch tried to get rid of should not be removed.
Thanks to Kyotaro Horiguchi for the discussion.
Author: Fujii Masao
Discussion: https://postgr.es/m/d8c0c608-021b-3c73-fffd-3240829ee986@oss.nttdata.com
"Ordering stuff" refered to a _bt_first() call to _bt_orderkeys().
However, the _bt_orderkeys() function was renamed to
_bt_preprocess_keys() by commit fa5c8a055a0.
_bt_preprocess_keys() is directly referenced just after the removed
comment already, which seems sufficient.
Invent a new flag bit HASH_STRINGS to specify C-string hashing, which
was formerly the default; and add assertions insisting that exactly
one of the bits HASH_STRINGS, HASH_BLOBS, and HASH_FUNCTION be set.
This is in hopes of preventing recurrences of the type of oversight
fixed in commit a1b8aa1e4 (i.e., mistakenly omitting HASH_BLOBS).
Also, when HASH_STRINGS is specified, insist that the keysize be
more than 8 bytes. This is a heuristic, but it should catch
accidental use of HASH_STRINGS for integer or pointer keys.
(Nearly all existing use-cases set the keysize to NAMEDATALEN or
more, so there's little reason to think this restriction should
be problematic.)
Tweak hash_create() to insist that the HASH_ELEM flag be set, and
remove the defaults it had for keysize and entrysize. Since those
defaults were undocumented and basically useless, no callers
omitted HASH_ELEM anyway.
Also, remove memset's zeroing the HASHCTL parameter struct from
those callers that had one. This has never been really necessary,
and while it wasn't a bad coding convention it was confusing that
some callers did it and some did not. We might as well save a few
cycles by standardizing on "not".
Also improve the documentation for hash_create().
In passing, improve reinit.c's usage of a hash table by storing
the key as a binary Oid rather than a string; and, since that's
a temporary hash table, allocate it in CurrentMemoryContext for
neatness.
Discussion: https://postgr.es/m/590625.1607878171@sss.pgh.pa.us
This is done for end-of-recovery and shutdown checkpoints/restartpoints
(end-of-recovery restartpoints don't exist) rather than all types of
checkpoints, in cases where it may not be possible to rely on
pg_stat_activity to get a status from the startup or checkpointer
processes.
For example, at the end of a crash recovery, this is useful to know if a
checkpoint is running in the startup process, while previously the ps
display may only show some information about "recovering" something,
that can be confusing while a checkpoint runs.
Author: Justin Pryzby
Reviewed-by: Nathan Bossart, Kirk Jamison, Fujii Masao, Michael Paquier
Discussion: https://postgr.es/m/20200818225238.GP17022@telsasoft.com
pg_database.datfrozenxid gets updated using an in-place update at the
end of vacuum or autovacuum. Since 96cdeae, as pg_database has a toast
relation, it is possible for a pg_database tuple to have toast values
if there is a large set of ACLs in place. In such a case, the in-place
update would fail because of the flattening of the toast values done for
the catcache entry fetched. Instead of using a copy from the catcache,
this changes the logic to fetch the copy of the tuple by directly
scanning pg_database.
Per the lack of complaints on the matter, no backpatch is done. Note
that before 96cdeae, attempting to insert such a tuple to pg_database
would cause a "row is too big" error, so the end-of-vacuum problem was
not reachable.
Author: Ashwin Agrawal, Junfeng Yang
Discussion: https://postgr.es/m/DM5PR0501MB38800D9E4605BCA72DD35557CCE10@DM5PR0501MB3880.namprd05.prod.outlook.com
User-visible log messages should go through ereport(), so they are
subject to translation. Many remaining elog(LOG) calls are really
debugging calls.
Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://www.postgresql.org/message-id/flat/92d6f545-5102-65d8-3c87-489f71ea0a37%40enterprisedb.com
Previously pg_stat_progress_cluster view reported the current block
number in heap scan as the number of heap blocks scanned (i.e.,
heap_blks_scanned). This reported number could be incorrect when
synchronize_seqscans is enabled, because it allowed the heap scan to
start at block in middle. This could result in wraparounds in the
heap_blks_scanned column when the heap scan wrapped around.
This commit fixes the bug by calculating the number of blocks from
the block that the heap scan starts at to the current block in scan,
and reporting that number in the heap_blks_scanned column.
Also, in pg_stat_progress_cluster view, previously heap_blks_scanned
could not reach heap_blks_total at the end of heap scan phase
if the last pages scanned were empty. This commit fixes the bug by
manually updating heap_blks_scanned to the same value as
heap_blks_total when the heap scan phase finishes.
Back-patch to v12 where pg_stat_progress_cluster view was introduced.
Reported-by: Matthias van de Meent
Author: Matthias van de Meent
Reviewed-by: Fujii Masao
Discussion: https://postgr.es/m/CAEze2WjCBWSGkVfYag001Rc4+-nNLDpWM7QbyD6yPvuhKs-gYQ@mail.gmail.com
While ereport() and elog() themselves are quite cheap when the
error message level is too low to be printed, some places need to do
substantial work before they can call those macros at all. To allow
optimizing away such setup work when nothing is to be printed, make
elog.c export a new function message_level_is_interesting(elevel)
that reports whether ereport/elog will do anything. Make use of that
in various places that had ad-hoc direct tests of log_min_messages etc.
Also teach ProcSleep to use it to avoid some work. (There may well
be other places that could usefully use this; I didn't search hard.)
Within elog.c, refactor a little bit to avoid having duplicate copies
of the policy-setting logic. When that code was written, we weren't
relying on the availability of inline functions; so it had some
duplications in the name of efficiency, which I got rid of.
Alvaro Herrera and Tom Lane
Discussion: https://postgr.es/m/129515.1606166429@sss.pgh.pa.us
Historically these were called >^ and <^, but that is inconsistent
with the similar box, polygon, and circle operators, which are named
|>> and <<| respectively. Worse, the >^ and <^ names are used for
*not* strict above/below tests for the box type.
Hence, invent new operators following the more common naming. The
old operators remain available for now, and are still accepted by
the relevant index opclasses too. But there's a deprecation notice,
so maybe we can get rid of them someday.
Emre Hasegeli, reviewed by Pavel Borisov
Discussion: https://postgr.es/m/24348.1587444160@sss.pgh.pa.us
Previously, when restore_command claimed to succeed but failed to restore
the file with the right name, for example, due to mis-configuration of
restore_command, no log message was reported. Then the recovery failed
later with an error message not directly related to the issue.
This commit changes the recovery so that a log message is emitted in
this error case. This would enable us to investigate what happened in
this case more easily.
Author: Jeff Janes, Fujii Masao
Reviewed-by: Pavel Borisov, Kyotaro Horiguchi
Discussion: https://postgr.es/m/CAMkU=1xkFs3Omp4JR4wMYWdam_KLuj6LXnTYfU8u3T0h=PLLMQ@mail.gmail.com
Streamline handling of the various strategies that we have to avoid a
page split in nbtinsert.c. When it looks like a leaf page is about to
overflow, we now perform deleting LP_DEAD items and deduplication in one
central place. This greatly simplifies _bt_findinsertloc().
This has an independently useful consequence: nbtree no longer relies on
the BTP_HAS_GARBAGE page level flag/hint for anything important. We
still set and unset the flag in the same way as before, but it's no
longer treated as a gating condition when considering if we should check
for already-set LP_DEAD bits. This happens at the point where the page
looks like it might have to be split anyway, so simply checking the
LP_DEAD bits in passing is practically free. This avoids missing
LP_DEAD bits just because the page-level hint is unset, which is
probably reasonably common (e.g. it happens when VACUUM unsets the
page-level flag without actually removing index tuples whose LP_DEAD-bit
was set recently, after the VACUUM operation began but before it reached
the leaf page in question).
Note that this isn't a big behavioral change compared to PostgreSQL 13.
We were already checking for set LP_DEAD bits regardless of whether the
BTP_HAS_GARBAGE page level flag was set before we considered doing a
deduplication pass. This commit only goes slightly further by doing the
same check for all indexes, even indexes where deduplication won't be
performed.
We don't completely remove the BTP_HAS_GARBAGE flag. We still rely on
it as a gating condition with pg_upgrade'd indexes from before B-tree
version 4/PostgreSQL 12. That makes sense because we sometimes have to
make a choice among pages full of duplicates when inserting a tuple with
pre version 4 indexes. It probably still pays to avoid accessing the
line pointer array of a page there, since it won't yet be clear whether
we'll insert on to the page in question at all, let alone split it as a
result.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Victor Yegorov <vyegorov@gmail.com>
Discussion: https://postgr.es/m/CAH2-Wz%3DYpc1PDdk8OVJDChGJBjT06%3DA0Mbv9HyTLCsOknGcUFg%40mail.gmail.com
Stop naming special area/opaque pointer variables 'lpageop' in contexts
where it doesn't make sense. This is a holdover from a time when logic
that performs tasks that are now spread across _bt_insertonpg(),
_bt_findinsertloc(), and _bt_split() was more centralized. 'lpageop'
denotes "left page", which doesn't make sense outside of contexts in
which there isn't also a right page.
Also acquire page flag variables up front within _bt_insertonpg(). This
makes it closer to _bt_split() following refactoring commit bc3087b626d.
This allows the page split and retail insert paths to both make use of
the same variables.
This commit changes the startup process in the standby server so that
it handles the interrupt signals after waiting for wal_retrieve_retry_interval
on the latch and resetting it, before entering another wait on the latch.
This change causes the standby server to promptly handle interrupt signals.
Otherwise, previously, there was the case where the standby needs to
wait extra five seconds to shutdown when the shutdown request arrived
while the startup process was waiting for wal_retrieve_retry_interval
on the latch.
Author: Fujii Masao, but implementation idea is from Soumyadeep Chakraborty
Reviewed-by: Soumyadeep Chakraborty
Discussion: https://postgr.es/m/9d7e6ab0-8a53-ddb9-63cd-289bcb25fe0e@oss.nttdata.com
Only a basic logic bug in a _bt_insertonpg() caller could lead to a
violation of this invariant (index corruption won't do it). A "can't
happen" error seems inappropriate (it is arbitrary at best).
Demote the error to a simple assertion. This matches similar nearby
sanity checks.
Commit 8bf74967dab moved some of the code from brin_new_memtuple to
brin_memtuple_initialize, but this resulted in some of the code being
duplicate. Fix by removing the duplicate lines and backpatch to 10.
Author: Tomas Vondra
Backpatch-through: 10
Discussion: https://postgr.es/m/5eb50c97-9a8e-b691-8c40-1b2a55611c4c%40enterprisedb.com
Introduce TimestampDifferenceMilliseconds() to simplify callers
that would rather have the difference in milliseconds, instead of
the select()-oriented seconds-and-microseconds format. This gets
rid of at least one integer division per call, and it eliminates
some apparently-easy-to-mess-up arithmetic.
Two of these call sites were in fact wrong:
* pg_prewarm's autoprewarm_main() forgot to multiply the seconds
by 1000, thus ending up with a delay 1000X shorter than intended.
That doesn't quite make it a busy-wait, but close.
* postgres_fdw's pgfdw_get_cleanup_result() thought it needed to compute
microseconds not milliseconds, thus ending up with a delay 1000X longer
than intended. Somebody along the way had noticed this problem but
misdiagnosed the cause, and imposed an ad-hoc 60-second limit rather
than fixing the units. This was relatively harmless in context, because
we don't care that much about exactly how long this delay is; still,
it's wrong.
There are a few more callers of TimestampDifference() that don't
have a direct need for seconds-and-microseconds, but can't use
TimestampDifferenceMilliseconds() either because they do need
microsecond precision or because they might possibly deal with
intervals long enough to overflow 32-bit milliseconds. It might be
worth inventing another API to improve that, but that seems outside
the scope of this patch; so those callers are untouched here.
Given the fact that we are fixing some bugs, and the likelihood
that future patches might want to back-patch code that uses this
new API, back-patch to all supported branches.
Alexey Kondratov and Tom Lane
Discussion: https://postgr.es/m/3b1c053a21c07c1ed5e00be3b2b855ef@postgrespro.ru
Remove a CHECK_FOR_INTERRUPTS() call that could never actually handle an
interrupt. We always have a heap page buffer lock at this point.
Having a useless CHECK_FOR_INTERRUPTS() call is harmless but misleading.
It is probably possible to work around the immediate problem by moving
the CHECK_FOR_INTERRUPTS() to before the heap page buffer lock is
acquired. That isn't enough to make the function responsive to
interrupts, though. The index AM caller will still hold an exclusive
buffer lock of its own.
Specifically, this blocks DECLARE ... WITH HOLD and firing of deferred
triggers within index expressions and materialized view queries. An
attacker having permission to create non-temp objects in at least one
schema could execute arbitrary SQL functions under the identity of the
bootstrap superuser. One can work around the vulnerability by disabling
autovacuum and not manually running ANALYZE, CLUSTER, REINDEX, CREATE
INDEX, VACUUM FULL, or REFRESH MATERIALIZED VIEW. (Don't restore from
pg_dump, since it runs some of those commands.) Plain VACUUM (without
FULL) is safe, and all commands are fine when a trusted user owns the
target object. Performance may degrade quickly under this workaround,
however. Back-patch to 9.5 (all supported versions).
Reviewed by Robert Haas. Reported by Etienne Stalmans.
Security: CVE-2020-25695
The description of how LP_DEAD bit setting by index scans works
following commit 2ed5b87f was rather unclear. Clean that up a bit.
Also refer to LP_DEAD bit setting within _bt_check_unique() at the start
of the same section. This mechanism may actually be more important than
the generic kill_prior_tuple mechanism that the section focuses on, so
it at least deserves to be mentioned in passing.
* Avoid pointlessly highlighting that an index vacuum was executed by a
parallel worker; user doesn't care.
* Don't give the impression that a non-concurrent reindex of an invalid
index on a TOAST table would work, because it wouldn't.
* Add a "translator:" comment for a mysterious message.
Discussion: https://postgr.es/m/20201107034943.GA16596@alvherre.pgsql
Reviewed-by: Michael Paquier <michael@paquier.xyz>
brin_form_tuple failed to consider the values may be toasted, inserting
the toast pointer into the index. This may easily result in index
corruption, as the toast data may be deleted and cleaned up by vacuum.
The cleanup however does not care about indexes, leaving invalid toast
pointers behind, which triggers errors like this:
ERROR: missing chunk number 0 for toast value 16433 in pg_toast_16426
A less severe consequence are inconsistent failures due to the index row
being too large, depending on whether brin_form_tuple operated on plain
or toasted version of the row. For example
CREATE TABLE t (val TEXT);
INSERT INTO t VALUES ('... long value ...')
CREATE INDEX idx ON t USING brin (val);
would likely succeed, as the row would likely include toast pointer.
Switching the order of INSERT and CREATE INDEX would likely fail:
ERROR: index row size 8712 exceeds maximum 8152 for index "idx"
because this happens before the row values are toasted.
The bug exists since PostgreSQL 9.5 where BRIN indexes were introduced.
So backpatch all the way back.
Author: Tomas Vondra
Reviewed-by: Alvaro Herrera
Backpatch-through: 9.5
Discussion: https://postgr.es/m/20201001184133.oq5uq75sb45pu3aw@development
Discussion: https://postgr.es/m/20201104010544.zexj52mlldagzowv%40development
wal_consistency_checking indicated an inconsistency in certain cases
involving nbtree page deletion. The underlying issue is that there was
a minor difference between the page image produced after a REDO routine
ran and the corresponding page image following original execution.
This harmless inconsistency has been around forever. We more or less
expect total consistency among even deleted nbtree pages these days,
though, so this won't do anymore.
To fix, tweak the REDO routine to match original execution.
Oversight in commit f47b5e13.
Logic for counting heap TIDs from posting list tuples (added by commit
0d861bbb) was faulty. It didn't count any TIDs/index tuples in the
event of no callback being set. This meant that we incorrectly counted
no index tuples in clean-up only VACUUMs, which could lead to
pg_class.reltuples being spuriously set to 0 in affected indexes.
To fix, go back to counting items from the page in cases where there is
no callback. This approach isn't very accurate, but it works well
enough in practice while avoiding the expense of accessing every index
tuple during cleanup-only VACUUMs.
Author: Peter Geoghegan <pg@bowt.ie>
Reported-By: Jehan-Guillaume de Rorthais <jgdr@dalibo.com>
https://postgr.es/m/20201023174451.69e358f1@firost
Backpatch: 13-, where nbtree deduplication was introduced
Commit dee663f7 intended to drop any queued up fsync requests before
unlinking segment files, but missed a code path. Fix, by centralizing
the forget-and-unlink code into a single function.
Reported-by: Tomas Vondra <tomas.vondra@2ndquadrant.com>
Discussion: https://postgr.es/m/20201104013205.icogbi773przyny5%40development
Commit ac22929a26 changed recoveryWakeupLatch so that it's reset to
NULL at the end of recovery. This change could cause a segmentation fault
in the buildfarm member 'elver'.
Previously the latch was reset to NULL after calling ShutdownWalRcv().
But there could be a window between ShutdownWalRcv() and the actual
exit of walreceiver. If walreceiver set the latch during that window,
the segmentation fault could happen.
To fix the issue, this commit changes walreceiver so that it sets
the latch only when the latch has not been reset to NULL yet.
Author: Fujii Masao
Discussion: https://postgr.es/m/5c1f8a85-747c-7bf9-241e-dd467d8a3586@iki.fi
This commit gets rid of the dedicated latch for signaling the startup
process in favor of using its procLatch, since that comports better
with possible generic signal handlers using that latch.
Commit 1e53fe0e70 changed background processes so that they use standard
SIGHUP handler. Like that, this commit also makes the startup process use
standard SIGHUP handler to simplify the code.
Author: Fujii Masao
Reviewed-by: Bharath Rupireddy, Michael Paquier
Discussion: https://postgr.es/m/CALj2ACXPorUqePswDtOeM_s82v9RW32E1fYmOPZ5NuE+TWKj_A@mail.gmail.com
Some places were using PG_GETARG_UINT32 where PG_GETARG_TRANSACTIONID
would be more appropriate. (Of course, they are the same internally,
so there is no externally visible effect.) To do that, export
PG_GETARG_TRANSACTIONID outside of xid.c. We also export
PG_RETURN_TRANSACTIONID for symmetry, even though there are currently
no external users.
Author: Ashutosh Bapat <ashutosh.bapat@2ndquadrant.com>
Discussion: https://www.postgresql.org/message-id/flat/d8f6bdd536df403b9b33816e9f7e0b9d@G08CNEXMBPEKD05.g08.fujitsu.local
Certain background workers initiate parallel queries while
debug_query_string==NULL, at which point they attempted strlen(NULL) and
died to SIGSEGV. Older debug_query_string observers allow NULL, so do
likewise in these newer ones. Back-patch to v11, where commit
7de4a1bcc56f494acbd0d6e70781df877dc8ecb5 introduced the first of these.
Discussion: https://postgr.es/m/20201014022636.GA1962668@rfd.leadboat.com
Because of this, if you tried to create an operator family with the new
sortsupport function, you got an error:
ERROR: support function number 11 is invalid for access method gist
We missed this in commit 16fa9b2b30 that added the sortsupport function,
because it only added sortsupport to a built-in operator family.
Author: Andrey Borodin
Discussion: https://www.postgresql.org/message-id/3520A18A-5C38-4697-A2E3-F3BDE3496CD5%40yandex-team.ru
This fixes a bug in the edge case where, for a temp table, heap_page_prune()
can end up with a different horizon than heap_vacuum_rel(). Which can trigger
errors like "ERROR: cannot freeze committed xmax ...".
The bug was introduced due to interaction of a7212be8b9e "Set cutoff xmin more
aggressively when vacuuming a temporary table." with dc7420c2c92 "snapshot
scalability: Don't compute global horizons while building snapshots.".
The problem is caused by lazy_scan_heap() assuming that the only reason its
HeapTupleSatisfiesVacuum() call would return HEAPTUPLE_DEAD is if the tuple is
a HOT tuple, or if the tuple's inserting transaction has aborted since the
heap_page_prune() call. But after a7212be8b9e that was also possible in other
cases for temp tables, because heap_page_prune() uses a different visibility
test after dc7420c2c92.
The fix is fairly simple: Move the special case logic for temp tables from
vacuum_set_xid_limits() to the infrastructure introduced in dc7420c2c92. That
ensures that the horizon used for pruning is at least as aggressive as the one
used by lazy_scan_heap(). The concrete horizon used for temp tables is
slightly different than the logic in dc7420c2c92, but should always be as
aggressive as before (see comments).
A significant benefit to centralizing the logic procarray.c is that now the
more aggressive horizons for temp tables does not just apply to VACUUM but
also to e.g. HOT pruning and the nbtree killtuples logic.
Because isTopLevel is not needed by vacuum_set_xid_limits() anymore, I
undid the the related changes from a7212be8b9e.
This commit also adds an isolation test ensuring that the more aggressive
vacuuming and pruning of temp tables keeps working.
Debugged-By: Amit Kapila <amit.kapila16@gmail.com>
Debugged-By: Tom Lane <tgl@sss.pgh.pa.us>
Debugged-By: Ashutosh Sharma <ashu.coek88@gmail.com>
Author: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/20201014203103.72oke6hqywcyhx7s@alap3.anarazel.de
Discussion: https://postgr.es/m/20201015083735.derdzysdtqdvxshp@alap3.anarazel.de