The contents of the line whose parsing failed was not reported in the
error message produced by generate-wait_event_types.pl, making harder
than necessary the debugging of incorrectly-shaped entries in the file.
Reported-by: Andres Freund
Discussion: https://postgr.es/m/ZK9S3jFEV1X797Ll@paquier.xyz
The double quotes used for the wait event names are not required, as
the values quoted are made of single words. The files generated by
generate-wait_event_types.pl (pgstat_wait_event.c, wait_event_types.h
and wait_event_types.sgml) are exactly the same before and after this
commit, hence the wait event names and the enum elements have the same
names as before.
Discussion: https://postgr.es/m/ZK9S3jFEV1X797Ll@paquier.xyz
Until now, when DROP DATABASE got interrupted in the wrong moment, the removal
of the pg_database row would also roll back, even though some irreversible
steps have already been taken. E.g. DropDatabaseBuffers() might have thrown
out dirty buffers, or files could have been unlinked. But we continued to
allow connections to such a corrupted database.
To fix this, mark databases invalid with an in-place update, just before
starting to perform irreversible steps. As we can't add a new column in the
back branches, we use pg_database.datconnlimit = -2 for this purpose.
An invalid database cannot be connected to anymore, but can still be
dropped.
Unfortunately we can't easily add output to psql's \l to indicate that some
database is invalid, it doesn't fit in any of the existing columns.
Add tests verifying that a interrupted DROP DATABASE is handled correctly in
the backend and in various tools.
Reported-by: Evgeny Morozov <postgresql3@realityexists.net>
Author: Andres Freund <andres@anarazel.de>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Reviewed-by: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/20230509004637.cgvmfwrbht7xm7p6@awork3.anarazel.de
Discussion: https://postgr.es/m/20230314174521.74jl6ffqsee5mtug@awork3.anarazel.de
Backpatch: 11-, bug present in all supported versions
When vac_truncate_clog() encounters bogus datfrozenxid / datminmxid values, it
returns early. Unfortunately, until now, it did not release
WrapLimitsVacuumLock. If the backend later tries to acquire
WrapLimitsVacuumLock, the session / autovacuum worker hangs in an
uncancellable way. Similarly, other sessions will hang waiting for the
lock. However, if the backend holding the lock exited or errored out for some
reason, the lock was released.
The bug was introduced as a side effect of 566372b3d6.
It is interesting that there are no production reports of this problem. That
is likely due to a mix of bugs leading to bogus values having gotten less
common, process exit releasing locks and instances of hangs being hard to
debug for "normal" users.
Discussion: https://postgr.es/m/20230621221208.vhsqgduwfpzwxnpg@awork3.anarazel.de
Commit 89e46da5e allowed REPLICA IDENTITY FULL tables to use an index
on the subscriber during apply of update/delete. This commit clarifies
in the documentation that the leftmost field of candidate indexes must
be a column (not an expression) that references the published relation
column.
The source code comments are also updated accordingly.
Reviewed-by: Peter Smith, Amit Kapila
Discussion: https://postgr.es/m/CAD21AoDJjffEvUFKXT27Q5U8-UU9JHv4rrJ9Ke8Zkc5UPWHLvA@mail.gmail.com
Backpatch-through: 16
A CaseTestExpr is currently being put into
JsonValueExpr.formatted_expr as placeholder for the result of
evaluating JsonValueExpr.raw_expr, which in turn is evaluated
separately. Though, there's no need for this indirection if
raw_expr itself can be embedded into formatted_expr and evaluated
as part of evaluating the latter, especially as there is no
special reason to evaluate it separately. So this commit makes it
so. As a result, JsonValueExpr.raw_expr no longer needs to be
evaluated in ExecInterpExpr(), eval_const_exprs_mutator() etc. and
is now only used for displaying the original "unformatted"
expression in ruleutils.c.
While at it, this also removes the function makeCaseTestExpr(),
because the code in makeJsonConstructorExpr() looks more readable
without it IMO and isn't used by anyone else either.
Finally, a note is added in the comment above CaseTestExpr's
definition that JsonConstructorExpr is also using it.
Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: https://postgr.es/m/CA+HiwqE4XTdfb1nW=Ojoy_tQSRhYt-q_kb6i5d4xcKyrLC1Nbg@mail.gmail.com
The first check on the enum values was not necessary as the values set
in wait_event_names.txt for the classes LWLock and Lock were able to
satisfy the check. The second check when generating the C and header
files is now based on a match of the class names, making it simpler to
understand.
Author: Masahiro Ikeda, Michael Paquier
Discussion: https://postgr.es/m/eaf82a85c0ef1b55dc3b651d3f7b867a@oss.nttdata.com
This commit adds two columns: indexes_total and indexes_processed, to
pg_stat_progress_vacuum system view to show the index vacuum
progress. These numbers are reported in the "vacuuming indexes" and
"cleaning up indexes" phases.
This uses the new parallel message type for progress reporting added
by be06506e7.
Bump catversion because this changes the definition of
pg_stat_progress_vacuum.
Author: Sami Imseih
Reviewed by: Masahiko Sawada, Michael Paquier, Nathan Bossart, Andres Freund
Discussion: https://www.postgresql.org/message-id/flat/5478DFCD-2333-401A-B2F0-0D186AB09228@amazon.com
This commit adds a new type of parallel message 'P' to allow a
parallel worker to poke at a leader to update the progress.
Currently it supports only incremental progress reporting but it's
possible to allow for supporting of other backend progress APIs in the
future.
There are no users of this new message type as of this commit. That
will follow in future commits.
Idea from Andres Freund.
Author: Sami Imseih
Reviewed by: Michael Paquier, Masahiko Sawada
Discussion: https://www.postgresql.org/message-id/flat/5478DFCD-2333-401A-B2F0-0D186AB09228@amazon.com
Windows has similar functions with leading underscores. Previously, we
provided the rename via a macro in win32_port.h. In fact its functions
are not always good replacements for the Unix functions, since they
can't deal with UTF-8. They are only currently used by pg_locale.c,
which is careful to redirect to other Windows routines for UTF-8. Given
that portability hazard, it seem unlikely to be a good idea to encourage
any other code to think of these functions as being available outside
pg_locale.c. Any code that thinks it wants these functions probably
wants our wchar2char() or char2wchar() routines instead, or it won't
actually work on Windows in UTF-8 databases.
Furthermore, some major libc implementations including glibc don't have
them (they only have the standard variants without _l), so external code
is very unlikely to require them to exist.
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://postgr.es/m/CA%2BhUKG%2Bt_CHPzEoPnKyARJBJgE9-GxNajJo6ZuSfRK_KWFO%2B6w%40mail.gmail.com
Since PostgresMain calls sigsetjmp, any local variables that are not
marked "volatile" have a risk of unspecified behavior. In practice
this means that when control returns via longjmp, such variables might
get reset to their values as of the time of sigsetjmp, depending on
whether the compiler chose to put them in registers or on the stack.
We were careful about this for "send_ready_for_query", but not the
other local variables.
In the case of the timeout_enabled flags, resetting them to
their initial "false" states is actually good, since we do
"disable_all_timeouts()" in the longjmp cleanup code path. If that
does not happen, we risk uselessly calling "disable_timeout()" later,
which is harmless but a little bit expensive. Let's explicitly reset
these flags so that the behavior is correct and platform-independent.
(This change means that we really don't need the new "volatile"
markings after all, but let's install them anyway since any change
in this logic could re-introduce a problem.)
There is no issue for "firstchar" and "input_message" because those
are explicitly reinitialized each time through the query processing
loop. To make that clearer, move them to be declared inside the loop.
That leaves us with all the function-lifespan locals except the
sigjmp_buf itself marked as volatile, which seems like a good policy
to have going forward.
Because of the possibility of extra disable_timeout() calls, this
seems worth back-patching.
Sergey Shinderuk and Tom Lane
Discussion: https://postgr.es/m/2eda015b-7dff-47fd-d5e2-f1a9899b90a6@postgrespro.ru
changeDependencyFor() returns the number of pg_depend entries changed,
or 0 if there is a problem. The callers of this routine expect only one
dependency to change, but they did not check for the result returned.
The following code paths gain checks:
- Namespace for extensions.
- Namespace for various object types (see AlterObjectNamespace).
- Planner support function for a function.
Some existing error messages related to all that are reworded to be more
consistent with the project style, and the new error messages added
follow the same style. This change has exposed one bug fixed a bit
earlier with bd5ddbe.
Reviewed-by: Heikki Linnakangas, Akshat Jaimini
Discussion: https://postgr.es/m/ZJzD/rn+UbloKjB7@paquier.xyz
As coded, the code would use as a base comparison the namespace OID from
the first object scanned in pg_depend when switching its namespace
dependency entry to the new one, and use it as a base of comparison for
any follow-up checks. It would also be used as the old namespace OID to
switch *from* for the extension's pg_depend entry. Hence, if the first
object scanned has a namespace different than the one stored in the
extension, we would finish by:
- Not checking that the extension objects map with the extension's
schema.
- Not switching the extension -> namespace dependency entry to the new
namespace provided by the user, making ALTER EXTENSION ineffective.
This issue exists since this command has been introduced in d9572c4 for
relocatable extension, so backpatch all the way down to 11. The test
case has been provided by Heikki, that I have tweaked a bit to show the
effects on pg_depend for the extension.
Reported-by: Heikki Linnakangas
Author: Michael Paquier, Heikki Linnakangas
Discussion: https://postgr.es/m/20eea594-a05b-4c31-491b-007b6fceef28@iki.fi
Backpatch-through: 11
Comments in src/backend/libpq/auth.c say: (after successfully finding
the final DN to check the user-supplied password against)
/* Unbind and disconnect from the LDAP server */
and later
/*
* Need to re-initialize the LDAP connection, so that we can bind to
* it with a different username.
*/
But the protocol actually permits multiple subsequent authentications
("binds") over a single connection.
So, it seems like the whole connection re-initialization thing was
just a confusion and can be safely removed, thus saving quite a few
network round-trips, especially for the case of ldaps/starttls.
Author: Anatoly Zaretsky <anatoly.zaretsky@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CALbq6kmJ-1+58df4B51ctPfTOSyPbY8Qi2=ct8oR=i4TamkUoQ@mail.gmail.com
locale_t is defined by POSIX.1-2008 and SUSv4, and available on all
targeted systems. For Windows, win32_port.h redirects to a partial
implementation called _locale_t. We can now remove a lot of
compile-time tests for HAVE_LOCALE_T, and associated comments and dead
code branches that were needed for older computers.
Since configure + MinGW builds didn't detect locale_t but now we assume
that all systems have it, further inconsistencies among the 3 Windows build
systems were revealed. With this commit, we no longer define
HAVE_WCSTOMBS_L and HAVE_MBSTOWCS_L on any Windows build system, but
we have logic to deal with that so that replacements are available where
appropriate.
Reviewed-by: Noah Misch <noah@leadboat.com>
Reviewed-by: Tristan Partin <tristan@neon.tech>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://postgr.es/m/CA%2BhUKGLg7_T2GKwZFAkEf0V7vbnur-NfCjZPKZb%3DZfAXSV1ORw%40mail.gmail.com
Commit 19d8e2308b allowed a weaker check for HOT with summarizing
indexes, but it did not update README.HOT. So do that now.
Patch by Matthias van de Meent, minor changes by me. Backpatch to 16,
where the optimization was introduced.
Author: Matthias van de Meent
Reviewed-by: Tomas Vondra
Backpatch-through: 16
Discussion: https://postgr.es/m/CAEze2WiEOm8V+c9kUeYp2BPhbEc5s473fUf51xNeqvSFGv44Ew@mail.gmail.com
We create a file, so we better WAL-log it. In practice, all the
built-in index AMs and all extensions that I'm aware of write a
metapage to the init fork, which is WAL-logged, and replay of the
metapage implicitly creates the fork too. But if ambuildempty() didn't
write any page, we would miss it.
This can be seen with dummy_index_am. Set up replication, create a
'dummy_index_am' index on an unlogged table, and look at the files
created in the replica: the init fork is not created on the
replica. Dummy_index_am doesn't do anything with the relation files,
however, so it doesn't lead to any user-visible errors.
Backpatch to all supported versions.
Reviewed-by: Robert Haas
Discussion: https://www.postgresql.org/message-id/6e5bbc08-cdfc-b2b3-9e23-1a914b9850a9%40iki.fi
This is useful to show the allocation state of huge pages when setting
up a server with "huge_pages = try", where allocating huge pages would
be attempted but the server would continue its startup sequence even if
the allocation fails. The effective status of huge pages is not easily
visible without OS-level tools (or for instance, a lookup at
/proc/N/smaps), and the environments where Postgres runs may not
authorize that. Like the other GUCs related to huge pages, this works
for Linux and Windows.
This GUC can report as values:
- "on", if huge pages were allocated.
- "off", if huge pages were not allocated.
- "unknown", a special state that could only be seen when using for
example postgres -C because it is only possible to know if the shared
memory allocation worked after we can check for the GUC values, even if
checking a runtime-computed GUC. This value should never be seen when
querying for the GUC on a running server. An assertion is added to
check that.
The discussion has also turned around having a new function to grab this
status, but this would have required more tricks for -DEXEC_BACKEND,
something that GUCs already handle.
Noriyoshi Shinoda has initiated the thread that has led to the result of
this commit.
Author: Justin Pryzby
Reviewed-by: Nathan Bossart, Kyotaro Horiguchi, Michael Paquier
Discussion: https://postgr.es/m/TU4PR8401MB1152EBB0D271F827E2E37A01EECC9@TU4PR8401MB1152.NAMPRD84.PROD.OUTLOOK.COM
The header file wait_event_types.h was generated without a newline at
its end, which was inconsistent with all the other things generated
automatically.
Per offline gripe from Nathan Bossart.
This commit reverts the work done by commits 3ba59ccc89 and 72e78d831a.
Those commits were incorrect in asserting that we never acquire any other
heavy-weight lock after acquring page lock other than relation extension
lock. We can acquire a lock on catalogs while doing catalog look up after
acquring page lock.
This won't impact any existing feature but we need to think some other way
to achieve this before parallelizing other write operations or even
improving the parallelism in vacuum (like allowing multiple workers
for an index).
Reported-by: Jaime Casanova
Author: Amit Kapila
Backpatch-through: 13
Discussion: https://postgr.es/m/CAJKUy5jffnRKNvRHKQ0LynRb0RJC-o4P8Ku3x9vGAVLwDBWumQ@mail.gmail.com
This commit comes as a continuation of the discussion that has led to
d522b05, as \v was handled inconsistently when parsing array values or
anything going through the parsers, and changing a parser behavior in
stable branches is a scary thing to do. The parsing of array values now
uses the more central scanner_isspace() and array_isspace() is removed.
As pointing out by Peter Eisentraut, fix a confusing reference to
horizontal space in the parsers with the term "horiz_space". \f was
included in this set since 3cfdd8f from 2000, but it is not horizontal.
"horiz_space" is renamed to "non_newline_space", to refer to all
whitespace characters except newlines.
The changes impact the parsers for the backend, psql, seg, cube, ecpg
and replication commands. Note that JSON should not escape \v, as per
RFC 7159, so these are not touched.
Reviewed-by: Peter Eisentraut, Tom Lane
Discussion: https://postgr.es/m/ZJKcjNwWHHvw9ksQ@paquier.xyz
llvm_release_context() called llvm_enter_fatal_on_oom(), but was missing
the corresponding llvm_leave_fatal_on_oom() call. As a result, if JIT was
used at all, we were almost always in the "fatal-on-oom" state.
It only makes a difference if you use an extension written in C++, and
run out of memory in a C++ 'new' call. In that case, you would get a
PostgreSQL FATAL error, instead of the default behavior of throwing a
C++ exception.
Back-patch to all supported versions.
Reviewed-by: Daniel Gustafsson
Discussion: https://www.postgresql.org/message-id/54b78cca-bc84-dad8-4a7e-5b56f764fab5@iki.fi
Most, but not all, of our "/*----" style comments have an end-guard
line with dashes at the end of the comment. However, pgindent doesn't
care about the end-guards, so they mostly just waste screen
space. Going forward, let's not require end-guards.
Remove a broken end-guard in a comment in walsender.c that led me to
think about this. Remove the end guard from another comment in
walsender.c for consistency, so that we use the same style in all
comments in the file.
However, we have thousands of existing "/*----" comments the repository,
so it's not worth the code churn to change them all.
Discussion: https://www.postgresql.org/message-id/fb083c91-d490-3b65-25f3-05e9118b6b0d%40iki.fi
BuildEventTriggerCache sets up a context "EventTriggerCache" which
house a hash named "Event Trigger Cache", which in turn creates a
context with the table name. This generated log output for memory
context dumps like below:
LOG: level: 2; EventTriggerCache: 8192 total in 1 blocks; 7928 free (4 chunks); 264 used
LOG: level: 3; Event Trigger Cache: 8192 total in 1 blocks; 2616 free (0 chunks); 5576 used
This renames the hash to ensure that the hash context has a unique
name for easier log reading and debugging.
Discussion: https://postgr.es/m/5EDC969E-CAE3-4CBD-965E-3B8A1294CFA4@yesql.se
Commit 7b64e4b3 taught DropSubscription() to drop stats entry of
subscription that is not associated with a replication slot for apply
worker at DROP SUBSCRIPTION but missed covering the case where the
subscription is not associated with replication slots for both apply
worker and tablesync worker.
Also add a test to verify that the stats for slot-less subscription is
removed at DROP SUBSCRIPTION time.
Backpatch down to 15.
Author: Masahiko Sawada
Reviewed-by: Nathan Bossart, Hayato Kuroda, Melih Mutlu, Amit Kapila
Discussion: https://postgr.es/m/CAD21AoB71zkP7uPT7JDPsZcvp0749ExEQnOJxeNKPDFisHar+w@mail.gmail.com
Backpatch-through: 15
The documentation and the code is generated automatically from a new
file called wait_event_names.txt, formatted in sections dedicated to
each wait event class (Timeout, Lock, IO, etc.) with three tab-separated
fields:
- C symbol in enums
- Format in the system views
- Description in the docs
Using this approach has several advantages, as we have proved to be
rather bad in maintaining this area of the tree across the years:
- The order of each item in the documentation and the code, which should
be alphabetical, has become incorrect multiple times, and the script
generating the code and documentation has a few rules to enforce that,
making the maintenance a no-brainer.
- Some wait events were added to the code, but not documented, so this
cannot be missed now.
- The order of the tables for each wait event class is enforced in the
documentation (the input .txt file does so as well for clarity, though
this is not mandatory).
- Less code, shaving 1.2k lines from the tree, with 1/3 of the savings
coming from the code, the rest from the documentation.
The wait event types "Lock" and "LWLock" still have their own code path
for their code, hence only the documentation is created for them. These
classes are listed with a special marker called WAIT_EVENT_DOCONLY in
the input file.
Adding a new wait event now requires only an update of
wait_event_names.txt, with "Lock" and "LWLock" treated as exceptions.
This commit has been tested with configure/Makefile, the CI and VPATH
build. clean, distclean and maintainer-clean were working fine.
Author: Bertrand Drouvot, Michael Paquier
Discussion: https://postgr.es/m/77a86b3a-c4a8-5f5d-69b9-d70bbf2e9b98@gmail.com
Clear any potential stale next_phase_at value from the snapshot
builder which otherwise may trip an assertion check ensuring
that there is no next_phase_at value.
This can be reproduced by running 80 concurrent sessions like
the below where $c is a loop counter (assumes there has been
1..$c databases created) :
echo "
CREATE TABLE replication_example(id SERIAL PRIMARY KEY,
somedata int,
text varchar(120));
SELECT 'init' FROM
pg_create_logical_replication_slot('regression_slot_$c',
'test_decoding');
SELECT data FROM
pg_logical_slot_get_changes('regression_slot_$c', NULL,
NULL, 'include-xids', '0',
'skip-empty-xacts', '1');
" | psql -d regress_$c >>psql.log &
Backpatch down to v16.
Bug: #17695
Author: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Alexander Lakhin <exclusion@gmail.com>
Reported-by: bowenshi <zxwsbg@qq.com>
Discussion: https://postgr.es/m/17695-6be9277c9295985f@postgresql.org
Backpatch-through: v16
If you create a table and don't insert any data into it, the relation file
is never fsync'd. You don't lose data, because an empty table doesn't have
any data to begin with, but if you crash and lose the file, subsequent
operations on the table will fail with "could not open file" error.
To fix, register an fsync request in mdcreate(), like we do for mdwrite().
Per discussion, we probably should also fsync the containing directory
after creating a new file. But that's a separate and much wider issue.
Backpatch to all supported versions.
Reviewed-by: Andres Freund, Thomas Munro
Discussion: https://www.postgresql.org/message-id/d47d8122-415e-425c-d0a2-e0160829702d%40iki.fi
Previously an "amcanorderbyop" index would only be used when the index
could provide sorted results which satisfied all query_pathkeys. Here
we relax this so that we also allow these indexes to be considered by the
planner when they only provide partially sorted results. This allows the
planner to later consider making use of an Incremental Sort to satisfy the
remaining pathkeys. This change is particularly useful for KNN-type
queries which contain a LIMIT clause and an additional ORDER BY clause for
a non-indexed column.
Author: Miroslav Bendik
Reviewed-by: Richard Guo, David Rowley
Discussion: https://postgr.es/m/CAPoEpV0QYDtzjwamwWUBqyWpaCVbJV2d6qOD7Uy09bWn47PJtw%40mail.gmail.com
It's OK to be lazy about re-binning memory segments when allocating,
because that can only leave segments in a bin that's too high. We'll
search higher bins if necessary while allocating next time, and
also eventually re-bin, so no memory can become unreachable that way.
However, when freeing memory, the largest contiguous range of free pages
might go up, so we should re-bin eagerly to make sure we don't leave the
segment in a bin that is too low for get_best_segment() to find.
The re-binning code is moved into a function of its own, so it can be
called whenever free pages are returned to the segment's free page map.
Back-patch to all supported releases.
Author: Dongming Liu <ldming101@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com> (earlier version)
Reviewed-by: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/CAL1p7e8LzB2LSeAXo2pXCW4%2BRya9s0sJ3G_ReKOU%3DAjSUWjHWQ%40mail.gmail.com
Prior to this, Bitmapsets could contain trailing words which had no
members set. Many places in bitmapset.c had to loop over trailing words
to check for empty words. If we ensure we always remove these trailing
zero words then we can perform various optimizations such as fast
pathing bms_is_subset to return false when 'a' has more words than 'b'.
A similar optimization is possible in bms_equal. Both of these together
can yield quite significant performance increases in the query planner
when querying a partitioned table with around 100 or more partitions.
While we're at it, since the minimum number of words a Bitmapset can
contain is 1, we can make use of do/while loops instead of for loops
when looping over all words in a set. This means checking the loop
condition 1 less time, which for single-word sets cuts the loop
condition checks in half.
Author: David Rowley
Reviewed-by: Yuya Watari
Discussion: https://postgr.es/m/CAApHDvr5O41MuUjw0DQKqmAnv7QqvmLqXReEd5o4nXTzWp8-+w@mail.gmail.com
This commit increases the size of the bgw_library_name member of
the BackgroundWorker struct from BGW_MAXLEN (96) bytes to MAXPGPATH
(default of 1024) bytes so that it can store longer file names
(e.g., absolute paths).
Author: Yurii Rashkovskii
Reviewed-by: Daniel Gustafsson, Aleksander Alekseev
Discussion: https://postgr.es/m/CA%2BRLCQyjFV5Y8tG5QgUb6gjteL4S3p%2B1gcyqWTqigyM93WZ9Pg%40mail.gmail.com
The ginfast.c code previously checked for conflicts in before locking
the relevant buffer, leaving a window where a RW conflict could be
missed. Re-order.
There was also a place where buffer ID and block number were confused
while trying to predicate-lock a page, noted by visual inspection.
Back-patch to all supported releases. Fixes one more problem discovered
with the reproducer from bug #17949, in this case when Dmitry tried
other index types.
Reported-by: Artem Anisimov <artem.anisimov.255@gmail.com>
Reported-by: Dmitry Dolgov <9erthalion6@gmail.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/17949-a0f17035294a55e2%40postgresql.org
When performing a bitmap heap scan, we don't want to miss concurrent
writes that occurred after we observed the heap's rs_nblocks, but before
we took predicate locks on index pages. Therefore, we can't skip
fetching any heap tuples that are referenced by the index, because we
need to test them all with CheckForSerializableConflictOut(). The
old optimization that would ignore any references to blocks >=
rs_nblocks gets in the way of that requirement, because it means that
concurrent writes in that window are ignored.
Removing that optimization shouldn't affect correctness at any isolation
level, because any new tuples shouldn't be visible to an MVCC snapshot.
There also shouldn't be any error-causing references to heap blocks past
the end, because we should have held at least an AccessShareLock on the
table before the index scan. It can't get smaller while our transaction
is running. For now, though, we'll keep the optimization at lower
levels to avoid making unnecessary changes in a bug fix.
Back-patch to all supported releases. In release 11, the code is in a
different place but not fundamentally different. Fixes one aspect of
bug #17949.
Reported-by: Artem Anisimov <artem.anisimov.255@gmail.com>
Reviewed-by: Dmitry Dolgov <9erthalion6@gmail.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/17949-a0f17035294a55e2%40postgresql.org