1
0
mirror of https://github.com/postgres/postgres.git synced 2025-11-09 06:21:09 +03:00
Commit Graph

5341 Commits

Author SHA1 Message Date
Michael Paquier
ad25744f43 Add wal_fpi_bytes to VACUUM and ANALYZE logs
The new wal_fpi_bytes counter calculates the total amount of full page
images inserted in WAL records, in bytes.  This commit adds this
information to VACUUM and ANALYZE logs alongside the existing counters,
building upon f9a09aa295.

Author: Shinya Kato <shinya11.kato@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/aQMMSSlFXy4Evxn3@paquier.xyz
2025-11-03 19:42:03 +09:00
Peter Geoghegan
b8f1c62807 Document nbtree row comparison design.
Add comments explaining when and where it is safe for nbtree to treat
row compare keys as if they were simple scalar inequality keys on the
row's most significant column.  This is particularly important within
_bt_advance_array_keys, which deals with required inequality keys in a
general and uniform way, without any special handling for row compares.

Also spell out the implications of _bt_check_rowcompare's approach of
_conditionally_ evaluating lower-order row compare subkeys, particularly
when one of its lower-order subkeys might see NULL index tuple values
(these may or may not affect whether the qual as a whole is satisfied).
The behavior in this area isn't particularly intuitive, so these issues
seem worth going into.

In passing, add a few more defensive/documenting row comparison related
assertions to _bt_first and _bt_check_rowcompare.

Follow-up to commits bd3f59fd and ec986020.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Victor Yegorov <vyegorov@gmail.com>
Reviewed-By: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/CAH2-Wznwkak_K7pcAdv9uH8ZfNo8QO7+tHXOaCUddMeTfaCCFw@mail.gmail.com
Backpatch-through: 18
2025-11-02 15:27:05 -05:00
Peter Geoghegan
4f08586c7a Remove obsolete nbtree equality key comments.
_bt_first reliably uses the same equality key (on each index column) for
initial positioning purposes as the one that _bt_checkkeys can use to
end the scan following commit f09816a0.  _bt_first no longer applies its
own independent rules to determine which initial positioning key to use
on each column (for equality and inequality keys alike).  Preprocessing
is now fully in control of determining which keys start and end each
scan, ensuring that _bt_first and _bt_checkkeys have symmetric behavior.

Remove obsolete comments that described why _bt_first was expected to
use at least one of the available required equality keys for initial
positioning purposes.  The rules in this area are now maximally strict
and uniform, so there's no reason to draw attention to equality keys.
Any column with a required equality key cannot have a redundant required
inequality key (nor can it have a redundant required equality key).

Oversight in commit f09816a0, which removed similar comments from
_bt_first, but missed these comments.

Author: Peter Geoghegan <pg@bowt.ie>
Backpatch-through: 18
2025-11-02 13:34:18 -05:00
Peter Eisentraut
8a27d418f8 Mark function arguments of type "Datum *" as "const Datum *" where possible
Several functions in the codebase accept "Datum *" parameters but do
not modify the pointed-to data.  These have been updated to take
"const Datum *" instead, improving type safety and making the
interfaces clearer about their intent.  This change helps the compiler
catch accidental modifications and better documents immutability of
arguments.

Most of "Datum *" parameters have a pairing "bool *isnull" parameter,
they are constified as well.

No functional behavior is changed by this patch.

Author: Chao Li <lic@highgo.com>
Discussion: https://www.postgresql.org/message-id/flat/CAEoWx2msfT0knvzUa72ZBwu9LR_RLY4on85w2a9YpE-o2By5HQ@mail.gmail.com
2025-10-31 10:47:25 +01:00
Peter Eisentraut
e1ac846f3d Mark ItemPointer arguments as const throughout
This is a follow up 991295f.  I searched over src/ and made all
ItemPointer arguments as const as much as possible.

Note: We cut out from the original patch the pieces that would have
created incompatibilities in the index or table AM APIs.  Those could
be considered separately.

Author: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/CAEoWx2nBaypg16Z5ciHuKw66pk850RFWw9ACS2DqqJ_AkKeRsw%40mail.gmail.com
2025-10-30 14:12:06 +01:00
Peter Eisentraut
8ce795fcb7 Fix some confusing uses of const
There are a few places where we have

    typedef struct FooData { ... } FooData;
    typedef FooData *Foo;

and then function declarations with

    bar(const Foo x)

which isn't incorrect but probably meant

    bar(const FooData *x)

meaning that the thing x points to is immutable, not x itself.

This patch makes those changes where appropriate.  In one
case (execGrouping.c), the thing being pointed to was not immutable,
so in that case remove the const altogether, to avoid further
confusion.

Co-authored-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://www.postgresql.org/message-id/CAEoWx2m2E0xE8Kvbkv31ULh_E%2B5zph-WA_bEdv3UR9CLhw%2B3vg%40mail.gmail.com
Discussion: https://www.postgresql.org/message-id/CAEoWx2kTDz%3Db6T2xHX78vy_B_osDeCC5dcTCi9eG0vXHp5QpdQ%40mail.gmail.com
2025-10-30 11:20:04 +01:00
Michael Paquier
d3111cb753 Fix correctness issue with computation of FPI size in WAL stats
XLogRecordAssemble() may be called multiple times before inserting a
record in XLogInsertRecord(), and the amount of FPIs generated inside
a record whose insertion is attempted multiple times may vary.

The logic added in f9a09aa295 touched directly pgWalUsage in
XLogRecordAssemble(), meaning that it could be possible for pgWalUsage
to be incremented multiple times for a single record.  This commit
changes the code to use the same logic as the number of FPIs added to a
record, where XLogRecordAssemble() returns this information and feeds it
to XLogInsertRecord(), updating pgWalUsage only when a record is
inserted.

Reported-by: Shinya Kato <shinya11.kato@gmail.com>
Discussion: https://postgr.es/m/CAOzEurSiSr+rusd0GzVy8Bt30QwLTK=ugVMnF6=5WhsSrukvvw@mail.gmail.com
2025-10-29 09:13:31 +09:00
Michael Paquier
f9a09aa295 Add wal_fpi_bytes to pg_stat_wal and pg_stat_get_backend_wal()
This new counter, called "wal_fpi_bytes", tracks the total amount in
bytes of full page images (FPIs) generated in WAL.  This data becomes
available globally via pg_stat_wal, and for backend statistics via
pg_stat_get_backend_wal().

Previously, this information could only be retrieved with pg_waldump or
pg_walinspect, which may not be available depending on the environment,
and are expensive to execute.  It offers hints about how much FPIs
impact the WAL generated, which could be a large percentage for some
workloads, as well as the effects of wal_compression or page holes.

Bump catalog version.
Bump PGSTAT_FILE_FORMAT_ID, due to the addition of wal_fpi_bytes in
PgStat_WalCounters.

Author: Shinya Kato <shinya11.kato@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/CAOzEurQtZEAfg6P0kU3Wa-f9BWQOi0RzJEMPN56wNTOmJLmfaQ@mail.gmail.com
2025-10-28 16:21:51 +09:00
Peter Eisentraut
10b5bb3bff Add some const qualifications
Add some const qualifications afforded by the previous change that
added a const qualification to PageAddItemExtended().

Reviewed-by: Nathan Bossart <nathandbossart@gmail.com>
Reviewed-by: Peter Geoghegan <pg@bowt.ie>
Discussion: https://www.postgresql.org/message-id/flat/c75cccf5-5709-407b-a36a-2ae6570be766@eisentraut.org
2025-10-27 09:55:59 +01:00
Peter Eisentraut
76acf4b722 Remove Item type
This type is just char * underneath, it provides no real value, no
type safety, and just makes the code one level more mysterious.  It is
more idiomatic to refer to blobs of memory by a combination of void *
and size_t, so change it to that.

Also, since this type hides the pointerness, we can't apply qualifiers
to what is pointed to, which requires some unconstify nonsense.  This
change allows fixing that.

Extension code that uses the Item type can change its code to use
void * to be backward compatible.

Reviewed-by: Nathan Bossart <nathandbossart@gmail.com>
Reviewed-by: Peter Geoghegan <pg@bowt.ie>
Discussion: https://www.postgresql.org/message-id/flat/c75cccf5-5709-407b-a36a-2ae6570be766@eisentraut.org
2025-10-27 09:55:59 +01:00
Fujii Masao
f33e60a53a Make invalid primary_slot_name follow standard GUC error reporting.
Previously, if primary_slot_name was set to an invalid slot name and
the configuration file was reloaded, both the postmaster and all other
backend processes reported a WARNING. With many processes running,
this could produce a flood of duplicate messages. The problem was that
the GUC check hook for primary_slot_name reported errors at WARNING
level via ereport().

This commit changes the check hook to use GUC_check_errdetail() and
GUC_check_errhint() for error reporting. As with other GUC parameters,
this causes non-postmaster processes to log the message at DEBUG3,
so by default, only the postmaster's message appears in the log file.

Backpatch to all supported versions.

Author: Fujii Masao <masao.fujii@gmail.com>
Reviewed-by: Chao Li <lic@highgo.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Discussion: https://postgr.es/m/CAHGQGwFud-cvthCTfusBfKHBS6Jj6kdAPTdLWKvP2qjUX6L_wA@mail.gmail.com
Backpatch-through: 13
2025-10-22 20:09:43 +09:00
David Rowley
2470ca435c Use CompactAttribute more often, when possible
5983a4cff added CompactAttribute for storing commonly used fields from
FormData_pg_attribute.  5983a4cff didn't go to the trouble of adjusting
every location where we can use CompactAttribute rather than
FormData_pg_attribute, so here we change the remaining ones.

There are some locations where I've left the code using
FormData_pg_attribute.  These are mostly in the ALTER TABLE code.  Using
CompactAttribute here seems more risky as often the TupleDesc is being
changed and those changes may not have been flushed to the
CompactAttribute yet.

I've also left record_recv(), record_send(), record_cmp(), record_eq()
and record_image_eq() alone as it's not clear to me that accessing the
CompactAttribute is a win here due to the FormData_pg_attribute still
having to be accessed for most cases.  Switching the relevant parts to
use CompactAttribute would result in having to access both for common
cases.  Careful benchmarking may reveal that something can be done to
make this better, but in absence of that, the safer option is to leave
these alone.

In ReorderBufferToastReplace(), there was a check to skip attnums < 0
while looping over the TupleDesc.  Doing this is redundant since
TupleDescs don't store < 0 attnums.  Removing that code allows us to
move to using CompactAttribute.

The change in validateDomainCheckConstraint() just moves fetching the
FormData_pg_attribute into the ERROR path, which is cold due to calling
errstart_cold() and results in code being moved out of the common path.

Author: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/CAApHDvrMy90o1Lgkt31F82tcSuwRFHq3vyGewSRN=-QuSEEvyQ@mail.gmail.com
2025-10-22 11:36:26 +13:00
Nathan Bossart
e94a7afe44 Re-pgindent brin.c.
Backpatch-through: 13
2025-10-21 09:56:26 -05:00
David Rowley
9fd29d7ff4 Fix BRIN 32-bit counter wrap issue with huge tables
A BlockNumber (32-bit) might not be large enough to add bo_pagesPerRange
to when the table contains close to 2^32 pages.  At worst, this could
result in a cancellable infinite loop during the BRIN index scan with
power-of-2 pagesPerRange, and slow (inefficient) BRIN index scans and
scanning of unneeded heap blocks for non power-of-2 pagesPerRange.

Backpatch to all supported versions.

Author: sunil s <sunilfeb26@gmail.com>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/CAOG6S4-tGksTQhVzJM19NzLYAHusXsK2HmADPZzGQcfZABsvpA@mail.gmail.com
Backpatch-through: 13
2025-10-21 20:46:14 +13:00
Peter Eisentraut
dd3ae37830 Add log_autoanalyze_min_duration
The log output functionality of log_autovacuum_min_duration applies to
both VACUUM and ANALYZE, so it is not possible to separate the VACUUM
and ANALYZE log output thresholds. Logs are likely to be output only for
VACUUM and not for ANALYZE.

Therefore, we decided to separate the threshold for log output of VACUUM
by autovacuum (log_autovacuum_min_duration) and the threshold for log
output of ANALYZE by autovacuum (log_autoanalyze_min_duration).

Author: Shinya Kato <shinya11.kato@gmail.com>
Reviewed-by: Kasahara Tatsuhito <kasaharatt@oss.nttdata.com>
Discussion: https://www.postgresql.org/message-id/flat/CAOzEurQtfV4MxJiWT-XDnimEeZAY+rgzVSLe8YsyEKhZcajzSA@mail.gmail.com
2025-10-15 14:31:12 +02:00
Melanie Plageman
3e4705484e Make heap_page_is_all_visible independent of LVRelState
This function only requires a few fields from LVRelState, so pass them
in individually.

This change allows calling heap_page_is_all_visible() from code such as
pruneheap.c, which does not have access to an LVRelState.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/2wk7jo4m4qwh5sn33pfgerdjfujebbccsmmlownybddbh6nawl%40mdyyqpqzxjek
2025-10-14 17:43:41 -04:00
Melanie Plageman
43b05b38ea Inline TransactionIdFollows/Precedes[OrEquals]()
These functions appeared prominently in a profile of a patch that sets
the visibility map on-access. Inline them to remove call overhead and
make them cheaper to use in hot paths.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/2wk7jo4m4qwh5sn33pfgerdjfujebbccsmmlownybddbh6nawl%40mdyyqpqzxjek
2025-10-14 17:03:48 -04:00
Melanie Plageman
c8dd6542ba Add helper for freeze determination to heap_page_prune_and_freeze
After scanning the line pointers on a heap page during the first phase
of vacuum, we use the information collected to decide whether to use
the assembled freeze plans.

Move this decision logic into a helper function to improve readability.

While here, rename a PruneState member and disambiguate some local
variables in heap_page_prune_and_freeze().

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/2wk7jo4m4qwh5sn33pfgerdjfujebbccsmmlownybddbh6nawl%40mdyyqpqzxjek
2025-10-14 15:08:50 -04:00
Melanie Plageman
add323da40 Eliminate XLOG_HEAP2_VISIBLE from vacuum phase III
Instead of emitting a separate XLOG_HEAP2_VISIBLE WAL record for each
page that becomes all-visible in vacuum's third phase, specify the VM
changes in the already emitted XLOG_HEAP2_PRUNE_VACUUM_CLEANUP record.

Visibility checks are now performed before marking dead items unused.
This is safe because the heap page is held under exclusive lock for the
entire operation.

This reduces the number of WAL records generated by VACUUM phase III by
up to 50%.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://postgr.es/m/flat/CAAKRu_ZMw6Npd_qm2KM%2BFwQ3cMOMx1Dh3VMhp8-V7SOLxdK9-g%40mail.gmail.com
2025-10-13 18:01:06 -04:00
Peter Geoghegan
7a662a46eb Remove unused nbtree array advancement variable.
Remove a variable that is no longer in use following commit 9a2e2a28.
It's not immediately clear why there were no compiler warnings about
this oversight.

Author: Peter Geoghegan <pg@bowt.ie>
Backpatch-through: 18
2025-10-12 14:04:08 -04:00
Peter Geoghegan
843e50208a Remove overzealous _bt_killitems assertion.
An assertion in _bt_killitems expected the scan's currPos state to
contain a valid LSN, saved from when currPos's page was initially read.
The assertion failed to account for the fact that even logged relations
can have leaf pages with an invalid LSN when built with wal_level set to
"minimal".  Remove the faulty assertion.

Oversight in commit e6eed40e (though note that the assertion was
backpatched to stable branches before 18 by commit 7c319f54).

Author: Peter Geoghegan <pg@bowt.ie>
Reported-By: Matthijs van der Vleuten <postgresql@zr40.nl>
Bug: #19082
Discussion: https://postgr.es/m/19082-628e62160dbbc1c1@postgresql.org
Backpatch-through: 13
2025-10-10 14:52:25 -04:00
Michael Paquier
3a36543d7d Fix two typos in xlogstats.h and xlogstats.c
Issue found while browsing this area of the code, introduced and
copy-pasted around by 2258e76f90.

Backpatch-through: 15
2025-10-10 11:51:45 +09:00
Melanie Plageman
d96f87332b Eliminate COPY FREEZE use of XLOG_HEAP2_VISIBLE
Instead of emitting a separate WAL XLOG_HEAP2_VISIBLE record for setting
bits in the VM, specify the VM block changes in the
XLOG_HEAP2_MULTI_INSERT record.

This halves the number of WAL records emitted by COPY FREEZE.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://postgr.es/m/flat/CAAKRu_ZMw6Npd_qm2KM%2BFwQ3cMOMx1Dh3VMhp8-V7SOLxdK9-g%40mail.gmail.com
2025-10-09 16:29:01 -04:00
Andres Freund
3baae90013 bufmgr: fewer calls to BufferDescriptorGetContentLock
We're planning to merge buffer content locks into BufferDesc.state. To reduce
the size of that patch, centralize calls to BufferDescriptorGetContentLock().

The biggest part of the change is in assertions, by introducing
BufferIsLockedByMe[InMode]() (and removing BufferIsExclusiveLocked()). This
seems like an improvement even without aforementioned plans.

Additionally replace some direct calls to LWLockAcquire() with calls to
LockBuffer().

Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://postgr.es/m/fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff
2025-10-08 16:06:19 -04:00
Michael Paquier
138da727a1 Improve description of some WAL records for GIN
The following information is added in the description of some GIN
records:
- In INSERT_LISTPAGE, the number of tuples and the right link block.
- In UPDATE_META_PAGE, the number of tuples, the previous tail block,
and the right link block.
- In SPLIT, the left and right children blocks.

Author: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru>
Discussion: https://postgr.es/m/CALdSSPgnAt5L=D_xGXRXLYO5FK1H31_eYEESxdU1n-r4g+6GqA@mail.gmail.com
2025-10-08 14:02:26 +09:00
Amit Kapila
035b09131d Fix typo in function header comment.
Reported-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://postgr.es/m/CA+TgmoZYh_nw-2j_Fi9y6ZAvrpN+W1aSOFNM7Rus2Q-zTkCsQw@mail.gmail.com
2025-10-08 03:17:05 +00:00
Masahiko Sawada
771cfe22a0 Avoid unnecessary GinFormTuple() calls for incompressible posting lists.
Previously, we attempted to form a posting list tuple even when
ginCompressPostingList() failed to compress the posting list due to
its size. While there was no functional failure, it always wasted one
GinFormTuple() call when item pointers didn't fit in a posting list
tuple.

This commit ensures that a GIN index tuple is formed only when all
item pointers in the posting list are successfully compressed.

Author: Arseniy Mukhin <arseniy.mukhin.dev@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://postgr.es/m/CAE7r3M+C=jcpTD93f_RBHrQp3C+=TAXFs+k4tTuZuuxboK8AvA@mail.gmail.com
2025-10-06 14:02:01 -07:00
Michael Paquier
7072a8855e Remove block information from description of some WAL records for GIN
The WAL records XLOG_GIN_INSERT and XLOG_GIN_VACUUM_DATA_LEAF_PAGE
included some information about the blocks added to the record.

This information is already provided by XLogRecGetBlockRefInfo() with
much more details about the blocks included in each record, like the
compression information, for example.  This commit removes the block
information that existed in the record descriptions specific to GIN.

Author: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru>
Discussion: https://postgr.es/m/CALdSSPgk=9WRoXhZy5fdk+T1hiau7qbL_vn94w_L1N=gtEdbsg@mail.gmail.com
2025-10-06 16:14:59 +09:00
Álvaro Herrera
1a8b5b11e4 Don't include access/htup_details.h in executor/tuptable.h
This is not at all needed; I suspect it was a simple mistake in commit
5408e233f0.  It causes htup_details.h to bleed into a huge number of
places via execnodes.h.  Remove it and fix fallout.

Discussion: https://postgr.es/m/202510021240.ptc2zl5cvwen@alvherre.pgsql
2025-10-05 18:00:38 +02:00
Álvaro Herrera
1b6f61bd89 Don't include execnodes.h in brin.h or gin.h
These headers don't need execnodes.h for anything.  I think they never
have.

Discussion: https://postgr.es/m/202510021240.ptc2zl5cvwen@alvherre.pgsql
2025-10-05 17:35:25 +02:00
John Naylor
54ab748651 Fix reuse-after-free hazard in dead_items_reset
In similar vein to commit ccc8194e42, a reset instance of a shared
memory TID store happened to occupy the same private memory as the old
one for the entry point, since the chunk freed after the last round
of index vacuuming was put on the context's freelist. The failure
to update the vacrel->dead_items pointer was evident by nudging the
system to allocate memory in a different area. This was not discovered
at the time of the earlier commit since our regression tests didn't
cover multiple index passes with parallel vacuum.

Backpatch to v17, when TidStore came in.

Author: Kevin Oommen Anish <kevin.o@zohocorp.com>
Reviewed-by: Richard Guo <guofenglinux@gmail.com>
Tested-by: Richard Guo <guofenglinux@gmail.com>
Discussion: https://postgr.es/m/199a07cbdfc.7a1c4aac25838.1675074408277594551%40zohocorp.com
Backpatch-through: 17
2025-10-03 16:05:02 +07:00
Michael Paquier
3f431109dc Remove useless pointer update in ginxlog.c
Oversight in 2c03216d83, when the redo code of GIN got refactored for
the new WAL format where block information has been standardized, as the
payload data got tracked for each block after the change, and not in the
whole record.  This is just a cleanup.

Author: Kirill Reshke <reshkekirill@gmail.com>
Reviewed-by: Andrey Borodin <x4mmm@yandex-team.ru>
Discussion: https://postgr.es/m/CALdSSPgnAt5L=D_xGXRXLYO5FK1H31_eYEESxdU1n-r4g+6GqA@mail.gmail.com
2025-10-02 17:16:20 +09:00
Michael Paquier
bb68cde413 Reorder XLogNeedsFlush() checks to be more consistent
During recovery, XLogNeedsFlush() checks the minimum recovery LSN point
instead of the flush LSN point.  The same condition checks are used when
updating the minimum recovery point in UpdateMinRecoveryPoint(), but are
written in reverse order.

This commit makes the order of the checks consistent between
XLogNeedsFlush() and UpdateMinRecoveryPoint(), improving the code
clarity.  Note that the second check (as ordered by this commit) relies
on InRecovery, which is true only in the startup process.  So this makes
XLogNeedsFlush() cheaper in the startup process with the first check
acting as a shortcut while doing crash recovery, where
LocalMinRecoveryPoint is an invalid LSN.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Discussion: https://postgr.es/m/aMIHNRTP6Wj6vw1s%40paquier.xyz
2025-09-30 09:38:32 +09:00
Michael Paquier
85e0ff62b6 Improve stability of btree page split on ERRORs
This improves the stability of VACUUM when processing btree indexes,
which was previously able to trigger an assertion failure in
_bt_lock_subtree_parent() when an error was previously thrown outside
the scope of _bt_split() when splitting a btree page.  VACUUM would
consider the index as in a corrupted state as the right page would not
be zeroed for the error thrown (allocation failure is one pattern).

In a non-assert build, VACUUM is able to succeed, reporting what it sees
as a corruption while attempting to fix the index.  This would manifest
as a LOG message, as of:
LOG: failed to re-find parent key in index "idx" for deletion target
page N
CONTEXT:  while vacuuming index "idx" of relation "public.tab"

This commit improves the code to rely on two PGAlignedBlocks that are
used as a temporary space for the left and right pages.  The main change
concerns the right page, whose contents are now copied into the
"temporary" PGAlignedBlock page while its original space is zeroed.  Its
contents are moved from the PGAlignedBlock page back to the page once we
enter in the critical section used for the split.  This simplifies the
split logic, as it is not necessary to zero the right page before
throwing an error anymore.  Hence errors can now be thrown outside the
split code.  For the left page, this shaves one allocation, with
PageGetTempPage() being previously used.

The previous logic originates from commit 8fa30f906b, at a point where
PGAlignedBlock did not exist yet.  This could be argued as something
that should be backpatched, but the lack of complaints indicates that it
may not be necessary.

Author: Konstantin Knizhnik <knizhnik@garret.ru>
Discussion: https://postgr.es/m/566dacaf-5751-47e4-abc6-73de17a5d42a@garret.ru
2025-09-26 08:41:06 +09:00
Daniel Gustafsson
0b3ce7878a Remove preprocessor guards from injection points
When defining an injection point there is no need to wrap the definition
with USE_INJECTION_POINT guards, the INJECTION_POINT macro is available
in all builds.  Remove to make the code consistent.

Author: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
Discussion: https://postgr.es/m/OSCPR01MB14966C8015DEB05ABEF2CE077F51FA@OSCPR01MB14966.jpnprd01.prod.outlook.com
Backpatch-through: 17
2025-09-25 15:27:33 +02:00
Álvaro Herrera
7e638d7f50 Don't include execnodes.h in replication/conflict.h
... which silently propagates a lot of headers into many places
via pgstat.h, as evidenced by the variety of headers that this patch
needs to add to seemingly random places.  Add a minimum of typedefs to
conflict.h to be able to remove execnodes.h, and fix the fallout.

Backpatch to 18, where conflict.h first appeared.

Discussion: https://postgr.es/m/202509191927.uj2ijwmho7nv@alvherre.pgsql
2025-09-25 14:52:41 +02:00
Álvaro Herrera
81fc3e28e3 Update some more forward declarations to use typedef
As commit d4d1fc527b.

Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://postgr.es/m/202509191025.22agk3fvpilc@alvherre.pgsql
2025-09-25 14:33:19 +02:00
Melanie Plageman
ae8ea7278c Correct prune WAL record opcode name in comment
f83d709760 incorrectly refers to a XLOG_HEAP2_PRUNE_FREEZE WAL record
opcode. No such code exists. The relevant opcodes are
XLOG_HEAP2_PRUNE_ON_ACCESS, XLOG_HEAP2_PRUNE_VACUUM_SCAN, and
XLOG_HEAP2_PRUNE_VACUUM_CLEANUP. Correct it.

Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/yn4zp35kkdsjx6wf47zcfmxgexxt4h2og47pvnw2x5ifyrs3qc%407uw6jyyxuyf7
2025-09-24 12:29:56 -04:00
Fujii Masao
7fcb32ad02 Fix incorrect and inconsistent comments in tableam.h and heapam.c.
This commit corrects several issues in function comments:

* The parameter "rel" was incorrectly referred to as "relation" in the comments
   for table_tuple_delete(), table_tuple_update(), and table_tuple_lock().
* In table_tuple_delete(), "changingPart" was listed as an output parameter
   in the comments but is actually input.
* In table_tuple_update(), "slot" was listed as an input parameter
   in the comments but is actually output.
* The comment for "update_indexes" in table_tuple_update() was mis-indented.
* The comments for heap_lock_tuple() incorrectly referenced a non-existent
   "tid" parameter.

Author: Chao Li <lic@highgo.com>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/CAEoWx2nB6Ay8g=KEn7L3qbYX_4+sLk9XOMkV0XZqHR4cTY8ZvQ@mail.gmail.com
2025-09-25 00:51:59 +09:00
Peter Eisentraut
a5b35fcedb Remove PointerIsValid()
This doesn't provide any value over the standard style of checking the
pointer directly or comparing against NULL.

Also remove related:
- AllocPointerIsValid() [unused]
- IndexScanIsValid() [had one user]
- HeapScanIsValid() [unused]
- InvalidRelation [unused]

Leaving HeapTupleIsValid(), ItemIdIsValid(), PortalIsValid(),
RelationIsValid for now, to reduce code churn.

Reviewed-by: Jacob Champion <jacob.champion@enterprisedb.com>
Discussion: https://www.postgresql.org/message-id/flat/ad50ab6b-6f74-4603-b099-1cd6382fb13d%40eisentraut.org
Discussion: https://www.postgresql.org/message-id/CA+hUKG+NFKnr=K4oybwDvT35dW=VAjAAfiuLxp+5JeZSOV3nBg@mail.gmail.com
Discussion: https://www.postgresql.org/message-id/bccf2803-5252-47c2-9ff0-340502d5bd1c@iki.fi
2025-09-24 15:17:20 +02:00
Michael Paquier
deb208df45 Make XLogFlush() and XLogNeedsFlush() decision-making more consistent
When deciding which code path to use depending on the state of recovery,
XLogFlush() and XLogNeedsFlush() have been relying on different
criterias:
- XLogFlush() relied on XLogInsertAllowed().
- XLogNeedsFlush() relied on RecoveryInProgress().

Currently, the checkpointer is allowed to insert WAL records while
RecoveryInProgress() returns true for an end-of-recovery checkpoint,
where XLogInsertAllowed() matters.  Using RecoveryInProgress() in
XLogNeedsFlush() did not really matter for its existing callers, as the
checkpointer only called XLogFlush().  However, a feature under
discussion, by Melanie Plageman, needs XLogNeedsFlush() to be able to
work in more contexts, the end-of-recovery checkpoint being one.

This commit changes XLogNeedsFlush() to use XLogInsertAllowed() instead
of RecoveryInProgress(), making the checks in both routines more
consistent.  While on it, an assertion based on XLogNeedsFlush() is
added at the end of XLogFlush(), triggered when flushing a physical
position (not for the normal recovery patch that checks for updates of
the minimum recovery point).  This assertion would fail for example in
the recovery test 015_promotion_pages if XLogNeedsFlush() is changed to
use RecoveryInProgress().  This should be hopefully enough to ensure
that the checks done in both routines remain consistent.

Author: Melanie Plageman <melanieplageman@gmail.com>
Co-authored-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Jeff Davis <pgsql@j-davis.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/CAAKRu_a1vZRZRWO3_jv_X13RYoqLRVipGO0237g5PKzPa2YX6g@mail.gmail.com
2025-09-19 13:47:28 +09:00
Peter Geoghegan
7d9cd2df5f Teach nbtree to avoid evaluating row compare keys.
Add logic to _bt_set_startikey that determines whether row compare keys
are guaranteed to be satisfied by every tuple on a page that is about to
be read by _bt_readpage.  This works in essentially the same way as the
existing scalar inequality logic.  Testing has shown that the new logic
improves performance to about the same degree as the existing scalar
inequality logic (compared to the unoptimized case).  In other words,
the new logic makes many row compare scans significantly faster.

Note that the new row compare inequality logic is only effective when
the same individual row member is the deciding subkey for all tuples on
the page (obviously, all tuples have to satisfy the row compare, too).
This is what makes the new row compare logic very similar to the
existing logic for scalar inequalities.  Note, in particular, that this
makes it safe to ignore whether all row compare members are against
either ASC or DESC index attributes (i.e. it doesn't matter if
individual subkeys don't all use the same inequality strategy).

Also stop refusing to set pstate.startikey to an offset beyond any
nonrequired key (don't add logic that'll do that for an individual row
compare subkey, either).  We can fully rely on our firstchangingattnum
tests instead.  This will do the right thing when a page has a group of
tuples with NULLs in a lower-order attribute that makes the tuples fail
to satisfy a row compare key -- we won't incorrectly conclude that all
tuples must satisfy the row compare, just because firsttup and lasttup
happen to.  Our firstchangingattnum test prevents that from happening.
(Note that the original "avoid evaluating nbtree scan keys" mechanism
added by commit e0b1ee17 couldn't support row compares due to issues
with tuples that contain NULLs in a lower-order subkey's attribute.
That original mechanism relied on requiredness markings, which the
replacement _bt_set_startikey mechanism never really needed.)

Follow up to commit 8a510275, which added the _bt_set_startikey
optimization.  _bt_set_startikey is now feature complete; there's no
remaining kind of nbtree scan key that it still doesn't support.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/CAH2-WznL6Z3H_GTQze9d8T_Ls=cYbnd-_9f-Jo7aYgTGRUD58g@mail.gmail.com
2025-09-15 16:56:49 -04:00
Peter Eisentraut
748caa9dcb Some stylistic improvements in toast_save_datum()
Move some variables to a smaller scope.  Initialize chunk_data before
storing a pointer to it; this avoids compiler warnings on clang-21, or
respectively us having to work around it by initializing it to zero
before the variable is used (as was done in commit e92677e863).

Discussion: https://www.postgresql.org/message-id/flat/6604ad6e-5934-43ac-8590-15113d6ae4b1%40eisentraut.org
2025-09-15 07:43:23 +02:00
Peter Geoghegan
454c046094 nbtree: Always set skipScan flag on rescan.
The TimescaleDB extension expects to be able to change an nbtree scan's
keys across rescans.  The issue arises in the extension's implementation
of loose index scan.  This is arguably a misuse of the index AM API,
though apparently it worked until recently.  It stopped working when the
skipScan flag was added to BTScanOpaqueData by commit 8a510275, though.
The flag wouldn't reliably track whether the scan (actually, the current
rescan) has any skip arrays, leading to confusion in _bt_set_startikey.

nbtree preprocessing will now defensively initialize the scan's skipScan
flag in all cases, including the case where _bt_preprocess_array_keys
returns early due to the (re)scan not using arrays.  While nbtree isn't
obligated to support this use case (at least not according to my reading
of the index AM API), it still seems like a good idea to be consistent
here, on general robustness grounds.

Author: Peter Geoghegan <pg@bowt.ie>
Reported-By: Natalya Aksman <natalya@timescale.com>
Discussion: https://postgr.es/m/CAJumhcirfMojbk20+W0YimbNDkwdECvJprQGQ-XqK--ph09nQw@mail.gmail.com
Backpatch-through: 18
2025-09-13 21:01:33 -04:00
Nathan Bossart
7e9c216b52 Re-pgindent nbtpreprocesskeys.c after commit 796962922e.
Backpatch-through: 18
2025-09-13 14:50:02 -05:00
Peter Geoghegan
796962922e Always commute strategy when preprocessing DESC keys.
A recently added nbtree preprocessing step failed to account for the
fact that DESC columns already had their B-Tree strategy number commuted
at this point in preprocessing.  As a result, preprocessing could output
a set of scan keys where one or more keys had the correct strategy
number, but used the wrong comparison routine.

To fix, make the faulty code path that looks up a more restrictive
replacement operator/comparison routine commute its requested inequality
strategy (while outputting the transformed strategy number as before).
This makes the final transformed scan key comport with the approach
preprocessing has always used to deal with DESC columns (which is
described by comments above _bt_fix_scankey_strategy).

Oversight in commit commit b3f1a13f, which made nbtree preprocessing
perform transformations on skip array inequalities that can reduce the
total number of index searches.

Author: Peter Geoghegan <pg@bowt.ie>
Reported-By: Natalya Aksman <natalya@timescale.com>
Discussion: https://postgr.es/m/19049-b7df801e71de41b2@postgresql.org
Backpatch-through: 18
2025-09-12 13:23:00 -04:00
Peter Eisentraut
e92677e863 Silence compiler warnings on clang 21
Clang 21 shows some new compiler warnings, for example:

warning: variable 'dstsize' is uninitialized when passed as a const pointer argument here [-Wuninitialized-const-pointer]

The fix is to initialize the variables when they are defined.  This is
similar to, for example, the existing situation in gistKeyIsEQ().

Discussion: https://www.postgresql.org/message-id/flat/6604ad6e-5934-43ac-8590-15113d6ae4b1%40eisentraut.org
2025-09-12 07:28:32 +02:00
Michael Paquier
528dadf691 Add more information for WAL records of hash index AMs
hashdesc.c was missing a couple of fields in its record descriptions, as
of:
- is_prev_bucket_same_wrt for SQUEEZE_PAGE.
- procid for INIT_META_PAGE.
- old_bucket_flag and new_bucket_flag for SPLIT_ALLOCATE_PAGE.

The author has noted the first hole, and I have spotted the others while
double-checking this area of the code.  Note that the only data missing
now are the offsets stored in VACUUM_ONE_PAGE.  We could perhaps add
them, if somebody sees value in this data, even if it makes the output
larger.  These are discarded here.

Author: Kirill Reshke <reshkekirill@gmail.com>
Discussion: https://postgr.es/m/CALdSSPjc-OVwtZH0Xrkvg7n=2ZwdbMJzqrm_ed_CfjiAzuKVGg@mail.gmail.com
2025-09-12 10:29:02 +09:00
Michael Paquier
8c8f7b199d Fix leak with SMgrRelations in startup process
The startup process does not process shared invalidation messages, only
sending them, and never calls AtEOXact_SMgr() which clean up any
unpinned SMgrRelations.  Hence, it is never able to free SMgrRelations
on a periodic basis, bloating its hashtable over time.

Like the checkpointer and the bgwriter, this commit takes a conservative
approach by freeing periodically SMgrRelations when replaying a
checkpoint record, either online or shutdown, so as the startup process
has a way to perform a periodic cleanup.

Issue caused by 21d9c3ee4e, so backpatch down to v17.

Author: Jingtang Zhang <mrdrivingduck@gmail.com>
Reviewed-by: Yuhang Qiu <iamqyh@gmail.com>
Discussion: https://postgr.es/m/28C687D4-F335-417E-B06C-6612A0BD5A10@gmail.com
Backpatch-through: 17
2025-09-10 07:23:05 +09:00
Melanie Plageman
8ec97e78a7 Add error codes when vacuum discovers VM corruption
Commit fd6ec93bf8 and other previous work established the
principle that when an error is potentially reachable in case of on-disk
corruption but is not expected to be reached otherwise,
ERRCODE_DATA_CORRUPTED should be used. This allows log monitoring
software to search for evidence of corruption by filtering on the error
code.

Enhance the existing log messages emitted when the heap page is found to
be inconsistent with the VM by adding this error code.

Suggested-by: Andrey Borodin <x4mmm@yandex-team.ru>
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://postgr.es/m/87DD95AA-274F-4F4F-BAD9-7738E5B1F905%40yandex-team.ru
2025-09-08 17:13:31 -04:00