Test failure on buildfarm member prion:
The test failed due to an unexpected LOCATION: line appearing between the
WARNING and ERROR messages. This occurred because the prion machine uses
log_error_verbosity = verbose, which includes additional context in error
messages. The test was originally checking for both WARNING and ERROR
messages in sequence sync, but the extra LOCATION: line disrupted this
pattern. To make the test robust across different verbosity settings, it
now only checks for the presence of the WARNING message after the test,
which is sufficient to validate the intended behavior.
Failure to sync sequences with quoted names:
The previous implementation did not correctly quote sequence names when
querying remote information, leading to failures when quoted sequence
names were used. This fix ensures that sequence names are properly quoted
during remote queries, allowing sequences with quoted identifiers to be
synced correctly.
Author: Vignesh C <vignesh21@gmail.com>
Author: Shinya Kato <shinya11.kato@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/CALDaNm0WcdSCoNPiE-5ek4J2dMJ5o111GPTzKCYj9G5i=ONYtQ@mail.gmail.com
Discussion: https://postgr.es/m/CAOzEurQOSN=Zcp9uVnatNbAy=2WgMTJn_DYszYjv0KUeQX_e_A@mail.gmail.com
03d40e4b5 added code to provide better row estimates for when a UNION
query ended up only with a single child due to other children being
found to be dummy rels. In that case, ordinarily it would be ok to call
estimate_num_groups() on the targetlist of the only child path, however
that's not safe to do if the UNION child is the result of some other set
operation as we generate targetlists containing Vars with varno==0 for
those, which estimate_num_groups() can't handle. This could lead to:
ERROR: XX000: no relation entry for relid 0
Fix this by avoiding doing this when the only child is the result of
another set operation. In that case we'll fall back on the
assume-all-rows-are-unique method.
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Author: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/cfbc99e5-9d44-4806-ba3c-f36b57a85e21@gmail.com
Various places that were using StringInfo but didn't need that
StringInfo to exist beyond the scope of the function were using
makeStringInfo(), which allocates both a StringInfoData and the buffer it
uses as two separate allocations. It's more efficient for these cases to
use a StringInfoData on the stack and initialize it with initStringInfo(),
which only allocates the string buffer. This also simplifies the cleanup,
in a few cases.
Author: Mats Kindahl <mats.kindahl@gmail.com>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/4379aac8-26f1-42f2-a356-ff0e886228d3@gmail.com
In generate_orderedappend_paths(), the function does not handle the
case where the paths in total_subpaths and fractional_subpaths are
identical. This situation is not uncommon, and as a result, it may
generate two exactly identical ordered append paths.
Fix by checking whether total_subpaths and fractional_subpaths contain
the same paths, and skipping creation of the ordered append path for
the fractional case when they are identical.
Given the lack of field complaints about this, I'm a bit hesitant to
back-patch, but let's clean it up in HEAD.
Author: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: Andrei Lepikhov <lepihov@gmail.com>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
Discussion: https://postgr.es/m/CAMbWs4-OYsgA75tGGiBARt87G0y_z_GBTSLrzudcJxAzndYkYw@mail.gmail.com
In generate_orderedappend_paths(), there is an assumption that a child
relation's row estimate is always greater than zero. There is an
Assert verifying this assumption, and the estimate is also used to
convert an absolute tuple count into a fraction.
However, this assumption is not always valid -- for example, upper
relations can have their row estimates unset, resulting in a value of
zero. This can cause an assertion failure in debug builds or lead to
the tuple fraction being computed as infinity in production builds.
To fix, use the row estimate from the cheapest_total path to compute
the tuple fraction. The row estimate in this path should already have
been forced to a valid value.
In passing, update the comment for generate_orderedappend_paths() to
note that the function also considers the cheapest-fractional case
when not all tuples need to be retrieved. That is, it collects all
the cheapest fractional paths and builds an ordered append path for
each interesting ordering.
Backpatch to v18, where this issue was introduced.
Bug: #19102
Reported-by: Kuntal Ghosh <kuntalghosh.2007@gmail.com>
Author: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: Kuntal Ghosh <kuntalghosh.2007@gmail.com>
Reviewed-by: Andrei Lepikhov <lepihov@gmail.com>
Discussion: https://postgr.es/m/19102-93480667e1200169@postgresql.org
Backpatch-through: 18
This patch introduces sequence synchronization. Sequences that are synced
will have 2 states:
- INIT (needs [re]synchronizing)
- READY (is already synchronized)
A new sequencesync worker is launched as needed to synchronize sequences.
A single sequencesync worker is responsible for synchronizing all
sequences. It begins by retrieving the list of sequences that are flagged
for synchronization, i.e., those in the INIT state. These sequences are
then processed in batches, allowing multiple entries to be synchronized
within a single transaction. The worker fetches the current sequence
values and page LSNs from the remote publisher, updates the corresponding
sequences on the local subscriber, and finally marks each sequence as
READY upon successful synchronization.
Sequence synchronization occurs in 3 places:
1) CREATE SUBSCRIPTION
- The command syntax remains unchanged.
- The subscriber retrieves sequences associated with publications.
- Published sequences are added to pg_subscription_rel with INIT
state.
- Initiate the sequencesync worker to synchronize all sequences.
2) ALTER SUBSCRIPTION ... REFRESH PUBLICATION
- The command syntax remains unchanged.
- Dropped published sequences are removed from pg_subscription_rel.
- Newly published sequences are added to pg_subscription_rel with INIT
state.
- Initiate the sequencesync worker to synchronize only newly added
sequences.
3) ALTER SUBSCRIPTION ... REFRESH SEQUENCES
- A new command introduced for PG19 by f0b3573c3a.
- All sequences in pg_subscription_rel are reset to INIT state.
- Initiate the sequencesync worker to synchronize all sequences.
- Unlike "ALTER SUBSCRIPTION ... REFRESH PUBLICATION" command,
addition and removal of missing sequences will not be done in this
case.
Author: Vignesh C <vignesh21@gmail.com>
Reviewed-by: shveta malik <shveta.malik@gmail.com>
Reviewed-by: Hou Zhijie <houzj.fnst@fujitsu.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: Nisha Moond <nisha.moond412@gmail.com>
Reviewed-by: Shlok Kyal <shlok.kyal.oss@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/CAA4eK1LC+KJiAkSrpE_NwvNdidw9F2os7GERUeSxSKv71gXysQ@mail.gmail.com
Previously, unnamed portals were kept until the next Bind message or the
end of the transaction. This could cause temporary files to persist
longer than expected and make logging not reflect the actual SQL
responsible for the temporary file.
This patch changes exec_execute_message() to drop unnamed portals
immediately after execution to completion at the end of an Execute
message, making their removal more aggressive. This forces temporary
file cleanups to happen at the same time as the completion of the portal
execution, with statement logging correctly reflecting to which
statements these temporary files were attached to (see the diffs in the
TAP test updated by this commit for an idea).
The documentation is updated to describe the lifetime of unnamed
portals, and test cases are updated to verify temporary file removal and
proper statement logging after unnamed portal execution. This changes
how unnamed portals are handled in the protocol, hence no backpatch is
done.
Author: Frédéric Yhuel <frederic.yhuel@dalibo.com>
Co-Authored-by: Sami Imseih <samimseih@gmail.com>
Co-Authored-by: Mircea Cadariu <cadariu.mircea@gmail.com>
Discussion: https://postgr.es/m/CAA5RZ0tTrTUoEr3kDXCuKsvqYGq8OOHiBwoD-dyJocq95uEOTQ%40mail.gmail.com
The comment for ChangeVarNodes() refers to a parameter named
change_RangeTblRef, which does not exist in the code.
The comment for ChangeVarNodesExtended() contains an extra space,
while the comment for replace_relid_callback() has an awkward line
break and a typo.
This patch fixes these issues and revises some sentences for smoother
wording.
Oversights in commits ab42d643c and fc069a3a6.
Author: Richard Guo <guofenglinux@gmail.com>
Discussion: https://postgr.es/m/CAMbWs480j16HC1JtjKCgj5WshivT8ZJYkOfTyZAM0POjFomJkg@mail.gmail.com
Backpatch-through: 18
These assertions may prove to become useful to make sure that no process
other than the startup process calls the routines where these checks are
added, as we expect that these do not interfere with a WAL receiver
switched to a "stopping" state by a startup process.
The assumption that only the startup process can use this code has
existed for many years, without a check enforcing it.
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
Discussion: https://postgr.es/m/aQmGeVLYl51y1m_0@paquier.xyz
First, the assertions in assign_io_method() were the wrong way round. Second,
the lengthof() assertion checked the length of io_method_options, which is the
wrong array to check and is always longer than pgaio_method_ops_table.
While add it, add a static assert to ensure pgaio_method_ops_table and
io_method_options stay in sync.
Per coverity and Tom Lane.
Reported-by: Tom Lane <tgl@sss.pgh.pa.us>
Backpatch-through: 18
In 2a0faed9d7, which added JIT compilation support for expressions, I
accidentally used sizeof(LLVMBasicBlockRef *) instead of
sizeof(LLVMBasicBlockRef) as part of computing the size of an allocation. That
turns out to have no real negative consequences due to LLVMBasicBlockRef being
a pointer itself (and thus having the same size). It still is wrong and
confusing, so fix it.
Reported by coverity.
Backpatch-through: 13
Allow pg_newlocale_from_collation(C_COLLATION_OID) to work even if
there's no catalog access, which some extensions expect.
Not known to be a bug without extensions involved, but backport to 18.
Also corrects an issue in master with dummy_c_locale (introduced in
commit 5a38104b36) where deterministic was not set. That wasn't a bug,
but could have been if that structure was used more widely.
Reported-by: Alexander Kukushkin <cyberdemn@gmail.com>
Reviewed-by: Alexander Kukushkin <cyberdemn@gmail.com>
Discussion: https://postgr.es/m/CAFh8B=nj966ECv5vi_u3RYij12v0j-7NPZCXLYzNwOQp9AcPWQ@mail.gmail.com
Backpatch-through: 18
This commit adds CHECK_FOR_INTERRUPTS to the shared buffer iteration
loops in EvictRelUnpinnedBuffers and EvictAllUnpinnedBuffers. These
functions, used by pg_buffercache's pg_buffercache_evict_relation and
pg_buffercache_evict_all, can now be interrupted during long-running
operations.
Backpatch to version 18, where these functions and their corresponding
pg_buffercache functions were introduced.
Author: Yuhang Qiu <iamqyh@gmail.com>
Discussion: https://postgr.es/m/8DC280D4-94A2-4E7B-BAB9-C345891D0B78%40gmail.com
Backpatch-through: 18
03d40e4b5 allowed dummy UNION [ALL] children to be removed from the plan
by checking for is_dummy_rel(). That commit neglected to still account
for the relids from the dummy rel so that the correct UPPERREL_SETOP
RelOptInfo could be found and used for adding the Paths to.
Not doing this could result in processing of subsequent UNIONs using the
same RelOptInfo as a previously processed UNION, which could result in
add_path() freeing old Paths that are needed by the previous UNION.
The same fix was independently submitted (2 mins later) by Richard Guo.
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Author: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/bee34aec-659c-46f1-9ab7-7bbae0b7616c@gmail.com
Commit a95e3d84c0 added ActiveSnapshot push+pop when processing
work-items (BRIN autosummarization), but forgot to handle the case of
a transaction failing during the run, which drops the snapshot untimely.
Fix by making the pop conditional on an element being actually there.
Author: Álvaro Herrera <alvherre@kurilemu.de>
Backpatch-through: 13
Discussion: https://postgr.es/m/202511041648.nofajnuddmwk@alvherre.pgsql
The parallel GIN builds perform "freezing" of TID lists when merging
chunks built earlier. This means determining what part of the list can
no longer change, depending on the last received chunk. The frozen part
can be evicted from memory and written out.
The code attempted to freeze items right before merging the old and new
TID list, after already attempting to trim the current buffer. That
means part of the data may get frozen based on the new TID list, but
will be trimmed later (on next loop). This increases memory usage.
This inverts the order, so that we freeze data first (before trimming).
The benefits are likely relatively small, but it's also virtually free
with no other downsides.
Discussion: https://postgr.es/m/CAHLJuCWDwn-PE2BMZE4Kux7x5wWt_6RoWtA0mUQffEDLeZ6sfA@mail.gmail.com
When building intermediate TID lists during parallel GIN builds, split
the sorted lists into smaller chunks, to limit the amount of memory
needed when merging the chunks later.
The leader may need to keep in memory up to one chunk per worker, and
possibly one extra chunk (before evicting some of the data). The code
processing item pointers uses regular palloc/repalloc calls, which means
it's subject to the MaxAllocSize (1GB) limit.
We could fix this by allowing huge allocations, but that'd require
changes in many places without much benefit. Larger chunks do not
actually improve performance, so the memory usage would be wasted.
Fixed by limiting the chunk size to not hit MaxAllocSize. Each worker
gets a fair share.
This requires remembering the number of participating workers, in a
place that can be accessed from the callback. Luckily, the bs_worker_id
field in GinBuildState was unused, so repurpose that.
Report by Greg Smith, investigation and fix by me. Batchpatched to 18,
where parallel GIN builds were introduced.
Reported-by: Gregory Smith <gregsmithpgsql@gmail.com>
Discussion: https://postgr.es/m/CAHLJuCWDwn-PE2BMZE4Kux7x5wWt_6RoWtA0mUQffEDLeZ6sfA@mail.gmail.com
Backpatch-through: 18
We have never had a SET syntax that allows setting a GUC_LIST_INPUT
parameter to be an empty list. A locution such as
SET search_path = '';
doesn't mean that; it means setting the GUC to contain a single item
that is an empty string. (For search_path the net effect is much the
same, because search_path ignores invalid schema names and '' must be
invalid.) This is confusing, not least because configuration-file
entries and the set_config() function can easily produce empty-list
values.
We considered making the empty-string syntax do this, but that would
foreclose ever allowing empty-string items to be valid in list GUCs.
While there isn't any obvious use-case for that today, it feels like
the kind of restriction that might hurt someday. Instead, let's
accept the forbidden-up-to-now value NULL and treat that as meaning an
empty list. (An objection to this could be "what if we someday want
to allow NULL as a GUC value?". That seems unlikely though, and even
if we did allow it for scalar GUCs, we could continue to treat it as
meaning an empty list for list GUCs.)
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Andrei Klychkov <andrew.a.klychkov@gmail.com>
Reviewed-by: Jim Jones <jim.jones@uni-muenster.de>
Discussion: https://postgr.es/m/CA+mfrmwsBmYsJayWjc8bJmicxc3phZcHHY=yW5aYe=P-1d_4bg@mail.gmail.com
It's possible to define BRIN indexes on functions that require a
snapshot to run, but the autosummarization feature introduced by commit
7526e10224 fails to provide one. This causes autovacuum to leave a
BRIN placeholder tuple behind after a failed work-item execution, making
such indexes less efficient. Repair by obtaining a snapshot prior to
running the task, and add a test to verify this behavior.
Author: Álvaro Herrera <alvherre@kurilemu.de>
Reported-by: Giovanni Fabris <giovanni.fabris@icon.it>
Reported-by: Arthur Nascimento <tureba@gmail.com>
Backpatch-through: 13
Discussion: https://postgr.es/m/202511031106.h4fwyuyui6fz@alvherre.pgsql
When the startup process switches from streaming to archive as WAL
source, we avoid calling ShutdownWalRcv() if the WAL receiver is not
streaming, based on WalRcvStreaming(). WALRCV_STOPPING is a state set
by ShutdownWalRcv(), called only by the startup process, meaning that it
should not be possible to reach this state while in
WaitForWALToBecomeAvailable().
This commit adds an assertion to make sure that a WAL receiver is never
in a WALRCV_STOPPING state should the startup process attempt to reset
InstallXLogFileSegmentActive.
Idea suggested by Noah Misch.
Author: Xuneng Zhou <xunengzhou@gmail.com>
Discussion: https://postgr.es/m/19093-c4fff49a608f82a0@postgresql.org
Commit b4f584f9d2 (affecting v15~, later backpatched down to 13 as of
3635a0a35a) introduced an unconditional WAL receiver shutdown when
switching from streaming to archive WAL sources. This causes problems
during a timeline switch, when a WAL receiver enters WALRCV_WAITING
state but remains alive, waiting for instructions.
The unconditional shutdown can break some monitoring scenarios as the
WAL receiver gets repeatedly terminated and re-spawned, causing
pg_stat_wal_receiver.status to show a "streaming" instead of "waiting"
status, masking the fact that the WAL receiver is waiting for a new TLI
and a new LSN to be able to continue streaming.
This commit changes the WAL receiver behavior so as the shutdown becomes
conditional, with InstallXLogFileSegmentActive being always reset to
prevent the regression fixed by b4f584f9d2: only terminate the WAL
receiver when it is actively streaming (WALRCV_STREAMING,
WALRCV_STARTING, or WALRCV_RESTARTING). When in WALRCV_WAITING state,
just reset InstallXLogFileSegmentActive flag to allow archive
restoration without killing the process. WALRCV_STOPPED and
WALRCV_STOPPING are not reachable states in this code path. For the
latter, the startup process is the one in charge of setting
WALRCV_STOPPING via ShutdownWalRcv(), waiting for the WAL receiver to
reach a WALRCV_STOPPED state after switching walRcvState, so
WaitForWALToBecomeAvailable() cannot be reached while a WAL receiver is
in a WALRCV_STOPPING state.
A regression test is added to check that a WAL receiver is not stopped
on timeline jump, that fails when the fix of this commit is reverted.
Reported-by: Ryan Bird <ryanzxg@gmail.com>
Author: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Noah Misch <noah@leadboat.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/19093-c4fff49a608f82a0@postgresql.org
Backpatch-through: 13
The order in this list was previously pretty random and had grown
organically over time. This made it unnecessarily cumbersome to
maintain these lists, as there was no clear guidelines about where to
put new entries. Also, after the merger of the type-specific GUC
structs, the list still reflected the previous type-specific
super-order.
By using alphabetical order, the place for new entries becomes clear,
and often related entries will be listed close together.
This patch reorders the existing entries in guc_parameters.dat, and it
also augments the generation script to error if an entry is found at
the wrong place.
Note: The order is actually checked after lower-casing, to handle the
likes of "DateStyle".
Reviewed-by: John Naylor <johncnaylorls@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://www.postgresql.org/message-id/flat/8fdfb91e-60fb-44fa-8df6-f5dea47353c9@eisentraut.org
We've been nibbling away at removing uses of "long" for a long time,
since its width is platform-dependent. Here's one more: change the
remaining "long" fields in Plan nodes to Cardinality, since the three
surviving examples all represent group-count estimates. The upstream
planner code was converted to Cardinality some time ago; for example
the corresponding fields in Path nodes are type Cardinality, as are
the arguments of the make_foo_path functions. Downstream in the
executor, it turns out that these all feed to the table-size argument
of BuildTupleHashTable. Change that to "double" as well, and fix it
so that it safely clamps out-of-range values to the uint32 limit of
simplehash.h, as was not being done before.
Essentially, this is removing all the artificial datatype-dependent
limitations on these values from upstream processing, and applying
just one clamp at the moment where we're forced to do so by the
datatype choices of simplehash.h.
Also, remove BuildTupleHashTable's misguided attempt to enforce
work_mem/hash_mem_limit. It doesn't have enough information
(particularly not the expected tuple width) to do that accurately,
and it has no real business second-guessing the caller's choice.
For all these plan types, it's really the planner's responsibility
to not choose a hashed implementation if the hashtable is expected
to exceed hash_mem_limit. The previous patch improved the
accuracy of those estimates, and even if BuildTupleHashTable had
more information it should arrive at the same conclusions.
Reported-by: Jeff Janes <jeff.janes@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/CAMkU=1zia0JfW_QR8L5xA2vpa0oqVuiapm78h=WpNsHH13_9uw@mail.gmail.com
For several types of plan nodes that use TupleHashTables, the
planner estimated the expected size of the table as basically
numEntries * (MAXALIGN(dataWidth) + MAXALIGN(SizeofHeapTupleHeader)).
This is pretty far off, especially for small data widths, because
it doesn't account for the overhead of the simplehash.h hash table
nor for any per-tuple "additional space" the plan node may request.
Jeff Janes noted a case where the estimate was off by about a factor
of three, even though the obvious hazards such as inaccurate estimates
of numEntries or dataWidth didn't apply.
To improve matters, create functions provided by the relevant executor
modules that can estimate the required sizes with reasonable accuracy.
(We're still not accounting for effects like allocator padding, but
this at least gets the first-order effects correct.)
I added functions that can estimate the tuple table sizes for
nodeSetOp and nodeSubplan; these rely on an estimator for
TupleHashTables in general, and that in turn relies on one for
simplehash.h hash tables. That feels like kind of a lot of mechanism,
but if we take any short-cuts we're violating modularity boundaries.
The other places that use TupleHashTables are nodeAgg, which took
pains to get its numbers right already, and nodeRecursiveunion.
I did not try to improve the situation for nodeRecursiveunion because
there's nothing to improve: we are not making an estimate of the hash
table size, and it wouldn't help us to do so because we have no
non-hashed alternative implementation. On top of that, our estimate
of the number of entries to be hashed in that module is so suspect
that we'd likely often choose the wrong implementation if we did have
two ways to do it.
Reported-by: Jeff Janes <jeff.janes@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/CAMkU=1zia0JfW_QR8L5xA2vpa0oqVuiapm78h=WpNsHH13_9uw@mail.gmail.com
Add comments explaining when and where it is safe for nbtree to treat
row compare keys as if they were simple scalar inequality keys on the
row's most significant column. This is particularly important within
_bt_advance_array_keys, which deals with required inequality keys in a
general and uniform way, without any special handling for row compares.
Also spell out the implications of _bt_check_rowcompare's approach of
_conditionally_ evaluating lower-order row compare subkeys, particularly
when one of its lower-order subkeys might see NULL index tuple values
(these may or may not affect whether the qual as a whole is satisfied).
The behavior in this area isn't particularly intuitive, so these issues
seem worth going into.
In passing, add a few more defensive/documenting row comparison related
assertions to _bt_first and _bt_check_rowcompare.
Follow-up to commits bd3f59fd and ec986020.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Victor Yegorov <vyegorov@gmail.com>
Reviewed-By: Chao Li <li.evan.chao@gmail.com>
Discussion: https://postgr.es/m/CAH2-Wznwkak_K7pcAdv9uH8ZfNo8QO7+tHXOaCUddMeTfaCCFw@mail.gmail.com
Backpatch-through: 18
_bt_first reliably uses the same equality key (on each index column) for
initial positioning purposes as the one that _bt_checkkeys can use to
end the scan following commit f09816a0. _bt_first no longer applies its
own independent rules to determine which initial positioning key to use
on each column (for equality and inequality keys alike). Preprocessing
is now fully in control of determining which keys start and end each
scan, ensuring that _bt_first and _bt_checkkeys have symmetric behavior.
Remove obsolete comments that described why _bt_first was expected to
use at least one of the available required equality keys for initial
positioning purposes. The rules in this area are now maximally strict
and uniform, so there's no reason to draw attention to equality keys.
Any column with a required equality key cannot have a redundant required
inequality key (nor can it have a redundant required equality key).
Oversight in commit f09816a0, which removed similar comments from
_bt_first, but missed these comments.
Author: Peter Geoghegan <pg@bowt.ie>
Backpatch-through: 18
Several functions in the codebase accept "Datum *" parameters but do
not modify the pointed-to data. These have been updated to take
"const Datum *" instead, improving type safety and making the
interfaces clearer about their intent. This change helps the compiler
catch accidental modifications and better documents immutability of
arguments.
Most of "Datum *" parameters have a pairing "bool *isnull" parameter,
they are constified as well.
No functional behavior is changed by this patch.
Author: Chao Li <lic@highgo.com>
Discussion: https://www.postgresql.org/message-id/flat/CAEoWx2msfT0knvzUa72ZBwu9LR_RLY4on85w2a9YpE-o2By5HQ@mail.gmail.com
For all extant uses of TupleHashTables, execGrouping.c itself does
nothing with the "tablecxt" except to allocate new hash entries in it,
and the callers do nothing with it except to reset the whole context.
So this is an ideal use-case for a BumpContext, and the hash tables
are frequently big enough for the savings to be significant.
(Commit cc721c459 already taught nodeAgg.c this idea, but neglected
the other callers of BuildTupleHashTable.)
While at it, let's clean up some ill-advised leftovers from rebasing
TupleHashTables on simplehash.h:
* Many comments and variable names were based on the idea that the
tablecxt holds the whole TupleHashTable, whereas now it only holds the
hashed tuples (plus any caller-defined "additional storage"). Rename
to names like tuplescxt and tuplesContext, and adjust the comments.
Also adjust the memory context names to be like "<Foo> hashed tuples".
* Make ResetTupleHashTable() reset the tuplescxt rather than relying
on the caller to do so; that was fairly bizarre and seems like a
recipe for leaks. This is less efficient in the case where nodeAgg.c
uses the same tuplescxt for several different hashtables, but only
microscopically so because mcxt.c will short-circuit the extra resets
via its isReset flag. I judge the extra safety and intellectual
cleanliness well worth those few cycles.
* Remove the long-obsolete "allow_jit" check added by ac88807f9;
instead, just Assert that metacxt and tuplescxt are different.
We need that anyway for this definition of ResetTupleHashTable() to
be safe.
There is a side issue of the extent to which this change invalidates
the planner's estimates of hashtable memory consumption. However,
those estimates are already pretty bad, so improving them seems like
it can be a separate project. This change is useful to do first to
establish consistent executor behavior that the planner can expect.
A loose end not addressed here is that the "entrysize" calculation
in BuildTupleHashTable seems wrong: "sizeof(TupleHashEntryData) +
additionalsize" corresponds neither to the size of the simplehash
entries nor to the total space needed per tuple. It's questionable
why BuildTupleHashTable is second-guessing its caller's nbuckets
choice at all, since the original source of the number should have had
more information. But that all seems wrapped up with the planner's
estimation logic, so let's leave it for the planned followup patch.
Reported-by: Jeff Janes <jeff.janes@gmail.com>
Reported-by: David Rowley <dgrowleyml@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/CAMkU=1zia0JfW_QR8L5xA2vpa0oqVuiapm78h=WpNsHH13_9uw@mail.gmail.com
Discussion: https://postgr.es/m/2268409.1761512111@sss.pgh.pa.us
The original is pretty baroque for no apparent reason; arguably, commit
2f9661311b should have done this. Noted while reviewing related code
for bug #18984. This is cosmetic (though I'm surprised that my compiler
generates shorter assembly this way), so no backpatch.
Discussion: https://postgr.es/m/18984-0f4778a6599ac3ae@postgresql.org