This commit reverts 818fefd8fd, that has been introduced to address a
an instability in some of the TAP tests due to the presence of random
standby snapshot WAL records, when slots are invalidated by
InvalidatePossiblyObsoleteSlot().
Anyway, this commit had also the consequence of introducing a behavior
regression. After 818fefd8fd, the code may determine that a slot needs
to be invalidated while it may not require one: the slot may have moved
from a conflicting state to a non-conflicting state between the moment
when the mutex is released and the moment when we recheck the slot, in
InvalidatePossiblyObsoleteSlot(). Hence, the invalidations may be more
aggressive than they actually have to.
105b2cb336 has tackled the test instability in a way that should be
hopefully sufficient for the buildfarm, even for slow members:
- In v18, the test relies on an injection point that bypasses the
creation of the random records generated for standby snapshots,
eliminating the random factor that impacted the test. This option was
not available when 818fefd8fd was discussed.
- In v16 and v17, the problem was bypassed by disallowing a slot to
become active in some of the scenarios tested.
While on it, this commit adds a comment to document that it is fine for
a recheck to use xmin and LSN values stored in the slot, without storing
and reusing them across multiple checks.
Reported-by: "suyu.cmj" <mengjuan.cmj@alibaba-inc.com>
Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/f492465f-657e-49af-8317-987460cb68b0.mengjuan.cmj@alibaba-inc.com
Backpatch-through: 16
RIGHT_SEMI joins rely on the HEAP_TUPLE_HAS_MATCH flag to guarantee
that only the first match for each inner tuple is considered.
However, in a parallel hash join, the inner relation is stored in a
shared global hash table that can be probed by multiple workers
concurrently. This allows different workers to inspect and set the
match flags of the same inner tuples at the same time.
If two workers probe the same inner tuple concurrently, both may see
the match flag as unset and emit the same tuple, leading to duplicate
output rows and violating RIGHT_SEMI join semantics.
For now, we disable parallel plans for RIGHT_SEMI joins. In the long
term, it may be possible to support parallel execution by performing
atomic operations on the match flag, for example using a CAS or
similar mechanism.
Backpatch to v18, where RIGHT_SEMI join was introduced.
Bug: #19094
Reported-by: Lori Corbani <Lori.Corbani@jax.org>
Diagnosed-by: Tom Lane <tgl@sss.pgh.pa.us>
Author: Richard Guo <guofenglinux@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/19094-6ed410eb5b256abd@postgresql.org
Backpatch-through: 18
Because long is 32-bit on 64-bit Windows, it isn't a good datatype to
store the difference between 2 pointers. The under-sized type could
overflow and lead to scary warnings in MEMORY_CONTEXT_CHECKING builds,
such as:
WARNING: problem in alloc set ExecutorState: bad single-chunk %p in block %p
However, the problem lies only in the code running the check, not from
an actual memory accounting bug.
Fix by using "Size" instead of "long". This means using an unsigned
type rather than the previous signed type. If the block's freeptr was
corrupted, we'd still catch that if the unsigned type wrapped. Unsigned
allows us to avoid further needless complexities around comparing signed
and unsigned types.
Author: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Backpatch-through: 13
Discussion: https://postgr.es/m/CAApHDvo-RmiT4s33J=aC9C_-wPZjOXQ232V-EZFgKftSsNRi4w@mail.gmail.com
Previously, we'd fill all fields except ccbin, and only later obtain and
detoast ccbin, with hypothetical failures being possible. If ccbin is
null (rare catalog corruption I have never witnessed) or its a corrupted
toast entry, we leak a tiny bit of memory in CacheMemoryContext from
having strdup'd the constraint name. Repair these by only attempting to
fill the struct once ccbin has been detoasted.
Author: Ranier Vilela <ranier.vf@gmail.com>
Discussion: https://postgr.es/m/CAEudQAr=i3_Z4GvmediX900+sSySTeMkvuytYShhQqEwoGyvhA@mail.gmail.com
Instead of having five separate GUC structs, one for each type, with
the generic part contained in each of them, flip it around and have
one common struct, with the type-specific part has a subfield.
The very original GUC design had type-specific structs and
type-specific lists, and the membership in one of the lists defined
the type. But now the structs themselves know the type (from the
.vartype field), and they are all loaded into a common hash table at
run time, and so this original separation no longer makes sense. It
creates a bunch of inconsistencies in the code about whether the
type-specific or the generic struct is the primary struct, and a lot
of casting in between, which makes certain assumptions about the
struct layouts.
After the change, all these casts are gone and all the data is
accessed via normal field references. Also, various code is
simplified because only one kind of struct needs to be processed.
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://www.postgresql.org/message-id/flat/8fdfb91e-60fb-44fa-8df6-f5dea47353c9@eisentraut.org
XLogRecordAssemble() may be called multiple times before inserting a
record in XLogInsertRecord(), and the amount of FPIs generated inside
a record whose insertion is attempted multiple times may vary.
The logic added in f9a09aa295 touched directly pgWalUsage in
XLogRecordAssemble(), meaning that it could be possible for pgWalUsage
to be incremented multiple times for a single record. This commit
changes the code to use the same logic as the number of FPIs added to a
record, where XLogRecordAssemble() returns this information and feeds it
to XLogInsertRecord(), updating pgWalUsage only when a record is
inserted.
Reported-by: Shinya Kato <shinya11.kato@gmail.com>
Discussion: https://postgr.es/m/CAOzEurSiSr+rusd0GzVy8Bt30QwLTK=ugVMnF6=5WhsSrukvvw@mail.gmail.com
I have never seen this be a problem in practice, but it came up when
purposely corrupting catalog contents to study the fix for a nearby bug:
we'd try to decrement relchecks, but since it's zero we error out and
fail to drop the constraint. The fix is to downgrade the error to
warning, skip decrementing the counter, and otherwise proceed normally.
Given lack of field complaints, no backpatch.
Author: Álvaro Herrera <alvherre@kurilemu.de>
Discussion: https://postgr.es/m/202508291058.q2zscdcs64fj@alvherre.pgsql
Some recent changes were made to remove the explicit dependency on
btree indexes in some parts of the code. One of these changes was
made in commit 9ef1851685, which allows non-btree indexes to be used
in get_actual_variable_range(). A follow-up commit ee1ae8b99f fixes
the cases where an index doesn’t have a sortopfamily as this is a
prerequisite to be used in get_actual_variable_range().
However, it was found that indexes that have amcanorder = true but do
not allow index-only-scans (amcanreturn returns false or is NULL) will
pass all of the conditions, while they should be rejected since
get_actual_variable_range() uses the index-only-scan machinery in
get_actual_variable_endpoint(). Such an index might cause errors like
ERROR: no data returned for index-only scan
during query planning.
The fix is to add a check in get_actual_variable_range() to reject
indexes that do not allow index-only scans.
Author: Maxime Schoemans <maxime.schoemans@enterprisedb.com>
Discussion: https://www.postgresql.org/message-id/flat/20ED852A-C2D9-41EB-8671-8C8B9D418BE9%40enterprisedb.com
This new counter, called "wal_fpi_bytes", tracks the total amount in
bytes of full page images (FPIs) generated in WAL. This data becomes
available globally via pg_stat_wal, and for backend statistics via
pg_stat_get_backend_wal().
Previously, this information could only be retrieved with pg_waldump or
pg_walinspect, which may not be available depending on the environment,
and are expensive to execute. It offers hints about how much FPIs
impact the WAL generated, which could be a large percentage for some
workloads, as well as the effects of wal_compression or page holes.
Bump catalog version.
Bump PGSTAT_FILE_FORMAT_ID, due to the addition of wal_fpi_bytes in
PgStat_WalCounters.
Author: Shinya Kato <shinya11.kato@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/CAOzEurQtZEAfg6P0kU3Wa-f9BWQOi0RzJEMPN56wNTOmJLmfaQ@mail.gmail.com
This type is just char * underneath, it provides no real value, no
type safety, and just makes the code one level more mysterious. It is
more idiomatic to refer to blobs of memory by a combination of void *
and size_t, so change it to that.
Also, since this type hides the pointerness, we can't apply qualifiers
to what is pointed to, which requires some unconstify nonsense. This
change allows fixing that.
Extension code that uses the Item type can change its code to use
void * to be backward compatible.
Reviewed-by: Nathan Bossart <nathandbossart@gmail.com>
Reviewed-by: Peter Geoghegan <pg@bowt.ie>
Discussion: https://www.postgresql.org/message-id/flat/c75cccf5-5709-407b-a36a-2ae6570be766@eisentraut.org
Previously, the check_hook for synchronized_standby_slots attempted to
validate that each specified slot existed and was physical. However, these
checks were not performed during server startup. As a result, if users
configured non-existent slots before startup, the misconfiguration would
go undetected initially. This could later cause parallel query failures,
as newly launched workers would detect the issue and raise an ERROR.
This patch improves the check_hook by validating the syntax and format of
slot names. Validation of slot existence and type is deferred to the WAL
sender process, aligning with the behavior of the check_hook for
primary_slot_name.
Reported-by: Fabrice Chapuis <fabrice636861@gmail.com>
Author: Shlok Kyal <shlok.kyal.oss@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Ashutosh Sharma <ashu.coek88@gmail.com>
Reviewed-by: Rahila Syed <rahilasyed90@gmail.com>
Backpatch-through: 17, where it was introduced
Discussion: https://postgr.es/m/CAA5-nLCeO4MQzWipCXH58qf0arruiw0OeUc1+Q=Z=4GM+=v1NQ@mail.gmail.com
When dealing with ResultRelInfos for partitions, there are cases where
there are mixed requirements for the ri_RootResultRelInfo. There are
cases when the partition itself requires a NULL ri_RootResultRelInfo and
in the same query, the same partition may require a ResultRelInfo with
its parent set in ri_RootResultRelInfo. This could cause the column
mapping between the partitioned table and the partition not to be done
which could result in crashes if the column attnums didn't match
exactly.
The fix is simple. We now check that the ri_RootResultRelInfo matches
what the caller passed to ExecGetTriggerResultRel() and only return a
cached ResultRelInfo when the ri_RootResultRelInfo matches what the
caller wants, otherwise we'll make a new one.
Author: David Rowley <dgrowleyml@gmail.com>
Author: Amit Langote <amitlangote09@gmail.com>
Reported-by: Dmitry Fomin <fomin.list@gmail.com>
Discussion: https://postgr.es/m/7DCE78D7-0520-4207-822B-92F60AEA14B4@gmail.com
Backpatch-through: 15
These two functions expect there to be room to insert another item
in the FreePageBtree's array, but their assertions were too weak
to guarantee that. This has little practical effect granting that
the callers are not buggy, but it seems to be misleading late-model
Coverity into complaining about possible array overrun.
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/799984.1761150474@sss.pgh.pa.us
Backpatch-through: 13
This patch adds support for a new SQL command:
ALTER SUBSCRIPTION ... REFRESH SEQUENCES
This command updates the sequence entries present in the
pg_subscription_rel catalog table with the INIT state to trigger
resynchronization.
In addition to the new command, the following subscription commands have
been enhanced to automatically refresh sequence mappings:
ALTER SUBSCRIPTION ... REFRESH PUBLICATION
ALTER SUBSCRIPTION ... ADD PUBLICATION
ALTER SUBSCRIPTION ... DROP PUBLICATION
ALTER SUBSCRIPTION ... SET PUBLICATION
These commands will perform the following actions:
Add newly published sequences that are not yet part of the subscription.
Remove sequences that are no longer included in the publication.
This ensures that sequence replication remains aligned with the current
state of the publication on the publisher side.
Note that the actual synchronization of sequence data/values will be
handled in a subsequent patch that introduces a dedicated sequence sync
worker.
Author: Vignesh C <vignesh21@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: shveta malik <shveta.malik@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Dilip Kumar <dilipbalaut@gmail.com>
Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: Nisha Moond <nisha.moond412@gmail.com>
Reviewed-by: Shlok Kyal <shlok.kyal.oss@gmail.com>
Reviewed-by: Chao Li <li.evan.chao@gmail.com>
Reviewed-by: Hou Zhijie <houzj.fnst@fujitsu.com>
Discussion: https://postgr.es/m/CAA4eK1LC+KJiAkSrpE_NwvNdidw9F2os7GERUeSxSKv71gXysQ@mail.gmail.com
Commit 883a95646a introduced overflow entries in the replication lag tracker
to fix an issue where lag columns in pg_stat_replication could stall when
the replay LSN stopped advancing.
This commit adds comments clarifying the purpose and behavior of overflow
entries to improve code readability and understanding.
Since commit 883a95646a was recently applied and backpatched to all
supported branches, this follow-up commit is also backpatched accordingly.
Author: Xuneng Zhou <xunengzhou@gmail.com>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/CABPTF7VxqQA_DePxyZ7Y8V+ErYyXkmwJ1P6NC+YC+cvxMipWKw@mail.gmail.com
Backpatch-through: 13
When JIT deformed tuples (controlled via the jit_tuple_deforming GUC),
types narrower than sizeof(Datum) would be zero-extended up to Datum
width. This wasn't the same as what fetch_att() does in the standard
tuple deforming code. Logically the values are the same when fetching
via the DatumGet*() marcos, but negative numbers are not the same in
binary form.
In the report, the problem was manifesting itself with:
ERROR: could not find memoization table entry
in a query which had a "Cache Mode: binary" Memoize node. However, it's
currently unclear what else is affected. Anything that uses
datum_image_eq() or datum_image_hash() on a Datum from a tuple deformed by
JIT could be affected, but it may not be limited to that.
The fix for this is simple: use signed extension instead of zero
extension.
Many thanks to Emmanuel Touzery for reporting this issue and providing
steps and backup which allowed the problem to easily be recreated.
Reported-by: Emmanuel Touzery <emmanuel.touzery@plandela.si>
Author: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/DB8P194MB08532256D5BAF894F241C06393F3A@DB8P194MB0853.EURP194.PROD.OUTLOOK.COM
Backpatch-through: 13
We had several places that used cast-to-unsigned-int as a substitute
for properly checking for overflow. Coverity has started objecting
to that practice as likely introducing Y2038 bugs. An extra
comparison is surely not much compared to the cost of time(NULL), nor
is this coding practice particularly readable. Let's do it honestly,
with explicit logic covering the cases of first-time-through and
clock-went-backwards.
I don't feel a need to back-patch though: our released versions
will be out of support long before 2038, and besides which I think
the code would accidentally work anyway for another 70 years or so.
This small function is only used in one place, and it fails to
handle quoted table names (although the table name portion of the
input should never be quoted in current usage). In addition to
removing make_temptable_name_n() in favor of open-coding it where
needed, this commit ensures the "diff" table name is properly
quoted in order to future-proof this area a bit.
Author: Aleksander Alekseev <aleksander@tigerdata.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Shinya Kato <shinya11.kato@gmail.com>
Discussion: https://postgr.es/m/CAJ7c6TO3a5q2NKRsjdJ6sLf8isVe4aMaaX1-Hj2TdHdhFw8zRA%40mail.gmail.com
Previously, if primary_slot_name was set to an invalid slot name and
the configuration file was reloaded, both the postmaster and all other
backend processes reported a WARNING. With many processes running,
this could produce a flood of duplicate messages. The problem was that
the GUC check hook for primary_slot_name reported errors at WARNING
level via ereport().
This commit changes the check hook to use GUC_check_errdetail() and
GUC_check_errhint() for error reporting. As with other GUC parameters,
this causes non-postmaster processes to log the message at DEBUG3,
so by default, only the postmaster's message appears in the log file.
Backpatch to all supported versions.
Author: Fujii Masao <masao.fujii@gmail.com>
Reviewed-by: Chao Li <lic@highgo.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@kurilemu.de>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Discussion: https://postgr.es/m/CAHGQGwFud-cvthCTfusBfKHBS6Jj6kdAPTdLWKvP2qjUX6L_wA@mail.gmail.com
Backpatch-through: 13
Previously it was mistakenly assumed that there's only one window
function argument which needs to be processed by WinGetFuncArgInFrame
or WinGetFuncArgInPartition when IGNORE NULLS option is specified. To
eliminate the limitation, WindowObject->notnull_info is modified from
"uint8 *" to "uint8 **" so that WindowObject->notnull_info could store
pointers to "uint8 *" which holds NOT NULL info corresponding to each
window function argument. Moreover, WindowObject->num_notnull_info is
changed from "int" to "int64 *" so that WindowObject->num_notnull_info
could store the number of NOT NULL info corresponding to each function
argument. Memories for these data structures will be allocated when
WinGetFuncArgInFrame or WinGetFuncArgInPartition is called. Thus no
memory except the pointers is allocated for function arguments which
do not call these functions
Also fix the set mark position logic in WinGetFuncArgInPartition to
not raise a "cannot fetch row before WindowObject's mark position"
error in IGNORE NULLS case.
Reported-by: Tom Lane <tgl@sss.pgh.pa.us>
Author: Tatsuo Ishii <ishii@postgresql.org>
Discussion: https://postgr.es/m/2952409.1760023154%40sss.pgh.pa.us
Previously, when the replay LSN reported in feedback messages from a standby
stopped advancing, for example, due to a recovery conflict, the write_lag and
flush_lag columns in pg_stat_replication would initially update but then stop
progressing. This prevented users from correctly monitoring replication lag.
The problem occurred because when any LSN stopped updating, the lag tracker's
cyclic buffer became full (the write head reached the slowest read head).
In that state, the lag tracker could no longer compute round-trip lag values
correctly.
This commit fixes the issue by handling the slowest read entry (the one
causing the buffer to fill) as a separate overflow entry and freeing space
so the write and other read heads can continue advancing in the buffer.
As a result, write_lag and flush_lag now continue updating even if the reported
replay LSN remains stalled.
Backpatch to all supported versions.
Author: Fujii Masao <masao.fujii@gmail.com>
Reviewed-by: Chao Li <lic@highgo.com>
Reviewed-by: Shinya Kato <shinya11.kato@gmail.com>
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
Discussion: https://postgr.es/m/CAHGQGwGdGQ=1-X-71Caee-LREBUXSzyohkoQJd4yZZCMt24C0g@mail.gmail.com
Backpatch-through: 13
5983a4cff added CompactAttribute for storing commonly used fields from
FormData_pg_attribute. 5983a4cff didn't go to the trouble of adjusting
every location where we can use CompactAttribute rather than
FormData_pg_attribute, so here we change the remaining ones.
There are some locations where I've left the code using
FormData_pg_attribute. These are mostly in the ALTER TABLE code. Using
CompactAttribute here seems more risky as often the TupleDesc is being
changed and those changes may not have been flushed to the
CompactAttribute yet.
I've also left record_recv(), record_send(), record_cmp(), record_eq()
and record_image_eq() alone as it's not clear to me that accessing the
CompactAttribute is a win here due to the FormData_pg_attribute still
having to be accessed for most cases. Switching the relevant parts to
use CompactAttribute would result in having to access both for common
cases. Careful benchmarking may reveal that something can be done to
make this better, but in absence of that, the safer option is to leave
these alone.
In ReorderBufferToastReplace(), there was a check to skip attnums < 0
while looping over the TupleDesc. Doing this is redundant since
TupleDescs don't store < 0 attnums. Removing that code allows us to
move to using CompactAttribute.
The change in validateDomainCheckConstraint() just moves fetching the
FormData_pg_attribute into the ERROR path, which is cold due to calling
errstart_cold() and results in code being moved out of the common path.
Author: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/CAApHDvrMy90o1Lgkt31F82tcSuwRFHq3vyGewSRN=-QuSEEvyQ@mail.gmail.com
Previously, tsearch used the database's CTYPE setting, which only
matches the database default collation if the locale provider is libc.
Note that tsearch types (tsvector and tsquery) are not collatable
types. The locale affects parsing the original text, which is a lossy
process, so a COLLATE clause on the already-parsed value would not
make sense.
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://postgr.es/m/0151ad01239e2cc7b3139644358cf8f7b9622ff7.camel@j-davis.com
A BlockNumber (32-bit) might not be large enough to add bo_pagesPerRange
to when the table contains close to 2^32 pages. At worst, this could
result in a cancellable infinite loop during the BRIN index scan with
power-of-2 pagesPerRange, and slow (inefficient) BRIN index scans and
scanning of unneeded heap blocks for non power-of-2 pagesPerRange.
Backpatch to all supported versions.
Author: sunil s <sunilfeb26@gmail.com>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/CAOG6S4-tGksTQhVzJM19NzLYAHusXsK2HmADPZzGQcfZABsvpA@mail.gmail.com
Backpatch-through: 13
67a54b9e8 taught the planner to push down HAVING clauses even when
grouping sets are present, as long as the clause does not reference
any columns that are nullable by the grouping sets. However, there
was an oversight: if any empty grouping sets are present, the
aggregation node can produce a row that did not come from the input,
and pushing down a HAVING clause in this case may cause us to fail to
filter out that row.
Currently, non-degenerate HAVING clauses are not pushed down when
empty grouping sets are present, since the empty grouping sets would
nullify the vars they reference. However, degenerate (variable-free)
HAVING clauses are not subject to this restriction and may be
incorrectly pushed down.
To fix, explicitly check for the presence of empty grouping sets and
retain degenerate clauses in HAVING when they are present. This
ensures that we don't emit a bogus aggregated row. A copy of each
such clause is also put in WHERE so that query_planner() can use it in
a gating Result node.
To facilitate this check, this patch expands the groupingSets tree of
the query to a flat list of grouping sets before applying the HAVING
pushdown optimization. This does not add any additional planning
overhead, since we need to do this expansion anyway.
In passing, make a small tweak to preprocess_grouping_sets() by
reordering its initial operations a bit.
Backpatch to v18, where this issue was introduced.
Reported-by: Yuhang Qiu <iamqyh@gmail.com>
Author: Richard Guo <guofenglinux@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/0879D9C9-7FE2-4A20-9593-B23F7A0B5290@gmail.com
Backpatch-through: 18