Teach nbtree's _bt_killitems to leave the so->currPos page that it sets
LP_DEAD items on in whatever state it was in when _bt_killitems was
called. In particular, make sure that so->dropPin scans don't acquire a
pin whose reference is saved in so->currPos.buf.
Allowing _bt_killitems to change so->currPos.buf like this is wrong.
The immediate consequence of allowing it is that code in _bt_steppage
(that copies so->currPos into so->markPos) will behave as if the scan is
a !so->dropPin scan. so->markPos will therefore retain the buffer pin
indefinitely, even though _bt_killitems only needs to acquire a pin
(along with a lock) for long enough to mark known-dead items LP_DEAD.
This issue came to light following a report of a failure of an assertion
from recent commit e6eed40e. The test case in question involves the use
of mark and restore. An initial call to _bt_killitems takes place that
leaves so->currPos.buf in a state that is inconsistent with the scan
being so->dropPin. A subsequent call to _bt_killitems for the same
position (following so->currPos being saved in so->markPos, and then
restored as so->currPos) resulted in the failure of an assertion that
tests that so->currPos.buf is InvalidBuffer when the scan is so->dropPin
(non-assert builds got a "resource was not closed" WARNING instead).
The same problem exists on earlier releases, though the issue is far
more subtle there. Recent commit e6eed40e introduced the so->dropPin
field as a partial replacement for testing so->currPos.buf directly.
Earlier releases won't get an assertion failure (or buffer pin leak),
but they will allow the second _bt_killitems call from the test case to
behave as if a buffer pin was consistently held since the original call
to _bt_readpage. This is wrong; there will have been an initial window
during which no pin was held on the so->currPos page, and yet the second
_bt_killitems call will neglect to check if so->currPos.lsn continues to
match the page's now-current LSN.
As a result of all this, it's just about possible that _bt_killitems
will set the wrong items LP_DEAD (on release branches). This could only
happen with merge joins (the sole user of nbtree mark/restore support),
when a concurrently inserted index tuple used a recently-recycled TID
(and only when the new tuple was inserted onto the same page as a
distinct concurrently-removed tuple with the same TID). This is exactly
the scenario that _bt_killitems' check of the page's now-current LSN
against the LSN stashed in currPos was supposed to prevent.
A follow-up commit will make nbtree completely stop conditioning whether
or not a position's pin needs to be dropped on whether the 'buf' field
is set. All call sites that might need to drop a still-held pin will be
taught to rely on the scan-level so->dropPin field recently introduced
by commit e6eed40e. That will make bugs of the same general nature as
this one impossible (or make them much easier to detect, at least).
Author: Peter Geoghegan <pg@bowt.ie>
Reported-By: Alexander Lakhin <exclusion@gmail.com>
Discussion: https://postgr.es/m/545be1e5-3786-439a-9257-a90d30f8b849@gmail.com
Backpatch-through: 13
Traditionally, libpq's pqPutMsgEnd has rounded down the amount-to-send
to be a multiple of 8K when it is eagerly writing some data. This
still seems like a good idea when sending through a Unix socket, as
pipes typically have a buffer size of 8K or some fraction/multiple of
that. But there's not much argument for it on a TCP connection, since
(a) standard MTU values are not commensurate with that, and (b) the
kernel typically applies its own packet splitting/merging logic.
Worse, our SSL and GSSAPI code paths both have API stipulations that
if they fail to send all the data that was offered in the previous
write attempt, we mustn't offer less data in the next attempt; else
we may get "SSL error: bad length" or "GSSAPI caller failed to
retransmit all data needing to be retried". The previous write
attempt might've been pqFlush attempting to send everything in the
buffer, so pqPutMsgEnd can't safely write less than the full buffer
contents. (Well, we could add some more state to track exactly how
much the previous write attempt was, but there's little value evident
in such extra complication.) Hence, apply the round-down only on
AF_UNIX sockets, where we never use SSL or GSSAPI.
Interestingly, we had a very closely related bug report before,
which I attempted to fix in commit d053a879b. But the test case
we had then seemingly didn't trigger this pqFlush-then-pqPutMsgEnd
scenario, or at least we failed to recognize this variant of the bug.
Bug: #18907
Reported-by: Dorjpalam Batbaatar <htgn.dbat.95@gmail.com>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/18907-d41b9bcf6f29edda@postgresql.org
Backpatch-through: 13
On macOS, when building with the make system, the exported symbols
list $(SHLIB_EXPORTS) was ignored. This was probably not intentional,
it was probably just forgotten, since that combination has never
actually been used until now (for libpq-oauth).
The meson build system handles this correctly. Also, other platforms
have been doing this correctly.
This fixes it. It also does a bit of refactoring to make the code
match the layout for other platforms.
Reviewed-by: Jacob Champion <jacob.champion@enterprisedb.com>
Discussion: https://www.postgresql.org/message-id/flat/c70ca32e-b109-460d-9810-6e23ebb4473f%40eisentraut.org
Delay calling BufferGetLSNAtomic() until we finish reading a page that
actually contains items that btgettuple will return to the executor.
This reduces the number of calls during plain index scans (we'll only
call BufferGetLSNAtomic() when _bt_readpage returns true), and totally
eliminates calls during index-only scans, bitmap index scans, and plain
index scans of an unlogged relation.
Currently, when checksums (or wal_log_hints) are enabled, acquiring a
page's LSN in BufferGetLSNAtomic() involves locking the buffer header
(which involves the use of spinlocks). Testing has shown that enabling
page-level checksums causes large regressions with certain workloads,
especially on larger multi-socket systems.
The regression isn't tied to any Postgres 18 commit. However, Postgres
18 commit 04bec894 made initdb use checksums by default, so it seems
prudent to address the problem now.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/941f0190-e3c6-4622-9ac7-c04e936e5fdb@vondra.me
Discussion: https://postgr.es/m/CAH2-Wzk-Dg5XWs_jDuiHt4_7ryrSY+n=vxmHY51EVqPDFsKXmg@mail.gmail.com
Use of a RowCompare key makes nbtree index scans ineligible to use
pstate.forcenonrequired following recent bugfix commit 5f4d98d4.
There's no longer any need for _bt_check_rowcompare to accept a
forcenonrequired argument, so remove it.
Similar to commit cc733ed164: when an unenforced foreign key that
references a partitioned table is altered to be enforced, we scan
the constrained table based on each partition on the referenced
partitioned table. This is bogus and likely to cause the ALTER TABLE to
fail: we must only scan the constrained table as pointing to the
top-level partitioned table. Oversight in commit eec0040c4b. Fix by
eliding those scans.
Author: Amul Sul <sulamul@gmail.com>
Reported-by: jian he <jian.universality@gmail.com>
Discussion: https://postgr.es/m/CACJufxF1e_gPOLtsDoaE4VCgQPC8KZW_kPAjPR5Rvv4Ew=fb2A@mail.gmail.com
Validating an unvalidated foreign key that references a partitioned
table would try to queue validations for each individual partition of
the referenced table, but this is wrong: each individual partition would
not necessarily have all the referenced rows, so errors would be raised.
Avoid doing that. The pg_constraint rows that cause this to happen are
only there to support the action triggers that implement the DELETE/
UPDATE actions of the FK, so no validating scan is necessary.
This was an oversight in commit b663b9436e.
An equivalent oversight exists for NOT ENFORCED constraints, which is
not fixed in this commit.
Author: Amul Sul <sulamul@gmail.com>
Reported-by: Antonin Houska <ah@cybertec.at>
Reviewed-by: jian he <jian.universality@gmail.com>
Reviewed-by: Tender Wang <tndrwang@gmail.com>
Discussion: https://postgr.es/m/26983.1748418675@localhost
This commit replaces the formula used for "TotalProcs" with a call to
pgaio_uring_procs() in pgaio_uring_shmem_init() for the shared memory
initialization, which is exactly the same, removing a duplication.
pgaio_uring_procs() is used for shared memory sizing and a sanity check,
and it has some documentation explaining some reasoning behind the
formula.
Author: Japin Li <japinli@hotmail.com>
Discussion: https://postgr.es/m/ME0P300MB044521067A1EDDA9EDEC3793B66DA@ME0P300MB0445.AUSP300.PROD.OUTLOOK.COM
The documentation for the pg_authid system catalog and the
pg_shadow system view indicates that passwords might be stored in
cleartext, but that hasn't been possible for some time.
Oversight in commit eb61136dc7.
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/aD2yKkZro4nbl5ol%40nathan
Backpatch-through: 13
Commit 4f7f7b0375 implemented the extension_control_path GUC, and to
make it work it was decided that we should strip the $libdir/ on
module_pathname from .control files, so that extensions don't need to
worry about this change.
This strip logic was implemented on expand_dynamic_library_name()
which works fine when executing the SQL functions from extensions, but
this function is also called when the LOAD command is executed, and
since the user may explicitly pass the $libdir prefix on LOAD
parameter, we should not strip in this case.
This commit fixes this issue by moving the strip logic from
expand_dynamic_library_name() to load_external_function() that is
called when the running the SQL script from extensions.
Reported-by: Evan Si <evsi@amazon.com>
Author: Matheus Alcantara <matheusssilv97@gmail.com>
Reviewed-by: Nathan Bossart <nathandbossart@gmail.com>
Reviewed-by: Rahila Syed <rahilasyed90@gmail.com>
Bug: #18920
Discussion: https://www.postgresql.org/message-id/flat/18920-b350b1c0a30af006%40postgresql.org
When the backend reads COPY data, it ignores all sync messages, as per
c01641f8ae. With psql pipelines, it is possible to manually send sync
messages with \sendpipeline which leaves the frontend in an
unrecoverable state as the backend will not send the necessary
ReadyForQuery message that is expected to feed psql result consumption
logic.
It could be possible to artificially reduce the piped_syncs and
requested_results, however libpq's state would still have queued sync
messages in its command queue, and the only way to consume those without
directly calling pqCommandQueueAdvance() is to process ReadyForQuery
messages that won't be sent since the backend ignores these. Perhaps
this could be improved in the future, but I am not really excited about
introducing this amount of complications in libpq to manipulate the
message queues without a better use case to support it.
Hence, this patch aborts the connection if we detect excessive sync
messages after a COPY in a pipeline to avoid staying in an inconsistent
protocol state, which is the best thing we can do with pipelines in
psql for now. Note that this change does not prevent wrapping a set
of queries inside a block made of \startpipeline and \endpipeline, only
the use of \syncpipeline for a COPY.
Reported-by: Nikita Kalinin <n.kalinin@postgrespro.ru>
Author: Anthonin Bonnefoy <anthonin.bonnefoy@datadoghq.com>
Discussion: https://postgr.es/m/18944-8a926c30f68387dd@postgresql.org
POSIX allows such platforms. Given the lack of complaints, we may not
currently test on such a platform. This is new in v18 (commit
7d5c83b4e9), so no back-patch.
We store values for these options as array elements with the syntax
"name=value", hence a name containing "=" confuses matters when
it's time to read the array back in. Since validation of the
options is often done (long) after this conversion to array format,
that leads to confusing and off-point error messages. We can
improve matters by rejecting names containing "=" up-front.
(Probably a better design would have involved pairs of array
elements, but it's too late now --- and anyway, there's no
evident use-case for option names like this. We already
reject such names in some other contexts such as GUCs.)
Reported-by: Chapman Flack <jcflack@acm.org>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Chapman Flack <jcflack@acm.org>
Discussion: https://postgr.es/m/6830EB30.8090904@acm.org
Backpatch-through: 13
052026c9b9 mistakenly reordered setup steps in heap_vacuum_rel(),
incorrectly moving RelationGetNumberOfBlocks() before
vacuum_get_cutoffs().
OldestXmin must be determined before RelationGetNumberOfBlocks()
calculates the number of blocks in the relation that will be vacuumed.
Otherwise tuples older than OldestXmin may be inserted into the end of
the relation into blocks that are not vacuumed. If additional tuples
newer than those inserted into unscanned blocks but older than
OldestXmin are inserted into free space earlier in the relation, the
result could be advancing pg_class.relfrozenxid to a newer value than an
unfrozen XID in one of the unscanned heap pages.
Assigning an incorrect relfrozenxid can lead to data loss, so it is
imperative that it correctly reflect the oldest unfrozen xid.
Reported-by: Peter Geoghegan <pg@bowt.ie>
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-WzntqvVEdbbpqG5JqSZGuLWmy4PBfUO-OswfivKchr2gvw%40mail.gmail.com
Commit 7406ab623f added a gist support function that we internally
refer to by the symbol GIST_STRATNUM_PROC. This translated from
"well-known" strategy numbers to opfamily-specific strategy numbers.
However, we later (commit 630f9a43ce) changed this to fit into
index-AM-level compare type mapping, so this function actually now
maps from compare type to opfamily-specific strategy numbers. So this
name is no longer fitting.
Moreover, the index AM level also supports the opposite, a function to
map from strategy number to compare type. This is currently not
supported in gist, but one might wonder what this function is supposed
to be called when it is added.
This patch changes the naming of the gist-level functionality to be
more in line with the index-AM-level functionality. This makes sense
because these are essentially the same thing on different levels.
This also changes the names of the externally visible functions that
are provided for use as such a support function.
Reviewed-by: Paul A Jungwirth <pj@illuminatedcomputing.com>
Discussion: https://www.postgresql.org/message-id/37ebb1d9-9036-485f-a215-e55435689917%40eisentraut.org
A cascading WAL sender doing logical decoding (as known as doing its
work on a standby) has been using as flush LSN the value returned by
GetStandbyFlushRecPtr() (last position safely flushed to disk). This is
incorrect as such processes are only able to decode changes up to the
LSN that has been replayed by the startup process.
This commit changes cascading logical WAL senders to use the replay LSN,
as returned by GetXLogReplayRecPtr(). This distinction is important
particularly during shutdown, when WAL senders need to send any
remaining available data to their clients, switching WAL senders to a
caught-up state. Using the latest flush LSN rather than the replay LSN
could cause the WAL senders to be stuck in an infinite loop preventing
them to shut down, as the startup process does not run when WAL senders
attempt to catch up, so they could keep waiting for work that would
never happen.
Backpatch down to v16, where logical decoding on standbys has been
introduced.
Author: Alexey Makhmutov <a.makhmutov@postgrespro.ru>
Reviewed-by: Ajin Cherian <itsajin@gmail.com>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/52138028-7246-421c-9161-4fa108b88070@postgrespro.ru
Backpatch-through: 16
PLy_elog_impl and its subroutine PLy_traceback intended to avoid
leaking any PyObject reference counts, but their coverage of the
matter was sadly incomplete. In particular, out-of-memory errors
in most of the string-construction subroutines could lead to
reference count leaks, because those calls were outside the
PG_TRY blocks responsible for dropping reference counts.
Fix by (a) adjusting the scopes of the PG_TRY blocks, and
(b) moving the responsibility for releasing the reference counts
of the traceback-stack objects to PLy_elog_impl. This requires
some additional "volatile" markers, but not too many.
In passing, fix an ancient thinko: use of the "e_module_o" PyObject
was guarded by "if (e_type_s)", where surely "if (e_module_o)"
was meant. This would only have visible consequences if the
"__name__" attribute were present but the "__module__" attribute
wasn't, which apparently never happens; but someday it might.
Rearranging the PG_TRY blocks requires indenting a fair amount
of code one more tab stop, which I'll do separately for clarity.
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/2954090.1748723636@sss.pgh.pa.us
Backpatch-through: 13
Previously, postgres_fdw always 1) opened a remote transaction in READ
WRITE mode even when the local transaction was READ ONLY, causing a READ
ONLY transaction using it that references a foreign table mapped to a
remote view executing a volatile function to write in the remote side,
and 2) opened the remote transaction in NOT DEFERRABLE mode even when
the local transaction was DEFERRABLE, causing a SERIALIZABLE READ ONLY
DEFERRABLE transaction using it to abort due to a serialization failure
in the remote side.
To avoid these, modify postgres_fdw to open a remote transaction in the
same access/deferrable modes as the local transaction. This commit also
modifies it to open a remote subtransaction in the same access mode as
the local subtransaction.
Although these issues exist since the introduction of postgres_fdw,
there have been no reports from the field. So it seems fine to just fix
them in master only.
Author: Etsuro Fujita <etsuro.fujita@gmail.com>
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAPmGK16n_hcUUWuOdmeUS%2Bw4Q6dZvTEDHb%3DOP%3D5JBzo-M3QmpQ%40mail.gmail.com
When a MERGE's target table is the parent of an inheritance tree, any
INSERT actions insert into the parent table using ModifyTableState's
rootResultRelInfo. However, there are two bugs in the way is
initialized:
1. ExecInitMerge() incorrectly uses a different ResultRelInfo entry
from ModifyTableState's resultRelInfo array to build the insert
projection, which may not be compatible with rootResultRelInfo.
2. ExecInitModifyTable() does not fully initialize rootResultRelInfo.
Specifically, ri_WithCheckOptions, ri_WithCheckOptionExprs,
ri_returningList, and ri_projectReturning are not initialized.
This can lead to crashes, or incorrect query results due to failing to
check WCO's or process the RETURNING list for INSERT actions.
Fix both these bugs in ExecInitMerge(), noting that it is only
necessary to fully initialize rootResultRelInfo if the MERGE has
INSERT actions and the target table is a plain inheritance parent.
Backpatch to v15, where MERGE was introduced.
Reported-by: Andres Freund <andres@anarazel.de>
Author: Dean Rasheed <dean.a.rasheed@gmail.com>
Reviewed-by: Jian He <jian.universality@gmail.com>
Reviewed-by: Tender Wang <tndrwang@gmail.com>
Discussion: https://postgr.es/m/4rlmjfniiyffp6b3kv4pfy4jw3pciy6mq72rdgnedsnbsx7qe5@j5hlpiwdguvc
Backpatch-through: 15
uint64 was chosen to be consistent with the type used by the query ID,
but the conclusion of a recent discussion for the query ID is that int64
is a better fit as the signed form is shown to the user, for PGSS or
EXPLAIN outputs.
This commit changes the plan ID to use int64, following c3eda50b06
that has done the same for the query ID.
The plan ID is new to v18, introduced in 2a0cd38da5.
Author: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Sami Imseih <samimseih@gmail.com>
Discussion: https://postgr.es/m/aCvzJNwetyEI3Sgo@paquier.xyz
A few places that access system catalogs don't set up an active
snapshot before potentially accessing their TOAST tables. To fix,
push an active snapshot just before each section of code that might
require accessing one of these TOAST tables, and pop it shortly
afterwards. While at it, this commit adds some rather strict
assertions in an attempt to prevent such issues in the future.
Commit 16bf24e0e4 recently removed pg_replication_origin's TOAST
table in order to fix the same problem for that catalog. On the
back-branches, those bugs are left in place. We cannot easily
remove a catalog's TOAST table on released major versions, and only
replication origins with extremely long names are affected. Given
the low severity of the issue, fixing older versions doesn't seem
worth the trouble of significantly modifying the patch.
Also, on v13 and v14, the aforementioned strict assertions have
been omitted because commit 2776922201, which added
HaveRegisteredOrActiveSnapshot(), was not back-patched. While we
could probably back-patch it now, I've opted against it because it
seems unlikely that new TOAST snapshot issues will be introduced in
the oldest supported versions.
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/18127-fe54b6a667f29658%40postgresql.org
Discussion: https://postgr.es/m/18309-c0bf914950c46692%40postgresql.org
Discussion: https://postgr.es/m/ZvMSUPOqUU-VNADN%40nathan
Backpatch-through: 13
postgres_fdw tries to use PG_TRY blocks to ensure that it will
eventually free the PGresult created by the remote modify command.
However, it's fundamentally impossible for this scheme to work
reliably when there's RETURNING data, because the query could fail
in between invocations of postgres_fdw's DirectModify methods.
There is at least one instance of exactly this situation in the
regression tests, and the ensuing session-lifespan leak is visible
under Valgrind.
We can improve matters by using a memory context reset callback
attached to the ExecutorState context. That ensures that the
PGresult will be freed when the ExecutorState context is torn
down, even if control never reaches postgresEndDirectModify.
I have little faith that there aren't other potential PGresult
leakages in the backend modules that use libpq. So I think it'd
be a good idea to apply this concept universally by creating
infrastructure that attaches a reset callback to every PGresult
generated in the backend. However, that seems too invasive for
v18 at this point, let alone the back branches. So for the
moment, apply this narrow fix that just makes DirectModify safe.
I have a patch in the queue for the more general idea, but it
will have to wait for v19.
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com>
Discussion: https://postgr.es/m/2976982.1748049023@sss.pgh.pa.us
Backpatch-through: 13