The current definition is dangerous. No bugs exist in our code at
present, but backpatch to 11 nonetheless where it was introduced.
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
explain_get_index_name() applied quote_identifier() to the index name.
This is fine for text output, but the non-text output formats all have
their own quoting conventions and would much rather start from the
actual index name. For example in JSON you'd get something like
"Index Name": "\"My Index\"",
which is surely not desirable, especially when the same does not
happen for table names. Hence, move the responsibility for applying
quoting out to the callers, where it can go into already-existing
special code paths for text format.
This changes the API spec for users of explain_get_index_name_hook:
before, they were supposed to apply quote_identifier() if necessary,
now they should not. Research suggests that the only publicly
available user of the hook is hypopg, and it actually forgot to
apply quoting anyway, so it's fine. (In any case, there's no
behavioral change for the output of a hook as seen in non-text
EXPLAIN formats, so this won't break any case that programs should
be relying on.)
Digging in the commit logs, it appears that quoting was included in
explain_get_index_name's duties when commit 604ffd280 invented it;
and that was fine at the time because we only had text output format.
This should have been rethought when non-text formats were invented,
but it wasn't.
This is a fairly clear bug for users of non-text EXPLAIN formats,
so back-patch to all supported branches.
Per bug #16502 from Maciek Sakrejda. Patch by me (based on
investigation by Euler Taveira); thanks to Julien Rouhaud for review.
Discussion: https://postgr.es/m/16502-57bd1c9f913ed1d1@postgresql.org
spg_mask() didn't take into account that pd_lower equal to SizeOfPageHeaderData
is still valid value. This commit fixes that. Backpatch to 11, where
spg_mask() pg_lower check was introduced.
Reported-by: Michael Paquier
Discussion: https://postgr.es/m/20200615131405.GM52676%40paquier.xyz
Backpatch-through: 11
It was possible for deduplication's single value strategy to mistakenly
believe that a very small duplicate tuple counts as one of the six large
tuples that it aims to leave behind after the page finally splits. This
could cause slightly suboptimal space utilization with very low
cardinality indexes, though only under fairly narrow conditions.
To fix, be particular about what kind of tuple counts as a
maxpostingsize-capped tuple. This avoids confusion in the event of a
small tuple that gets "wedged" between two large tuples, where all
tuples on the page are duplicates of the same value.
Discussion: https://postgr.es/m/CAH2-Wz=Y+sgSFc-O3LpiZX-POx2bC+okec2KafERHuzdVa7-rQ@mail.gmail.com
Backpatch: 13-, where deduplication was introduced (by commit 0d861bbb)
This commit fixes the following issues.
1. There is the case where the slot is dropped while trying to invalidate it.
InvalidateObsoleteReplicationSlots() did not handle this case, and
which could cause checkpoint to fail.
2. InvalidateObsoleteReplicationSlots() could emit the same log message
multiple times unnecessary. It should be logged only once.
3. When marking the slot as used, we always searched the target slot from
all the replication slots even if we already found it. This could cause
useless waste of cycles.
Back-patch to v13 where these issues were added as a part of
max_slot_wal_keep_size code.
Author: Fujii Masao
Reviewed-by: Kyotaro Horiguchi, Alvaro Herrera
Discussion: https://postgr.es/m/66c05b67-3396-042c-1b41-bfa6c3ddcf82@oss.nttdata.com
Since 1f39bce02, HashAgg nodes have had the ability to spill to disk when
memory consumption exceeds work_mem. That commit added new properties to
EXPLAIN ANALYZE to show the maximum memory usage and disk usage, however,
it didn't quite go as far as showing that information for parallel
workers. Since workers may have experienced something very different from
the main process, we should show this information per worker, as is done
in Sort.
Reviewed-by: Justin Pryzby
Reviewed-by: Jeff Davis
Discussion: https://postgr.es/m/CAApHDvpEKbfZa18mM1TD7qV6PG+w97pwCWq5tVD0dX7e11gRJw@mail.gmail.com
Backpatch-through: 13, where the hashagg spilling code was added.
This was a danger only for --disable-spinlocks in combination with
atomic operations unsupported by the current platform.
While atomics.c was careful to signal that a separate semaphore ought
to be used when spinlock emulation is active, spin.c didn't actually
implement that mechanism. That's my (Andres') fault, it seems to have
gotten lost during the development of the atomic operations support.
Fix that issue and add test for nesting atomic operations inside a
spinlock.
Author: Andres Freund
Discussion: https://postgr.es/m/20200605023302.g6v3ydozy5txifji@alap3.anarazel.de
Backpatch: 9.5-
Advancing a replication slot did not recompute the oldest xmin and LSN
values across replication slots, preventing resource removal like
segments not recycled at checkpoint time. The original commit that
introduced the slot advancing in 9c7d06d never did the update of those
oldest values, and b0afdca removed this code.
This commit adds a TAP test to check segment recycling with advancing
for physical slots, enforcing an extra segment switch before advancing
to check if the segment gets correctly recycled after a checkpoint.
Reported-by: Andres Freund
Reviewed-by: Alexey Kondratov, Kyptaro Horiguchi
Discussion: https://postgr.es/m/20200609171904.kpltxxvjzislidks@alap3.anarazel.de
Backpatch-through: 11
On platforms without support for 64bit atomic operations where we also
cannot rely on 64bit reads to have single copy atomicity, such atomics
are implemented using a spinlock based fallback. That means it's not
safe to even read such atomics from within a signal handler (since the
signal handler might run when the spinlock already is held).
To avoid this issue defer global barrier processing out of the signal
handler. Instead of checking local / shared barrier generation to
determine whether to set ProcSignalBarrierPending, introduce
PROCSIGNAL_BARRIER and always set ProcSignalBarrierPending when
receiving such a signal. Additionally avoid redundant work in
ProcessProcSignalBarrier if ProcSignalBarrierPending is unnecessarily.
Also do a small amount of other polishing.
Author: Andres Freund
Reviewed-By: Robert Haas
Discussion: https://postgr.es/m/20200609193723.eu5ilsjxwdpyxhgz@alap3.anarazel.de
Backpatch: 13-, where the code was introduced.
Convert buffile.c error handling to use ereport. This fixes cases where
I/O errors were indistinguishable from EOF or not reported. Also remove
"%m" from error messages where errno would be bogus. While we're
modifying those strings, add block numbers and short read byte counts
where appropriate.
Back-patch to all supported releases.
Reported-by: Amit Khandekar <amitdkhan.pg@gmail.com>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Ibrar Ahmed <ibrar.ahmad@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/CA%2BhUKGJE04G%3D8TLK0DLypT_27D9dR8F1RQgNp0jK6qR0tZGWOw%40mail.gmail.com
When there is just one non-null input value, and it is infinity or NaN,
aggregates such as stddev_pop and covar_pop should produce a NaN
result, because the calculation is not well-defined. They used to do
so, but since we adopted Youngs-Cramer aggregation in commit e954a727f,
they produced zero instead. That's an oversight, so fix it. Add tests
exercising these edge cases.
Affected aggregates are
var_pop(double precision)
stddev_pop(double precision)
var_pop(real)
stddev_pop(real)
regr_sxx(double precision,double precision)
regr_syy(double precision,double precision)
regr_sxy(double precision,double precision)
regr_r2(double precision,double precision)
regr_slope(double precision,double precision)
regr_intercept(double precision,double precision)
covar_pop(double precision,double precision)
corr(double precision,double precision)
Back-patch to v12 where the behavior change was accidentally introduced.
Report and patch by me; thanks to Dean Rasheed for review.
Discussion: https://postgr.es/m/353062.1591898766@sss.pgh.pa.us
In passing, also remove a few surplus empty lines from pg_ltoa and
pg_ulltoa_n in numutils.c
Reported-by: Andrew Gierth
Discussion: https://postgr.es/m/87y2ou3xuh.fsf@news-spur.riddles.org.uk
Backpatch-through: 13, where these changes were introduced
Rewrite the documentation of these functions, in light of recent bug fix
commit 5940ffb2.
Back-patch to 13 where the check-for-conflict-out code was split up into
AM-specific and generic parts, and new documentation was added that now
looked wrong.
Reviewed-by: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/db7b729d-0226-d162-a126-8a8ab2dc4443%40jepsen.io
When merging two NumericAggStates, the code missed adding the new
state's NaNcount unless its N was also nonzero; since those counts
are independent, this is wrong.
This would only have visible effect if some partial aggregate scans
found only NaNs while earlier ones found only non-NaNs; then we could
end up falsely deciding that there were no NaNs and fail to return a
NaN final result as expected. That's pretty improbable, so it's no
surprise this hasn't been reported from the field. Still, it's a bug.
I didn't try to produce a regression test that would show the bug,
but I did notice that these functions weren't being reached at all
in our regression tests, so I improved the tests to at least
exercise them. With these additions, I see pretty complete code
coverage on the aggregation-related functions in numeric.c.
Back-patch to 9.6 where this code was introduced. (I only added
the improved test case as far back as v10, though, since the
relevant part of aggregates.sql isn't there at all in 9.6.)
Eliminate enable_groupingsets_hash_disk, which was primarily useful
for testing grouping sets that use HashAgg and spill. Instead, hack
the table stats to convince the planner to choose hashed aggregation
for grouping sets that will spill to disk. Suggested by Melanie
Plageman.
Rename enable_hashagg_disk to hashagg_avoid_disk_plan, and invert the
meaning of on/off. The new name indicates more strongly that it only
affects the planner. Also, the word "avoid" is less definite, which
should avoid surprises when HashAgg still needs to use the
disk. Change suggested by Justin Pryzby, though I chose a different
GUC name.
Discussion: https://postgr.es/m/CAAKRu_aisiENMsPM2gC4oUY1hHG3yrCwY-fXUg22C6_MJUwQdA%40mail.gmail.com
Discussion: https://postgr.es/m/20200610021544.GA14879@telsasoft.com
Backpatch-through: 13
SSI's HeapCheckForSerializableConflictOut() test failed to correctly
handle conditions involving a concurrently inserted tuple which is later
concurrently updated by a separate transaction . A SELECT statement
that called HeapCheckForSerializableConflictOut() could end up using the
same XID (updater's XID) for both the original tuple, and the successor
tuple, missing the XID of the xact that created the original tuple
entirely. This only happened when neither tuple from the chain was
visible to the transaction's MVCC snapshot.
The observable symptoms of this bug were subtle. A pair of transactions
could commit, with the later transaction failing to observe the effects
of the earlier transaction (because of the confusion created by the
update to the non-visible row). This bug dates all the way back to
commit dafaa3ef, which added SSI.
To fix, make sure that we check the xmin of concurrently inserted tuples
that happen to also have been updated concurrently.
Author: Peter Geoghegan
Reported-By: Kyle Kingsbury
Reviewed-By: Thomas Munro
Discussion: https://postgr.es/m/db7b729d-0226-d162-a126-8a8ab2dc4443@jepsen.io
Backpatch: All supported versions
The previous description string still described the pre-PostgreSQL
10 (pre eb61136dc75a76caef8460fa939244d8593100f2) behavior of
selecting between encrypted and unencrypted, but it is now choosing
between encryption algorithms.
Commit cec2edfa78 introduced logical_decoding_work_mem to limit
ReorderBuffer memory usage. We spill the changes once the memory occupied
by changes exceeds logical_decoding_work_mem. There was an assumption
in the code that by evicting the largest (sub)transaction we will come
under the memory limit as the selected transaction will be at least as
large as the most recent change (which caused us to go over the memory
limit). However, that is not true because a user can reduce the
logical_decoding_work_mem to a smaller value before the most recent
change.
We fix it by allowing to evict the transactions until we reach under the
memory limit.
Reported-by: Fujii Masao
Author: Amit Kapila
Reviewed-by: Fujii Masao
Backpatch-through: 13, where it was introduced
Discussion: https://postgr.es/m/2b7ba291-22e0-a187-d167-9e5309a3458d@oss.nttdata.com
The redo routines for XLOG_CHECKPOINT_{ONLINE,SHUTDOWN} must acquire
ControlFileLock before modifying ControlFile->checkPointCopy, or the
checkpointer could write out a control file with a bad checksum.
Likewise, XLogReportParameters() must acquire ControlFileLock before
modifying ControlFile and calling UpdateControlFile().
Back-patch to all supported releases.
Author: Nathan Bossart <bossartn@amazon.com>
Author: Fujii Masao <masao.fujii@oss.nttdata.com>
Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/70BF24D6-DC51-443F-B55A-95735803842A%40amazon.com
Since database connections can be used with WAL senders in 9.4, it is
possible to use physical replication. This commit fixes a crash when
starting physical replication with a WAL sender using a database
connection, caused by the refactoring done in 850196b.
There have been discussions about forbidding the use of physical
replication in a database connection, but this is left for later,
taking care only of the crash new to 13.
While on it, add a test to check for a failure when attempting logical
replication if the WAL sender does not have a database connection. This
part is extracted from a larger patch by Kyotaro Horiguchi.
Reported-by: Vladimir Sitnikov
Author: Michael Paquier, Kyotaro Horiguchi
Reviewed-by: Kyotaro Horiguchi, Álvaro Herrera
Discussion: https://postgr.es/m/CAB=Je-GOWMj1PTPkeUhjqQp-4W3=nW-pXe2Hjax6rJFffB5_Aw@mail.gmail.com
Backpatch-through: 13
Commit 24d85952 made a change that indirectly caused a performance
regression by triggering a change in the way GCC optimizes memcpy() on
some platforms.
The behavior seemed to contradict a GCC document, so I filed a report:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95556
This patch implements a narrow workaround which eliminates the
regression I observed. The workaround is benign enough that it seems
unlikely to cause a different regression on another platform.
Discussion: https://postgr.es/m/99b2eab335c1592c925d8143979c8e9e81e1575f.camel@j-davis.com
ineq_histogram_selectivity() can be invoked in situations where the
ordering we care about is not that of the column's histogram. We could
be considering some other collation, or even more drastically, the
query operator might not agree at all with what was used to construct
the histogram. (We'll get here for anything using scalarineqsel-based
estimators, so that's quite likely to happen for extension operators.)
Up to now we just ignored this issue and assumed we were dealing with
an operator/collation whose sort order exactly matches the histogram,
possibly resulting in junk estimates if the binary search gets confused.
It's past time to improve that, since the use of nondefault collations
is increasing. What we can do is verify that the given operator and
collation match what's recorded in pg_statistic, and use the existing
code only if so. When they don't match, instead execute the operator
against each histogram entry, and take the fraction of successes as our
selectivity estimate. This gives an estimate that is probably good to
about 1/histogram_size, with no assumptions about ordering. (The quality
of the estimate is likely to degrade near the ends of the value range,
since the two orderings probably don't agree on what is an extremal value;
but this is surely going to be more reliable than what we did before.)
At some point we might further improve matters by storing more than one
histogram calculated according to different orderings. But this code
would still be good fallback logic when no matches exist, so that is
not an argument for not doing this.
While here, also improve get_variable_range() to deal more honestly
with non-default collations.
This isn't back-patchable, because it requires adding another argument
to ineq_histogram_selectivity, and because it might have significant
impact on the estimation results for extension operators relying on
scalarineqsel --- mostly for the better, one hopes, but in any case
destabilizing plan choices in back branches is best avoided.
Per investigation of a report from James Lucas.
Discussion: https://postgr.es/m/CAAFmbbOvfi=wMM=3qRsPunBSLb8BFREno2oOzSBS=mzfLPKABw@mail.gmail.com
Commit 5e0928005 changed the planner so that, instead of blindly using
DEFAULT_COLLATION_OID when invoking operators for selectivity estimation,
it would use the collation of the column whose statistics we're
considering. This was recognized as still being not quite the right
thing, but it seemed like a good incremental improvement. However,
shortly thereafter we introduced nondeterministic collations, and that
creates cases where operators can fail if they're passed the wrong
collation. We don't want planning to fail in cases where the query itself
would work, so this means that we *must* use the query's collation when
invoking operators for estimation purposes.
The only real problem this creates is in ineq_histogram_selectivity, where
the binary search might produce a garbage answer if we perform comparisons
using a different collation than the column's histogram is ordered with.
However, when the query's collation is significantly different from the
column's default collation, the estimate we previously generated would be
pretty irrelevant anyway; so it's not clear that this will result in
noticeably worse estimates in practice. (A follow-on patch will improve
this situation in HEAD, but it seems too invasive for back-patch.)
The patch requires changing the signatures of mcv_selectivity and allied
functions, which are exported and very possibly are used by extensions.
In HEAD, I just did that, but an API/ABI break of this sort isn't
acceptable in stable branches. Therefore, in v12 the patch introduces
"mcv_selectivity_ext" and so on, with signatures matching HEAD, and makes
the old functions into wrappers that assume DEFAULT_COLLATION_OID should
be used. That does not match the prior behavior, but it should avoid risk
of failure in most cases. (In practice, I think most extension datatypes
aren't collation-aware, so the change probably doesn't matter to them.)
Per report from James Lucas. Back-patch to v12 where the problem was
introduced.
Discussion: https://postgr.es/m/CAAFmbbOvfi=wMM=3qRsPunBSLb8BFREno2oOzSBS=mzfLPKABw@mail.gmail.com
If the flag value is lost, logical decoding would work the same way as
REPLICA IDENTITY NOTHING, meaning that no old tuple values would be
included in the changes anymore produced by logical decoding.
Author: Michael Paquier
Reviewed-by: Euler Taveira
Discussion: https://postgr.es/m/20200603065340.GK89559@paquier.xyz
Backpatch-through: 12
It's intentional that we don't allow values greater than 24 hours,
while we do allow "24:00:00" as well as "23:59:60" as inputs.
However, the range check was miscoded in such a way that it would
accept "23:59:60.nnn" with a nonzero fraction. For time or timetz,
the stored result would then be greater than "24:00:00" which would
fail dump/reload, not to mention possibly confusing other operations.
Fix by explicitly calculating the result and making sure it does not
exceed 24 hours. (This calculation is redundant with what will happen
later in tm2time or tm2timetz. Maybe someday somebody will find that
annoying enough to justify refactoring to avoid the duplication; but
that seems too invasive for a back-patched bug fix, and the cost is
probably unmeasurable anyway.)
Note that this change also rejects such input as the time portion
of a timestamp(tz) value.
Back-patch to v10. The bug is far older, but to change this pre-v10
we'd need to ensure that the logic behaves sanely with float timestamps,
which is possibly nontrivial due to roundoff considerations.
Doesn't really seem worth troubling with.
Per report from Christoph Berg.
Discussion: https://postgr.es/m/20200520125807.GB296739@msg.df7cb.de
Fix some more violations of the "only straight-line code inside a
spinlock" rule. These are hazardous not only because they risk
holding the lock for an excessively long time, but because it's
possible for palloc to throw elog(ERROR), leaving a stuck spinlock
behind.
copy_replication_slot() had two separate places that did pallocs
while holding a spinlock. We can make the code simpler and safer
by copying the whole ReplicationSlot struct into a local variable
while holding the spinlock, and then referencing that copy.
(While that's arguably more cycles than we really need to spend
holding the lock, the struct isn't all that big, and this way seems
far more maintainable than copying fields piecemeal. Anyway this
is surely much cheaper than a palloc.) That bug goes back to v12.
InvalidateObsoleteReplicationSlots() not only did a palloc while
holding a spinlock, but for extra sloppiness then leaked the memory
--- probably for the lifetime of the checkpointer process, though
I didn't try to verify that. Fortunately that silliness is new
in HEAD.
pg_get_replication_slots() had a cosmetic violation of the rule,
in that it only assumed it's safe to call namecpy() while holding
a spinlock. Still, that's a hazard waiting to bite somebody, and
there were some other cosmetic coding-rule violations in the same
function, so clean it up. I back-patched this as far as v10; the
code exists before that but it looks different, and this didn't
seem important enough to adapt the patch further back.
Discussion: https://postgr.es/m/20200602.161518.1399689010416646074.horikyota.ntt@gmail.com
Previously UpdateSpillStats() called elog(DEBUG2) while holding
the spinlock even though the local variables that the elog() accesses
don't need to be protected by the lock. Since spinlocks are intended
for very short-term locks, they should not be used when calling
elog(DEBUG2). So this commit moves that elog() out of spinlock period.
Author: Kyotaro Horiguchi
Reviewed-by: Amit Kapila and Fujii Masao
Discussion: https://postgr.es/m/20200602.161518.1399689010416646074.horikyota.ntt@gmail.com
This issue has been present since the introduction of this code as of
a3519a2 from 2002, and has been found by buildfarm member prion that
uses RELCACHE_FORCE_RELEASE via the tests introduced recently in
e786be5.
Discussion: https://postgr.es/m/20200601022055.GB4121@paquier.xyz
Backpatch-through: 9.5
A relation that has no storage initializes rd_tableam to NULL, which
caused those two functions to crash because of a pointer dereference.
Note that in 11 and older versions, this has always failed with a
confusing error "could not open file".
These two functions are used by the Postgres ODBC driver, which requires
them only when connecting to a backend strictly older than 8.1. When
connected to 8.2 or a newer version, the driver uses a RETURNING clause
instead whose support has been added in 8.2, so it should be possible to
just remove both functions in the future. This is left as an issue to
address later.
While on it, add more regression tests for those functions as we never
really had coverage for them, and for aggregates of TIDs.
Reported-by: Jaime Casanova, via sqlsmith
Author: Michael Paquier
Reviewed-by: Álvaro Herrera
Discussion: https://postgr.es/m/CAJGNTeO93u-5APMga6WH41eTZ3Uee9f3s8dCpA-GSSqNs1b=Ug@mail.gmail.com
Backpatch-through: 12
Commit 1f39bce021 added disk-based hash aggregation, which may spill
incoming tuples to disk. It however did not request projection to make
the tuples as narrow as possible, which may mean having to spill much
more data than necessary (increasing I/O, pushing other stuff from page
cache, etc.).
This adds CP_SMALL_TLIST in places that may use hash aggregation - we do
that only for AGG_HASHED. It's unnecessary for AGG_SORTED, because that
either uses explicit Sort (which already does projection) or pre-sorted
input (which does not need spilling to disk).
Author: Tomas Vondra
Reviewed-by: Jeff Davis
Discussion: https://postgr.es/m/20200519151202.u2p2gpiawoaznsv2%40development
The repeat() function loops for potentially a long time without
ever checking for interrupts. This prevents, for example, a query
cancel from interrupting until the work is all done. Fix by
inserting a CHECK_FOR_INTERRUPTS() into the loop.
Backpatch to all supported versions.
Discussion: https://www.postgresql.org/message-id/flat/8692553c-7fe8-17d9-cbc1-7cddb758f4c6%40joeconway.com
ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE was used in an ereport with the
same message but different errdetail a few lines earlier, so use that
here as well.
Backpatch-through: 11
Disk-based HashAgg relies on writing to multiple tapes
concurrently. Avoid fragmentation of the tapes' blocks by
preallocating many blocks for a tape at once. No file operations are
performed during preallocation; only the block numbers are reserved.
Reviewed-by: Tomas Vondra
Discussion: https://postgr.es/m/20200519151202.u2p2gpiawoaznsv2%40development
These were missed when these were added to pg_hba.conf in PG 12;
updates docs and pg_hba.conf.sample.
Reported-by: Arthur Nascimento
Bug: 16380
Discussion: https://postgr.es/m/20200421182736.GG19613@momjian.us
Backpatch-through: 12