According to README we acquire predicate locks on entry tree leafs and posting
tree roots. However, when ginFindLeafPage() is going to lock leaf in exclusive
mode, then it checks root for conflicts regardless whether it's a entry or
posting tree. Assuming that we never place predicate lock on entry tree root
(excluding corner case when root is leaf), this check is redundant. This
commit removes this check. Now, root conflict checking is controlled by
separate argument of ginFindLeafPage().
Discussion: https://postgr.es/m/CAPpHfdv7rrDyy%3DMgsaK-L9kk0AH7az0B-mdC3w3p0FSb9uoyEg%40mail.gmail.com
Author: Alexander Korotkov
Backpatch-through: 11
Inheritance trees can include temporary tables if the parent is
permanent, which makes possible the presence of multiple temporary
children from different sessions. Trying to issue a TRUNCATE on the
parent in this scenario causes a failure, so similarly to any other
queries just ignore such cases, which makes TRUNCATE work
transparently.
This makes truncation behave similarly to any other DML query working on
the parent table with queries which need to be issues on children. A
set of isolation tests is added to cover basic cases.
Reported-by: Zhou Digoal
Author: Amit Langote, Michael Paquier
Discussion: https://postgr.es/m/15565-ce67a48d0244436a@postgresql.org
Backpatch-through: 9.4
At the end of recovery for the post-promotion process, a new history
file is created followed by the last partial segment of the previous
timeline. Based on the timing, the archiver would first try to archive
the last partial segment and then the history file. This can delay the
detection of a new timeline taken, particularly depending on the time it
takes to transfer the last partial segment as it delays the moment the
history file of the new timeline gets archived. This can cause promoted
standbys to use the same timeline as one already taken depending on the
circumstances if multiple instances look at archives at the same
location.
This commit changes the order of archiving so as history files are
archived in priority over other file types, which reduces the likelihood
of the same timeline being taken (still not reducing the window to
zero), and it makes the archiver behave more consistently with the
startup process doing its post-promotion business.
Author: David Steele
Reviewed-by: Michael Paquier, Kyotaro Horiguchi
Discussion: https://postgr.es/m/929068cf-69e1-bba2-9dc0-e05986aed471@pgmasters.net
Backpatch-through: 9.5
COPY can skip writing WAL when loading data on a table which has been
created in the same transaction as the one loading the data, however
this cannot work on views or foreign table as this would result in
trying to flush relation files which do not exist. So disable the
optimization so as commands are able to work the same way with any
configuration of wal_level.
Tests are added to cover the different cases, which need to have
wal_level set to minimal to allow the problem to show up, and that is
not the default configuration.
Reported-by: Luis M. Carril, Etsuro Fujita
Author: Amit Langote, Michael Paquier
Reviewed-by: Etsuro Fujita
Discussion: https://postgr.es/m/15552-c64aa14c5c22f63c@postgresql.org
Backpatch-through: 10, where support for COPY on views has been added,
while v11 has added support for COPY on foreign tables.
013ebc0a7b implements so-called GiST microvacuum. That is gistgettuple() marks
index tuples as dead when kill_prior_tuple is set. Later, when new tuple
insertion claims page space, those dead index tuples are physically deleted
from page. When this deletion is replayed on standby, it might conflict with
read-only queries. But 013ebc0a7b doesn't handle this. That may lead to
disappearance of some tuples from read-only snapshots on standby.
This commit implements resolving of conflicts between replay of GiST microvacuum
and standby queries. On the master we implement new WAL record type
XLOG_GIST_DELETE, which comprises necessary information. On stable releases
we've to be tricky to keep WAL compatibility. Information required for conflict
processing is just appended to data of XLOG_GIST_PAGE_UPDATE record. So,
PostgreSQL version, which doesn't know about conflict processing, will just
ignore that.
Reported-by: Andres Freund
Diagnosed-by: Andres Freund
Discussion: https://postgr.es/m/20181212224524.scafnlyjindmrbe6%40alap3.anarazel.de
Author: Alexander Korotkov
Backpatch-through: 9.6
For probably bogus reasons, we acquire only AccessShareLock on the
partition when we try to detach it from its parent partitioned table.
This can cause ugly things to happen if another transaction is doing
any sort of DDL to the partition concurrently.
Upgrade that lock to ShareUpdateExclusiveLock, which per discussion
seems to be the minimum needed.
Reported by Robert Haas.
Discussion: https://postgr.es/m/CA+TgmoYruJQ+2qnFLtF1xQtr71pdwgfxy3Ziy-TxV28M6pEmyA@mail.gmail.com
When a partition is detached from its parent, we acquire locks on all
attached indexes to also detach them ... but we release those locks
immediately. This is a violation of the policy of keeping locks on user
objects to the end of the transaction. Bug introduced in 8b08f7d482.
It's unclear that there are any ill effects possible, but it's clearly
wrong nonetheless. It's likely that bad behavior *is* possible, but
mostly because the relation that the index is for is only locked with
AccessShareLock, which is an older bug that shall be fixed separately.
While touching that line of code, close the index opened with
index_open() using index_close() instead of relation_close().
No difference in practice, but let's be consistent.
Unearthed by Robert Haas.
Discussion: https://postgr.es/m/CA+TgmoYruJQ+2qnFLtF1xQtr71pdwgfxy3Ziy-TxV28M6pEmyA@mail.gmail.com
The flag for IF NOT EXISTS was only being passed down in the normal
recursing case. It's been this way since originally added in 9.6 in
commit 2cd40adb85 so backpatch back to 9.6.
Commit 40dae7ec53, which made the handling of interrupted nbtree page
splits more robust, removed an nbtree-specific end-of-recovery cleanup
step. This meant that it was no longer possible to complete an
interrupted page split during recovery. However, a reference to
recovery as a reason for using a NULL stack while inserting into a
parent page was missed. Remove the reference.
Remove a similar obsolete reference to recovery that was introduced much
more recently, as part of the btree fastpath optimization enhancement
that made it into Postgres 11 (commit 2b272734, and follow-up commits).
Backpatch: 11-, where the fastpath optimization was introduced.
"rescanratio" was computed as 1 + rescanned-tuples / total-inner-tuples,
which is sensible if it's to be multiplied by total-inner-tuples or a cost
value corresponding to scanning all the inner tuples. But in reality it
was (mostly) multiplied by inner_rows or a related cost, numbers that take
into account the possibility of stopping short of scanning the whole inner
relation thanks to a limited key range in the outer relation. This'd
still make sense if we could expect that stopping short would result in a
proportional decrease in the number of tuples that have to be rescanned.
It does not, however. The argument that establishes the validity of our
estimate for that number is independent of whether we scan all of the inner
relation or stop short, and experimentation also shows that stopping short
doesn't reduce the number of rescanned tuples. So the correct calculation
is 1 + rescanned-tuples / inner_rows, and we should be sure to multiply
that by inner_rows or a corresponding cost value.
Most of the time this doesn't make much difference, but if we have
both a high rescan rate (due to lots of duplicate values) and an outer
key range much smaller than the inner key range, then the error can
be significant, leading to a large underestimate of the cost associated
with rescanning.
Per report from Vijaykumar Jain. This thinko appears to go all the way
back to the introduction of the rescan estimation logic in commit
70fba7043, so back-patch to all supported branches.
Discussion: https://postgr.es/m/CAE7uO5hMb_TZYJcZmLAgO6iD68AkEK6qCe7i=vZUkCpoKns+EQ@mail.gmail.com
This is an oversight from recent commit b13fd344. While on it, tweak
the previous test with a better name for the renamed primary key.
Detected by buildfarm member prion which forces relation cache release
with -DRELCACHE_FORCE_RELEASE. Back-patch down to 9.4 as the previous
commit.
When a constraint gets renamed, it may have associated with it a target
relation (for example domain constraints don't have one). Not
invalidating the target relation cache when issuing the renaming can
result in issues with subsequent commands that refer to the old
constraint name using the relation cache, causing various failures. One
pattern spotted was using CREATE TABLE LIKE after a constraint
renaming.
Reported-by: Stuart <sfbarbee@gmail.com>
Author: Amit Langote
Reviewed-by: Michael Paquier
Discussion: https://postgr.es/m/2047094.V130LYfLq4@station53.ousa.org
Commit ffa4cbd62 added logic to detect SIGPIPE failure of a COPY child
process, but it only worked correctly if the SIGPIPE occurred in the
immediate child process. Depending on the shell in use and the
complexity of the shell command string, we might instead get back
an exit code of 128 + SIGPIPE, representing a shell error exit
reporting SIGPIPE in the child process.
We could just hack up ClosePipeToProgram() to add the extra case,
but it seems like this is a fairly general issue deserving a more
general and better-documented solution. I chose to add a couple
of functions in src/common/wait_error.c, which is a natural place
to know about wait-result encodings, that will test for either a
specific child-process signal type or any child-process signal failure.
Then, adjust other places that were doing ad-hoc tests of this type
to use the common functions.
In RestoreArchivedFile, this fixes a race condition affecting whether
the process will report an error or just silently proc_exit(1): before,
that depended on whether the intermediate shell got SIGTERM'd itself
or reported a child process failing on SIGTERM.
Like the previous patch, back-patch to v10; we could go further
but there seems no real need to.
Per report from Erik Rijkers.
Discussion: https://postgr.es/m/f3683f87ab1701bea5d86a7742b22432@xs4all.nl
The idea here is to not call recordDependencyOn for the default collation,
since we know that's pinned. But what the code actually did was to record
the partition key's dependency on the opclass twice, instead.
Evidently introduced by sloppy coding in commit 2186b608b. Back-patch
to v10 where that came in.
When GIN vacuum deletes a posting tree page, it assumes that no concurrent
searchers can access it, thanks to ginStepRight() locking two pages at once.
However, since 9.4 searches can skip parts of posting trees descending from the
root. That leads to the risk that page is deleted and reclaimed before
concurrent search can access it.
This commit prevents the risk of above by waiting for every transaction, which
might wait to reference this page, to finish. Due to binary compatibility
we can't change GinPageOpaqueData to store corresponding transaction id.
Instead we reuse page header pd_prune_xid field, which is unused in index pages.
Discussion: https://postgr.es/m/31a702a.14dd.166c1366ac1.Coremail.chjischj%40163.com
Author: Andrey Borodin, Alexander Korotkov
Reviewed-by: Alexander Korotkov
Backpatch-through: 9.4
On standby ginRedoDeletePage() can work concurrently with read-only queries.
Those queries can traverse posting tree in two ways.
1) Using rightlinks by ginStepRight(), which locks the next page before
unlocking its left sibling.
2) Using downlinks by ginFindLeafPage(), which locks at most one page at time.
Original lock order was: page, parent, left sibling. That lock order can
deadlock with ginStepRight(). In order to prevent deadlock this commit changes
lock order to: left sibling, page, parent. Note, that position of parent in
locking order seems insignificant, because we only lock one page at time while
traversing downlinks.
Reported-by: Chen Huajun
Diagnosed-by: Chen Huajun, Peter Geoghegan, Andrey Borodin
Discussion: https://postgr.es/m/31a702a.14dd.166c1366ac1.Coremail.chjischj%40163.com
Author: Alexander Korotkov
Backpatch-through: 9.4
Before 218f51584d if posting tree page is about to be deleted, then the whole
posting tree is locked by LockBufferForCleanup() on root preventing all the
concurrent inserts. 218f51584d reduced locking to the subtree containing
page to be deleted. However, due to concurrent parent split, inserter doesn't
always holds pins on all the pages constituting path from root to the target
leaf page. That could cause a deadlock between GIN vacuum process and GIN
inserter. And we didn't find non-invasive way to fix this.
This commit reverts VACUUM behavior to lock the whole posting tree before
delete any page. However, we keep another useful change by 218f51584d: the
tree is locked only if there are pages to be deleted.
Reported-by: Chen Huajun
Diagnosed-by: Chen Huajun, Andrey Borodin, Peter Geoghegan
Discussion: https://postgr.es/m/31a702a.14dd.166c1366ac1.Coremail.chjischj%40163.com
Author: Alexander Korotkov, based on ideas from Andrey Borodin and Peter Geoghegan
Reviewed-by: Andrey Borodin
Backpatch-through: 10
postgres_fdw's postgresGetForeignPlan() assumes without checking that the
outer_plan it's given for a join relation must have a NestLoop, MergeJoin,
or HashJoin node at the top. That's been wrong at least since commit
4bbf6edfb (which could cause insertion of a Sort node on top) and it seems
like a pretty unsafe thing to Just Assume even without that.
Through blind good fortune, this doesn't seem to have any worse
consequences today than strange EXPLAIN output, but it's clearly trouble
waiting to happen.
To fix, test the node type explicitly before touching Join-specific
fields, and avoid jamming the new tlist into a node type that can't
do projection. Export a new support function from createplan.c
to avoid building low-level knowledge about the latter into FDWs.
Back-patch to 9.6 where the faulty coding was added. Note that the
associated regression test cases don't show any changes before v11,
apparently because the tests back-patched with 4bbf6edfb don't actually
exercise the problem case before then (there's no top-level Sort
in those plans).
Discussion: https://postgr.es/m/8946.1544644803@sss.pgh.pa.us
Our support for multiple-set-clauses in UPDATE assumes that the Params
referencing a MULTIEXPR_SUBLINK SubPlan will appear before that SubPlan
in the targetlist of the plan node that calculates the updated row.
(Yeah, it's a hack...) In some PG branches it's possible that a Result
node gets inserted between the primary calculation of the update tlist
and the ModifyTable node. setrefs.c did the wrong thing in this case
and left the upper-level Params as Params, causing a crash at runtime.
What it should do is replace them with "outer" Vars referencing the child
plan node's output. That's a result of careless ordering of operations
in fix_upper_expr_mutator, so we can fix it just by reordering the code.
Fix fix_join_expr_mutator similarly for consistency, even though join
nodes could never appear in such a context. (In general, it seems
likely to be a bit cheaper to use Vars than Params in such situations
anyway, so this patch might offer a tiny performance improvement.)
The hazard extends back to 9.5 where the MULTIEXPR_SUBLINK stuff
was introduced, so back-patch that far. However, this may be a live
bug only in 9.6.x and 10.x, as the other branches don't seem to want
to calculate the final tlist below the Result node. (That plan shape
change between branches might be a mini-bug in itself, but I'm not
really interested in digging into the reasons for that right now.
Still, add a regression test memorializing what we expect there,
so we'll notice if it changes again.)
Per bug report from Eduards Bezverhijs.
Discussion: https://postgr.es/m/b6cd572a-3e44-8785-75e9-c512a5a17a73@tieto.com
Although copyfuncs.c has a check_stack_depth call in its recursion,
equalfuncs.c, outfuncs.c, and readfuncs.c lacked one. This seems
unwise.
Likewise fix planstate_tree_walker(), in branches where that exists.
Discussion: https://postgr.es/m/30253.1544286631@sss.pgh.pa.us
Previously, it would just pass back a partially-uninitialized tupdesc,
which doesn't seem like a safe or useful behavior.
Backpatch to v10 where this code came in.
Discussion: https://postgr.es/m/30830.1544384975@sss.pgh.pa.us
The stanza of ExecuteTruncate[Guts] that truncates a target table's toast
relation re-used the loop local variable "rel" to reference the toast rel.
This was safe enough when written, but commit d42358efb added code below
that that supposed "rel" still pointed to the parent table. Therefore,
the stats counter update was applied to the wrong relcache entry (the
toast rel not the user rel); and if we were unlucky and that relcache
entry had been flushed during reindex_relation, very bad things could
ensue.
(I'm surprised that CLOBBER_CACHE_ALWAYS testing hasn't found this.
I'm even more surprised that the problem wasn't detected during the
development of d42358efb; it must not have been tested in any case
with a toast table, as the incorrect stats counts are very obvious.)
To fix, replace use of "rel" in that code branch with a more local
variable. Adjust test cases added by d42358efb so that some of them
use tables with toast tables.
Per bug #15540 from Pan Bian. Back-patch to 9.5 where d42358efb came in.
Discussion: https://postgr.es/m/15540-01078812338195c0@postgresql.org
Remove dead code (which would be incorrect if it weren't dead),
per report from Pan Bian. Add a CHECK_FOR_INTERRUPTS in the
inner loop over child relations, because there's little point
in having one in the outer loop if there's not one here too.
Minor stylistic adjustments and comment improvements.
Seems to be aboriginal to this code (cf commit 665d1fad9).
Back-patch to v10 where that came in, not because any of this
is significant, but just to keep the branches looking similar.
Discussion: https://postgr.es/m/15539-06d00ef6b1e2e1bb@postgresql.org
When an indexes is created on a partitioned table using ONLY (don't
recurse to partitions), it gets marked invalid until index partitions
are attached for each table partition. But there's no reason to do this
if there are no partitions ... and moreover, there's no way to get the
index to become valid afterwards, because all partitions that get
created/attached get their own index partition already attached to the
parent index, so there's no chance to do ALTER INDEX ... ATTACH PARTITION
that would make the parent index valid.
Fix by not marking the index as invalid to begin with.
This is very similar to 9139aa1942, but the pg_dump aspect does not
appear to be relevant until we add FKs that can point to PKs on
partitioned tables. (I tried to cause the pg_upgrade test to break by
leaving some of these bogus tables around, but wasn't able to.)
Making this change means that an index that was supposed to be invalid
in the insert_conflict regression test is no longer invalid; reorder the
DDL so that the test continues to verify the behavior we want it to.
Author: Álvaro Herrera
Reviewed-by: Amit Langote
Discussion: https://postgr.es/m/20181203225019.2vvdef2ybnkxt364@alvherre.pgsql
Three issues are fixed in this patch:
- Base backups forgot to ignore files specific to EXEC_BACKEND, leading
to spurious warnings when checksums are enabled, per analysis from me.
- pg_verify_checksums forgot about files specific to EXEC_BACKEND,
leading to failures of the tool on any such build, particularly Windows.
This error was originally found by newly-introduced TAP tests in various
buildfarm members using EXEC_BACKEND.
- pg_verify_checksums forgot to count for temporary files and temporary
paths, which could be valid relation files, without checksums, per
report from Andres Freund. More tests are added to cover this case.
A new test case which emulates corruption for a file in a different
tablespace is added, coming from from Michael Banck, while I have coded
the main code and refactored the test code.
Author: Michael Banck, Michael Paquier
Reviewed-by: Stephen Frost, David Steele
Discussion: https://postgr.es/m/20181021134206.GA14282@paquier.xyz
This fixes an oversight from c6c3334 which forgot that if a subset of
WAL senders are stopping and in a sync state, other WAL senders could
still be waiting for a WAL position to be synced while committing a
transaction. However the subset of stopping senders would not release
waiters, potentially breaking synchronous replication guarantees. This
commit makes sure that even WAL senders stopping are able to release
waiters and are tracked properly.
On 9.4, this can also trigger an assertion failure when setting for
example max_wal_senders to 1 where a WAL sender is not able to find
itself as in synchronous state when the instance stops.
Reported-by: Paul Guo
Author: Paul Guo, Michael Paquier
Discussion: https://postgr.es/m/CAEET0ZEv8VFqT3C-cQm6byOB4r4VYWcef1J21dOX-gcVhCSpmA@mail.gmail.com
Backpatch-through: 9.4
Move the responsibility for checking for and reporting a failure from
the only current BufFileSize() caller, logtape.c, to BufFileSize()
itself. Code within buffile.c is generally responsible for interfacing
with fd.c to report irrecoverable failures. This seems like a
convention that's worth sticking to.
Reorganizing things this way makes it easy to make the error message
raised in the event of BufFileSize() failure descriptive of the
underlying problem. We're now clear on the distinction between
temporary file name and BufFile name, and can show errno, confident that
its value actually relates to the error being reported. In passing, an
existing, similar buffile.c ereport() + errcode_for_file_access() site
is changed to follow the same conventions.
The API of the function BufFileSize() is changed by this commit, despite
already being in a stable release (Postgres 11). This seems acceptable,
since the BufFileSize() ABI was changed by commit aa55183042, which
hasn't made it into a point release yet. Besides, it's difficult to
imagine a third party BufFileSize() caller not just raising an error
anyway, since BufFile state should be considered corrupt when
BufFileSize() fails.
Per complaint from Tom Lane.
Discussion: https://postgr.es/m/26974.1540826748@sss.pgh.pa.us
Backpatch: 11-, where shared BufFiles were introduced.
Since commit 2f1d2b7a we have set PAM_RHOST to "[local]" for Unix
sockets. This caused Linux PAM's libaudit integration to make DNS
requests for that name. It's not exactly clear what value PAM_RHOST
should have in that case, but it seems clear that we shouldn't set it
to an unresolvable name, so don't do that.
Back-patch to 9.6. Bug #15520.
Author: Thomas Munro
Reviewed-by: Peter Eisentraut
Reported-by: Albert Schabhuetl
Discussion: https://postgr.es/m/15520-4c266f986998e1c5%40postgresql.org
During table rewrites (VACUUM FULL and CLUSTER), the main heap is logged
using XLOG / FPI records, and thus (correctly) ignored in decoding.
But the associated TOAST table is WAL-logged as plain INSERT records,
and so was logically decoded and passed to reorder buffer.
That has severe consequences with TOAST tables of non-trivial size.
Firstly, reorder buffer has to keep all those changes, possibly spilling
them to a file, incurring I/O costs and disk space.
Secondly, ReoderBufferCommit() was stashing all those TOAST chunks into
a hash table, which got discarded only after processing the row from the
main heap. But as the main heap is not decoded for rewrites, this never
happened, so all the TOAST data accumulated in memory, resulting either
in excessive memory consumption or OOM.
The fix is simple, as commit e9edc1ba already introduced infrastructure
(namely HEAP_INSERT_NO_LOGICAL flag) to skip logical decoding of TOAST
tables, but it only applied it to system tables. So simply use it for
all TOAST data in raw_heap_insert().
That would however solve only the memory consumption issue - the TOAST
changes would still be decoded and added to the reorder buffer, and
spilled to disk (although without TOAST tuple data, so much smaller).
But we can solve that by tweaking DecodeInsert() to just ignore such
INSERT records altogether, using XLH_INSERT_CONTAINS_NEW_TUPLE flag,
instead of skipping them later in ReorderBufferCommit().
Review: Masahiko Sawada
Discussion: https://www.postgresql.org/message-id/flat/1a17c643-e9af-3dba-486b-fbe31bc1823a%402ndquadrant.com
Backpatch: 9.4-, where logical decoding was introduced
The function generated to perform JIT compiled tuple deforming failed
when HeapTupleHeader's t_hoff was bigger than a signed int8. I'd
failed to realize that LLVM's getelementptr would treat an int8 index
argument as signed, rather than unsigned. That means that a hoff
larger than 127 would result in a negative offset being applied. Fix
that by widening the index to 32bit.
Add a testcase with a wide table. Don't drop it, as it seems useful to
verify other tools deal properly with wide tables.
Thanks to Justin Pryzby for both reporting a bug and then reducing it
to a reproducible testcase!
Reported-By: Justin Pryzby
Author: Andres Freund
Discussion: https://postgr.es/m/20181115223959.GB10913@telsasoft.com
Backpatch: 11, just as jit compilation was
ftoi4 and its sibling coercion functions did their overflow checks in
a way that looked superficially plausible, but actually depended on an
assumption that the MIN and MAX comparison constants can be represented
exactly in the float4 or float8 domain. That fails in ftoi4, ftoi8,
and dtoi8, resulting in a possibility that values near the MAX limit will
be wrongly converted (to negative values) when they need to be rejected.
Also, because we compared before rounding off the fractional part,
the other three functions threw errors for values that really ought
to get rounded to the min or max integer value.
Fix by doing rint() first (requiring an assumption that it handles
NaN and Inf correctly; but dtoi8 and ftoi8 were assuming that already),
and by comparing to values that should coerce to float exactly, namely
INTxx_MIN and -INTxx_MIN. Also remove some random cosmetic discrepancies
between these six functions.
This back-patches commits cbdb8b4c0 and 452b637d4. In the 9.4 branch,
also back-patch the portion of 62e2a8dc2 that added PG_INTnn_MIN and
related constants to c.h, so that these functions can rely on them.
Per bug #15519 from Victor Petrovykh.
Patch by me; thanks to Andrew Gierth for analysis and discussion.
Discussion: https://postgr.es/m/15519-4fc785b483201ff1@postgresql.org
populate_recordset_worker() failed to consider the possibility that the
supplied JSON data contains no rows, so that update_cached_tupdesc never
got called. This led to a null-pointer dereference since commit 9a5e8ed28;
before that it led to a bogus "set-valued function called in context that
cannot accept a set" error. Fix by forcing the update to happen.
Per bug #15514. Back-patch to v11 as 9a5e8ed28 was. (If we were excited
about the bogus error, we could perhaps go back further, but it'd take more
work to figure out how to fix it in older branches. Given the lack of
field complaints about that aspect, I'm not excited.)
Discussion: https://postgr.es/m/15514-59d5b4c4065b178b@postgresql.org
Per POSIX, WIFSIGNALED and related macros are provided by <sys/wait.h>.
Apparently on Linux they're also pulled in by some other inclusions,
but BSD-ish systems are pickier. Fixes portability issue in ffa4cbd62.
Per buildfarm.
Previously, any program launched by COPY TO/FROM PROGRAM inherited the
server's setting of SIGPIPE handling, i.e. SIG_IGN. Hence, if we were
doing COPY FROM PROGRAM and closed the pipe early, the child process
would see EPIPE on its output file and typically would treat that as
a fatal error, in turn causing the COPY to report error. Similarly,
one could get a failure report from a query that didn't read all of
the output from a contrib/file_fdw foreign table that uses file_fdw's
PROGRAM option.
To fix, ensure that child programs inherit SIG_DFL not SIG_IGN processing
of SIGPIPE. This seems like an all-around better situation since if
the called program wants some non-default treatment of SIGPIPE, it would
expect to have to set that up for itself. Then in COPY, if it's COPY
FROM PROGRAM and we stop reading short of detecting EOF, treat a SIGPIPE
exit from the called program as a non-error condition. This still allows
us to report an error for any case where the called program gets SIGPIPE
on some other file descriptor.
As coded, we won't report a SIGPIPE if we stop reading as a result of
seeing an in-band EOF marker (e.g. COPY BINARY EOF marker). It's
somewhat debatable whether we should complain if the called program
continues to transmit data after an EOF marker. However, it seems like
we should avoid throwing error in any questionable cases, especially in a
back-patched fix, and anyway it would take additional code to make such
an error get reported consistently.
Back-patch to v10. We could go further back, since COPY FROM PROGRAM
has been around awhile, but AFAICS the only way to reach this situation
using core or contrib is via file_fdw, which has only supported PROGRAM
sources since v10. The COPY statement per se has no feature whereby
it'd stop reading without having hit EOF or an error already. Therefore,
I don't see any upside to back-patching further that'd outweigh the
risk of complaints about behavioral change.
Per bug #15449 from Eric Cyr.
Patch by me, review by Etsuro Fujita and Kyotaro Horiguchi
Discussion: https://postgr.es/m/15449-1cf737dd5929450e@postgresql.org
This didn't actually work: COPY would fail to flush the right files, and
instead would try to flush a non-existing file, causing the whole
transaction to fail.
Cope by raising an error as soon as the command is sent instead, to
avoid a nasty later surprise. Of course, it would be much better to
make it work, but we don't have a patch for that yet, and we don't know
if we'll want to backpatch one when we do.
Reported-by: Tomas Vondra
Author: David Rowley
Reviewed-by: Amit Langote, Steve Singer, Tomas Vondra
On some operating systems, it doesn't make sense to retry fsync(),
because dirty data cached by the kernel may have been dropped on
write-back failure. In that case the only remaining copy of the
data is in the WAL. A subsequent fsync() could appear to succeed,
but not have flushed the data. That means that a future checkpoint
could apparently complete successfully but have lost data.
Therefore, violently prevent any future checkpoint attempts by
panicking on the first fsync() failure. Note that we already
did the same for WAL data; this change extends that behavior to
non-temporary data files.
Provide a GUC data_sync_retry to control this new behavior, for
users of operating systems that don't eject dirty data, and possibly
forensic/testing uses. If it is set to on and the write-back error
was transient, a later checkpoint might genuinely succeed (on a
system that does not throw away buffers on failure); if the error is
permanent, later checkpoints will continue to fail. The GUC defaults
to off, meaning that we panic.
Back-patch to all supported releases.
There is still a narrow window for error-loss on some operating
systems: if the file is closed and later reopened and a write-back
error occurs in the intervening time, but the inode has the bad
luck to be evicted due to memory pressure before we reopen, we could
miss the error. A later patch will address that with a scheme
for keeping files with dirty data open at all times, but we judge
that to be too complicated to back-patch.
Author: Craig Ringer, with some adjustments by Thomas Munro
Reported-by: Craig Ringer
Reviewed-by: Robert Haas, Thomas Munro, Andres Freund
Discussion: https://postgr.es/m/20180427222842.in2e4mibx45zdth5%40alap3.anarazel.de
If fsync() fails, md.c must keep the request in its bitmap, so that
future attempts will try again.
Back-patch to all supported releases.
Author: Thomas Munro
Reviewed-by: Amit Kapila
Reported-by: Andrew Gierth
Discussion: https://postgr.es/m/87y3i1ia4w.fsf%40news-spur.riddles.org.uk
With run-time partition pruning, there is no longer necessarily an
executor node for each corresponding plan node.
Author: David Rowley <david.rowley@2ndquadrant.com>
BufFileSize() can't use off_t, because it's only 32 bits wide on
some systems. BufFile objects can have many 1GB segments so the
total size can exceed 2^31. The only known client of the function
is parallel CREATE INDEX, which was reported to fail when building
large indexes on Windows.
Though this is technically an ABI break on platforms with a 32 bit
off_t and we might normally avoid back-patching it, the function is
brand new and thus unlikely to have been discovered by extension
authors yet, and it's fairly thoroughly broken on those platforms
anyway, so just fix it.
Defect in 9da0cc35. Bug #15460. Back-patch to 11, where this
function landed.
Author: Thomas Munro
Reported-by: Paul van der Linden, Pavel Oskin
Reviewed-by: Peter Geoghegan
Discussion: https://postgr.es/m/15460-b6db80de822fa0ad%40postgresql.org
Discussion: https://postgr.es/m/CAHDGBJP_GsESbTt4P3FZA8kMUKuYxjg57XHF7NRBoKnR%3DCAR-g%40mail.gmail.com
If a failure happens when a transaction is starting between the moment
the transaction status is changed from TRANS_DEFAULT to TRANS_START and
the moment the current user ID and security context flags are fetched
via GetUserIdAndSecContext(), or before initializing its basic fields,
then those may get reset to incorrect values when the transaction
aborts, leaving the session in an inconsistent state.
One problem reported is that failing a starting transaction at the first
query of a session could cause several kinds of system crashes on the
follow-up queries.
In order to solve that, move the initialization of the transaction state
fields and the call of GetUserIdAndSecContext() in charge of fetching
the current user ID close to the point where the transaction status is
switched to TRANS_START, where there cannot be any error triggered
in-between, per an idea of Tom Lane. This properly ensures that the
current user ID, the security context flags and that the basic fields of
TransactionState remain consistent even if the transaction fails while
starting.
Reported-by: Richard Guo
Diagnosed-By: Richard Guo
Author: Michael Paquier
Reviewed-by: Tom Lane
Discussion: https://postgr.es/m/CAN_9JTxECSb=pEPcb0a8d+6J+bDcOZ4=DgRo_B7Y5gRHJUM=Rw@mail.gmail.com
Backpatch-through: 9.4
group clearing mechanism.
Commits 0e141c0fbb and baaf272ac9 introduced initialization of atomic
variables in InitProcess which means that it's not safe to look at those
for backends that aren't currently in use. Fix that by initializing them
during postmaster startup.
Reported-by: Andres Freund
Author: Amit Kapila
Backpatch-through: 9.6
Discussion: https://postgr.es/m/20181027104138.qmbbelopvy7cw2qv@alap3.anarazel.de
Commit 35c0754f failed to handle space-separated lists of alternative
hostnames in ldapserver, when building a URI for ldap_initialize()
(OpenLDAP). Such lists need to be expanded to space-separated URIs.
Repair. Back-patch to 11, to fix bug report #15495.
Author: Thomas Munro
Reported-by: Renaud Navarro
Discussion: https://postgr.es/m/15495-2c39fc196c95cd72%40postgresql.org
Coverty reports a possible buffer overrun in the code that populates the
pg_hba_file_rules view. It may not be a live bug due to restrictions
on options that can be used together, but let's increase MAX_HBA_OPTIONS
and correct a nearby misleading comment.
Back-patch to 10 where this code arrived.
Reported-by: Julian Hsiao
Discussion: https://postgr.es/m/CADnGQpzbkWdKS2YHNifwAvX5VEsJ5gW49U4o-7UL5pzyTv4vTg%40mail.gmail.com
classify_index_clause_usage() is O(N^2) in the number of distinct index
qual clauses it considers, because of its use of a simple search list to
store them. For nearly all queries, that's fine because only a few clauses
will be considered. But Alexander Kuzmenkov reported a machine-generated
query with 80000 (!) index qual clauses, which caused this code to take
forever. Somewhat remarkably, this is the only O(N^2) behavior we now
have for such a query, so let's fix it.
We can get rid of the O(N^2) runtime for cases like this without much
damage to the functionality of choose_bitmap_and() by separating out
paths with "too many" qual or pred clauses, and deeming them to always
be nonredundant with other paths. Then their clauses needn't go into
the search list, so it doesn't get too long, but we don't lose the
ability to consider bitmap AND plans altogether. I set the threshold
for "too many" to be 100 clauses per path, which should be plenty to
ensure no change in planning behavior for normal queries.
There are other things we could do to make this go faster, but it's not
clear that it's worth any additional effort. 80000 qual clauses require
a whole lot of work in many other places, too.
The code's been like this for a long time, so back-patch to all supported
branches. The troublesome query only works back to 9.5 (in 9.4 it fails
with stack overflow in the parser); so I'm not sure that fixing this in
9.4 has any real-world benefit, but perhaps it does.
Discussion: https://postgr.es/m/90c5bdfa-d633-dabe-9889-3cf3e1acd443@postgrespro.ru
This patch fixes several related cases in which pg_shdepend entries were
never made, or were lost, for references to roles appearing in the ACLs of
schemas and/or types. While that did no immediate harm, if a referenced
role were later dropped, the drop would be allowed and would leave a
dangling reference in the object's ACL. That still wasn't a big problem
for normal database usage, but it would cause obscure failures in
subsequent dump/reload or pg_upgrade attempts, taking the form of
attempts to grant privileges to all-numeric role names. (I think I've
seen field reports matching that symptom, but can't find any right now.)
Several cases are fixed here:
1. ALTER DOMAIN SET/DROP DEFAULT would lose the dependencies for any
existing ACL entries for the domain. This case is ancient, dating
back as far as we've had pg_shdepend tracking at all.
2. If a default type privilege applies, CREATE TYPE recorded the
ACL properly but forgot to install dependency entries for it.
This dates to the addition of default privileges for types in 9.2.
3. If a default schema privilege applies, CREATE SCHEMA recorded the
ACL properly but forgot to install dependency entries for it.
This dates to the addition of default privileges for schemas in v10
(commit ab89e465c).
Another somewhat-related problem is that when creating a relation
rowtype or implicit array type, TypeCreate would apply any available
default type privileges to that type, which we don't really want
since such an object isn't supposed to have privileges of its own.
(You can't, for example, drop such privileges once they've been added
to an array type.)
ab89e465c is also to blame for a race condition in the regression tests:
privileges.sql transiently installed globally-applicable default
privileges on schemas, which sometimes got absorbed into the ACLs of
schemas created by concurrent test scripts. This should have resulted
in failures when privileges.sql tried to drop the role holding such
privileges; but thanks to the bug fixed here, it instead led to dangling
ACLs in the final state of the regression database. We'd managed not to
notice that, but it became obvious in the wake of commit da906766c, which
allowed the race condition to occur in pg_upgrade tests.
To fix, add a function recordDependencyOnNewAcl to encapsulate what
callers of get_user_default_acl need to do; while the original call
sites got that right via ad-hoc code, none of the later-added ones
have. Also change GenerateTypeDependencies to generate these
dependencies, which requires adding the typacl to its parameter list.
(That might be annoying if there are any extensions calling that
function directly; but if there are, they're most likely buggy in the
same way as the core callers were, so they need work anyway.) While
I was at it, I changed GenerateTypeDependencies to accept most of its
parameters in the form of a Form_pg_type pointer, making its parameter
list a bit less unwieldy and mistake-prone.
The test race condition is fixed just by wrapping the addition and
removal of default privileges into a single transaction, so that that
state is never visible externally. We might eventually prefer to
separate out tests of default privileges into a script that runs by
itself, but that would be a bigger change and would make the tests
run slower overall.
Back-patch relevant parts to all supported branches.
Discussion: https://postgr.es/m/15719.1541725287@sss.pgh.pa.us
This commit fixes a set of issues with ON COMMIT actions when used on
partitioned tables and tables with inheritance children:
- Applying ON COMMIT DROP on a partitioned table with partitions or on a
table with inheritance children caused a failure at commit time, with
complains about the children being already dropped as all relations are
dropped one at the same time.
- Applying ON COMMIT DELETE on a partition relying on a partitioned
table which uses ON COMMIT DROP would cause the partition truncation to
fail as the parent is removed first.
The solution to the first problem is to handle the removal of all the
dependencies in one go instead of dropping relations one-by-one, based
on a suggestion from Álvaro Herrera. So instead all the relation OIDs
to remove are gathered and then processed in one round of multiple
deletions.
The solution to the second problem is to reorder the actions, with
truncation happening first and relation drop done after. Even if it
means that a partition could be first truncated, then immediately
dropped if its partitioned table is dropped, this has the merit to keep
the code simple as there is no need to do existence checks on the
relations to drop.
Contrary to a manual TRUNCATE on a partitioned table, ON COMMIT DELETE
does not cascade to its partitions. The ON COMMIT action defined on
each partition gets the priority.
Author: Michael Paquier
Reviewed-by: Amit Langote, Álvaro Herrera, Robert Haas
Discussion: https://postgr.es/m/68f17907-ec98-1192-f99f-8011400517f5@lab.ntt.co.jp
Backpatch-through: 10