1
0
mirror of https://github.com/postgres/postgres.git synced 2025-05-29 16:21:20 +03:00

31046 Commits

Author SHA1 Message Date
Amit Kapila
3f02b5150a Match the buffer usage tracking for leader and worker backends.
In the leader backend, we don't track the buffer usage for ExecutorStart
phase whereas in worker backend we track it for ExecutorStart phase as
well.  This leads to different value for buffer usage stats for the
parallel and non-parallel query.  Change the code so that worker backend
also starts tracking buffer usage after ExecutorStart.

Author: Amit Kapila and Robert Haas
Reviewed-by: Robert Haas and Andres Freund
Backpatch-through: 9.6 where this code was introduced
Discussion:https://postgr.es/m/86137f17-1dfb-42f9-7421-82fd786b04a1@anayrat.info
2018-08-03 09:50:24 +05:30
Tom Lane
71e3b28901 Fix libpq's code for searching .pgpass; rationalize empty-list-item cases.
Before v10, we always searched ~/.pgpass using the host parameter,
and nothing else, to match to the "hostname" field of ~/.pgpass.
(However, null host or host matching DEFAULT_PGSOCKET_DIR was replaced by
"localhost".)  In v10, this got broken by commit 274bb2b38, repaired by
commit bdac9836d, and broken again by commit 7b02ba62e; in the code
actually shipped, we'd search with hostaddr if both that and host were
specified --- though oddly, *not* if only hostaddr were specified.
Since this is directly contrary to the documentation, and not
backwards-compatible, it's clearly a bug.

However, the change wasn't totally without justification, even though it
wasn't done quite right, because the pre-v10 behavior has arguably been
buggy since we added hostaddr.  If hostaddr is specified and host isn't,
the pre-v10 code will search ~/.pgpass for "localhost", and ship that
password off to a server that most likely isn't local at all.  That's
unhelpful at best, and could be a security breach at worst.

Therefore, rather than just revert to that old behavior, let's define
the behavior as "search with host if provided, else with hostaddr if
provided, else search for localhost".  (As before, a host name matching
DEFAULT_PGSOCKET_DIR is replaced by localhost.)  This matches the
behavior of the actual connection code, so that we don't pick up an
inappropriate password; and it allows useful searches to happen when
only hostaddr is given.

While we're messing around here, ensure that empty elements within a
host or hostaddr list select the same behavior as a totally-empty
field would; for instance "host=a,,b" is equivalent to "host=a,/tmp,b"
if DEFAULT_PGSOCKET_DIR is /tmp.  Things worked that way in some cases
already, but not consistently so, which contributed to the confusion
about what key ~/.pgpass would get searched with.

Update documentation accordingly, and also clarify some nearby text.

Back-patch to v10 where the host/hostaddr list functionality was
introduced.

Discussion: https://postgr.es/m/30805.1532749137@sss.pgh.pa.us
2018-08-01 12:30:36 -04:00
Bruce Momjian
d8dd8d221a pg_upgrade: fix --check for live source server checks
Fix for commit 244142d32afd02e7408a2ef1f249b00393983822.

Backpatch-through: 9.3
2018-07-31 18:10:06 -04:00
Tom Lane
31b29b1b30 Further fixes for quoted-list GUC values in pg_dump and ruleutils.c.
Commits 742869946 et al turn out to be a couple bricks shy of a load.
We were dumping the stored values of GUC_LIST_QUOTE variables as they
appear in proconfig or setconfig catalog columns.  However, although that
quoting rule looks a lot like SQL-identifier double quotes, there are two
critical differences: empty strings ("") are legal, and depending on which
variable you're considering, values longer than NAMEDATALEN might be valid
too.  So the current technique fails altogether on empty-string list
entries (as reported by Steven Winfield in bug #15248) and it also risks
truncating file pathnames during dump/reload of GUC values that are lists
of pathnames.

To fix, split the stored value without any downcasing or truncation,
and then emit each element as a SQL string literal.

This is a tad annoying, because we now have three copies of the
comma-separated-string splitting logic in varlena.c as well as a fourth
one in dumputils.c.  (Not to mention the randomly-different-from-those
splitting logic in libpq...)  I looked at unifying these, but it would
be rather a mess unless we're willing to tweak the API definitions of
SplitIdentifierString, SplitDirectoriesString, or both.  That might be
worth doing in future; but it seems pretty unsafe for a back-patched
bug fix, so for now accept the duplication.

Back-patch to all supported branches, as the previous fix was.

Discussion: https://postgr.es/m/7585.1529435872@sss.pgh.pa.us
2018-07-31 13:00:08 -04:00
Alvaro Herrera
2c4d0f32e0 Set ActiveSnapshot when logically replaying inserts
Input functions for the inserted tuples may require a snapshot, when
they are replayed by native logical replication.  An example is a domain
with a constraint using a SQL-language function, which prior to this
commit failed to apply on the subscriber side.

Reported-by: Mai Peng <maily.peng@webedia-group.com>
Co-authored-by: Minh-Quan TRAN <qtran@itscaro.me>
Co-authored-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: https://postgr.es/m/4EB4BD78-BFC3-4D04-B8DA-D53DF7160354@webedia-group.com
Discussion: https://postgr.es/m/153211336163.1404.11721804383024050689@wrigleys.postgresql.org
2018-07-30 16:30:07 -04:00
Tom Lane
96b1d984fe Fix pg_dump's failure to dump REPLICA IDENTITY for constraint indexes.
pg_dump knew about printing ALTER TABLE ... REPLICA IDENTITY USING INDEX
for indexes declared as indexes, but it failed to print that for indexes
declared as unique or primary-key constraints.  Per report from Achilleas
Mantzios.

This has been broken since the feature was introduced, AFAICS.
Back-patch to 9.4.

Discussion: https://postgr.es/m/1e6cc5ad-b84a-7c07-8c08-a4d0c3cdc938@matrix.gatewaynet.com
2018-07-30 12:35:49 -04:00
Noah Misch
ca4a6de174 Document security implications of qualified names.
Commit 5770172cb0c9df9e6ce27c507b449557e5b45124 documented secure schema
usage, and that advice suffices for using unqualified names securely.
Document, in typeconv-func primarily, the additional issues that arise
with qualified names.  Back-patch to 9.3 (all supported versions).

Reviewed by Jonathan S. Katz.

Discussion: https://postgr.es/m/20180721012446.GA1840594@rfd.leadboat.com
2018-07-28 20:08:21 -07:00
Bruce Momjian
d524a11a72 pgtest: run clean, build, and check stages separately
This allows for cleaner error reporting.

Backpatch-through: 9.5
2018-07-28 15:34:06 -04:00
Bruce Momjian
9a13e7f0f0 pg_upgrade: check for clean server shutdowns
Previously pg_upgrade checked for the pid file and started/stopped the
server to force a clean shutdown.  However, "pg_ctl -m immediate"
removes the pid file but doesn't do a clean shutdown, so check
pg_controldata for a clean shutdown too.

Diagnosed-by: Vimalraj A

Discussion: https://postgr.es/m/CAFKBAK5e4Q-oTUuPPJ56EU_d2Rzodq6GWKS3ncAk3xo7hAsOZg@mail.gmail.com

Backpatch-through: 9.3
2018-07-28 15:01:55 -04:00
Bruce Momjian
fe25526e9d pgtest: grab possible warnings from install.log
Since PG 9.5, 'make check' records the build output in install.log, so
look in there for warnings too.

Backpatch-through: 9.5
2018-07-28 11:35:52 -04:00
Amit Kapila
ff8ce0b790 Fix the buffer release order for parallel index scans.
During parallel index scans, if the current page to be read is deleted, we
skip it and try to get the next page for a scan without releasing the buffer
lock on the current page.  To get the next page, sometimes it needs to wait
for another process to complete its scan and advance it to the next page.
Now, it is quite possible that the master backend has errored out before
advancing the scan and issued a termination signal for all workers.  The
workers failed to notice the termination request during wait because the
interrupts are held due to buffer lock on the previous page.  This lead to
all workers being stuck.

The fix is to release the buffer lock on current page before trying to get
the next page.  We are already doing same in backward scans, but missed
it for forward scans.

Reported-by: Victor Yegorov
Bug: 15290
Diagnosed-by: Thomas Munro and Amit Kapila
Author: Amit Kapila
Reviewed-by: Thomas Munro
Tested-By: Thomas Munro and Victor Yegorov
Backpatch-through: 10 where parallel index scans were introduced
Discussion: https://postgr.es/m/153228422922.1395.1746424054206154747@wrigleys.postgresql.org
2018-07-27 11:05:06 +05:30
Thomas Munro
46201d603f Pad semaphores to avoid false sharing.
In a USE_UNNAMED_SEMAPHORES build, the default on Linux and FreeBSD
since commit ecb0d20a, we have an array of sem_t objects.  This
turned out to reduce performance compared to the previous default
USE_SYSV_SEMAPHORES on an 8 socket system.  Testing showed that the
lost performance could be regained by padding the array elements so
that they have their own cache lines.  This matches what we do for
similar hot arrays (see LWLockPadded, WALInsertLockPadded).

Back-patch to 10, where unnamed semaphores were adopted as the default
semaphore interface on those operating systems.

Author: Thomas Munro
Reviewed-by: Andres Freund
Reported-by: Mithun Cy
Tested-by: Mithun Cy, Tom Lane, Thomas Munro
Discussion: https://postgr.es/m/CAD__OugYDM3O%2BdyZnnZSbJprSfsGFJcQ1R%3De59T3hcLmDug4_w%40mail.gmail.com
2018-07-25 11:00:53 +12:00
Tom Lane
dfbad3f454 Further portability hacking in pg_upgrade's test script.
I blew the dust off a Bourne shell (file date 1996, yea verily) and
tried to run test.sh with it.  It mostly worked, but I found that the
temp-directory creation code introduced by commit be76a6d39 was not
compatible, for a couple of reasons: this shell thinks "set -e" should
force an exit if a command within backticks fails, and it also thinks code
within braces should be executed by a sub-shell, meaning that variable
settings don't propagate back up to the parent shell.  In view of Victor
Wagner's report that Solaris is still using pre-POSIX shells, seems like
we oughta make this case work.  It's not like the code is any less
idiomatic this way; the prior coding technique appeared nowhere else.

(There is a remaining bash-ism here, which is that $RANDOM doesn't do
what the code hopes in non-bash shells.  But the use of $$ elsewhere in
that path should be enough to ensure uniqueness and some amount of
randomness, so I think it's okay as-is.)

Back-patch to all supported branches, as the previous commit was.

Discussion: https://postgr.es/m/20180720153820.69e9ae6c@fafnir.local.vm
2018-07-21 15:40:51 -04:00
Dean Rasheed
821200405c Guard against rare RAND_bytes() failures in pg_strong_random().
When built using OpenSSL, pg_strong_random() uses RAND_bytes() to
generate the random number. On very rare occasions that can fail, if
its PRNG has not been seeded with enough data. Additionally, once it
does fail, all subsequent calls will also fail until more seed data is
added. Since this is required during backend startup, this can result
in all new backends failing to start until a postmaster restart.

Guard against that by checking the state of OpenSSL's PRNG using
RAND_status(), and if necessary (very rarely), seeding it using
RAND_poll().

Back-patch to v10, where pg_strong_random() was introduced.

Dean Rasheed and Michael Paquier.

Discussion: https://postgr.es/m/CAEZATCXMtxbzSAvyKKk5uCRf9pNt4UV%2BF_5v%3DgLfJUuPxU4Ytg%40mail.gmail.com
2018-07-20 08:58:37 +01:00
Tom Lane
2131d4501f Remove undocumented restriction against duplicate partition key columns.
transformPartitionSpec rejected duplicate simple partition columns
(e.g., "PARTITION BY RANGE (x,x)") but paid no attention to expression
columns, resulting in inconsistent behavior.  Worse, cases like
"PARTITION BY RANGE (x,(x))") were accepted but would then result in
dump/reload failures, since the expression (x) would get simplified
to a plain column later.

There seems no better reason for this restriction than there was for
the one against duplicate included index columns (cf commit 701fd0bbc),
so let's just remove it.

Back-patch to v10 where this code was added.

Report and patch by Yugo Nagata.

Discussion: https://postgr.es/m/20180712165939.36b12aff.nagata@sraoss.co.jp
2018-07-19 15:41:46 -04:00
Alexander Korotkov
0d26812a43 Fix handling of empty uncompressed posting list pages in GIN
PostgreSQL 9.4 introduces posting list compression in GIN.  This feature
supports online upgrade, so that after pg_upgrade uncompressed posting
lists are compressed on-the-fly.  Underlying code appears to always
expect at least one item on uncompressed posting list page.  But there
could be completely empty pages, because VACUUM never deletes leftmost
and rightmost pages from posting trees.  This commit fixes that.

Reported-by: Sivasubramanian Ramasubramanian
Discussion: https://postgr.es/m/1531867212836.63354%40amazon.com
Author: Sivasubramanian Ramasubramanian, Alexander Korotkov
Backpatch-through: 9.4
2018-07-19 21:12:43 +03:00
Heikki Linnakangas
ff4fb4cc1b Fix error message when a hostaddr cannot be parsed.
We were incorrectly passing hostname, not hostaddr, in the error message,
and because of that, you got:

$ psql 'hostaddr=foo'
psql: could not parse network address "(null)": Name or service not known

Backpatch to v10, where this was broken (by commit 7b02ba62e9).

Report and fix by Robert Haas.

Discussion: https://www.postgresql.org/message-id/CA+TgmoapFQA30NomGKEaZCu3iN7mF7fux8fbbk9SouVOT2JP7w@mail.gmail.com
2018-07-19 20:25:05 +03:00
Heikki Linnakangas
a1dc4ea020 Rephrase a few comments for clarity.
I was confused by what "intended to be parallel serially" meant, until
Robert Haas and David G. Johnston explained it. Rephrase the comment to
make it more clear, using David's suggested wording.

Discussion: https://www.postgresql.org/message-id/1fec9022-41e8-e484-70ce-2179b08c2092%40iki.fi
2018-07-19 16:08:49 +03:00
Michael Paquier
49d506dd21 Fix print of Path nodes when using OPTIMIZER_DEBUG
GatherMergePath (introduced in 10) and CustomPath (introduced in 9.5)
have gone missing.  The order of the Path nodes was inconsistent with
what is listed in nodes.h, so make the order consistent at the same time
to ease future checks and additions.

Author: Sawada Masahiko
Reviewed-by: Michael Paquier
Discussion: https://postgr.es/m/CAD21AoBQMLoc=ohH-oocuAPsELrmk8_EsRJjOyR8FQLZkbE0wA@mail.gmail.com
2018-07-19 09:55:15 +09:00
Tom Lane
5c513db12c Remove race-prone hot_standby_feedback test cases in 001_stream_rep.pl.
This script supposed that if it turned hot_standby_feedback on and then
shut down the standby server, at least one feedback message would be
guaranteed to be sent before the standby stops.  But there is no such
guarantee, if the standby's walreceiver process is slow enough --- and
we've seen multiple failures in the buildfarm showing that that does
happen in practice.  While we could rearrange the walreceiver logic to
make it less likely, it seems probably impossible to create a really
bulletproof guarantee of that sort; and if we tried, we might create
situations where the walreceiver wouldn't react in a timely manner to
shutdown commands.  It seems better instead to remove the script's
assumption that feedback will occur before shutdown.

But once we do that, these last few tests seem quite redundant with
the earlier tests in the script.  So let's just drop them altogether
and save some buildfarm cycles.

Backpatch to v10 where these tests were added.

Discussion: https://postgr.es/m/1922.1531592205@sss.pgh.pa.us
2018-07-18 17:39:27 -04:00
Heikki Linnakangas
ed529faf7e Fix misc typos, mostly in comments.
A collection of typos I happened to spot while reading code, as well as
grepping for common mistakes.

Backpatch to all supported versions, as applicable, to avoid conflicts
when backporting other commits in the future.
2018-07-18 16:18:27 +03:00
Robert Haas
4beb25c632 Add subtransaction handling for table synchronization workers.
Since the old logic was completely unaware of subtransactions, a
change made in a subsequently-aborted subtransaction would still cause
workers to be stopped at toplevel transaction commit.  Fix that by
managing a stack of worker lists rather than just one.

Amit Khandekar and Robert Haas

Discussion: http://postgr.es/m/CAJ3gD9eaG_mWqiOTA2LfAug-VRNn1hrhf50Xi1YroxL37QkZNg@mail.gmail.com
2018-07-16 17:55:13 -04:00
Tom Lane
0bb28ca36e Fix hashjoin costing mistake introduced with inner_unique optimization.
In final_cost_hashjoin(), commit 9c7f5229a allowed inner_unique cases
to follow a code path previously used only for SEMI/ANTI joins; but it
neglected to fix an if-test within that path that assumed SEMI and ANTI
were the only possible cases.  This resulted in a wrong value for
hashjointuples, and an ensuing bad cost estimate, for inner_unique normal
joins.  Fortunately, for inner_unique normal joins we can assume the number
of joined tuples is the same as for a SEMI join; so there's no need for
more code, we just have to invert the test to check for ANTI not SEMI.

It turns out that in two contrib tests in which commit 9c7f5229a
changed the plan expected for a query, the change was actually wrong
and induced by this estimation error, not by any real improvement.
Hence this patch also reverts those changes.

Per report from RK Korlapati.  Backpatch to v10 where the error was
introduced.

David Rowley

Discussion: https://postgr.es/m/CA+SNy03bhq0fodsfOkeWDCreNjJVjsdHwUsb7AG=jpe0PtZc_g@mail.gmail.com
2018-07-14 11:59:12 -04:00
Tom Lane
f1963a1e79 Fix inadequate buffer locking in FSM and VM page re-initialization.
When reading an existing FSM or VM page that was found to be corrupt by the
buffer manager, the code applied PageInit() to reinitialize the page, but
did so without any locking.  There is thus a hazard that two backends might
concurrently do PageInit, which in itself would still be OK, but the slower
one might then zero over subsequent data changes applied by the faster one.
Even that is unlikely to be fatal; but it's not desirable, so add locking
to prevent it.

This does not add any locking overhead in the normal code path where the
page is OK.  It's not immediately obvious that that's safe, but I believe
it is, for reasons explained in the added comments.

Problem noted by R P Asim.  It's been like this for a long time, so
back-patch to all supported branches.

Discussion: https://postgr.es/m/CANXE4Te4G0TGq6cr0-TvwP0H4BNiK_-hB5gHe8mF+nz0mcYfMQ@mail.gmail.com
2018-07-13 11:53:12 -04:00
Michael Paquier
11abea37d5 Make logical WAL sender report streaming state appropriately
WAL senders sending logically-decoded data fail to properly report in
"streaming" state when starting up, hence as long as one extra record is
not replayed, such WAL senders would remain in a "catchup" state, which
is inconsistent with the physical cousin.

This can be easily reproduced by for example using pg_recvlogical and
restarting the upstream server.  The TAP tests have been slightly
modified to detect the failure and strengthened so as future tests also
make sure that a node is in streaming state when waiting for its
catchup.

Backpatch down to 9.4 where this code has been introduced.

Reported-by: Sawada Masahiko
Author: Simon Riggs, Sawada Masahiko
Reviewed-by: Petr Jelinek, Michael Paquier, Vaishnavi Prabakaran
Discussion: https://postgr.es/m/CAD21AoB2ZbCCqOx=bgKMcLrAvs1V0ZMqzs7wBTuDySezTGtMZA@mail.gmail.com
2018-07-12 10:20:08 +09:00
Tom Lane
c350320270 Fix create_scan_plan's handling of sortgrouprefs for physical tlists.
We should only run apply_pathtarget_labeling_to_tlist if CP_LABEL_TLIST
was specified, because only in that case has use_physical_tlist checked
that the labeling will succeed; otherwise we may get an "ORDER/GROUP BY
expression not found in targetlist" error.  (This subsumes the previous
test about gating_clauses, because we reset "flags" to zero earlier
if there are gating clauses to apply.)

The only known case in which a failure can occur is with a ProjectSet
path directly atop a table scan path, although it seems likely that there
are other cases or will be such in future.  This means that the failure
is currently only visible in the v10 branch: 9.6 didn't have ProjectSet,
while in v11 and HEAD, apply_scanjoin_target_to_paths for some weird
reason is using create_projection_path not apply_projection_to_path,
masking the problem because there's a ProjectionPath in between.

Nonetheless this code is clearly wrong on its own terms, so back-patch
to 9.6 where this logic was introduced.

Per report from Regina Obe.

Discussion: https://postgr.es/m/001501d40f88$75186950$5f493bf0$@pcorp.us
2018-07-11 15:25:29 -04:00
Alvaro Herrera
7c644b7d3f Better handle pseudotypes as partition keys
We fail to handle polymorphic types properly when they are used as
partition keys: we were unnecessarily adding a RelabelType node on top,
which confuses code examining the nodes.  In particular, this makes
predtest.c-based partition pruning not to work, and ruleutils.c to emit
expressions that are uglier than needed.  Fix it by not adding RelabelType
when not needed.

In master/11 the new pruning code is separate so it doesn't suffer from
this problem, since we already fixed it (in essentially the same way) in
e5dcbb88a15d, which also added a few tests; back-patch those tests to
pg10 also.  But since UPDATE/DELETE still uses predtest.c in pg11, this
change improves partitioning for those cases too.  Add tests for this.
The ruleutils.c behavior change is relevant in pg11/master too.

Co-authored-by: Amit Langote <Langote_Amit_f8@lab.ntt.co.jp>
Co-authored-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://postgr.es/m/54745d13-7ed4-54ac-97d8-ea1eec95ae25@lab.ntt.co.jp
2018-07-10 15:07:28 -04:00
Peter Eisentraut
7f0911f33e Fix typos 2018-07-10 11:15:59 +02:00
Tom Lane
59b2dcbf45 Avoid emitting a bogus WAL record when recycling an all-zero btree page.
Commit fafa374f2 caused _bt_getbuf() to possibly emit a WAL record for
a page that it was about to recycle.  However, it failed to distinguish
all-zero pages from dead pages, which is important because only the
latter have valid btpo.xact values, or indeed any special space at all.
Recycling an all-zero page with XLogStandbyInfoActive() enabled therefore
led to an Assert failure, or to emission of a WAL record containing a
bogus cutoff XID, which might lead to unnecessary query cancellations
on hot standby servers.

Per reports from Antonin Houska and 自己.  Amit Kapila was first to
propose this fix, and Robert Haas, myself, and Kyotaro Horiguchi
reviewed it at various times.

This is an old bug, so back-patch to all supported branches.

Discussion: https://postgr.es/m/2628.1474272158@localhost
Discussion: https://postgr.es/m/48875502.f4a0.1635f0c27b0.Coremail.zoulx1982@163.com
2018-07-09 19:26:19 -04:00
Tom Lane
c74f48a4ec Prevent accidental linking of system-supplied copies of libpq.so etc.
Back-patch commit dddfc4cb2, which broke LDFLAGS and related Makefile
variables into two parts, one for within-build-tree library references and
one for external libraries, to ensure that the order of -L flags has all
of the former before all of the latter.  This turns out to fix a problem
recently noted on buildfarm member peripatus, that we attempted to
incorporate code from libpgport.a into a shared library.  That will fail on
platforms that are sticky about putting non-PIC code into shared libraries.
(It's quite surprising we hadn't seen such failures before, since the code
in question has been like that for a long time.)

I think that peripatus' problem could have been fixed with just a subset
of this patch; but since the previous issue of accidentally linking to the
wrong copy of a Postgres shlib seems likely to bite people in the field,
let's just back-patch the whole change.  Now that commit dddfc4cb2 has
survived some beta testing, I'm less afraid to back-patch it than I was
at the time.

This also fixes undesired inclusion of "-DFRONTEND" in pg_config's CPPFLAGS
output (in 9.6 and up) and undesired inclusion of "-L../../src/common" in
its LDFLAGS output (in all supported branches).

Back-patch to v10 and older branches; this is already in v11.

Discussion: https://postgr.es/m/20180704234304.bq2dxispefl65odz@ler-imac.local
2018-07-09 17:23:31 -04:00
Michael Paquier
c030db3495 Rework order of end-of-recovery actions to delay timeline history write
A critical failure in some of the end-of-recovery actions before the
end-of-recovery record is written can cause PostgreSQL to react
inconsistently with the rest of the cluster in the event of a crash
before the final record is written.  Two such failures are for example
an error while processing a two-phase state files or when operating on
recovery.conf.  With this commit, the failures are still considered
FATAL, but the write of the timeline history file is delayed as much as
possible so as the window between the moment the file is written and the
end-of-recovery record is generated gets minimized. This way, in the
event of a crash or a failure, the new timeline decided at promotion
will not seem taken by other nodes in the cluster.  It is not really
possible to reduce to zero this window, hence one could still see
failures if a crash happens between the history file write and the
end-of-recovery record, so any future code should be careful when
adding new end-of-recovery actions.  The original report from Magnus
Hagander mentioned a renamed recovery.conf as original end-of-recovery
failure which caused a timeline to be seen as taken but the subsequent
processing on the now-missing recovery.conf cause the startup process to
issue stop on FATAL, which at follow-up startup made the system
inconsistent because of on-disk changes which already happened.

Processing of two-phase state files still needs some work as corrupted
entries are simply ignored now.  This is left as a future item and this
commit fixes the original complain.

Reported-by: Magnus Hagander
Author: Heikki Linnakangas
Reviewed-by: Alexander Korotkov, Michael Paquier, David Steele
Discussion: https://postgr.es/m/CABUevEz09XY2EevA2dLjPCY-C5UO4Hq=XxmXLmF6ipNFecbShQ@mail.gmail.com
2018-07-09 10:26:18 +09:00
Alvaro Herrera
a1f680d962 Allow replication slots to be dropped in single-user mode
Starting with commit 9915de6c1cb2, replication slot drop uses a
condition variable sleep to wait until the current user of the slot goes
away.  This is more user friendly than the previous behavior of erroring
out if the slot is in use, but it fails with a not-for-user-consumption
error message in single-user mode; plus, if you're using single-user
mode because you don't want to start the server in the regular mode
(say, disk is full and WAL won't recycle because of the slot), it's
inconvenient.

Fix by skipping the cond variable sleep in single-user mode, since
there can't be anybody to wait for anyway.

Reported-by: tushar <tushar.ahuja@enterprisedb.com>
Author: Álvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: https://postgr.es/m/3b2f809f-326c-38dd-7a9e-897f957a4eb1@enterprisedb.com
2018-07-06 16:38:29 -04:00
Alvaro Herrera
bba8c61211 logical decoding: beware of an unset specinsert change
Coverity complains that there is no protection in the code (at least in
non-assertion-enabled builds) against speculative insertion failing to
follow the expected protocol.  Add an elog(ERROR) for the case.
2018-07-05 17:42:37 -04:00
Michael Paquier
6716f2f962 Prevent references to invalid relation pages after fresh promotion
If a standby crashes after promotion before having completed its first
post-recovery checkpoint, then the minimal recovery point which marks
the LSN position where the cluster is able to reach consistency may be
set to a position older than the first end-of-recovery checkpoint while
all the WAL available should be replayed.  This leads to the instance
thinking that it contains inconsistent pages, causing a PANIC and a hard
instance crash even if all the WAL available has not been replayed for
certain sets of records replayed.  When in crash recovery,
minRecoveryPoint is expected to always be set to InvalidXLogRecPtr,
which forces the recovery to replay all the WAL available, so this
commit makes sure that the local copy of minRecoveryPoint from the
control file is initialized properly and stays as it is while crash
recovery is performed.  Once switching to archive recovery or if crash
recovery finishes, then the local copy minRecoveryPoint can be safely
updated.

Pavan Deolasee has reported and diagnosed the failure in the first
place, and the base fix idea to rely on the local copy of
minRecoveryPoint comes from Kyotaro Horiguchi, which has been expanded
into a full-fledged patch by me.  The test included in this commit has
been written by Álvaro Herrera and Pavan Deolasee, which I have modified
to make it faster and more reliable with sleep phases.

Backpatch down to all supported versions where the bug appears, aka 9.3
which is where the end-of-recovery checkpoint is not run by the startup
process anymore.  The test gets easily supported down to 10, still it
has been tested on all branches.

Reported-by: Pavan Deolasee
Diagnosed-by: Pavan Deolasee
Reviewed-by: Pavan Deolasee, Kyotaro Horiguchi
Author: Michael Paquier, Kyotaro Horiguchi, Pavan Deolasee, Álvaro
Herrera
Discussion: https://postgr.es/m/CABOikdPOewjNL=05K5CbNMxnNtXnQjhTx2F--4p4ruorCjukbA@mail.gmail.com
2018-07-05 10:47:01 +09:00
Andres Freund
0095809890 Check for interrupts inside the nbtree page deletion code.
When deleting pages the nbtree code has to walk through siblings of a
tree node. When those sibling links are corrupted that can lead to
endless loops - which are currently not interruptible.  This is
especially problematic if autovacuum is repeatedly blocked on such
indexes, as it can be hard to get out of that situation without
resorting to single user mode.

Thus add interrupt checks to appropriate places in such
loops. Unfortunately in one of the cases it's it's not easy to do so.

Between 9.3 and 9.4 the page deletion (and page split) code changed
significantly. Before it was significantly less robust against
interruptions. Therefore don't backpatch to 9.3.

Author: Andres Freund
Discussion: https://postgr.es/m/20180627191629.wkunw2qbibnvlz53@alap3.anarazel.de
Backpatch: 9.4-
2018-07-04 14:58:39 -07:00
Fujii Masao
8463be060d Improve the performance of relation deletes during recovery.
When multiple relations are deleted at the same transaction,
the files of those relations are deleted by one call to smgrdounlinkall(),
which leads to scan whole shared_buffers only one time. OTOH,
previously, during recovery, smgrdounlink() (not smgrdounlinkall()) was
called for each file to delete, which led to scan shared_buffers
multiple times. Obviously this could cause to increase the WAL replay
time very much especially when shared_buffers was huge.

To alleviate this situation, this commit changes the recovery so that
it also calls smgrdounlinkall() only one time to delete multiple
relation files.

This is just fix for oversight of commit 279628a0a7, not new feature.
So, per discussion on pgsql-hackers, we concluded to backpatch this
to all supported versions.

Author: Fujii Masao
Reviewed-by: Michael Paquier, Andres Freund, Thomas Munro, Kyotaro Horiguchi, Takayuki Tsunakawa
Discussion: https://postgr.es/m/CAHGQGwHVQkdfDqtvGVkty+19cQakAydXn1etGND3X0PHbZ3+6w@mail.gmail.com
2018-07-05 02:26:22 +09:00
Peter Eisentraut
c2c69d43ad Fix libpq example programs
When these programs call pg_catalog.set_config, they need to check for
PGRES_TUPLES_OK instead of PGRES_COMMAND_OK.  Fix for
5770172cb0c9df9e6ce27c507b449557e5b45124.

Reported-by: Ideriha, Takeshi <ideriha.takeshi@jp.fujitsu.com>
2018-07-01 14:09:03 +02:00
Amit Kapila
5521e3bd5f Fix thinko in comments.
A slot can not be stored in a tuple but it's vice versa.

Reported-by: Ashutosh Bapat
Author: Ashutosh Bapat
Reviewed-by: Amit Kapila
Discussion: https://postgr.es/m/CAFjFpRcHhNhXdegyJv3KKDWrwO1_NB_KYZM_ZSDeMOZaL1A5jQ@mail.gmail.com
2018-06-27 18:29:42 +05:30
Alvaro Herrera
b767b3f2e5 Fix "base" snapshot handling in logical decoding
Two closely related bugs are fixed.  First, xmin of logical slots was
advanced too early.  During xl_running_xacts processing, xmin of the
slot was set to the oldest running xid in the record, but that's wrong:
actually, snapshots which will be used for not-yet-replayed transactions
might consider older txns as running too, so we need to keep xmin back
for them.  The problem wasn't noticed earlier because DDL which allows
to delete tuple (set xmax) while some another not-yet-committed
transaction looks at it is pretty rare, if not unique: e.g. all forms of
ALTER TABLE which change schema acquire ACCESS EXCLUSIVE lock
conflicting with any inserts. The included test case (test_decoding's
oldest_xmin) uses ALTER of a composite type, which doesn't have such
interlocking.

To deal with this, we must be able to quickly retrieve oldest xmin
(oldest running xid among all assigned snapshots) from ReorderBuffer. To
fix, add another list of ReorderBufferTXNs to the reorderbuffer, where
transactions are sorted by base-snapshot-LSN.  This is slightly
different from the existing (sorted by first-LSN) list, because a
transaction can have an earlier LSN but a later Xmin, if its first
record does not obtain an xmin (eg. xl_xact_assignment).  Note this new
list doesn't fully replace the existing txn list: we still need that one
to prevent WAL recycling.

The second issue concerns SnapBuilder snapshots and subtransactions.
SnapBuildDistributeNewCatalogSnapshot never assigned a snapshot to a
transaction that is known to be a subtxn, which is good in the common
case that the top-level transaction already has one (no point in doing
so), but a bug otherwise.  To fix, arrange to transfer the snapshot from
the subtxn to its top-level txn as soon as the kinship gets known.
test_decoding's snapshot_transfer verifies this.

Also, fix a minor memory leak: refcount of toplevel's old base snapshot
was not decremented when the snapshot is transferred from child.

Liberally sprinkle code comments, and rewrite a few existing ones.  This
part is my (Álvaro's) contribution to this commit, as I had to write all
those comments in order to understand the existing code and Arseny's
patch.

Reported-by: Arseny Sher <a.sher@postgrespro.ru>
Diagnosed-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Arseny Sher <a.sher@postgrespro.ru>
Co-authored-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Reviewed-by: Antonin Houska <ah@cybertec.at>
Discussion: https://postgr.es/m/87lgdyz1wj.fsf@ars-thinkpad
2018-06-26 16:38:34 -04:00
Thomas Munro
3566873f21 Add PGTYPESchar_free() to avoid cross-module problems on Windows.
On Windows, it is sometimes important for corresponding malloc() and
free() calls to be made from the same DLL, since some build options can
result in multiple allocators being active at the same time.  For that
reason we already provided PQfreemem().  This commit adds a similar
function for freeing string results allocated by the pgtypes library.

Author: Takayuki Tsunakawa
Reviewed-by: Kyotaro Horiguchi
Discussion: https://postgr.es/m/0A3221C70F24FB45833433255569204D1F8AD5D6%40G01JPEXMBYT05
2018-06-26 19:49:52 +12:00
Thomas Munro
88554c0911 Move RecoveryLockList into a hash table.
Standbys frequently need to release all locks held by a given xid.
Instead of searching one big list linearly, let's create one list
per xid and put them in a hash table, so we can find what we need
in O(1) time.

Earlier analysis and a prototype were done by David Rowley, though
this isn't his patch.

Back-patch all the way.

Author: Thomas Munro
Diagnosed-by: David Rowley, Andres Freund
Reviewed-by: Andres Freund, Tom Lane, Robert Haas
Discussion: https://postgr.es/m/CAEepm%3D1mL0KiQ2KJ4yuPpLGX94a4Ns_W6TL4EGRouxWibu56pA%40mail.gmail.com
Discussion: https://postgr.es/m/CAKJS1f9vJ841HY%3DwonnLVbfkTWGYWdPN72VMxnArcGCjF3SywA%40mail.gmail.com
2018-06-26 17:17:27 +12:00
Michael Paquier
324076a626 Correct handling of fsync failures with tar mode of walmethods.c
This file has been missing the fact that it needs to report back to
callers a proper failure on fsync calls.  I have spotted the one in
tar_finish() while Kuntal has spotted the one in tar_close().

Backpatch down to 10 where this code has been introduced.

Reported by: Michael Paquier, Kuntal Ghosh
Author: Michael Paquier
Reviewed-by: Kuntal Ghosh, Magnus Hagander
Discussion: https://postgr.es/m/20180625024356.GD1146@paquier.xyz
2018-06-26 09:56:55 +09:00
Alvaro Herrera
99fb44357c Update obsolete comments
Commit 9fab40ad32ef removed some pre-allocating logic in
reorderbuffer.c, but left outdated comments in place.  Repair.

Author: Álvaro Herrera
2018-06-25 15:43:05 -04:00
Michael Paquier
6eec6724ff Address set of issues with errno handling
System calls mixed up in error code paths are causing two issues which
several code paths have not correctly handled:
1) For write() calls, sometimes the system may return less bytes than
what has been written without errno being set.  Some paths were careful
enough to consider that case, and assumed that errno should be set to
ENOSPC, other calls missed that.
2) errno generated by a system call is overwritten by other system calls
which may succeed once an error code path is taken, causing what is
reported to the user to be incorrect.

This patch uses the brute-force approach of correcting all those code
paths.  Some refactoring could happen in the future, but this is let as
future work, which is not targeted for back-branches anyway.

Author: Michael Paquier
Reviewed-by: Ashutosh Sharma
Discussion: https://postgr.es/m/20180622061535.GD5215@paquier.xyz
2018-06-25 11:20:19 +09:00
Tom Lane
b8a1d03020 Fix partial aggregation for variance(int4) and related aggregates.
A typo in numeric_poly_combine caused bogus results for queries using
it, but of course would only manifest if parallel aggregation is
performed.  Reported by Rajkumar Raghuwanshi.

David Rowley did the diagnosis and the fix; I editorialized rather
heavily on his regression test additions.

Back-patch to v10 where the breakage was introduced (by 9cca11c91).

Discussion: https://postgr.es/m/CAKcux6nU4E2x8nkSBpLOT2DPvQ5LviJ3SGyAN6Sz7qDH4G4+Pw@mail.gmail.com
2018-06-21 16:18:34 -04:00
Tom Lane
a4c95b0b80 Fix mishandling of sortgroupref labels while splitting SRF targetlists.
split_pathtarget_at_srfs() neglected to worry about sortgroupref labels
in the intermediate PathTargets it constructs.  I think we'd supposed
that their labeling didn't matter, but it does at least for the case that
GroupAggregate/GatherMerge nodes appear immediately under the ProjectSet
step(s).  This results in "ERROR: ORDER/GROUP BY expression not found in
targetlist" during create_plan(), as reported by Rajkumar Raghuwanshi.

To fix, make this logic track the sortgroupref labeling of expressions,
not just their contents.  This also restores the pre-v10 behavior that
separate GROUP BY expressions will be kept distinct even if they are
textually equal().

Discussion: https://postgr.es/m/CAKcux6=1_Ye9kx8YLBPmJs_xE72PPc6vNi5q2AOHowMaCWjJ2w@mail.gmail.com
2018-06-21 10:58:42 -04:00
Alvaro Herrera
04ab840b84 Update expected XML output with disabled XML, too 2018-06-20 13:02:46 -04:00
Alvaro Herrera
e10bc161f9 Accept TEXT and CDATA nodes in XMLTABLE's column_expression.
Column expressions that match TEXT or CDATA nodes must return the
contents of the nodes themselves, not the content of non-existing
children (i.e. the empty string).

Author: Markus Winand
Reported-by: Markus Winand
Reviewed-by: Álvaro Herrera
Discussion: https://postgr.es/m/0684A598-002C-42A2-AE12-F024A324EAE4@winand.at
2018-06-20 12:58:12 -04:00
Michael Paquier
5862174ec7 Clarify use of temporary tables within partition trees
Since their introduction, partition trees have been a bit lossy
regarding temporary relations.  Inheritance trees respect the following
patterns:
1) a child relation can be temporary if the parent is permanent.
2) a child relation can be temporary if the parent is temporary.
3) a child relation cannot be permanent if the parent is temporary.
4) The use of temporary relations also imply that when both parent and
child need to be from the same sessions.

Partitions share many similar patterns with inheritance, however the
handling of the partition bounds make the situation a bit tricky for
case 1) as the partition code bases a lot of its lookup code upon
PartitionDesc which does not really look after relpersistence.  This
causes for example a temporary partition created by session A to be
visible by another session B, preventing this session B to create an
extra partition which overlaps with the temporary one created by A with
a non-intuitive error message.  There could be use-cases where mixing
permanent partitioned tables with temporary partitions make sense, but
that would be a new feature.  Partitions respect 2), 3) and 4) already.

It is a bit depressing to see those error checks happening in
MergeAttributes() whose purpose is different, but that's left as future
refactoring work.

Back-patch down to 10, which is where partitioning has been introduced,
except that default partitions do not apply there.  Documentation also
includes limitations related to the use of temporary tables with
partition trees.

Reported-by: David Rowley
Author: Amit Langote, Michael Paquier
Reviewed-by: Ashutosh Bapat, Amit Langote, Michael Paquier
Discussion: https://postgr.es/m/CAKJS1f94Ojk0og9GMkRHGt8wHTW=ijq5KzJKuoBoqWLwSVwGmw@mail.gmail.com
2018-06-20 10:48:28 +09:00
Michael Paquier
fb28104a4b Prevent hard failures of standbys caused by recycled WAL segments
When a standby's WAL receiver stops reading WAL from a WAL stream, it
writes data to the current WAL segment without having priorily zero'ed
the page currently written to, which can cause the WAL reader to read
junk data from a past recycled segment and then it would try to get a
record from it.  While sanity checks in place provide most of the
protection needed, in some rare circumstances, with chances increasing
when a record header crosses a page boundary, then the startup process
could fail violently on an allocation failure, as follows:
FATAL:  invalid memory alloc request size XXX

This is confusing for the user and also unhelpful as this requires in
the worst case a manual restart of the instance, impacting potentially
the availability of the cluster, and this also makes WAL data look like
it is in a corrupted state.

The chances of seeing failures are higher if the connection between the
standby and its root node is unstable, causing WAL pages to be written
in the middle.  A couple of approaches have been discussed, like
zero-ing  new WAL pages within the WAL receiver itself but this has the
disadvantage of impacting performance of any existing instances as this
breaks the sequential writes done by the WAL receiver.  This commit
deals with the problem with a more simple approach, which has no
performance impact without reducing the detection of the problem: if a
record is found with a length higher than 1GB for backends, then do not
try any allocation and report a soft failure which will force the
standby to retry reading WAL.  It could be possible that the allocation
call passes and that an unnecessary amount of memory is allocated,
however follow-up checks on records would just fail, making this
allocation short-lived anyway.

This patch owes a great deal to Tsunakawa Takayuki for reporting the
failure first, and then discussing a couple of potential approaches to
the problem.

Backpatch down to 9.5, which is where palloc_extended has been
introduced.

Reported-by: Tsunakawa Takayuki
Reviewed-by: Tsunakawa Takayuki
Author: Michael Paquier
Discussion: https://postgr.es/m/0A3221C70F24FB45833433255569204D1F8B57AD@G01JPEXMBYT05
2018-06-18 10:43:42 +09:00