When due to publication configuration, a TRUNCATE change ends up with
zero tables to be published, don't send the message out, just skip it.
It's not wrong, but obviously useless overhead.
Teach both base backups and pg_verify_checksums that if a page is new,
it does not have a checksum yet, so it shouldn't be verified.
Noted by Tomas Vondra, review by David Steele.
In commit 9c0a0de4c, I'd failed to notice that catalog/catalog.h
should also be considered a frontend-unsafe header, because it includes
(and needs) the full form of pg_class.h, not to mention relcache.h.
However, various frontend code was depending on it to get
TABLESPACE_VERSION_DIRECTORY, so refactoring of some sort is called for.
The cleanest answer seems to be to move TABLESPACE_VERSION_DIRECTORY,
as well as the OIDCHARS symbol, to common/relpath.h. Do that, and mop up
inclusions as necessary. (I found that quite a few current users of
catalog/catalog.h don't seem to need it at all anymore, apparently as a
result of the refactorings that created common/relpath.[hc]. And
initdb.c needed it only as a route to pg_class_d.h.)
Discussion: https://postgr.es/m/6629.1523294509@sss.pgh.pa.us
This reverts the backend sides of commit 1fde38beaa.
I have, at least for now, left the pg_verify_checksums tool in place, as
this tool can be very valuable without the rest of the patch as well,
and since it's a read-only tool that only runs when the cluster is down
it should be a lot safer.
Consolidate directory and file create permissions for tools which work
with the PG data directory by adding a new module (common/file_perm.c)
that contains variables (pg_file_create_mode, pg_dir_create_mode) and
constants to initialize them (0600 for files and 0700 for directories).
Convert mkdir() calls in the backend to MakePGDirectory() if the
original call used default permissions (always the case for regular PG
directories).
Add tests to make sure permissions in PGDATA are set correctly by the
tools which modify the PG data directory.
Authors: David Steele <david@pgmasters.net>,
Adam Brightwell <adam.brightwell@crunchydata.com>
Reviewed-By: Michael Paquier, with discussion amongst many others.
Discussion: https://postgr.es/m/ad346fe6-b23e-59f1-ecb7-0e08390ad629%40pgmasters.net
Update the built-in logical replication system to make use of the
previously added logical decoding for TRUNCATE support. Add the
required truncate callback to pgoutput and a new logical replication
protocol message.
Publications get a new attribute to determine whether to replicate
truncate actions. When updating a publication via pg_dump from an older
version, this is not set, thus preserving the previous behavior.
Author: Simon Riggs <simon@2ndquadrant.com>
Author: Marco Nenciarini <marco.nenciarini@2ndquadrant.it>
Author: Peter Eisentraut <peter.eisentraut@2ndquadrant.com>
Reviewed-by: Petr Jelinek <petr.jelinek@2ndquadrant.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org>
We don't actually need the insert-or-update logic, so it's clearer to
have separate functions for the inserting and updating.
Author: Petr Jelinek <petr.jelinek@2ndquadrant.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
In case the subscription is removed before the worker is fully started,
give a specific error message instead of the generic "cache lookup"
error.
Author: Petr Jelinek <petr.jelinek@2ndquadrant.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
This makes it possible to turn checksums on in a live cluster, without
the previous need for dump/reload or logical replication (and to turn it
off).
Enabling checkusm starts a background process in the form of a
launcher/worker combination that goes through the entire database and
recalculates checksums on each and every page. Only when all pages have
been checksummed are they fully enabled in the cluster. Any failure of
the process will revert to checksums off and the process has to be
started.
This adds a new WAL record that indicates the state of checksums, so
the process works across replicated clusters.
Authors: Magnus Hagander and Daniel Gustafsson
Review: Tomas Vondra, Michael Banck, Heikki Linnakangas, Andrey Borodin
THis adds a "flags" field to the BackgroundWorkerInitializeConnection()
and BackgroundWorkerInitializeConnectionByOid(). For now only one flag,
BGWORKER_BYPASS_ALLOWCONN, is defined, which allows the worker to ignore
datallowconn.
We were being careless in some places about the order of -L switches in
link command lines, such that -L switches referring to external directories
could come before those referring to directories within the build tree.
This made it possible to accidentally link a system-supplied library, for
example /usr/lib/libpq.so, in place of the one built in the build tree.
Hilarity ensued, the more so the older the system-supplied library is.
To fix, break LDFLAGS into two parts, a sub-variable LDFLAGS_INTERNAL
and the main LDFLAGS variable, both of which are "recursively expanded"
so that they can be incrementally adjusted by different makefiles.
Establish a policy that -L switches for directories in the build tree
must always be added to LDFLAGS_INTERNAL, while -L switches for external
directories must always be added to LDFLAGS. This is sufficient to
ensure a safe search order. For simplicity, we typically also put -l
switches for the respective libraries into those same variables.
(Traditional make usage would have us put -l switches into LIBS, but
cleaning that up is a project for another day, as there's no clear
need for it.)
This turns out to also require separating SHLIB_LINK into two variables,
SHLIB_LINK and SHLIB_LINK_INTERNAL, with a similar rule about which
switches go into which variable. And likewise for PG_LIBS.
Although this change might appear to affect external users of pgxs.mk,
I think it doesn't; they shouldn't have any need to touch the _INTERNAL
variables.
In passing, tweak src/common/Makefile so that the value of CPPFLAGS
recorded in pg_config lacks "-DFRONTEND" and the recorded value of
LDFLAGS lacks "-L../../../src/common". Both of those things are
mistakes, apparently introduced during prior code rearrangements,
as old versions of pg_config don't print them. In general we don't
want anything that's specific to the src/common subdirectory to
appear in those outputs.
This is certainly a bug fix, but in view of the lack of field
complaints, I'm unsure whether it's worth the risk of back-patching.
In any case it seems wise to see what the buildfarm makes of it first.
Discussion: https://postgr.es/m/25214.1522604295@sss.pgh.pa.us
When base backups are run over the replication protocol (for example
using pg_basebackup), verify the checksums of all data blocks if
checksums are enabled. If checksum failures are encountered, log them
as warnings but don't abort the backup.
This becomes the default behaviour in pg_basebackup (provided checksums
are enabled on the server), so add a switch (-k) to disable the checks
if necessary.
Author: Michael Banck
Reviewed-By: Magnus Hagander, David Steele
Discussion: https://postgr.es/m/20180228180856.GE13784@nighthawk.caipicrew.dd-dns.de
Previously there was no way in the standby side to find out the host and port
of the sender server that the walreceiver was currently connected to when
multiple hosts and ports were specified in primary_conninfo. For that purpose,
this patch adds sender_host and sender_port columns into pg_stat_wal_receiver
view. They report the host and port that the active replication connection
currently uses.
Bump catalog version.
Author: Haribabu Kommi
Reviewed-by: Michael Paquier and me
Discussion: https://postgr.es/m/CAJrrPGcV_aq8=cdqkFhVDJKEnDQ70yRTTdY9RODzMnXNrCz2Ow@mail.gmail.com
The target cluster that was rewound needs to perform recovery from
the checkpoint created at failover, which leads it to remove or recreate
some files and directories that may have been copied from the source
cluster. So pg_rewind can skip synchronizing such files and directories,
and which reduces the amount of data transferred during a rewind
without changing the usefulness of the operation.
Author: Michael Paquier
Reviewed-by: Anastasia Lubennikova, Stephen Frost and me
Discussion: https://postgr.es/m/20180205071022.GA17337@paquier.xyz
Originally, we treated memory context names as potentially variable in
all cases, and therefore always copied them into the context header.
Commit 9fa6f00b1 rethought this a little bit and invented a distinction
between fixed and variable names, skipping the copy step for the former.
But we can make things both simpler and more useful by instead allowing
there to be two parts to a context's identification, a fixed "name" and
an optional, variable "ident". The name supplied in the context create
call is now required to be a compile-time-constant string in all cases,
as it is never copied but just pointed to. The "ident" string, if
wanted, is supplied later. This is needed because typically we want
the ident to be stored inside the context so that it's cleaned up
automatically on context deletion; that means it has to be copied into
the context before we can set the pointer.
The cost of this approach is basically just an additional pointer field
in struct MemoryContextData, which isn't much overhead, and is bought
back entirely in the AllocSet case by not needing a headerSize field
anymore, since we no longer have to cope with variable header length.
In addition, we can simplify the internal interfaces for memory context
creation still further, saving a few cycles there. And it's no longer
true that a custom identifier disqualifies a context from participating
in aset.c's freelist scheme, so possibly there's some win on that end.
All the places that were using non-compile-time-constant context names
are adjusted to put the variable info into the "ident" instead. This
allows more effective identification of those contexts in many cases;
for example, subsidary contexts of relcache entries are now identified
by both type (e.g. "index info") and relname, where before you got only
one or the other. Contexts associated with PL function cache entries
are now identified more fully and uniformly, too.
I also arranged for plancache contexts to use the query source string
as their identifier. This is basically free for CachedPlanSources, as
they contained a copy of that string already. We pay an extra pstrdup
to do it for CachedPlans. That could perhaps be avoided, but it would
make things more fragile (since the CachedPlanSource is sometimes
destroyed first). I suspect future improvements in error reporting will
require CachedPlans to have a copy of that string anyway, so it's not
clear that it's worth moving mountains to avoid it now.
This also changes the APIs for context statistics routines so that the
context-specific routines no longer assume that output goes straight
to stderr, nor do they know all details of the output format. This
is useful immediately to reduce code duplication, and it also allows
for external code to do something with stats output that's different
from printing to stderr.
The reason for pushing this now rather than waiting for v12 is that
it rethinks some of the API changes made by commit 9fa6f00b1. Seems
better for extension authors to endure just one round of API changes
not two.
Discussion: https://postgr.es/m/CAB=Je-FdtmFZ9y9REHD7VsSrnCkiBhsA4mdsLKSPauwXtQBeNA@mail.gmail.com
Commit 8694cc96b did this randomly differently from other callers of
parse_filename_for_nontemp_relation(). Perhaps unsurprisingly,
the randomly different way is wrong; it fails to ensure the
extracted string is null-terminated. Per buildfarm member skink.
Discussion: https://postgr.es/m/14453.1522001792@sss.pgh.pa.us
Per the project style guide, details and hints should have leading
capitalization and end with a period. On the other hand, errcontext should
not be capitalized and should not end with a period. To support well
formatted error contexts in dblink, extend dblink_res_error() to take a
format+arguments rather than a hardcoded string.
Daniel Gustafsson
Discussion: https://postgr.es/m/B3C002C8-21A0-4F53-A06E-8CAB29FCF295@yesql.se
Logical decoding should not publish anything about tables created as
part of a heap rewrite during DDL. Those tables don't exist externally,
so consumers of logical decoding cannot do anything sensible with that
information. In ab28feae2b, we worked
around this for built-in logical replication, but that was hack.
This is a more proper fix: We mark such transient heaps using the new
field pg_class.relwrite, linking to the original relation OID. By
default, we ignore them in logical decoding before they get to the
output plugin. Optionally, a plugin can register their interest in
getting such changes, if they handle DDL specially, in which case the
new field will help them get information about the actual table.
Reviewed-by: Craig Ringer <craig@2ndquadrant.com>
We call this thing a "transaction block" everywhere except in a few
functions, where it is mysteriously called a "transaction chain". In
the SQL standard, a transaction chain is something different. So rename
these functions to match the common terminology.
Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org>
The logical replication type map seems to have been misused by its only
caller -- it would try to use the remote OID as input for local type
routines, which unsurprisingly could result in bogus "cache lookup
failed for type XYZ" errors, or random other type names being picked up
if they happened to use the right OID. Fix that, changing
Oid logicalrep_typmap_getid(Oid remoteid) to
char *logicalrep_typmap_gettypname(Oid remoteid)
which is more useful. If the remote type is not part of the typmap,
this simply prints "unrecognized type" instead of choking trying to
figure out -- a pointless exercise (because the only input for that
comes from replication messages, which are not under the local node's
control) and dangerous to boot, when called from within an error context
callback.
Once that is done, it comes to light that the local OID in the typmap
entry was not being used for anything; the type/schema names are what we
need, so remove local type OID from that struct.
Once you do that, it becomes pointless to attach a callback to regular
syscache invalidation. So remove that also.
Reported-by: Dang Minh Huong
Author: Masahiko Sawada
Reviewed-by: Álvaro Herrera, Petr Jelínek, Dang Minh Huong, Atsushi Torikoshi
Discussion: https://postgr.es/m/75DB81BEEA95B445AE6D576A0A5C9E936A6BE964@BPXM05GP.gisp.nec.co.jp
Discussion: https://postgr.es/m/75DB81BEEA95B445AE6D576A0A5C9E936A6C4B0A@BPXM05GP.gisp.nec.co.jp
If a walsender exits leaving data in reorderbuffers, the next walsender
that tries to decode the same transaction would append its decoded data
in the same spill files without truncating it first, which effectively
duplicate the data. Avoid that by removing any leftover reorderbuffer
spill files when a walsender starts.
Backpatch to 9.4; this bug has been there from the very beginning of
logical decoding.
Author: Craig Ringer, revised by me
Reviewed by: Álvaro Herrera, Petr Jelínek, Masahiko Sawada
In the pgoutput plugin, skip changes for relations that are not
publishable, per is_publishable_class(). This concerns in particular
materialized views and information_schema tables. While those relations
cannot be part of a publication, per existing checks, they will be
considered by a FOR ALL TABLES publication. A subscription would not
actually apply changes for those relations, again per existing checks,
but trying to match incoming changes to local tables on the subscriber
would lead to errors if no matching local table exists. Skipping those
changes on the publisher avoids sending useless changes and eliminates
the error.
Bug: #15044
Reported-by: Chad Trabant <chad@iris.washington.edu>
Reviewed-by: Petr Jelinek <petr.jelinek@2ndquadrant.com>
The reason for doing so is that it will allow expression evaluation to
optimize based on the underlying tupledesc. In particular it will
allow to JIT tuple deforming together with the expression itself.
For that expression initialization needs to be moved after the
relevant slots are initialized - mostly unproblematic, except in the
case of nodeWorktablescan.c.
After doing so there's no need for ExecAssignResultType() and
ExecAssignResultTypeFromTL() anymore, as all former callers have been
converted to create a slot with a fixed descriptor.
When creating a slot with a fixed descriptor, tts_values/isnull can be
allocated together with the main slot, reducing allocation overhead
and increasing cache density a bit.
Author: Andres Freund
Discussion: https://postgr.es/m/20171206093717.vqdxe5icqttpxs3p@alap3.anarazel.de
Other header files should never #include postgres.h (nor postgres_fe.h,
nor c.h), per project policy. Also, there's no need for any backend .c
file to explicitly include elog.h or palloc.h, because postgres.h pulls
those in already.
Extracted from a larger patch by Kyotaro Horiguchi. The rest of the
removals he suggests require more study, but these are no-brainers.
Discussion: https://postgr.es/m/20180215.200447.209320006.horiguchi.kyotaro@lab.ntt.co.jp
Ability to advance both physical and logical replication slots using a
new user function pg_replication_slot_advance().
For logical advance that means records are consumed as fast as possible
and changes are not given to output plugin for sending. Makes 2nd phase
(after we reached SNAPBUILD_FULL_SNAPSHOT) of replication slot creation
faster, especially when there are big transactions as the reorder buffer
does not have to deal with data changes and does not have to spill to
disk.
Author: Petr Jelinek
Reviewed-by: Simon Riggs
replorigin_drop() misunderstood the API for condition variables: it
had ConditionVariablePrepareToSleep and ConditionVariableCancelSleep
inside its test-and-sleep loop, rather than outside the loop as
intended. The net effect is a narrow race-condition window wherein,
if the process using a replication slot releases it immediately after
replorigin_drop() releases the ReplicationOriginLock, replorigin_drop()
would get into the condition variable's wait list too late and then
wait indefinitely for a signal that won't come.
Because there's a different CV for each replication slot, we can't
just move the ConditionVariablePrepareToSleep call to above the
test-and-sleep loop. What we can do, in the wake of commit 13db3b936,
is drop the ConditionVariablePrepareToSleep call entirely. This fix
depends on that commit because (at least in principle) the slot matching
the target replication origin might move around, so that once in a blue
moon successive loop iterations might involve different CVs. We can now
cope with such a scenario, at the cost of an extra trip through the
retry loop.
(There are ways we could fix this bug without depending on that commit,
but they're all a lot more complicated than this way.)
While at it, upgrade the rather skimpy comments in this function.
Back-patch to v10 where this code came in.
Discussion: https://postgr.es/m/19947.1515455433@sss.pgh.pa.us
25fff40798 introduced
default monitoring roles. Apply these corrections:
* Allow access to pg_stat_get_wal_senders()
by role pg_read_all_stats
* Correct comment in pg_stat_get_wal_receiver()
to show it is no longer superuser-only.
Author: Feike Steenbergen
Reviewed-by: Michael Paquier
Apply to HEAD, then later backpatch to 10
Logical decoding's reorderbuffer.c may spill transaction files to disk
when transactions are large. These are supposed to be removed when they
become "too old" by xid; but file removal requires the boundary LSNs of
the transaction to be known. The final_lsn is only set when we see the
commit or abort record for the transaction, but nothing sets the value
for transactions that crash, so the removal code misbehaves -- in
assertion-enabled builds, it crashes by a failed assertion.
To fix, modify the final_lsn of transactions that don't have a value
set, to the LSN of the very latest change in the transaction. This
causes the spilled files to be removed appropriately.
Author: Atsushi Torikoshi
Reviewed-by: Kyotaro HORIGUCHI, Craig Ringer, Masahiko Sawada
Discussion: https://postgr.es/m/54e4e488-186b-a056-6628-50628e4e4ebc@lab.ntt.co.jp
When walsenders were included in pg_stat_activity, only the ones
actually streaming WAL were listed as active when they were active. In
particular, the connections sending base backups were listed as being
idle. Which means that a regular pg_basebackup would show up with one
active and one idle connection, when both were active.
This patch updates to set all walsenders to active when they are
(including those doing very fast things like IDENTIFY_SYSTEM), and then
back to idle. Details about exactly what they are doing is available in
pg_stat_replication.
Patch by me, review by Michael Paquier and David Steele.
A momentary window exists when synchronous_standby_names
changes that allows commands issued after the change to
continue to act as async until the change becomes visible.
Remove the race by using more appropriate test in syncrep.c
Author: Asim Rama Praveen and Ashwin Agrawal
Reported-by: Xin Zhang, Ashwin Agrawal, and Asim Rama Praveen
Reviewed-by: Michael Paquier, Masahiko Sawada
Previously an assertion failure occurred when pg_stop_backup() for
non-exclusive backup was aborted while it's waiting for WAL files to
be archived. This assertion failure happened in do_pg_abort_backup()
which was called when a non-exclusive backup was canceled.
do_pg_abort_backup() assumes that there is at least one non-exclusive
backup running when it's called. But pg_stop_backup() can be canceled
even after it marks the end of non-exclusive backup (e.g.,
during waiting for WAL archiving). This broke the assumption that
do_pg_abort_backup() relies on, and which caused an assertion failure.
This commit changes do_pg_abort_backup() so that it does nothing
when non-exclusive backup has been already marked as completed.
That is, the asssumption is also changed, and do_pg_abort_backup()
now can handle even the case where it's called when there is
no running backup.
Backpatch to 9.6 where SQL-callable non-exclusive backup was added.
Author: Masahiko Sawada and Michael Paquier
Reviewed-By: Robert Haas and Fujii Masao
Discussion: https://www.postgresql.org/message-id/CAD21AoD2L1Fu2c==gnVASMyFAAaq3y-AQ2uEVj-zTCGFFjvmDg@mail.gmail.com
The logical slots have a fast code path for sending data so as not to
impose too high a per message overhead. The fast path skips checks for
interrupts and timeouts. However, the existing coding failed to consider
the fact that a transaction with a large number of changes may take a
very long time to be processed and sent to the client. This causes the
walsender to ignore interrupts for potentially a long time and more
importantly it will result in the walsender being killed due to
timeout at the end of such a transaction.
This commit changes the fast path to also check for interrupts and only
allows calling the fast path when the last keepalive check happened less
than half the walsender timeout ago. Otherwise the slower code path will
be taken.
Backpatched to 9.4
Petr Jelinek, reviewed by Kyotaro HORIGUCHI, Yura Sokolov, Craig
Ringer and Robert Haas.
Discussion: https://postgr.es/m/e082a56a-fd95-a250-3bae-0fff93832510@2ndquadrant.com
This patch makes a number of interrelated changes to reduce the overhead
involved in creating/deleting memory contexts. The key ideas are:
* Include the AllocSetContext header of an aset.c context in its first
malloc request, rather than allocating it separately in TopMemoryContext.
This means that we now always create an initial or "keeper" block in an
aset, even if it never receives any allocation requests.
* Create freelists in which we can save and recycle recently-destroyed
asets (this idea is due to Robert Haas).
* In the common case where the name of a context is a constant string,
just store a pointer to it in the context header, rather than copying
the string.
The first change eliminates a palloc/pfree cycle per context, and
also avoids bloat in TopMemoryContext, at the price that creating
a context now involves a malloc/free cycle even if the context never
receives any allocations. That would be a loser for some common
usage patterns, but recycling short-lived contexts via the freelist
eliminates that pain.
Avoiding copying constant strings not only saves strlen() and strcpy()
overhead, but is an essential part of the freelist optimization because
it makes the context header size constant. Currently we make no
attempt to use the freelist for contexts with non-constant names.
(Perhaps someday we'll need to think harder about that, but in current
usage, most contexts with custom names are long-lived anyway.)
The freelist management in this initial commit is pretty simplistic,
and we might want to refine it later --- but in common workloads that
will never matter because the freelists will never get full anyway.
To create a context with a non-constant name, one is now required to
call AllocSetContextCreateExtended and specify the MEMCONTEXT_COPY_NAME
option. AllocSetContextCreate becomes a wrapper macro, and it includes
a test that will complain about non-string-literal context name
parameters on gcc and similar compilers.
An unfortunate side effect of making AllocSetContextCreate a macro is
that one is now *required* to use the size parameter abstraction macros
(ALLOCSET_DEFAULT_SIZES and friends) with it; the pre-9.6 habit of
writing out individual size parameters no longer works unless you
switch to AllocSetContextCreateExtended.
Internally to the memory-context-related modules, the context creation
APIs are simplified, removing the rather baroque original design whereby
a context-type module called mcxt.c which then called back into the
context-type module. That saved a bit of code duplication, but not much,
and it prevented context-type modules from exercising control over the
allocation of context headers.
In passing, I converted the test-and-elog validation of aset size
parameters into Asserts to save a few more cycles. The original thought
was that callers might compute size parameters on the fly, but in practice
nobody does that, so it's useless to expend cycles on checking those
numbers in production builds.
Also, mark the memory context method-pointer structs "const",
just for cleanliness.
Discussion: https://postgr.es/m/2264.1512870796@sss.pgh.pa.us
do_pg_start_backup() expects its callers to pass in an open DIR pointer
for the pg_tblspc directory, but there's no apparent advantage in that.
It complicates the callers without adding any flexibility, and there's no
robustness advantage, since we surely have to be prepared for errors during
the scan of pg_tblspc anyway. In fact, by holding an extra kernel resource
during operations like the preliminary checkpoint, we might be making
things a fraction more failure-prone not less. Hence, remove that argument
and open the directory just for the duration of the actual scan.
Discussion: https://postgr.es/m/28752.1512413887@sss.pgh.pa.us