synchronous_standby_names cannot be reloaded safely by backends, and the
checkpointer is in charge of updating a state in shared memory if the
GUC is enabled in WalSndCtl, to let the backends know if they should
wait or not for a given LSN. This provides a strict control on the
timing of the waiting queues if the GUC is enabled or disabled, then
reloaded. The checkpointer is also in charge of waking up the backends
that could be waiting for a LSN when the GUC is disabled.
This logic had a race condition at startup, where it would be possible
for backends to not wait for a LSN even if synchronous_standby_names is
enabled. This would cause visibility issues with transactions that we
should be waiting for but they were not. The problem lasts until the
checkpointer does its initial update of the shared memory state when it
loads synchronous_standby_names.
In order to take care of this problem, the shared memory state in
WalSndCtl is extended to detect if it has been initialized by the
checkpointer, and not only check if synchronous_standby_names is
defined. In WalSndCtlData, sync_standbys_defined is renamed to
sync_standbys_status, a bits8 able to know about two states:
- If the shared memory state has been initialized. This flag is set by
the checkpointer at startup once, and never removed.
- If synchronous_standby_names is known as defined in the shared memory
state. This is the same as the previous sync_standbys_defined in
WalSndCtl.
This method gives a way for backends to decide what they should do until
the shared memory area is initialized, and they now ultimately fall back
to a check on the GUC value in this case, which is the best thing that
can be done.
Fortunately, SyncRepUpdateSyncStandbysDefined() is called immediately by
the checkpointer when this process starts, so the window is very narrow.
It is possible to enlarge the problematic window by making the
checkpointer wait at the beginning of SyncRepUpdateSyncStandbysDefined()
with a hardcoded sleep for example, and doing so has showed that a 2PC
visibility test is indeed failing. On machines slow enough, this bug
would cause spurious failures.
In 17~, we have looked at the possibility of adding an injection point
to have a reproducible test, but as the problematic window happens at
early startup, we would need to invent a way to make an injection point
optionally persistent across restarts when attached, something that
would be fine for this case as it would involve the checkpointer. This
issue is quite old, and can be reproduced on all the stable branches.
Author: Melnikov Maksim <m.melnikov@postgrespro.ru>
Co-authored-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/163fcbec-900b-4b07-beaa-d2ead8634bec@postgrespro.ru
Backpatch-through: 13
WAL senders do not flush their statistics until they exit, limiting the
monitoring possible for live processes. This is penalizing when WAL
senders are running for a long time, like in streaming or logical
replication setups, because it is not possible to know the amount of IO
they generate while running.
This commit makes WAL senders more aggressive with their statistics
flush, using an internal of 1 second, with the flush timing calculated
based on the existing GetCurrentTimestamp() done before the sleeps done
to wait for some activity. Note that the sleep done for logical and
physical WAL senders happens in two different code paths, so the stats
flushes need to happen in these two places.
One test is added for the physical WAL sender case, and one for the
logical WAL sender case. This can be done in a stable fashion by
relying on the WAL generated by the TAP tests in combination with a
stats reset while a server is running, but only on HEAD as WAL data has
been added to pg_stat_io in a051e71e28.
This issue exists since a9c70b46db and the introduction of pg_stat_io,
so backpatch down to v16.
Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: vignesh C <vignesh21@gmail.com>
Reviewed-by: Xuneng Zhou <xunengzhou@gmail.com>
Discussion: https://postgr.es/m/Z73IsKBceoVd4t55@ip-10-97-1-34.eu-west-3.compute.internal
Backpatch-through: 16
This commit just does the minimal wiring up of the AIO subsystem, added in the
next commit, to the rest of the system. The next commit contains more details
about motivation and architecture.
This commit is kept separate to make it easier to review, separating the
changes across the tree, from the implementation of the new subsystem.
We discussed squashing this commit with the main commit before merging AIO,
but there has been a mild preference for keeping it separate.
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
We want to support a "noreturn" decoration on more compilers besides
just GCC-compatible ones, but for that we need to move the decoration
in front of the function declaration instead of either behind it or
wherever, which is the current style afforded by GCC-style attributes.
Also rename the macro to "pg_noreturn" to be similar to the C11
standard "noreturn".
pg_noreturn is now supported on all compilers that support C11 (using
_Noreturn), as well as GCC-compatible ones (using __attribute__, as
before), as well as MSVC (using __declspec). (When PostgreSQL
requires C11, the latter two variants can be dropped.)
Now, all supported compilers effectively support pg_noreturn, so the
extra code for !HAVE_PG_ATTRIBUTE_NORETURN can be dropped.
This also fixes a possible problem if third-party code includes
stdnoreturn.h, because then the current definition of
#define pg_attribute_noreturn() __attribute__((noreturn))
would cause an error.
Note that the C standard does not support a noreturn attribute on
function pointer types. So we have to drop these here. There are
only two instances at this time, so it's not a big loss. In one case,
we can make up for it by adding the pg_noreturn to a wrapper function
and adding a pg_unreachable(), in the other case, the latter was
already done before.
Reviewed-by: Dagfinn Ilmari Mannsåker <ilmari@ilmari.org>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://www.postgresql.org/message-id/flat/pxr5b3z7jmkpenssra5zroxi7qzzp6eswuggokw64axmdixpnk@zbwxuq7gbbcw
Once a replication slot is invalidated, it cannot be altered or used to
fetch changes. However, a process could still acquire an invalid slot and
fail later.
For example, if a process acquires a logical slot that was invalidated due
to wal_removed, it will eventually fail in CreateDecodingContext() when
attempting to access the removed WAL. Similarly, for physical replication
slots, even if the slot is invalidated and invalidation_reason is set to
wal_removed, the walsender does not currently check for invalidation when
starting physical replication. Instead, replication starts, and an error
is only reported later while trying to access WAL. Similarly, we prohibit
modifying slot properties for invalid slots but give the error for the
same after acquiring the slot.
This patch improves error handling by detecting invalid slots earlier at
the time of slot acquisition which is the first step. This also helped in
unifying different ERROR messages at different places and gave a
consistent message for invalid slots. This means that the message for
invalid slots will change to a generic message.
This will also be helpful for future patches where we are planning to
invalidate slots due to more reasons like idle_timeout because we don't
have to modify multiple places in such cases and avoid the chances of
missing out on a particular place.
Author: Nisha Moond <nisha.moond412@gmail.com>
Author: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Vignesh C <vignesh21@gmail.com>
Reviewed-by: Peter Smith <smithpb2250@gmail.com>
Reviewed-by: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Discussion: https://postgr.es/m/CABdArM6pBL5hPnSQ+5nEVMANcF4FCH7LQmgskXyiLY75TMnKpw@mail.gmail.com
Use the flex %option reentrant and the bison option %pure-parser to
make the generated scanner and parser pure, reentrant, and
thread-safe.
Make the generated scanner use palloc() etc. instead of malloc() etc.
Previously, we only used palloc() for the buffer, but flex would still
use malloc() for its internal structures. As a result, there could be
some small memory leaks in case of uncaught errors. Now, all the
memory is under palloc() control, so there are no more such issues.
Simplify flex scan buffer management: Instead of constructing the
buffer from pieces and then using yy_scan_buffer(), we can just use
yy_scan_string(), which does the same thing internally.
The previous code was necessary because we allocated the buffer with
palloc() and the rest of the state was handled by malloc(). But this
is no longer the case; everything is under palloc() now.
Use flex yyextra to handle context information, instead of global
variables. This complements the other changes to make the scanner
reentrant.
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Co-authored-by: Andreas Karlsson <andreas@proxel.se>
Reviewed-by: Andreas Karlsson <andreas@proxel.se>
Discussion: https://www.postgresql.org/message-id/flat/eb6faeac-2a8a-4b69-9189-c33c520e5b7b@eisentraut.org
The two_phase option is controlled by both the publisher (as a slot
option) and the subscriber (as a subscription option), so the slot option
must also be modified.
Changing the 'two_phase' option for a subscription from 'true' to 'false'
is permitted only when there are no pending prepared transactions
corresponding to that subscription. Otherwise, the changes of already
prepared transactions can be replicated again along with their corresponding
commit leading to duplicate data or errors.
To avoid data loss, the 'two_phase' option for a subscription can only be
changed from 'false' to 'true' once the initial data synchronization is
completed. Therefore this is performed later by the logical replication worker.
Author: Hayato Kuroda, Ajin Cherian, Amit Kapila
Reviewed-by: Peter Smith, Hou Zhijie, Amit Kapila, Vitaly Davydov, Vignesh C
Discussion: https://postgr.es/m/8fab8-65d74c80-1-2f28e880@39088166
Commit f4b54e1ed9, which introduced macros for protocol characters,
missed updating a few places. It also did not introduce macros for
messages sent from parallel workers to their leader processes.
This commit adds a new section in protocol.h for those.
Author: Aleksander Alekseev
Discussion: https://postgr.es/m/CAJ7c6TNTd09AZq8tGaHS3LDyH_CCnpv0oOz2wN1dGe8zekxrdQ%40mail.gmail.com
Backpatch-through: 17
Commit 0fdab27ad6 changed the code to wait for WAL to be available before
determining the timeline but forgot to move the failure check.
This change is to make the related code easier to understand and enhance
otherwise there is no bug in the current code.
In the passing, improve the nearby comments to explain why we determine
am_cascading_walsender after waiting for the required WAL.
Author: Peter Smith
Reviewed-by: Bertrand Drouvot, Amit Kapila
Discussion: https://postgr.es/m/CAHut+PvqX49fusLyXspV1Mmd_EekPtXG0oT146vZjcb9XDvNgw@mail.gmail.com
Up to now, committing a transaction has caused CurrentMemoryContext to
get set to TopMemoryContext. Most callers did not pay any particular
heed to this, which is problematic because TopMemoryContext is a
long-lived context that never gets reset. If the caller assumes it
can leak memory because it's running in a limited-lifespan context,
that behavior translates into a session-lifespan memory leak.
The first-reported instance of this involved ProcessIncomingNotify,
which is called from the main processing loop that normally runs in
MessageContext. That outer-loop code assumes that whatever it
allocates will be cleaned up when we're done processing the current
client message --- but if we service a notify interrupt, then whatever
gets allocated before the next switch to MessageContext will be
permanently leaked in TopMemoryContext. sinval catchup interrupts
have a similar problem, and I strongly suspect that some places in
logical replication do too.
To fix this in a generic way, let's redefine the behavior as
"CommitTransactionCommand restores the memory context that was current
at entry to StartTransactionCommand". This clearly fixes the issue
for the notify and sinval cases, and it seems to match the mental
model that's in use in the logical replication code, to the extent
that anybody thought about it there at all.
For consistency, likewise make subtransaction exit restore the context
that was current at subtransaction start (rather than always selecting
the CurTransactionContext of the parent transaction level). This case
has less risk of resulting in a permanent leak than the outer-level
behavior has, but it would not meet the principle of least surprise
for some CommitTransactionCommand calls to restore the previous
context while others don't.
While we're here, also change xact.c so that we reset
TopTransactionContext at transaction exit and then re-use it in later
transactions, rather than dropping and recreating it in each cycle.
This probably doesn't save a lot given the context recycling mechanism
in aset.c, but it should save a little bit. Per suggestion from David
Rowley. (Parenthetically, the text in src/backend/utils/mmgr/README
implies that this is how I'd planned to implement it as far back as
commit 1aebc3618 --- but the code actually added in that commit just
drops and recreates it each time.)
This commit also cleans up a few places outside xact.c that were
needlessly making TopMemoryContext current, and thus risking more
leaks of the same kind. I don't think any of them represent
significant leak risks today, but let's deal with them while the
issue is top-of-mind.
Per bug #18512 from wizardbrony. Commit to HEAD only, as this change
seems to have some risk of breaking things for some callers. We'll
apply a narrower fix for the known-broken cases in the back branches.
Discussion: https://postgr.es/m/3478884.1718656625@sss.pgh.pa.us
The standby_slot_names GUC allows the specification of physical standby
slots that must be synchronized before the logical walsenders associated
with logical failover slots. However, for this purpose, the GUC name is
too generic.
Author: Hou Zhijie
Reviewed-by: Bertrand Drouvot, Masahiko Sawada
Backpatch-through: 17
Discussion: https://postgr.es/m/ZnWeUgdHong93fQN@momjian.us
Allow pg_sync_replication_slots() to error out during promotion of standby.
This makes the behavior of the SQL function consistent with the slot sync
worker. We also ensured that pg_sync_replication_slots() cannot be
executed if sync_replication_slots is enabled and the slotsync worker is
already running to perform the synchronization of slots. Previously, it
would have succeeded in cases when the worker is idle and failed when it
is performing sync which could confuse users.
This patch fixes another issue in the slot sync worker where
SignalHandlerForShutdownRequest() needs to be registered *before* setting
SlotSyncCtx->pid, otherwise, the slotsync worker could miss handling
SIGINT sent by the startup process(ShutDownSlotSync) if it is sent before
worker could register SignalHandlerForShutdownRequest(). To be consistent,
all signal handlers' registration is moved to a prior location before we
set the worker's pid.
Ensure that we clean up synced temp slots at the end of
pg_sync_replication_slots() to avoid such slots being left over after
promotion.
Ensure that ShutDownSlotSync() captures SlotSyncCtx->pid under spinlock to
avoid accessing invalid value as it can be reset by concurrent slot sync
exit due to an error.
Author: Shveta Malik
Reviewed-by: Hou Zhijie, Bertrand Drouvot, Amit Kapila, Masahiko Sawada
Discussion: https://postgr.es/m/CAJpy0uBefXUS_TSz=oxmYKHdg-fhxUT0qfjASW3nmqnzVC3p6A@mail.gmail.com
In WalSndWaitForWal(), we fetch a recent flush pointer both outside the
loop and inside the loop. But we start using RecentFlushPtr only after we
fetch it inside the loop. So we can remove one outside the loop.
Author: Shveta Malik
Reviewed-by: Bertrand Drouvot, Matthias van de Meent, Amit Kapila
Discussion: https://postgr.es/m/CAJpy0uBSCQz1yMD-WiEthzEe23dti2-Kr_pitVb7vAPFbFKm=A@mail.gmail.com
This patch provides a way to ensure that physical standbys that are
potential failover candidates have received and flushed changes before
the primary server making them visible to subscribers. Doing so guarantees
that the promoted standby server is not lagging behind the subscribers
when a failover is necessary.
The logical walsender now guarantees that all local changes are sent and
flushed to the standby servers corresponding to the replication slots
specified in 'standby_slot_names' before sending those changes to the
subscriber.
Additionally, the SQL functions pg_logical_slot_get_changes,
pg_logical_slot_peek_changes and pg_replication_slot_advance are modified
to ensure that they process changes for failover slots only after physical
slots specified in 'standby_slot_names' have confirmed WAL receipt for those.
Author: Hou Zhijie and Shveta Malik
Reviewed-by: Masahiko Sawada, Peter Smith, Bertrand Drouvot, Ajin Cherian, Nisha Moond, Amit Kapila
Discussion: https://postgr.es/m/514f6f2f-6833-4539-39f1-96cd1e011f23@enterprisedb.com
as determined by include-what-you-use (IWYU)
While IWYU also suggests to *add* a bunch of #include's (which is its
main purpose), this patch does not do that. In some cases, a more
specific #include replaces another less specific one.
Some manual adjustments of the automatic result:
- IWYU currently doesn't know about includes that provide global
variable declarations (like -Wmissing-variable-declarations), so
those includes are being kept manually.
- All includes for port(ability) headers are being kept for now, to
play it safe.
- No changes of catalog/pg_foo.h to catalog/pg_foo_d.h, to keep the
patch from exploding in size.
Note that this patch touches just *.c files, so nothing declared in
header files changes in hidden ways.
As a small example, in src/backend/access/transam/rmgr.c, some IWYU
pragma annotations are added to handle a special case there.
Discussion: https://www.postgresql.org/message-id/flat/af837490-6b2f-46df-ba05-37ea6a6653fc%40eisentraut.org
Now that BackendId was just another index into the proc array, it was
redundant with the 0-based proc numbers used in other places. Replace
all usage of backend IDs with proc numbers.
The only place where the term "backend id" remains is in a few pgstat
functions that expose backend IDs at the SQL level. Those IDs are now
in fact 0-based ProcNumbers too, but the documentation still calls
them "backend ids". That term still seems appropriate to describe what
the numbers are, so I let it be.
One user-visible effect is that pg_temp_0 is now a valid temp schema
name, for backend with ProcNumber 0.
Reviewed-by: Andres Freund
Discussion: https://www.postgresql.org/message-id/8171f1aa-496f-46a6-afc3-c46fe7a9b407@iki.fi
By enabling slot synchronization, all the failover logical replication
slots on the primary (assuming configurations are appropriate) are
automatically created on the physical standbys and are synced
periodically. The slot sync worker on the standby server pings the primary
server at regular intervals to get the necessary failover logical slots
information and create/update the slots locally. The slots that no longer
require synchronization are automatically dropped by the worker.
The nap time of the worker is tuned according to the activity on the
primary. The slot sync worker waits for some time before the next
synchronization, with the duration varying based on whether any slots were
updated during the last cycle.
A new parameter sync_replication_slots enables or disables this new
process.
On promotion, the slot sync worker is shut down by the startup process to
drop any temporary slots acquired by the slot sync worker and to prevent
the worker from trying to fetch the failover slots.
A functionality to allow logical walsenders to wait for the physical will
be done in a subsequent commit.
Author: Shveta Malik, Hou Zhijie based on design inputs by Masahiko Sawada and Amit Kapila
Reviewed-by: Masahiko Sawada, Bertrand Drouvot, Peter Smith, Dilip Kumar, Ajin Cherian, Nisha Moond, Kuroda Hayato, Amit Kapila
Discussion: https://postgr.es/m/514f6f2f-6833-4539-39f1-96cd1e011f23@enterprisedb.com
Previously, some callers requested XLOG_BLCKSZ bytes
unconditionally. While this did not cause a problem, because the extra
bytes are ignored, it's confusing and makes it harder to add safety
checks. Additionally, the comment about zero padding was incorrect.
With this commit, all callers request the number of bytes they
actually need.
Author: Bharath Rupireddy
Reviewed-by: Kyotaro Horiguchi
Discussion: https://postgr.es/m/CALj2ACWBRFac2TingD3PE3w2EBHXUHY3=AEEZPJmqhpEOBGExg@mail.gmail.com
Presently, we rely on each individual signal handler to save the
initial value of errno and then restore it before returning if
needed. This is easily forgotten and, if missed, often goes
undetected for a long time.
In commit 3b00fdba9f, we introduced a wrapper signal handler
function that checks whether MyProcPid matches getpid(). This
commit moves the aforementioned errno restoration code from the
individual signal handlers to the new wrapper handler so that we no
longer need to worry about missing it.
Reviewed-by: Andres Freund, Noah Misch
Discussion: https://postgr.es/m/20231121212008.GA3742740%40nathanxps13
This commit introduces a new SQL function pg_sync_replication_slots()
which is used to synchronize the logical replication slots from the
primary server to the physical standby so that logical replication can be
resumed after a failover or planned switchover.
A new 'synced' flag is introduced in pg_replication_slots view, indicating
whether the slot has been synchronized from the primary server. On a
standby, synced slots cannot be dropped or consumed, and any attempt to
perform logical decoding on them will result in an error.
The logical replication slots on the primary can be synchronized to the
hot standby by using the 'failover' parameter of
pg-create-logical-replication-slot(), or by using the 'failover' option of
CREATE SUBSCRIPTION during slot creation, and then calling
pg_sync_replication_slots() on standby. For the synchronization to work,
it is mandatory to have a physical replication slot between the primary
and the standby aka 'primary_slot_name' should be configured on the
standby, and 'hot_standby_feedback' must be enabled on the standby. It is
also necessary to specify a valid 'dbname' in the 'primary_conninfo'.
If a logical slot is invalidated on the primary, then that slot on the
standby is also invalidated.
If a logical slot on the primary is valid but is invalidated on the
standby, then that slot is dropped but will be recreated on the standby in
the next pg_sync_replication_slots() call provided the slot still exists
on the primary server. It is okay to recreate such slots as long as these
are not consumable on standby (which is the case currently). This
situation may occur due to the following reasons:
- The 'max_slot_wal_keep_size' on the standby is insufficient to retain
WAL records from the restart_lsn of the slot.
- 'primary_slot_name' is temporarily reset to null and the physical slot
is removed.
The slot synchronization status on the standby can be monitored using the
'synced' column of pg_replication_slots view.
A functionality to automatically synchronize slots by a background worker
and allow logical walsenders to wait for the physical will be done in
subsequent commits.
Author: Hou Zhijie, Shveta Malik, Ajin Cherian based on an earlier version by Peter Eisentraut
Reviewed-by: Masahiko Sawada, Bertrand Drouvot, Peter Smith, Dilip Kumar, Nisha Moond, Kuroda Hayato, Amit Kapila
Discussion: https://postgr.es/m/514f6f2f-6833-4539-39f1-96cd1e011f23@enterprisedb.com
If available, read directly from WAL buffers, avoiding the need to go
through the filesystem. Only for physical replication for now, but can
be expanded to other callers.
In preparation for replicating unflushed WAL data.
Author: Bharath Rupireddy
Discussion: https://postgr.es/m/CALj2ACXKKK%3DwbiG5_t6dGao5GoecMwRkhr7GjVBM_jg54%2BNa%3DQ%40mail.gmail.com
Reviewed-by: Andres Freund, Alvaro Herrera, Nathan Bossart, Dilip Kumar, Nitin Jadhav, Melih Mutlu, Kyotaro Horiguchi
This commit implements a new replication command called
ALTER_REPLICATION_SLOT and a corresponding walreceiver API function named
walrcv_alter_slot. Additionally, the CREATE_REPLICATION_SLOT command has
been extended to support the failover option.
These new additions allow the modification of the failover property of a
replication slot on the publisher. A subsequent commit will make use of
these commands in subscription commands and will add the tests as well to
cover the functionality added/changed by this commit.
Author: Hou Zhijie, Shveta Malik
Reviewed-by: Peter Smith, Bertrand Drouvot, Dilip Kumar, Masahiko Sawada, Nisha Moond, Kuroda, Hayato, Amit Kapila
Discussion: https://postgr.es/m/514f6f2f-6833-4539-39f1-96cd1e011f23@enterprisedb.com
This commit adds the failover property to the replication slot. The
failover property indicates whether the slot will be synced to the standby
servers, enabling the resumption of corresponding logical replication
after failover. But note that this commit does not yet include the
capability to sync the replication slot; the subsequent commits will add
that capability.
A new optional parameter 'failover' is added to the
pg_create_logical_replication_slot() function. We will also enable to set
'failover' option for slots via the subscription commands in the
subsequent commits.
The value of the 'failover' flag is displayed as part of
pg_replication_slots view.
Author: Hou Zhijie, Shveta Malik, Ajin Cherian
Reviewed-by: Peter Smith, Bertrand Drouvot, Dilip Kumar, Masahiko Sawada, Nisha Moond, Kuroda, Hayato, Amit Kapila
Discussion: https://postgr.es/m/514f6f2f-6833-4539-39f1-96cd1e011f23@enterprisedb.com
To take an incremental backup, you use the new replication command
UPLOAD_MANIFEST to upload the manifest for the prior backup. This
prior backup could either be a full backup or another incremental
backup. You then use BASE_BACKUP with the INCREMENTAL option to take
the backup. pg_basebackup now has an --incremental=PATH_TO_MANIFEST
option to trigger this behavior.
An incremental backup is like a regular full backup except that
some relation files are replaced with files with names like
INCREMENTAL.${ORIGINAL_NAME}, and the backup_label file contains
additional lines identifying it as an incremental backup. The new
pg_combinebackup tool can be used to reconstruct a data directory
from a full backup and a series of incremental backups.
Patch by me. Reviewed by Matthias van de Meent, Dilip Kumar, Jakub
Wartak, Peter Eisentraut, and Álvaro Herrera. Thanks especially to
Jakub for incredibly helpful and extensive testing.
Discussion: http://postgr.es/m/CA+TgmoYOYZfMCyOXFyC-P+-mdrZqm5pP2N7S-r0z3_402h9rsA@mail.gmail.com
This refactoring reduces the code in charge of creating replication
slots from two "if" block to a single one, making it slightly cleaner.
This change is possible since 1d04a59be3, that has removed the
intermediate code that existed between the two "if" blocks in charge of
initializing the output message buffer.
Author: Peter Smith
Discussion: https://postgr.es/m/CAHut+PtnJzqKT41Zt8pChRzba=QgCqjtfYvcf84NMj3VFJoKfw@mail.gmail.com
The event names use the same case-insensitive characters, hence applying
lower() or upper() to the monitoring queries allows the detection of the
same events as before this change. It is possible to cross-check the
data with the system view pg_wait_events, for instance, with a query
like that showing no differences:
SELECT lower(type), lower(name), description
FROM pg_wait_events ORDER BY 1, 2;
This will help in the introduction of more simplifications in the format
of wait_event_names. Some of the enum values in the code had to be
renamed a bit to follow the same convention naming across the board.
Reviewed-by: Bertrand Drouvot
Discussion: https://postgr.es/m/ZOxVHQwEC/9X/p/z@paquier.xyz
The code is able to compile already without warnings under
-Wshadow=compatible-local, which is itself already enabled in the tree,
and the ones fixed here showed up with the more restrictive -Wshadow.
There are more of these that we may want to look at, and the ones fixed
here made the code confusing.
Author: Peter Smith
Discussion: https://postgr.es/m/CAHut+PuR0y4ofNOxi691VTVWmBfScHV9AaBMGSpeh8+DKp81Nw@mail.gmail.com
This commit introduces descriptively-named macros for the
identifiers used in wire protocol messages. These new macros are
placed in a new header file so that they can be easily used by
third-party code.
Author: Dave Cramer
Reviewed-by: Alvaro Herrera, Tatsuo Ishii, Peter Smith, Robert Haas, Tom Lane, Peter Eisentraut, Michael Paquier
Discussion: https://postgr.es/m/CADK3HHKbBmK-PKf1bPNFoMC%2BoBt%2BpD9PH8h5nvmBQskEHm-Ehw%40mail.gmail.com
Most, but not all, of our "/*----" style comments have an end-guard
line with dashes at the end of the comment. However, pgindent doesn't
care about the end-guards, so they mostly just waste screen
space. Going forward, let's not require end-guards.
Remove a broken end-guard in a comment in walsender.c that led me to
think about this. Remove the end guard from another comment in
walsender.c for consistency, so that we use the same style in all
comments in the file.
However, we have thousands of existing "/*----" comments the repository,
so it's not worth the code churn to change them all.
Discussion: https://www.postgresql.org/message-id/fb083c91-d490-3b65-25f3-05e9118b6b0d%40iki.fi
WalSndWakeup() currently loops through all the walsenders slots, with a
spinlock acquisition and release for every iteration, to wake up waiting
walsenders.
This commonly was not a problem before e101dfac3a. But, to allow logical
decoding on standbys, we need to wake up logical walsenders after every WAL
record is applied on the standby, rather just when flushing WAL or switching
timelines. This causes a performance regression for workloads replaying a lot
of WAL records.
To solve this, we use condition variable (CV) to efficiently wake up
walsenders in WalSndWakeup().
Every walsender prepares to sleep on a shared memory CV. Note that it just
prepares to sleep on the CV (i.e., adds itself to the CV's waitlist), but does
not actually wait on the CV (IOW, it never calls ConditionVariableSleep()). It
still uses WaitEventSetWait() for waiting, because CV infrastructure doesn't
handle FeBe socket events currently. The processes (startup process,
walreceiver etc.) wanting to wake up walsenders use
ConditionVariableBroadcast(), which in turn calls SetLatch(), helping
walsenders come out of WaitEventSetWait().
We use separate shared memory CVs for physical and logical walsenders for
selective wake ups, see WalSndWakeup() for more details.
This approach is simple and reasonably efficient. But not very elegant. But
for 16 it seems to be a better path than a larger redesign of the CV
mechanism. A desirable future improvement would be to add support for CVs
into WaitEventSetWait().
This still leaves us with a small regression in very extreme workloads (due to
the spinlock acquisition in ConditionVariableBroadcast() when there are no
waiters) - but that seems acceptable.
Reported-by: Andres Freund <andres@anarazel.de>
Suggested-by: Andres Freund <andres@anarazel.de>
Author: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>
Reviewed-by: Zhijie Hou <houzj.fnst@fujitsu.com>
Discussion: https://www.postgresql.org/message-id/20230509190247.3rrplhdgem6su6cg%40awork3.anarazel.de
Unsurprisingly, this requires wal_level = logical to be set on the primary and
standby. The infrastructure added in 26669757b6 ensures that slots are
invalidated if the primary's wal_level is lowered.
Creating a slot on a standby waits for a xl_running_xact record to be
processed. If the primary is idle (and thus not emitting xl_running_xact
records), that can take a while. To make that faster, this commit also
introduces the pg_log_standby_snapshot() function. By executing it on the
primary, completion of slot creation on the standby can be accelerated.
Note that logical decoding on a standby does not itself enforce that required
catalog rows are not removed. The user has to use physical replication slots +
hot_standby_feedback or other measures to prevent that. If catalog rows
required for a slot are removed, the slot is invalidated.
See 6af1793954 for an overall design of logical decoding on a standby.
Bumps catversion, for the addition of the pg_log_standby_snapshot() function.
Author: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>
Author: Andres Freund <andres@anarazel.de> (in an older version)
Author: Amit Khandekar <amitdkhan.pg@gmail.com> (in an older version)
Reviewed-by: Andres Freund <andres@anarazel.de>
Reviewed-by: FabrÌzio de Royes Mello <fabriziomello@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Physical walsenders can't send data until it's been flushed; logical
walsenders can't decode and send data until it's been applied. On the
standby, the WAL is flushed first, which will only wake up physical
walsenders; and then applied, which will only wake up logical
walsenders.
Previously, all walsenders were awakened when the WAL was flushed. That
was fine for logical walsenders on the primary; but on the standby the
flushed WAL would have been not applied yet, so logical walsenders were
awakened too early.
Per idea from Jeff Davis and Amit Kapila.
Author: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>
Reviewed-By: Jeff Davis <pgsql@j-davis.com>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://postgr.es/m/CAA4eK1+zO5LUeisabX10c81LU-fWMKO4M9Wyg1cdkbW7Hqh6vQ@mail.gmail.com
Previously we had checks for this in multiple places. Support for logical
decoding on standbys will add other forms of invalidation, making it worth
while to centralize the checks.
This slightly changes the error message for both the walsender and SQL
interface. Particularly the SQL interface error was inaccurate, as the "This
slot has never previously reserved WAL" portion was unreachable.
Reviewed-by: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/20230407075009.igg7be27ha2htkbt@awork3.anarazel.de
As per one of the CI reports, there is an assertion failure which
indicates that we were trying to use an unenforced xmin horizon for
decoding snapshots. Though, we couldn't figure out the reason for
assertion failure these checks would help us in finding the reason if the
problem happens again in the future.
Author: Amit Kapila based on suggestions by Andres Freund
Reviewd by: Andres Freund
Discussion: https://postgr.es/m/CAA4eK1L8wYcyTPxNzPGkhuO52WBGoOZbT0A73Le=ZUWYAYmdfw@mail.gmail.com
This adjusts a few things for GUCs related to logical replication,
replication slots and WAL senders, in the shape of incorrect comments
and values inconsistent with their initial default value.
Author: Peter Smith
Reviewed-by: Nathan Bossart, Tom Lane, Justin Pryzby
Discussion: https://postgr.es/m/CAHut+PtHE0XSfjjRQ6D4v7+dqzCw=d+1a64ujra4EX8aoc_Z+w@mail.gmail.com
Per discussion, the existing routine name able to initialize a SRF
function with materialize mode is unpopular, so rename it. Equally, the
flags of this function are renamed, as of:
- SRF_SINGLE_USE_EXPECTED -> MAT_SRF_USE_EXPECTED_DESC
- SRF_SINGLE_BLESS -> MAT_SRF_BLESS
The previous function and flags introduced in 9e98583 are kept around
for compatibility purposes, so as any extension code already compiled
with v15 continues to work as-is. The declarations introduced here for
compatibility will be removed from HEAD in a follow-up commit.
The new names have been suggested by Andres Freund and Melanie
Plageman.
Discussion: https://postgr.es/m/20221013194820.ciktb2sbbpw7cljm@awork3.anarazel.de
Backpatch-through: 15
This replaces all MemSet() calls with struct initialization where that
is easily and obviously possible. (For example, some cases have to
worry about padding bits, so I left those.)
(The same could be done with appropriate memset() calls, but this
patch is part of an effort to phase out MemSet(), so it doesn't touch
memset() calls.)
Reviewed-by: Ranier Vilela <ranier.vf@gmail.com>
Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: https://www.postgresql.org/message-id/9847b13c-b785-f4e2-75c3-12ec77a3b05c@enterprisedb.com