All the errors triggered in the code paths patched here would cause the
backend to issue an internal_error errcode, which is a state that should
be used only for "can't happen" situations. However, these code paths
are reachable by the regression tests, and could be seen by users in
valid cases. Some regression tests expect internal errcodes as they
manipulate the backend state to cause corruption (like checksums), or
use elog() because it is more convenient (like injection points), these
have no need to change.
This reduces the number of internal failures triggered in a check-world
by more than half, while providing correct errcodes for these valid
cases.
Reviewed-by: Robert Haas
Discussion: https://postgr.es/m/Zic_GNgos5sMxKoa@paquier.xyz
After further review, we want to move in the direction of always
quoting GUC names in error messages, rather than the previous (PG16)
wildly mixed practice or the intermittent (mid-PG17) idea of doing
this depending on how possibly confusing the GUC name is.
This commit applies appropriate quotes to (almost?) all mentions of
GUC names in error messages. It partially supersedes a243569bf6 and
8d9978a717, which had moved things a bit in the opposite direction
but which then were abandoned in a partial state.
Author: Peter Smith <smithpb2250@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAHut%2BPv-kSN8SkxSdoHano_wPubqcg5789ejhCDZAcLFceBR-w%40mail.gmail.com
06c418e163 introduced pg_wal_replay_wait() procedure allowing to wait for
the particular LSN to be replayed on standby. The waiters were stored in
the flat array. Even though scanning small arrays is fast, that might be a
problem at scale (a lot of waiting processes).
This commit replaces the flat shared memory array with the pairing heap,
which holds the waiter with the least LSN at the top. This gives us O(log N)
complexity for both inserting and removing waiters.
Reported-by: Alvaro Herrera
Discussion: https://postgr.es/m/202404030658.hhj3vfxeyhft%40alvherre.pgsql
pg_wal_replay_wait() is to be used on standby and specifies waiting for
the specific WAL location to be replayed before starting the transaction.
This option is useful when the user makes some data changes on primary and
needs a guarantee to see these changes on standby.
The queue of waiters is stored in the shared memory array sorted by LSN.
During replay of WAL waiters whose LSNs are already replayed are deleted from
the shared memory array and woken up by setting of their latches.
pg_wal_replay_wait() needs to wait without any snapshot held. Otherwise,
the snapshot could prevent the replay of WAL records implying a kind of
self-deadlock. This is why it is only possible to implement
pg_wal_replay_wait() as a procedure working in a non-atomic context,
not a function.
Catversion is bumped.
Discussion: https://postgr.es/m/eb12f9b03851bb2583adab5df9579b4b%40postgrespro.ru
Author: Kartyshov Ivan, Alexander Korotkov
Reviewed-by: Michael Paquier, Peter Eisentraut, Dilip Kumar, Amit Kapila
Reviewed-by: Alexander Lakhin, Bharath Rupireddy, Euler Taveira
Two FATALs and one PANIC gain details about the LSNs they fail at:
- When restoring from a backup_label, the FATAL log generated when not
finding the checkpoint record now reports its LSN.
- When restoring from a backup_label, the FATAL log generated when not
finding the redo record referenced by a checkpoint record now shows both
the redo and checkpoint record LSNs.
- When not restoring from a backup_label, the PANIC error generated when
not finding the checkpoint record now reports its LSN.
This information is useful when debugging corruption issues, and these
LSNs may not show up in the logs depending on the level of logging
configured in the backend.
Author: David Steele
Discussion: https://postgr.es/m/0e90da89-77ca-4ccf-872c-9626d755e288@pgmasters.net
as determined by include-what-you-use (IWYU)
While IWYU also suggests to *add* a bunch of #include's (which is its
main purpose), this patch does not do that. In some cases, a more
specific #include replaces another less specific one.
Some manual adjustments of the automatic result:
- IWYU currently doesn't know about includes that provide global
variable declarations (like -Wmissing-variable-declarations), so
those includes are being kept manually.
- All includes for port(ability) headers are being kept for now, to
play it safe.
- No changes of catalog/pg_foo.h to catalog/pg_foo_d.h, to keep the
patch from exploding in size.
Note that this patch touches just *.c files, so nothing declared in
header files changes in hidden ways.
As a small example, in src/backend/access/transam/rmgr.c, some IWYU
pragma annotations are added to handle a special case there.
Discussion: https://www.postgresql.org/message-id/flat/af837490-6b2f-46df-ba05-37ea6a6653fc%40eisentraut.org
By enabling slot synchronization, all the failover logical replication
slots on the primary (assuming configurations are appropriate) are
automatically created on the physical standbys and are synced
periodically. The slot sync worker on the standby server pings the primary
server at regular intervals to get the necessary failover logical slots
information and create/update the slots locally. The slots that no longer
require synchronization are automatically dropped by the worker.
The nap time of the worker is tuned according to the activity on the
primary. The slot sync worker waits for some time before the next
synchronization, with the duration varying based on whether any slots were
updated during the last cycle.
A new parameter sync_replication_slots enables or disables this new
process.
On promotion, the slot sync worker is shut down by the startup process to
drop any temporary slots acquired by the slot sync worker and to prevent
the worker from trying to fetch the failover slots.
A functionality to allow logical walsenders to wait for the physical will
be done in a subsequent commit.
Author: Shveta Malik, Hou Zhijie based on design inputs by Masahiko Sawada and Amit Kapila
Reviewed-by: Masahiko Sawada, Bertrand Drouvot, Peter Smith, Dilip Kumar, Ajin Cherian, Nisha Moond, Kuroda Hayato, Amit Kapila
Discussion: https://postgr.es/m/514f6f2f-6833-4539-39f1-96cd1e011f23@enterprisedb.com
Three LOG messages are added in the recovery code paths, providing
information that can be useful to track corruption issues depending on
the state of the cluster, telling that:
- Recovery has started from a backup_label.
- Recovery is restarting from a backup start LSN, without a
backup_label.
- Recovery has completed from a backup.
Author: Andres Freund
Reviewed-by: David Steele, Laurenz Albe, Michael Paquier
Discussion: https://postgr.es/m/20231117041811.vz4vgkthwjnwp2pp@awork3.anarazel.de
To take an incremental backup, you use the new replication command
UPLOAD_MANIFEST to upload the manifest for the prior backup. This
prior backup could either be a full backup or another incremental
backup. You then use BASE_BACKUP with the INCREMENTAL option to take
the backup. pg_basebackup now has an --incremental=PATH_TO_MANIFEST
option to trigger this behavior.
An incremental backup is like a regular full backup except that
some relation files are replaced with files with names like
INCREMENTAL.${ORIGINAL_NAME}, and the backup_label file contains
additional lines identifying it as an incremental backup. The new
pg_combinebackup tool can be used to reconstruct a data directory
from a full backup and a series of incremental backups.
Patch by me. Reviewed by Matthias van de Meent, Dilip Kumar, Jakub
Wartak, Peter Eisentraut, and Álvaro Herrera. Thanks especially to
Jakub for incredibly helpful and extensive testing.
Discussion: http://postgr.es/m/CA+TgmoYOYZfMCyOXFyC-P+-mdrZqm5pP2N7S-r0z3_402h9rsA@mail.gmail.com
This GUC was intended as a debugging help in the 9.0 area when hot
standby and streaming replication were being developped, able to offer
more information at LOG level rather than DEBUGn. There are more tools
available these days that are able to offer rather equivalent
information, like pg_waldump introduced in 9.3. It is not obvious how
this facility is useful these days, so let's remove it.
Author: Bharath Rupireddy
Discussion: https://postgr.es/m/ZXEXEAUVFrvpquSd@paquier.xyz
When beginning recovery, a LOG is displayed by the startup process to
show which recovery mode will be used depending on the .signal file(s)
set in the data folder, like "standby mode", recovery up to a given
target type and value, or archive recovery.
A different patch is under discussion to simplify the startup code by
requiring the presence of recovery.signal and/or standby.signal when a
backup_label file is read. Delaying a bit this LOG ensures that the
correct recovery mode would be reported, and putting it at this position
does not make it lose its value.
While on it, this commit adds a few comments documenting a bit more the
initial recovery steps and their dependencies, and fixes an incorrect
comment format. This introduces no behavior changes.
Extracted from a larger patch by me.
Reviewed-by: David Steele, Bowen Shi
Discussion: https://postgr.es/m/ZArVOMifjzE7f8W7@paquier.xyz
When beginning recovery from a base backup by reading a backup_label
file, it may be possible that no checkpoint record is available
depending on the method used when the case backup was taken, which would
prevent recovery from beginning. In this case, the FATAL messages
issued, initially added by c900c15269, mentioned recovery.signal as
an option to do recovery but not standby.signal. Let's add it as an
available option, for clarity.
Per suggestion from Bowen Shi, extracted from a larger patch by me.
Author: Michael Paquier
Discussion: https://postgr.es/m/CAM_vCudkSjr7NsNKSdjwtfAm9dbzepY6beZ5DP177POKy8=2aw@mail.gmail.com
Since C99, there can be a trailing comma after the last value in an
enum definition. A lot of new code has been introducing this style on
the fly. Some new patches are now taking an inconsistent approach to
this. Some add the last comma on the fly if they add a new last
value, some are trying to preserve the existing style in each place,
some are even dropping the last comma if there was one. We could
nudge this all in a consistent direction if we just add the trailing
commas everywhere once.
I omitted a few places where there was a fixed "last" value that will
always stay last. I also skipped the header files of libpq and ecpg,
in case people want to use those with older compilers. There were
also a small number of cases where the enum type wasn't used anywhere
(but the enum values were), which ended up confusing pgindent a bit,
so I left those alone.
Discussion: https://www.postgresql.org/message-id/flat/386f8c45-c8ac-4681-8add-e3b0852c1620%40eisentraut.org
This shouldn't change behavior except in the unusual case where
there are file in the tablespace directory that have entirely
numeric names but are nevertheless not possible names for a
tablespace directory, either because their names have leading zeroes
that shouldn't be there, or the value is actually zero, or because
the value is too large to represent as an OID.
In those cases, the directory would previously have made it into
the list of tablespaceinfo objects and no longer will. Thus, base
backups will now ignore such directories, instead of treating them
as legitimate tablespace directories. Similarly, if entries for
such tablespaces occur in a tablespace_map file, they will now
be rejected as erroneous, instead of being honored.
This is infrastructure for future work that wants to be able to
know the tablespace of each relation that is part of a backup
*as an OID*. By strengthening the up-front validation, we don't
have to worry about weird cases later, and can more easily avoid
repeated string->integer conversions.
Patch by me, reviewed by David Steele.
Discussion: http://postgr.es/m/CA+TgmoZNVeBzoqDL8xvr-nkaepq815jtDR4nJzPew7=3iEuM1g@mail.gmail.com
This allows tools that read the WAL sequentially to identify (possible)
redo points when they're reached, rather than only being able to
detect them in retrospect when XLOG_CHECKPOINT_ONLINE is found, possibly
much later in the WAL stream. There are other possible applications as
well; see the discussion links below.
Any redo location that precedes the checkpoint location should now point
to an XLOG_CHECKPOINT_REDO record, so add a cross-check to verify this.
While adjusting the code in CreateCheckPoint() for this patch, I made it
call WALInsertLockAcquireExclusive a bit later than before, since there
appears to be no need for it to be held while checking whether the system
is idle, whether this is an end-of-recovery checkpoint, or what the current
timeline is.
Bump XLOG_PAGE_MAGIC.
Patch by me, based in part on earlier work from Dilip Kumar. Review by
Dilip Kumar, Amit Kapila, Andres Freund, and Michael Paquier.
Discussion: http://postgr.es/m/CA+TgmoYy-Vc6G9QKcAKNksCa29cv__czr+N9X_QCxEfQVpp_8w@mail.gmail.com
Discussion: http://postgr.es/m/20230614194717.jyuw3okxup4cvtbt%40awork3.anarazel.de
Discussion: http://postgr.es/m/CA+hUKG+b2ego8=YNW2Ohe9QmSiReh1-ogrv8V_WZpJTqP3O+2w@mail.gmail.com
Run pgindent, pgperltidy, and reformat-dat-files.
This set of diffs is a bit larger than typical. We've updated to
pg_bsd_indent 2.1.2, which properly indents variable declarations that
have multi-line initialization expressions (the continuation lines are
now indented one tab stop). We've also updated to perltidy version
20230309 and changed some of its settings, which reduces its desire to
add whitespace to lines to make assignments etc. line up. Going
forward, that should make for fewer random-seeming changes to existing
code.
Discussion: https://postgr.es/m/20230428092545.qfb3y5wcu4cm75ur@alvherre.pgsql
Physical walsenders can't send data until it's been flushed; logical
walsenders can't decode and send data until it's been applied. On the
standby, the WAL is flushed first, which will only wake up physical
walsenders; and then applied, which will only wake up logical
walsenders.
Previously, all walsenders were awakened when the WAL was flushed. That
was fine for logical walsenders on the primary; but on the standby the
flushed WAL would have been not applied yet, so logical walsenders were
awakened too early.
Per idea from Jeff Davis and Amit Kapila.
Author: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>
Reviewed-By: Jeff Davis <pgsql@j-davis.com>
Reviewed-By: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://postgr.es/m/CAA4eK1+zO5LUeisabX10c81LU-fWMKO4M9Wyg1cdkbW7Hqh6vQ@mail.gmail.com
In standby mode, we don't actually report progress of recovery,
but up until now, startup_progress_timeout_handler() nevertheless
got called every log_startup_progress_interval seconds. That's
an unnecessary expense, so avoid it.
Report by Thomas Munro. Patch by Bharath Rupireddy, reviewed by
Simon Riggs, Thomas Munro, and me. Back-patch to v15, where
the problem was introduced.
Discussion: https://www.postgresql.org/message-id/CA%2BhUKGKCHSffAj8zZJKJvNX7ygnQFxVD6wm1d-2j3fVw%2BMafPQ%40mail.gmail.com
7fcbf6a and 2ff6555 changed the function signature of XLogPageRead()
but did not update the comment.
XLogReaderRoutine contains up to date information about the API, so no
need to repeat all that at XLogPageRead(), but fix the mentions of the
no longer existing function arguments.
When ending recovery based on recovery_target_xid matching with
recovery_target_inclusive = off, we printed an incorrect timestamp
(always 2000-01-01) in the "recovery stopping before ... transaction"
log message. This is a consequence of sloppy refactoring in
c945af80c: the code to fetch recordXtime out of the commit/abort
record used to be executed unconditionally, but it was changed
to get called only in the RECOVERY_TARGET_TIME case. We need only
flip the order of operations to restore the intended behavior.
Per report from Torsten Förtsch. Back-patch to all supported
branches.
Discussion: https://postgr.es/m/CAKkG4_kUevPqbmyOfLajx7opAQk6Cvwkvx0HRcFjSPfRPTXanA@mail.gmail.com
Pay down some ancient technical debt (dating to commit 022fd9966):
fix a couple of places in datetime parsing that were throwing
ereport's immediately instead of returning a DTERR code that could be
interpreted by DateTimeParseError. The reason for that was that there
was no mechanism for passing any auxiliary data (such as a zone name)
to DateTimeParseError, and these errors seemed to really need it.
Up to now it didn't matter that much just where the error got thrown,
but now we'd like to have a hard policy that datetime parse errors
get thrown from just the one place.
Hence, invent a "DateTimeErrorExtra" struct that can be used to
carry any extra values needed for specific DTERR codes. Perhaps
in the future somebody will be motivated to use this to improve
the specificity of other DateTimeParseError messages, but for now
just deal with the timezone-error cases.
This is on the way to making the datetime input functions report
parse errors softly; but it's really an independent change, so
commit separately.
Discussion: https://postgr.es/m/3bbbb0df-7382-bf87-9737-340ba096e034@postgrespro.ru
The error messages reported during any failures while reading or
validating the header of a WAL currently includes only the offset of the
page but not the compiled LSN referring to the page, requiring an extra
step to compile it if looking at the surroundings with pg_waldump or
similar. Adding this information costs a bit in translation, but also
eases debugging.
Author: Bharath Rupireddy
Reviewed-by: Álvaro Herrera, Kyotaro Horiguchi, Maxim Orlov, Michael
Paquier
Discussion: https://postgr.es/m/CALj2ACWV=FCddsxcGbVOA=cvPyMr75YCFbSQT6g4KDj=gcJK4g@mail.gmail.com
Previously, we'd compress only when the active range of array entries
reached Max(4 * PROCARRAY_MAXPROCS, 2 * pArray->numKnownAssignedXids).
If max_connections is large, the first term could result in not
compressing for a long time, resulting in much wastage of cycles in
hot-standby backends scanning the array to take snapshots. Get rid
of that term, and just bound it to 2 * pArray->numKnownAssignedXids.
That however creates the opposite risk, that we might spend too much
effort compressing. Hence, consider compressing only once every 128
commit records. (This frequency was chosen by benchmarking. While
we only tried one benchmark scenario, the results seem stable over
a fairly wide range of frequencies.)
Also, force compression when processing RecoveryInfo WAL records
(which should be infrequent); the old code could perform compression
then, but would do so only after the same array-range check as for
the transaction-commit path.
Also, opportunistically run compression if the startup process is about
to wait for WAL, though not oftener than once a second. This should
prevent cases where we waste lots of time by leaving the array
not-compressed for long intervals due to low WAL traffic.
Lastly, add a simple check to keep us from uselessly compressing
when the array storage is already compact.
Back-patch, as the performance problem is worse in pre-v14 branches
than in HEAD.
Simon Riggs and Michail Nikolaev, with help from Tom Lane and
Andres Freund.
Discussion: https://postgr.es/m/CALdSSPgahNUD_=pB_j=1zSnDBaiOtqVfzo8Ejt5J_k7qZiU1Tw@mail.gmail.com
Commits cf112c12 and a0dc8271 were a little too hasty in getting rid of
the pg_ prefixes where we use pread(), pwrite() and vectored variants.
We dropped support for ancient Unixes where we needed to use lseek() to
implement replacements for those, but it turns out that Windows also
changes the current position even when you pass in an offset to
ReadFile() and WriteFile() if the file handle is synchronous, despite
its documentation saying otherwise.
Switching to asynchronous file handles would fix that, but have other
complications. For now let's just put back the pg_ prefix and add some
comments to highlight the non-standard side-effect, which we can now
describe as Windows-only.
Reported-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Discussion: https://postgr.es/m/20220923202439.GA1156054%40nathanxps13
There are still some alignment-related failures in the buildfarm,
which might or might not be able to be fixed quickly, but I've also
just realized that it increased the size of many WAL records by 4 bytes
because a block reference contains a RelFileLocator. The effect of that
hasn't been studied or discussed, so revert for now.
RelFileNumbers are now assigned using a separate counter, instead of
being assigned from the OID counter. This counter never wraps around:
if all 2^56 possible RelFileNumbers are used, an internal error
occurs. As the cluster is limited to 2^64 total bytes of WAL, this
limitation should not cause a problem in practice.
If the counter were 64 bits wide rather than 56 bits wide, we would
need to increase the width of the BufferTag, which might adversely
impact buffer lookup performance. Also, this lets us use bigint for
pg_class.relfilenode and other places where these values are exposed
at the SQL level without worrying about overflow.
This should remove the need to keep "tombstone" files around until
the next checkpoint when relations are removed. We do that to keep
RelFileNumbers from being recycled, but now that won't happen
anyway. However, this patch doesn't actually change anything in
this area; it just makes it possible for a future patch to do so.
Dilip Kumar, based on an idea from Andres Freund, who also reviewed
some earlier versions of the patch. Further review and some
wordsmithing by me. Also reviewed at various points by Ashutosh
Sharma, Vignesh C, Amul Sul, Álvaro Herrera, and Tom Lane.
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
After commit cc2c7d65fc added this flag,
failure to reset it caused assertion failures. In non-assert builds, it
made the system fail to achieve the objectives listed in that commit;
chiefly, we might emit a spurious log message. Back-patch to v15, where
that commit first appeared.
Bharath Rupireddy and Kyotaro Horiguchi. Reviewed by Dilip Kumar,
Nathan Bossart and Michael Paquier. Reported by Dilip Kumar.
Discussion: https://postgr.es/m/CAFiTN-sE3ry=ycMPVtC+Djw4Fd7gbUGVv_qqw6qfzp=JLvqT3g@mail.gmail.com
Referring to the WAL as just "log" invites confusion with the
postmaster log, so avoid doing that in docs and error messages.
Also shorten "WAL segment file" to just "WAL file" in various
places.
Bharath Rupireddy, reviewed by Nathan Bossart and Kyotaro Horiguchi
Discussion: https://postgr.es/m/CALj2ACUeXa8tDPaiTLexBDMZ7hgvaN+RTb957-cn5qwv9zf-MQ@mail.gmail.com
guc.c has grown to be one of our largest .c files, making it
a bottleneck for compilation. It's also acquired a bunch of
knowledge that'd be better kept elsewhere, because of our not
very good habit of putting variable-specific check hooks here.
Hence, split it up along these lines:
* guc.c itself retains just the core GUC housekeeping mechanisms.
* New file guc_funcs.c contains the SET/SHOW interfaces and some
SQL-accessible functions for GUC manipulation.
* New file guc_tables.c contains the data arrays that define the
built-in GUC variables, along with some already-exported constant
tables.
* GUC check/assign/show hook functions are moved to the variable's
home module, whenever that's clearly identifiable. A few hard-
to-classify hooks ended up in commands/variable.c, which was
already a home for miscellaneous GUC hook functions.
To avoid cluttering a lot more header files with #include "guc.h",
I also invented a new header file utils/guc_hooks.h and put all
the GUC hook functions' declarations there, regardless of their
originating module. That allowed removal of #include "guc.h"
from some existing headers. The fallout from that (hopefully
all caught here) demonstrates clearly why such inclusions are
best minimized: there are a lot of files that, for example,
were getting array.h at two or more levels of remove, despite
not having any connection at all to GUCs in themselves.
There is some very minor code beautification here, such as
renaming a couple of inconsistently-named hook functions
and improving some comments. But mostly this just moves
code from point A to point B and deals with the ensuing
needs for #include adjustments and exporting a few functions
that previously weren't exported.
Patch by me, per a suggestion from Andres Freund; thanks also
to Michael Paquier for the idea to invent guc_funcs.c.
Discussion: https://postgr.es/m/587607.1662836699@sss.pgh.pa.us
On failure in restoring a block image, no details were provided, while
it is possible to see failure with an inconsistent record state, a
failure in processing decompression or a failure in decompression
because a build does not support this option.
RestoreBlockImage() is used in two code paths in the backend code,
during recovery and when checking a page consistency after applying
masking, and both places are changed to consume the error message
produced by the internal routine when it returns a false status. All
the error messages are reported under ERRCODE_INTERNAL_ERROR, that gets
used also when attempting to access a page compressed by a method
not supported by the build attempting the decompression. This is
something that can happen in core when doing physical replication with
primary and standby using inconsistent build options, for example.
This routine is available since 2c03216d and it has never provided any
context about the error happening when it failed. This change is
justified even more after 57aa5b2, that introduced compression of FPWs
in WAL.
Reported-by: Justin Prysby
Author: Michael Paquier
Discussion: https://postgr.es/m/20220905002320.GD31833@telsasoft.com
Backpatch-through: 15
XLogPageRead() can retry internally after a pread() system call has
succeeded, in the case of short reads, and page validation failures
while in standby mode (see commit 0668719801). Due to an oversight in
commit 3f1ce973, these cases could leave stale data in the internal
cache of xlogreader.c without marking it invalid. The main defense
against stale cached data on failure to read a page was in the error
handling path of the calling function ReadPageInternal(), but that
wasn't quite enough for errors handled internally by XLogPageRead()'s
retry loop if we then exited with XLREAD_WOULDBLOCK.
1. ReadPageInternal() now marks the cache invalid before calling the
page_read callback, by setting state->readLen to 0. It'll be set to
a non-zero value only after a successful read. It'll stay valid as
long as the caller requests data in the cached range.
2. XLogPageRead() no long performs internal retries while reading
ahead. While such retries should work, the general philosophy is
that we should give up prefetching if anything unusual happens so we
can handle it when recovery catches up, to reduce the complexity of
the system. Let's do that here too.
3. While here, a new function XLogReaderResetError() improves the
separation between xlogrecovery.c and xlogreader.c, where the former
previously clobbered the latter's internal error buffer directly.
The new function makes this more explicit, and also clears a related
flag, without which a standby would needlessly retry in the outer
function.
Thanks to Noah Misch for tracking down the conditions required for a
rare build farm failure in src/bin/pg_ctl/t/003_promote.pl, and
providing a reproducer.
Back-patch to 15.
Reported-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/20220807003627.GA4168930%40rfd.leadboat.com
When a PostgreSQL instance performing archive recovery but not using
standby mode is promoted, and the last WAL segment that it attempted
to read ended in a partial record, the previous code would create
invalid WAL on the new timeline. The WAL from the previously timeline
would be copied to the new timeline up until the end of the last valid
record, but instead of beginning to write WAL at immediately
afterwards, the promoted server would write an overwrite contrecord at
the beginning of the next segment. The end of the previous segment
would be left as all-zeroes, resulting in failures if anything tried
to read WAL from that file.
The root of the issue is that ReadRecord() decides whether to set
abortedRecPtr and missingContrecPtr based on the value of StandbyMode,
but ReadRecord() switches to a new timeline based on the value of
ArchiveRecoveryRequested. We shouldn't try to write an overwrite
contrecord if we're switching to a new timeline, so change the test in
ReadRecod() to check ArchiveRecoveryRequested instead.
Code fix by Dilip Kumar. Comments by me incorporating suggested
language from Álvaro Herrera. Further review from Kyotaro Horiguchi
and Sami Imseih.
Discussion: http://postgr.es/m/CAFiTN-t7umki=PK8dT1tcPV=mOUe2vNhHML6b3T7W7qqvvajjg@mail.gmail.com
Discussion: http://postgr.es/m/FB0DEA0B-E14E-43A0-811F-C1AE93D00FF3%40amazon.com
pread() and pwrite() are in SUSv2, and all targeted Unix systems have
them.
Previously, we defined pg_pread and pg_pwrite to emulate these function
with lseek() on old Unixen. The names with a pg_ prefix were a reminder
of a portability hazard: they might change the current file position.
That hazard is gone, so we can drop the prefixes.
Since the remaining replacement code is Windows-only, move it into
src/port/win32p{read,write}.c, and move the declarations into
src/include/port/win32_port.h.
No need for vestigial HAVE_PREAD, HAVE_PWRITE macros as they were only
used for declarations in port.h which have now moved into win32_port.h.
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Greg Stark <stark@mit.edu>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CA+hUKGJ3LHeP9w5Fgzdr4G8AnEtJ=z=p6hGDEm4qYGEUX5B6fQ@mail.gmail.com
Crash recovery on standby may encounter missing directories
when replaying database-creation WAL records. Prior to this
patch, the standby would fail to recover in such a case;
however, the directories could be legitimately missing.
Consider the following sequence of commands:
CREATE DATABASE
DROP DATABASE
DROP TABLESPACE
If, after replaying the last WAL record and removing the
tablespace directory, the standby crashes and has to replay the
create database record again, crash recovery must be able to continue.
A fix for this problem was already attempted in 49d9cfc68b, but it
was reverted because of design issues. This new version is based
on Robert Haas' proposal: any missing tablespaces are created
during recovery before reaching consistency. Tablespaces
are created as real directories, and should be deleted
by later replay. CheckRecoveryConsistency ensures
they have disappeared.
The problems detected by this new code are reported as PANIC,
except when allow_in_place_tablespaces is set to ON, in which
case they are WARNING. Apart from making tests possible, this
gives users an escape hatch in case things don't go as planned.
Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Author: Asim R Praveen <apraveen@pivotal.io>
Author: Paul Guo <paulguo@gmail.com>
Reviewed-by: Anastasia Lubennikova <lubennikovaav@gmail.com> (older versions)
Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com> (older versions)
Reviewed-by: Michaël Paquier <michael@paquier.xyz>
Diagnosed-by: Paul Guo <paulguo@gmail.com>
Discussion: https://postgr.es/m/CAEET0ZGx9AvioViLf7nbR_8tH9-=27DN5xWJ2P9-ROH16e4JUA@mail.gmail.com
This commit removes two arguments "report" and "whichChkpt"
in ReadCheckpointRecord().
"report" is obviously useless because it's always true, i.e., there are
two callers of the function and they always specify true as "report".
Commit 1d919de5eb removed the only call with "report" = false.
"whichChkpt" indicated where the specified checkpoint location
came from, pg_control or backup_label. This information was used
to report different error messages depending on where the invalid
checkpoint record came from, when it was found.
But ReadCheckpointRecord() doesn't need to do that because
its callers already do that and users can still identify where
the invalid checkpoint record came from, by reading such log messages.
Also when "whichChkpt" was 0, the word "primary checkpoint" was used
in the log message and could confuse users because the concept of
primary and secondary checkpoints was already removed before.
These are why this commit removes "whichChkpt" argument.
Author: Fujii Masao
Reviewed-by: Bharath Rupireddy, Kyotaro Horiguchi
Discussion: https://postgr.es/m/fa2e12eb-81c3-0717-0272-755f8a81c8f2@oss.nttdata.com
We have been using the term RelFileNode to refer to either (1) the
integer that is used to name the sequence of files for a certain relation
within the directory set aside for that tablespace/database combination;
or (2) that value plus the OIDs of the tablespace and database; or
occasionally (3) the whole series of files created for a relation
based on those values. Using the same name for more than one thing is
confusing.
Replace RelFileNode with RelFileNumber when we're talking about just the
single number, i.e. (1) from above, and with RelFileLocator when we're
talking about all the things that are needed to locate a relation's files
on disk, i.e. (2) from above. In the places where we refer to (3) as
a relfilenode, instead refer to "relation storage".
Since there is a ton of SQL code in the world that knows about
pg_class.relfilenode, don't change the name of that column, or of other
SQL-facing things that derive their name from it.
On the other hand, do adjust closely-related internal terminology. For
example, the structure member names dbNode and spcNode appear to be
derived from the fact that the structure itself was called RelFileNode,
so change those to dbOid and spcOid. Likewise, various variables with
names like rnode and relnode get renamed appropriately, according to
how they're being used in context.
Hopefully, this is clearer than before. It is also preparation for
future patches that intend to widen the relfilenumber fields from its
current width of 32 bits. Variables that store a relfilenumber are now
declared as type RelFileNumber rather than type Oid; right now, these
are the same, but that can now more easily be changed.
Dilip Kumar, per an idea from me. Reviewed also by Andres Freund.
I fixed some whitespace issues, changed a couple of words in a
comment, and made one other minor correction.
Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com
Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com
Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
pg_start_backup() and pg_stop_backup() have been respectively renamed to
pg_backup_start() and pg_backup_stop() as of 39969e2, but a few comments
did not get the call.
Reviewed-by: Kyotaro Horiguchi, David Steele
Discussion: https://postgr.es/m/YrqGlj1+4DF3dbZ/@paquier.xyz
All but a few existing callers assume without checking that this
function succeeds. While it probably will, that's a poor excuse for
not checking. Let's make it return void and instead throw an error
if it doesn't find the block reference. Callers that actually need
to handle the no-such-block case must now use the underlying function
XLogRecGetBlockTagExtended.
In addition to being a bit less error-prone, this should also serve
to suppress some Coverity complaints about XLogRecGetBlockRefInfo.
While at it, clean up some inconsistency about use of the
XLogRecHasBlockRef macro: make XLogRecGetBlockTagExtended use
that instead of open-coding the same condition, and avoid calling
XLogRecHasBlockRef twice in relevant code paths. (That is,
calling XLogRecHasBlockRef followed by XLogRecGetBlockTag is now
deprecated: use XLogRecGetBlockTagExtended instead.)
Patch HEAD only; this doesn't seem to have enough value to consider
a back-branch API break.
Discussion: https://postgr.es/m/425039.1649701221@sss.pgh.pa.us