postgres

mirror of https://github.com/postgres/postgres.git synced 2025-11-07 19:06:32 +03:00

Author	SHA1	Message	Date
Alvaro Herrera	94232c909d	Fix typos tablesapce -> tablespace there -> their These were introduced in `72d422a52`, so no need to backpatch.	2015-06-08 15:37:42 -03:00
Fujii Masao	7abc685974	Refactor WAL segment copying code. * Remove unused argument "dstfname" and related code from XLogFileCopy(). * Previously XLogFileCopy() returned a pstrdup'd string so that InstallXLogFileSegment() used it later. Since the pstrdup'd string was never free'd, there could be a risk of memory leak. It was almost harmless because the startup process exited just after calling XLogFileCopy(), it existed. This commit changes XLogFileCopy() so that it directly calls InstallXLogFileSegment() and doesn't call pstrdup() at all. Which fixes that memory leak problem. * Extend InstallXLogFileSegment() so that the caller can specify the log level. Which allows us to emit an error when InstallXLogFileSegment() fails a disk file access like link() and rename(). Previously it was always logged with LOG level and additionally needed to be logged with ERROR when we wanted to treat it as an error. Michael Paquier	2015-06-09 03:03:24 +09:00
Andres Freund	d1b958218a	Allow HotStandbyActiveInReplay() to be called in single user mode. HotStandbyActiveInReplay, introduced in `061b079f`, only allowed WAL replay to happen in the startup process, missing the single user case. This buglet is fairly harmless as it only causes problems when single user mode in an assertion enabled build is used to replay a btree vacuum record. Backpatch to 9.2. `061b079f` was backpatched further, but the assertion was not.	2015-06-08 14:09:27 +02:00
Tom Lane	d8179b001a	Fix fsync-at-startup code to not treat errors as fatal. Commit `2ce439f337` introduced a rather serious regression, namely that if its scan of the data directory came across any un-fsync-able files, it would fail and thereby prevent database startup. Worse yet, symlinks to such files also caused the problem, which meant that crash restart was guaranteed to fail on certain common installations such as older Debian. After discussion, we agreed that (1) failure to start is worse than any consequence of not fsync'ing is likely to be, therefore treat all errors in this code as nonfatal; (2) we should not chase symlinks other than those that are expected to exist, namely pg_xlog/ and tablespace links under pg_tblspc/. The latter restriction avoids possibly fsync'ing a much larger part of the filesystem than intended, if the user has left random symlinks hanging about in the data directory. This commit takes care of that and also does some code beautification, mainly moving the relevant code into fd.c, which seems a much better place for it than xlog.c, and making sure that the conditional compilation for the pre_sync_fname pass has something to do with whether pg_flush_data works. I also relocated the call site in xlog.c down a few lines; it seems a bit silly to be doing this before ValidateXLOGDirectoryStructure(). The similar logic in initdb.c ought to be made to match this, but that change is noncritical and will be dealt with separately. Back-patch to all active branches, like the prior commit. Abhijit Menon-Sen and Tom Lane	2015-05-28 17:33:03 -04:00
Bruce Momjian	807b9e0dff	pgindent run for 9.5	2015-05-23 21:35:49 -04:00
Tom Lane	72809480d6	Fix incorrect snprintf() limit. Typo in commit `7cbee7c0a`. No practical effect since the buffer should never actually be overrun, but various compilers and static analyzers will whine about it. Petr Jelinek	2015-05-23 16:05:52 -04:00
Heikki Linnakangas	7cbee7c0a1	At promotion, don't leave behind a partial segment on the old timeline. With commit `de768844`, a copy of the partial segment was archived with the .partial suffix, but the original file was still left in pg_xlog, so it didn't actually solve the problems with archiving the partial segment that it was supposed to solve. With this patch, the partial segment is renamed rather than copied, so we only archive it with the .partial suffix. Also be more robust in detecting if the last segment is already being archived. Previously I used XLogArchiveIsBusy() for that, but that's not quite right. With archive_mode='always', there might be a .ready file for it, and we don't want to rename it to .partial in that case. The old segment is needed until we're fully committed to the new timeline, i.e. until we've written the end-of-recovery WAL record and updated the min recovery point and timeline in the control file. So move the renaming later in the startup sequence, after all that's been done.	2015-05-22 11:04:33 +03:00
Fujii Masao	85d0e661aa	Make recovery_target_action = pause work. Previously even if recovery_target_action was set to pause and the recovery target was reached, the recovery could never be paused. Because the setting of pause was always overridden with that of shutdown unexpectedly. This override is valid and intentional if hot_standby is not enabled because there is no way to resume the paused recovery in this case and the setting of pause is completely useless. But not if hot_standby is enabled. This patch changes the code so that the setting of pause is overridden with that of shutdown only when hot_standby is not enabled. Bug reported by Andres Freund	2015-05-21 13:56:17 +09:00
Heikki Linnakangas	4fc72cc7bb	Collection of typo fixes. Use "a" and "an" correctly, mostly in comments. Two error messages were also fixed (they were just elogs, so no translation work required). Two function comments in pg_proc.h were also fixed. Etsuro Fujita reported one of these, but I found a lot more with grep. Also fix a few other typos spotted while grepping for the a/an typos. For example, "consists out of ..." -> "consists of ...". Plus a "though"/ "through" mixup reported by Euler Taveira. Many of these typos were in old code, which would be nice to backpatch to make future backpatching easier. But much of the code was new, and I didn't feel like crafting separate patches for each branch. So no backpatching.	2015-05-20 16:56:22 +03:00
Simon Riggs	f6a54fefc2	Fix spelling in comment	2015-05-19 18:37:46 -04:00
Heikki Linnakangas	ffd37740ee	Add archive_mode='always' option. In 'always' mode, the standby independently archives all files it receives from the primary. Original patch by Fujii Masao, docs and review by me.	2015-05-15 18:55:24 +03:00
Andrew Dunstan	72d422a522	Map basebackup tablespaces using a tablespace_map file Windows can't reliably restore symbolic links from a tar format, so instead during backup start we create a tablespace_map file, which is used by the restoring postgres to create the correct links in pg_tblspc. The backup protocol also now has an option to request this file to be included in the backup stream, and this is used by pg_basebackup when operating in tar mode. This is done on all platforms, not just Windows. This means that pg_basebackup will not not work in tar mode against 9.4 and older servers, as this protocol option isn't implemented there. Amit Kapila, reviewed by Dilip Kumar, with a little editing from me.	2015-05-12 09:29:10 -04:00
Heikki Linnakangas	de7688442f	At promotion, archive last segment from old timeline with .partial suffix. Previously, we would archive the possible-incomplete WAL segment with its normal filename, but that causes trouble if the server owning that timeline is still running, and tries to archive the same segment later. It's not nice for the standby to trip up the master's archival like that. And it's pretty confusing, anyway, to have an incomplete segment in the archive that's indistinguishable from a normal, complete segment. To avoid such confusion, add a .partial suffix to the file. Or to be more precise, make a copy of the old segment under the .partial suffix, and archive that instead of the original file. pg_receivexlog also uses the .partial suffix for the same purpose, to tell apart incompletely streamed files from complete ones. There is no automatic mechanism to use the .partial files at recovery, so they will go unused, unless the administrator manually copies to them to the pg_xlog directory (and removes the .partial suffix). Recovery won't normally need the WAL - when recovering to the new timeline, it will find the same WAL on the first segment on the new timeline instead - but it nevertheless feels better to archive the file with the .partial suffix, for debugging purposes if nothing else.	2015-05-08 21:59:01 +03:00
Heikki Linnakangas	179cdd0981	Add macros to check if a filename is a WAL segment or other such file. We had many instances of the strlen + strspn combination to check for that. This makes the code a bit easier to read.	2015-05-08 21:58:57 +03:00
Robert Haas	2ce439f337	Recursively fsync() the data directory after a crash. Otherwise, if there's another crash, some writes from after the first crash might make it to disk while writes from before the crash fail to make it to disk. This could lead to data corruption. Back-patch to all supported versions. Abhijit Menon-Sen, reviewed by Andres Freund and slightly revised by me.	2015-05-04 14:13:53 -04:00
Robert Haas	924bcf4f16	Create an infrastructure for parallel computation in PostgreSQL. This does four basic things. First, it provides convenience routines to coordinate the startup and shutdown of parallel workers. Second, it synchronizes various pieces of state (e.g. GUCs, combo CID mappings, transaction snapshot) from the parallel group leader to the worker processes. Third, it prohibits various operations that would result in unsafe changes to that state while parallelism is active. Finally, it propagates events that would result in an ErrorResponse, NoticeResponse, or NotifyResponse message being sent to the client from the parallel workers back to the master, from which they can then be sent on to the client. Robert Haas, Amit Kapila, Noah Misch, Rushabh Lathia, Jeevan Chalke. Suggestions and review from Andres Freund, Heikki Linnakangas, Noah Misch, Simon Riggs, Euler Taveira, and Jim Nasby.	2015-04-30 15:02:14 -04:00
Andres Freund	5aa2350426	Introduce replication progress tracking infrastructure. When implementing a replication solution ontop of logical decoding, two related problems exist: * How to safely keep track of replication progress * How to change replication behavior, based on the origin of a row; e.g. to avoid loops in bi-directional replication setups The solution to these problems, as implemented here, consist out of three parts: 1) 'replication origins', which identify nodes in a replication setup. 2) 'replication progress tracking', which remembers, for each replication origin, how far replay has progressed in a efficient and crash safe manner. 3) The ability to filter out changes performed on the behest of a replication origin during logical decoding; this allows complex replication topologies. E.g. by filtering all replayed changes out. Most of this could also be implemented in "userspace", e.g. by inserting additional rows contain origin information, but that ends up being much less efficient and more complicated. We don't want to require various replication solutions to reimplement logic for this independently. The infrastructure is intended to be generic enough to be reusable. This infrastructure also replaces the 'nodeid' infrastructure of commit timestamps. It is intended to provide all the former capabilities, except that there's only 2^16 different origins; but now they integrate with logical decoding. Additionally more functionality is accessible via SQL. Since the commit timestamp infrastructure has also been introduced in 9.5 (commit `73c986add`) changing the API is not a problem. For now the number of origins for which the replication progress can be tracked simultaneously is determined by the max_replication_slots GUC. That GUC is not a perfect match to configure this, but there doesn't seem to be sufficient reason to introduce a separate new one. Bumps both catversion and wal page magic. Author: Andres Freund, with contributions from Petr Jelinek and Craig Ringer Reviewed-By: Heikki Linnakangas, Petr Jelinek, Robert Haas, Steve Singer Discussion: 20150216002155.GI15326@awork2.anarazel.de, 20140923182422.GA15776@alap3.anarazel.de, 20131114172632.GE7522@alap2.anarazel.de	2015-04-29 19:30:53 +02:00
Heikki Linnakangas	3d80a1e0e3	Fix logic to skip checkpoint if no records have been inserted. After the WAL format changes, the calculation of the size of a checkpoint record became incorrect. Instead of trying to fix the math, check that the previous record, i.e. the xl_prev value that we'd write for the next record, matches the last checkpoint's redo pointer. That way it's not dependent on the size of the checkpoint record at all. The old logic was actually slightly wrong all along: if the previous checkpoint record crossed a page boundary, the page headers threw off the record size calculation, and the checkpoint was not skipped. The new checkpoint would not cross a page boundary, so this only resulted in at most one extra checkpoint after the system became idle. The new logic fixes that. (It's not worth fixing in backbranches). However, it makes some sense to try to keep the latest checkpoint contained fully in a page, or at least in a single WAL segment, just on general robustness grounds. If something goes awfully wrong, it's more likely that you can recover the latest WAL segment, than the last two WAL segments. So I added an extra check that the checkpoint is not skipped if the previous checkpoint crossed a WAL segment. Reported by Jeff Janes.	2015-04-15 17:21:04 +03:00
Heikki Linnakangas	4f700bcd20	Reorganize our CRC source files again. Now that we use CRC-32C in WAL and the control file, the "traditional" and "legacy" CRC-32 variants are not used in any frontend programs anymore. Move the code for those back from src/common to src/backend/utils/hash. Also move the slicing-by-8 implementation (back) to src/port. This is in preparation for next patch that will add another implementation that uses Intel SSE 4.2 instructions to calculate CRC-32C, where available.	2015-04-14 17:03:42 +03:00
Heikki Linnakangas	b2a5545bd6	Don't archive bogus recycled or preallocated files after timeline switch. After a timeline switch, we would leave behind recycled WAL segments that are in the future, but on the old timeline. After promotion, and after they become old enough to be recycled again, we would notice that they don't have a .ready or .done file, create a .ready file for them, and archive them. That's bogus, because the files contain garbage, recycled from an older timeline (or prealloced as zeros). We shouldn't archive such files. This could happen when we're following a timeline switch during replay, or when we switch to new timeline at end-of-recovery. To fix, whenever we switch to a new timeline, scan the data directory for WAL segments on the old timeline, but with a higher segment number, and remove them. Those don't belong to our timeline history, and are most likely bogus recycled or preallocated files. They could also be valid files that we streamed from the primary ahead of time, but in any case, they're not needed to recover to the new timeline.	2015-04-13 16:53:49 +03:00
Fujii Masao	6e4bf4ecd3	Fix error handling of XLogReaderAllocate in case of OOM Similarly to previous fix `9b8d478`, commit `2c03216` has switched XLogReaderAllocate() to use a set of palloc calls instead of malloc, causing any callers of this function to fail with an error instead of receiving a NULL pointer in case of out-of-memory error. Fix this by using palloc_extended with MCXT_ALLOC_NO_OOM that will safely return NULL in case of an OOM. Michael Paquier, slightly modified by me.	2015-04-03 21:55:37 +09:00
Andres Freund	62e2a8dc2c	Define integer limits independently from the system definitions. In `83ff1618` we defined integer limits iff they're not provided by the system. That turns out not to be the greatest idea because there's different ways some datatypes can be represented. E.g. on OSX PG's 64bit datatype will be a 'long int', but OSX unconditionally uses 'long long'. That disparity then can lead to warnings, e.g. around printf formats. One way to fix that would be to back int64 using stdint.h's int64_t. While a good idea it's not that easy to implement. We would e.g. need to include stdint.h in our external headers, which we don't today. Also computing the correct int64 printf formats in that case is nontrivial. Instead simply prefix the integer limits with PG_ and define them unconditionally. I've adjusted all the references to them in code, but not the ones in comments; the latter seems unnecessary to me. Discussion: 20150331141423.GK4878@alap3.anarazel.de	2015-04-02 17:43:35 +02:00
Andres Freund	83ff1618bc	Centralize definition of integer limits. Several submitted and even committed patches have run into the problem that C89, our baseline, does not provide minimum/maximum values for various integer datatypes. C99's stdint.h does, but we can't rely on it. Several parts of the code defined limits locally, so instead centralize the definitions to c.h. This patch also changes the more obvious usages of literal limit values; there's more places that could be changed, but it's less clear whether it's beneficial to change those. Author: Andrew Gierth Discussion: 87619tc5wc.fsf@news-spur.riddles.org.uk	2015-03-25 22:39:42 +01:00
Andres Freund	87cec51d3a	Don't delay replication for less than recovery_min_apply_delay's resolution. Recovery delays are implemented by waiting on a latch, and latches take milliseconds as a parameter. The required amount of waiting was computed using microsecond resolution though and the wait loop's abort condition was checking the delay in microseconds as well. This could lead to short spurts of busy looping when the overall wait time was below a millisecond, but above 0 microseconds. Instead just formulate the wait loop's abort condition in millisecond granularity as well. Given that that's recovery_min_apply_delay resolution, it seems harmless to not wait for less than a millisecond. Backpatch to 9.4 where recovery_min_apply_delay was introduced. Discussion: 20150323141819.GH26995@alap3.anarazel.de	2015-03-23 16:51:11 +01:00
Andres Freund	a1105c3dd4	Fix copy & paste error in `4f1b890b13`. Due to the bug delayed standbys would not delay when applying prepared transactions. Discussion: CAB7nPqT6BO1cCn+sAyDByBxA4EKZNAiPi2mFJ=ANeZmnmewRyg@mail.gmail.com Michael Paquier via Coverity.	2015-03-23 15:53:40 +01:00
Andres Freund	4f1b890b13	Merge the various forms of transaction commit & abort records. Since `465883b0a` two versions of commit records have existed. A compact version that was used when no cache invalidations, smgr unlinks and similar were needed, and a full version that could deal with all that. Additionally the full version was embedded into twophase commit records. That resulted in a measurable reduction in the size of the logged WAL in some workloads. But more recently additions like logical decoding, which e.g. needs information about the database something was executed on, made it applicable in fewer situations. The static split generally made it hard to expand the commit record, because concerns over the size made it hard to add anything to the compact version. Additionally it's not particularly pretty to have twophase.c insert RM_XACT records. Rejigger things so that the commit and abort records only have one form each, including the twophase equivalents. The presence of the various optional (in the sense of not being in every record) pieces is indicated by a bits in the 'xinfo' flag. That flag previously was not included in compact commit records. To prevent an increase in size due to its presence, it's only included if necessary; signalled by a bit in the xl_info bits available for xact.c, similar to heapam.c's XLOG_HEAP_OPMASK/XLOG_HEAP_INIT_PAGE. Twophase commit/aborts are now the same as their normal counterparts. The original transaction's xid is included in an optional data field. This means that commit records generally are smaller, except in the case of a transaction with subtransactions, but no other special cases; the increase there is four bytes, which seems acceptable given that the more common case of not having subtransactions shrank. The savings are especially measurable for twophase commits, which previously always used the full version; but will in practice only infrequently have required that. The motivation for this work are not the space savings and and deduplication though; it's that it makes it easier to extend commit records with additional information. That's just a few lines of code now; without impacting the common case where that information is not needed. Discussion: 20150220152150.GD4149@awork2.anarazel.de, 235610.92468.qm%40web29004.mail.ird.yahoo.com Reviewed-By: Heikki Linnakangas, Simon Riggs	2015-03-15 17:37:07 +01:00
Andres Freund	a0f5954af1	Increase max_wal_size's default from 128MB to 1GB. The introduction of min_wal_size & max_wal_size in `88e9823026` makes it feasible to increase the default upper bound in checkpoint size. Previously raising the default would lead to a increased disk footprint, even if more segments weren't beneficial. The low default of checkpoint size is one of common performance problem users have thus increasing the default makes sense. Setups where the increase in maximum disk usage is a problem will very likely have to run with a modified configuration anyway. Discussion: 54F4EFB8.40202@agliodbs.com, CA+TgmoZEAgX5oMGJOHVj8L7XOkAe05Gnf45rP40m-K3FhZRVKg@mail.gmail.com Author: Josh Berkus, after a discussion involving lots of people.	2015-03-15 17:37:07 +01:00
Andres Freund	51c11a7025	Remove pause_at_recovery_target recovery.conf setting. The new recovery_target_action (introduced in aedccb1f6/b8e33a85d4) replaces it's functionality. Having both seems likely to cause more confusion than it saves worry due to the incompatibility. Discussion: 5484FC53.2060903@2ndquadrant.com Author: Petr Jelinek	2015-03-15 17:37:07 +01:00
Fujii Masao	57aa5b2bb1	Add GUC to enable compression of full page images stored in WAL. When newly-added GUC parameter, wal_compression, is on, the PostgreSQL server compresses a full page image written to WAL when full_page_writes is on or during a base backup. A compressed page image will be decompressed during WAL replay. Turning this parameter on can reduce the WAL volume without increasing the risk of unrecoverable data corruption, but at the cost of some extra CPU spent on the compression during WAL logging and on the decompression during WAL replay. This commit changes the WAL format (so bumping WAL version number) so that the one-byte flag indicating whether a full page image is compressed or not is included in its header information. This means that the commit increases the WAL volume one-byte per a full page image even if WAL compression is not used at all. We can save that one-byte by borrowing one-bit from the existing field like hole_offset in the header and using it as the flag, for example. But which would reduce the code readability and the extensibility of the feature. Per discussion, it's not worth paying those prices to save only one-byte, so we decided to add the one-byte flag to the header. This commit doesn't introduce any new compression algorithm like lz4. Currently a full page image is compressed using the existing PGLZ algorithm. Per discussion, we decided to use it at least in the first version of the feature because there were no performance reports showing that its compression ratio is unacceptably lower than that of other algorithm. Of course, in the future, it's worth considering the support of other compression algorithm for the better compression. Rahila Syed and Michael Paquier, reviewed in various versions by myself, Andres Freund, Robert Haas, Abhijit Menon-Sen and many others.	2015-03-11 15:52:24 +09:00
Alvaro Herrera	4f3924d9cd	Keep CommitTs module in sync in standby and master We allow this module to be turned off on restarts, so a restart time check is enough to activate or deactivate the module; however, if there is a standby replaying WAL emitted from a master which is restarted, but the standby isn't, the state in the standby becomes inconsistent and can easily be crashed. Fix by activating and deactivating the module during WAL replay on parameter change as well as on system start. Problem reported by Fujii Masao in http://www.postgresql.org/message-id/CAHGQGwFhJ3CnHo1CELEfay18yg_RA-XZT-7D8NuWUoYSZ90r4Q@mail.gmail.com Author: Petr Jelínek	2015-03-09 17:44:00 -03:00
Heikki Linnakangas	88e9823026	Replace checkpoint_segments with min_wal_size and max_wal_size. Instead of having a single knob (checkpoint_segments) that both triggers checkpoints, and determines how many checkpoints to recycle, they are now separate concerns. There is still an internal variable called CheckpointSegments, which triggers checkpoints. But it no longer determines how many segments to recycle at a checkpoint. That is now auto-tuned by keeping a moving average of the distance between checkpoints (in bytes), and trying to keep that many segments in reserve. The advantage of this is that you can set max_wal_size very high, but the system won't actually consume that much space if there isn't any need for it. The min_wal_size sets a floor for that; you can effectively disable the auto-tuning behavior by setting min_wal_size equal to max_wal_size. The max_wal_size setting is now the actual target size of WAL at which a new checkpoint is triggered, instead of the distance between checkpoints. Previously, you could calculate the actual WAL usage with the formula "(2 + checkpoint_completion_target) * checkpoint_segments + 1". With this patch, you set the desired WAL usage with max_wal_size, and the system calculates the appropriate CheckpointSegments with the reverse of that formula. That's a lot more intuitive for administrators to set. Reviewed by Amit Kapila and Venkata Balaji N.	2015-02-23 18:53:02 +02:00
Fujii Masao	5d2b45e3f7	Add GUC to control the time to wait before retrieving WAL after failed attempt. Previously when the standby server failed to retrieve WAL files from any sources (i.e., streaming replication, local pg_xlog directory or WAL archive), it always waited for five seconds (hard-coded) before the next attempt. For example, this is problematic in warm-standby because restore_command can fail every five seconds even while new WAL file is expected to be unavailable for a long time and flood the log files with its error messages. This commit adds new parameter, wal_retrieve_retry_interval, to control that wait time. Alexey Vasiliev and Michael Paquier, reviewed by Andres Freund and me.	2015-02-23 20:55:17 +09:00
Heikki Linnakangas	49b04188f8	Fix thinko in re-setting wal_log_hints flag from a parameter-change record. The flag is supposed to be copied from the record. Same issue with track_commit_timestamps, but that's master-only. Report and fix by Petr Jalinek. Backpatch to 9.4, where wal_log_hints was added.	2015-01-15 20:52:41 +02:00
Heikki Linnakangas	1e78d81e88	Don't open a WAL segment for writing at end of recovery. Since commit `ba94518a`, we used XLogFileOpen to open the next segment for writing, but if the end-of-recovery happens exactly at a segment boundary, the new segment might not exist yet. (Before `ba94518a`, XLogFileOpen was correct, because we would open the previous segment if the switch happened at the boundary.) Instead of trying to create it if necessary, it's simpler to not bother opening the segment at all. XLogWrite() will open or create it soon anyway, after writing the checkpoint or end-of-recovery record. Reported by Andres Freund.	2015-01-07 16:20:20 +02:00
Bruce Momjian	4baaf863ec	Update copyright for 2015 Backpatch certain files through 9.0	2015-01-06 11:43:47 -05:00
Tom Lane	d6657d2a10	Treat negative values of recovery_min_apply_delay as having no effect. At one point in the development of this feature, it was claimed that allowing negative values would be useful to compensate for timezone differences between master and slave servers. That was based on a mistaken assumption that commit timestamps are recorded in local time; but of course they're in UTC. Nor is a negative apply delay likely to be a sane way of coping with server clock skew. However, the committed patch still treated negative delays as doing something, and the timezone misapprehension survived in the user documentation as well. If recovery_min_apply_delay were a proper GUC we'd just set the minimum allowed value to be zero; but for the moment it seems better to treat negative settings as if they were zero. In passing do some extra wordsmithing on the parameter's documentation, including correcting a second misstatement that the parameter affects processing of Restore Point records. Issue noted by Michael Paquier, who also provided the code patch; doc changes by me. Back-patch to 9.4 where the feature was introduced.	2015-01-03 13:14:03 -05:00
Heikki Linnakangas	2ef6c66a2b	Fix file descriptor leak at end of recovery. XLogFileInit() returns a file descriptor, which needs to be closed. The leak was short-lived, since the startup process exits shortly afterwards, but it was clearly a bug, nevertheless. Per Coverity report.	2014-12-21 21:51:59 +02:00
Heikki Linnakangas	5c805d0a81	Fix timestamp in end-of-recovery WAL records. We used time(null) to set a TimestampTz field, which gave bogus results. Noticed while looking at pg_xlogdump output. Backpatch to 9.3 and above, where the fast promotion was introduced.	2014-12-19 17:04:20 +02:00
Heikki Linnakangas	ba94518aad	Change how first WAL segment on new timeline after promotion is created. Two changes: 1. When copying a WAL segment from old timeline to create the first segment on the new timeline, only copy up to the point where the timeline switch happens, and zero-fill the rest. This avoids corner cases where we might think that the copied WAL from the previous timeline belong to the new timeline. 2. If the timeline switch happens at a segment boundary, don't copy the whole old segment to the new timeline. It's pointless, because it's 100% identical to the old segment.	2014-12-18 20:23:03 +02:00
Andres Freund	c303e9e7e5	Fix (re-)starting from a basebackup taken off a standby after a failure. When starting up from a basebackup taken off a standby extra logic has to be applied to compute the point where the data directory is consistent. Normal base backups use a WAL record for that purpose, but that isn't possible on a standby. That logic had a error check ensuring that the cluster's control file indicates being in recovery. Unfortunately that check was too strict, disregarding the fact that the control file could also indicate that the cluster was shut down while in recovery. That's possible when the a cluster starting from a basebackup is shut down before the backup label has been removed. When everything goes well that's a short window, but when either restore_command or primary_conninfo isn't configured correctly the window can get much wider. That's because inbetween reading and unlinking the label we restore the last checkpoint from WAL which can need additional WAL. To fix simply also allow starting when the control file indicates "shutdown in recovery". There's nicer fixes imaginable, but they'd be more invasive. Backpatch to 9.2 where support for taking basebackups from standbys was added.	2014-12-18 08:47:27 +01:00
Simon Riggs	b8e33a85d4	Tweaks for recovery_target_action Rename parameter action_at_recovery_target to recovery_target_action suggested by Christoph Berg. Place into recovery.conf suggested by Fujii Masao, replacing (deprecating) earlier parameters, per Michael Paquier.	2014-12-07 21:55:29 +09:00
Alvaro Herrera	73c986adde	Keep track of transaction commit timestamps Transactions can now set their commit timestamp directly as they commit, or an external transaction commit timestamp can be fed from an outside system using the new function TransactionTreeSetCommitTsData(). This data is crash-safe, and truncated at Xid freeze point, same as pg_clog. This module is disabled by default because it causes a performance hit, but can be enabled in postgresql.conf requiring only a server restart. A new test in src/test/modules is included. Catalog version bumped due to the new subdirectory within PGDATA and a couple of new SQL functions. Authors: Álvaro Herrera and Petr Jelínek Reviewed to varying degrees by Michael Paquier, Andres Freund, Robert Haas, Amit Kapila, Fujii Masao, Jaime Casanova, Simon Riggs, Steven Singer, Peter Eisentraut	2014-12-03 11:53:02 -03:00
Heikki Linnakangas	afeacd2748	Fix assertion failure at end of PITR. InitXLogInsert() cannot be called in a critical section, because it allocates memory. But CreateCheckPoint() did that, when called for the end-of-recovery checkpoint by the startup process. In the passing, fix the scratch space allocation in InitXLogInsert to go to the right memory context. Also update the comment at InitXLOGAccess, which hasn't been totally accurate since hot standby was introduced (in a hot standby backend, InitXLOGAccess isn't called at backend startup). Reported by Michael Paquier	2014-11-28 09:31:53 +02:00
Simon Riggs	aedccb1f6f	action_at_recovery_target recovery config option action_at_recovery_target = pause \| promote \| shutdown Petr Jelinek Reviewed by Muhammad Asif Naeem, Fujji Masao and Simon Riggs	2014-11-25 20:13:30 +00:00
Heikki Linnakangas	0bd624d63b	Distinguish XLOG_FPI records generated for hint-bit updates. Add a new XLOG_FPI_FOR_HINT record type, and use that for full-page images generated for hint bit updates, when checksums are enabled. The new record type is replayed exactly the same as XLOG_FPI, but allows them to be tallied separately e.g. in pg_xlogdump.	2014-11-24 11:09:08 +02:00
Heikki Linnakangas	2c03216d83	Revamp the WAL record format. Each WAL record now carries information about the modified relation and block(s) in a standardized format. That makes it easier to write tools that need that information, like pg_rewind, prefetching the blocks to speed up recovery, etc. There's a whole new API for building WAL records, replacing the XLogRecData chains used previously. The new API consists of XLogRegister* functions, which are called for each buffer and chunk of data that is added to the record. The new API also gives more control over when a full-page image is written, by passing flags to the XLogRegisterBuffer function. This also simplifies the XLogReadBufferForRedo() calls. The function can dig the relation and block number from the WAL record, so they no longer need to be passed as arguments. For the convenience of redo routines, XLogReader now disects each WAL record after reading it, copying the main data part and the per-block data into MAXALIGNed buffers. The data chunks are not aligned within the WAL record, but the redo routines can assume that the pointers returned by XLogRecGet* functions are. Redo routines are now passed the XLogReaderState, which contains the record in the already-disected format, instead of the plain XLogRecord. The new record format also makes the fixed size XLogRecord header smaller, by removing the xl_len field. The length of the "main data" portion is now stored at the end of the WAL record, and there's a separate header after XLogRecord for it. The alignment padding at the end of XLogRecord is also removed. This compansates for the fact that the new format would otherwise be more bulky than the old format. Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera, Fujii Masao.	2014-11-20 18:46:41 +02:00
Andres Freund	d3586fc8aa	Ensure unlogged tables are reset even if crash recovery errors out. Unlogged relations are reset at the end of crash recovery as they're only synced to disk during a proper shutdown. Unfortunately that and later steps can fail, e.g. due to running out of space. This reset was, up to now performed after marking the database as having finished crash recovery successfully. As out of space errors trigger a crash restart that could lead to the situation that not all unlogged relations are reset. Once that happend usage of unlogged relations could yield errors like "could not open file "...": No such file or directory". Luckily clusters that show the problem can be fixed by performing a immediate shutdown, and starting the database again. To fix, just call ResetUnloggedRelations(UNLOGGED_RELATION_INIT) earlier, before marking the database as having successfully recovered. Discussion: 20140912112246.GA4984@alap3.anarazel.de Backpatch to 9.1 where unlogged tables were introduced. Abhijit Menon-Sen and Andres Freund	2014-11-15 01:19:26 +01:00
Heikki Linnakangas	7250d8535b	Fix building with WAL_DEBUG. Now that the backup blocks are appended to the WAL record in xloginsert.c, XLogInsert doesn't see them anymore and cannot remove them from the version reconstructed for xlog_outdesc. This makes running with wal_debug=on more expensive, as we now make (unnecessary) temporary copies of the backup blocks, but it doesn't seem worth convoluting the code to keep that optimization. Reported by Alvaro Herrera.	2014-11-07 23:09:31 +02:00
Heikki Linnakangas	2076db2aea	Move the backup-block logic from XLogInsert to a new file, xloginsert.c. xlog.c is huge, this makes it a little bit smaller, which is nice. Functions related to putting together the WAL record are in xloginsert.c, and the lower level stuff for managing WAL buffers and such are in xlog.c. Also move the definition of XLogRecord to a separate header file. This causes churn in the #includes of all the files that write WAL records, and redo routines, but it avoids pulling in xlog.h into most places. Reviewed by Michael Paquier, Alvaro Herrera, Andres Freund and Amit Kapila.	2014-11-06 13:55:36 +02:00
Heikki Linnakangas	5028f22f6e	Switch to CRC-32C in WAL and other places. The old algorithm was found to not be the usual CRC-32 algorithm, used by Ethernet et al. We were using a non-reflected lookup table with code meant for a reflected lookup table. That's a strange combination that AFAICS does not correspond to any bit-wise CRC calculation, which makes it difficult to reason about its properties. Although it has worked well in practice, seems safer to use a well-known algorithm. Since we're changing the algorithm anyway, we might as well choose a different polynomial. The Castagnoli polynomial has better error-correcting properties than the traditional CRC-32 polynomial, even if we had implemented it correctly. Another reason for picking that is that some new CPUs have hardware support for calculating CRC-32C, but not CRC-32, let alone our strange variant of it. This patch doesn't add any support for such hardware, but a future patch could now do that. The old algorithm is kept around for tsquery and pg_trgm, which use the values in indexes that need to remain compatible so that pg_upgrade works. While we're at it, share the old lookup table for CRC-32 calculation between hstore, ltree and core. They all use the same table, so might as well.	2014-11-04 11:39:48 +02:00

1 2 3 4 5 ...

783 Commits