postgres

mirror of https://github.com/postgres/postgres.git synced 2025-06-01 14:21:49 +03:00

Author	SHA1	Message	Date
Robert Haas	84e3712677	Create VXID locks "lazily" in the main lock table. Instead of entering them on transaction startup, we materialize them only when someone wants to wait, which will occur only during CREATE INDEX CONCURRENTLY. In Hot Standby mode, the startup process must also be able to probe for conflicting VXID locks, but the lock need never be fully materialized, because the startup process does not use the normal lock wait mechanism. Since most VXID locks never need to touch the lock manager partition locks, this can significantly reduce blocking contention on read-heavy workloads. Patch by me. Review by Jeff Davis.	2011-08-04 12:38:33 -04:00
Tom Lane	ac36e6f71f	Move CheckRecoveryConflictDeadlock() call to a safer place. This kluge was inserted in a spot apparently chosen at random: the lock manager's state is not yet fully set up for the wait, and in particular LockWaitCancel hasn't been armed by setting lockAwaited, so the ProcLock will not get cleaned up if the ereport is thrown. This seems to not cause any observable problem in trivial test cases, because LockReleaseAll will silently clean up the debris; but I was able to cause failures with tests involving subtransactions. Fixes breakage induced by commit c85c941470efc44494fd7a5f426ee85fc65c268c. Back-patch to all affected branches.	2011-08-02 15:16:29 -04:00
Robert Haas	85b436f7b1	Minor stylistic corrections.	2011-08-01 08:24:45 -04:00
Robert Haas	b4fbe392f8	Reduce sinval synchronization overhead. Testing shows that the overhead of acquiring and releasing SInvalReadLock and msgNumLock on high-core count boxes can waste a lot of CPU time and hurt performance. This patch adds a per-backend flag that allows us to skip all that locking in most cases. Further testing shows that this improves performance even when sinval traffic is very high. Patch by me. Review and testing by Noah Misch.	2011-07-29 16:46:13 -04:00
Robert Haas	4240e429d0	Try to acquire relation locks in RangeVarGetRelid. In the previous coding, we would look up a relation in RangeVarGetRelid, lock the resulting OID, and then AcceptInvalidationMessages(). While this was sufficient to ensure that we noticed any changes to the relation definition before building the relcache entry, it didn't handle the possibility that the name we looked up no longer referenced the same OID. This was particularly problematic in the case where a table had been dropped and recreated: we'd latch on to the entry for the old relation and fail later on. Now, we acquire the relation lock inside RangeVarGetRelid, and retry the name lookup if we notice that invalidation messages have been processed meanwhile. Many operations that would previously have failed with an error in the presence of concurrent DDL will now succeed. There is a good deal of work remaining to be done here: many callers of RangeVarGetRelid still pass NoLock for one reason or another. In addition, nothing in this patch guards against the possibility that the meaning of an unqualified name might change due to the creation of a relation in a schema earlier in the user's search path than the one where it was previously found. Furthermore, there's nothing at all here to guard against similar race conditions for non-relations. For all that, it's a start. Noah Misch and Robert Haas	2011-07-08 22:19:30 -04:00
Heikki Linnakangas	89fd72cbf2	Introduce a pipe between postmaster and each backend, which can be used to detect postmaster death. Postmaster keeps the write-end of the pipe open, so when it dies, children get EOF in the read-end. That can conveniently be waited for in select(), which allows eliminating some of the polling loops that check for postmaster death. This patch doesn't yet change all the loops to use the new mechanism, expect a follow-on patch to do that. This changes the interface to WaitLatch, so that it takes as argument a bitmask of events that it waits for. Possible events are latch set, timeout, postmaster death, and socket becoming readable or writeable. The pipe method behaves slightly differently from the kill() method previously used in PostmasterIsAlive() in the case that postmaster has died, but its parent has not yet read its exit code with waitpid(). The pipe returns EOF as soon as the process dies, but kill() continues to return true until waitpid() has been called (IOW while the process is a zombie). Because of that, change PostmasterIsAlive() to use the pipe too, otherwise WaitLatch() would return immediately with WL_POSTMASTER_DEATH, while PostmasterIsAlive() would claim it's still alive. That could easily lead to busy-waiting while postmaster is in zombie state. Peter Geoghegan with further changes by me, reviewed by Fujii Masao and Florian Pflug.	2011-07-08 18:44:07 +03:00
Peter Eisentraut	8a8fbe7e79	Capitalization fixes	2011-06-19 00:37:30 +03:00
Peter Eisentraut	5caa3479c2	Clean up most -Wunused-but-set-variable warnings from gcc 4.6 This warning is new in gcc 4.6 and part of -Wall. This patch cleans up most of the noise, but there are some still warnings that are trickier to remove.	2011-04-11 22:28:45 +03:00
Bruce Momjian	bf50caf105	pgindent run before PG 9.1 beta 1.	2011-04-10 11:42:00 -04:00
Simon Riggs	b98ac467f5	Prevent intermittent hang in recovery from bgwriter interaction. Startup process waited for cleanup lock but when hot_standby = off the pid was not registered, so that the bgwriter would not wake the waiting process as intended.	2011-03-23 13:30:05 +00:00
Robert Haas	6436098795	Minor sync rep corrections. Fujii Masao, with a bit of additional wordsmithing by me.	2011-03-10 14:57:02 -05:00
Heikki Linnakangas	93d888232e	Don't throw a warning if vacuum sees PD_ALL_VISIBLE flag set on a page that contains newly-inserted tuples that according to our OldestXmin are not yet visible to everyone. The value returned by GetOldestXmin() is conservative, and it can move backwards on repeated calls, so if we see that contradiction between the PD_ALL_VISIBLE flag and status of tuples on the page, we have to assume it's because an earlier vacuum calculated a higher OldestXmin value, and all the tuples really are visible to everyone. We have received several reports of this bug, with the "PD_ALL_VISIBLE flag was incorrectly set in relation ..." warning appearing in logs. We were finally able to hunt it down with David Gould's help to run extra diagnostics in an environment where this happened frequently. Also reword the warning, per Robert Haas' suggestion, to not imply that the PD_ALL_VISIBLE flag is necessarily at fault, as it might also be a symptom of corruption on a tuple header. Backpatch to 8.4, where the PD_ALL_VISIBLE flag was introduced.	2011-03-08 20:30:53 +02:00
Simon Riggs	a8a8a3e096	Efficient transaction-controlled synchronous replication. If a standby is broadcasting reply messages and we have named one or more standbys in synchronous_standby_names then allow users who set synchronous_replication to wait for commit, which then provides strict data integrity guarantees. Design avoids sending and receiving transaction state information so minimises bookkeeping overheads. We synchronize with the highest priority standby that is connected and ready to synchronize. Other standbys can be defined to takeover in case of standby failure. This version has very strict behaviour; more relaxed options may be added at a later date. Simon Riggs and Fujii Masao, with reviews by Yeb Havinga, Jaime Casanova, Heikki Linnakangas and Robert Haas, plus the assistance of many other design reviewers.	2011-03-06 22:49:16 +00:00
Simon Riggs	bca8b7f16a	Hot Standby feedback for avoidance of cleanup conflicts on standby. Standby optionally sends back information about oldestXmin of queries which is then checked and applied to the WALSender's proc->xmin. GetOldestXmin() is modified slightly to agree with GetSnapshotData(), so that all backends on primary include WALSender within their snapshots. Note this does nothing to change the snapshot xmin on either master or standby. Feedback piggybacks on the standby reply message. vacuum_defer_cleanup_age is no longer used on standby, though parameter still exists on primary, since some use cases still exist. Simon Riggs, review comments from Fujii Masao, Heikki Linnakangas, Robert Haas	2011-02-16 19:29:37 +00:00
Heikki Linnakangas	dafaa3efb7	Implement genuine serializable isolation level. Until now, our Serializable mode has in fact been what's called Snapshot Isolation, which allows some anomalies that could not occur in any serialized ordering of the transactions. This patch fixes that using a method called Serializable Snapshot Isolation, based on research papers by Michael J. Cahill (see README-SSI for full references). In Serializable Snapshot Isolation, transactions run like they do in Snapshot Isolation, but a predicate lock manager observes the reads and writes performed and aborts transactions if it detects that an anomaly might occur. This method produces some false positives, ie. it sometimes aborts transactions even though there is no anomaly. To track reads we implement predicate locking, see storage/lmgr/predicate.c. Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared memory is finite, so when a transaction takes many tuple-level locks on a page, the locks are promoted to a single page-level lock, and further to a single relation level lock if necessary. To lock key values with no matching tuple, a sequential scan always takes a relation-level lock, and an index scan acquires a page-level lock that covers the search key, whether or not there are any matching keys at the moment. A predicate lock doesn't conflict with any regular locks or with another predicate locks in the normal sense. They're only used by the predicate lock manager to detect the danger of anomalies. Only serializable transactions participate in predicate locking, so there should be no extra overhead for for other transactions. Predicate locks can't be released at commit, but must be remembered until all the transactions that overlapped with it have completed. That means that we need to remember an unbounded amount of predicate locks, so we apply a lossy but conservative method of tracking locks for committed transactions. If we run short of shared memory, we overflow to a new "pg_serial" SLRU pool. We don't currently allow Serializable transactions in Hot Standby mode. That would be hard, because even read-only transactions can cause anomalies that wouldn't otherwise occur. Serializable isolation mode now means the new fully serializable level. Repeatable Read gives you the old Snapshot Isolation level that we have always had. Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and Anssi Kääriäinen	2011-02-08 00:09:08 +02:00
Simon Riggs	8585ad3625	Fix error code for canceling statement due to conflict with recovery. All retryable conflict errors now have an error code that indicates that a retry is possible, correcting my incomplete fix of 2010/05/12 Tatsuo Ishii and Simon Riggs, input from Robert Haas and Florian Pflug	2011-01-31 19:20:23 +00:00
Heikki Linnakangas	b1dc45c11d	Fix thinko in comment. Spotted by Jim Nasby.	2011-01-18 10:46:13 +02:00
Heikki Linnakangas	8f5d65e916	Treat a WAL sender process that hasn't started streaming yet as a regular backend, as far as the postmaster shutdown logic is concerned. That means, fast shutdown will wait for WAL sender processes to exit before signaling bgwriter to finish. This avoids race conditions between a base backup stopping or starting, and bgwriter writing the shutdown checkpoint WAL record. We don't want e.g the end-of-backup WAL record to be written after the shutdown checkpoint.	2011-01-15 16:38:21 +02:00
Bruce Momjian	5d950e3b0c	Stamp copyrights for year 2011.	2011-01-01 13:18:15 -05:00
Robert Haas	24ecde7742	Work around unfortunate getppid() behavior on BSD-ish systems. On MacOS X, and apparently also on other BSD-derived systems, attaching a debugger causes getppid() to return the pid of the debugging process rather than the actual parent PID. As a result, debugging the autovacuum launcher, startup process, or WAL sender on such systems causes it to exit, because the previous coding of PostmasterIsAlive() detects postmaster death by testing whether getppid() == PostmasterPid. Work around that behavior by checking the return value of getppid() more carefully. If it's PostmasterPid, the postmaster must be alive; if it's 1, assume the postmaster is dead. If it's any other value, assume we've been debugged and fall through to the less-reliable kill() test. Review by Tom Lane.	2010-12-21 06:30:32 -05:00
Robert Haas	8bd4b89e24	Try to save a kernel call in ResolveRecoveryConflictWithVirtualXIDs. If there's no work to be done, just exit quickly, before initialization.	2010-12-17 11:32:02 -05:00
Robert Haas	611fed3712	Reset 'ps' display just once when resolving VXID conflicts. This prevents the word "waiting" from briefly disappearing from the ps status line when ResolveRecoveryConflictWithVirtualXIDs begins a new iteration of the outer loop. Along the way, remove some useless pgstat_report_waiting() calls; the startup process doesn't appear in pg_stat_activity. Fujii Masao	2010-12-17 08:30:57 -05:00
Tom Lane	04f4e10cfc	Use symbolic names not octal constants for file permission flags. Purely cosmetic patch to make our coding standards more consistent --- we were doing symbolic some places and octal other places. This patch fixes all C-coded uses of mkdir, chmod, and umask. There might be some other calls I missed. Inconsistency noted while researching tablespace directory permissions issue.	2010-12-10 17:35:33 -05:00
Simon Riggs	e620ee35b2	Optimize commit_siblings in two ways to improve group commit. First, avoid scanning the whole ProcArray once we know there are at least commit_siblings active; second, skip the check altogether if commit_siblings = 0. Greg Smith	2010-12-08 18:48:03 +00:00
Heikki Linnakangas	5a031a5556	Fix bugs in the hot standby known-assigned-xids tracking logic. If there's an old transaction running in the master, and a lot of transactions have started and finished since, and a WAL-record is written in the gap between the creating the running-xacts snapshot and WAL-logging it, recovery will fail with "too many KnownAssignedXids" error. This bug was reported by Joachim Wieland on Nov 19th. In the same scenario, when fewer transactions have started so that all the xids fit in KnownAssignedXids despite the first bug, a more serious bug arises. We incorrectly initialize the clog code with the oldest still running transaction, and when we see the WAL record belonging to a transaction with an XID larger than one that committed already before the checkpoint we're recovering from, we zero the clog page containing the already committed transaction, leading to data loss. In hindsight, trying to track xids in the known-assigned-xids array before seeing the running-xacts record was too complicated. To fix that, hold XidGenLock while the running-xacts snapshot is taken and WAL-logged. That ensures that no transaction can begin or end in that gap, so that in recvoery we know that the snapshot contains all transactions running at that point in WAL.	2010-12-07 09:23:30 +01:00
Simon Riggs	ed78384acd	Move call to GetTopTransactionId() earlier in LockAcquire(), removing an infrequently occurring race condition in Hot Standby. An xid must be assigned before a lock appears in shared memory, rather than immediately after, else GetRunningTransactionLocks() may see InvalidTransactionId, causing assertion failures during lock processing on standby. Bug report and diagnosis by Fujii Masao, fix by me.	2010-11-29 01:08:02 +00:00
Peter Eisentraut	fc946c39ae	Remove useless whitespace at end of lines	2010-11-23 22:34:55 +02:00
Simon Riggs	52010027ef	Avoid spurious Hot Standby conflicts from btree delete records. Similar conflicts were already avoided for related record types. Massive over-caution resulted in a usability bug. Clear theoretical basis for doing this is now confirmed by me. Request to remove from Heikki (twice), over-caution by me.	2010-11-15 09:30:13 +00:00
Magnus Hagander	9f2e211386	Remove cvs keywords from all files.	2010-09-20 22:08:53 +02:00
Heikki Linnakangas	236b6bc29e	Simplify Windows implementation of latches. There's no need to keep a dynamic pool of event handles, we can permanently assign one for each shared latch. Thanks to that, we no longer need a separate shared memory block for latches, and we don't need to know in advance how many shared latches there is, so you no longer need to remember to update NumSharedLatches when you introduce a new latch to the system.	2010-09-15 10:06:21 +00:00
Heikki Linnakangas	2746e5f21d	Introduce latches. A latch is a boolean variable, with the capability to wait until it is set. Latches can be used to reliably wait until a signal arrives, which is hard otherwise because signals don't interrupt select() on some platforms, and even when they do, there's race conditions. On Unix, latches use the so called self-pipe trick under the covers to implement the sleep until the latch is set, without race conditions. On Windows, Windows events are used. Use the new latch abstraction to sleep in walsender, so that as soon as a transaction finishes, walsender is woken up to immediately send the WAL to the standby. This reduces the latency between master and standby, which is good. Preliminary work by Fujii Masao. The latch implementation is by me, with helpful comments from many people.	2010-09-11 15:48:04 +00:00
Tom Lane	174a51332f	Cosmetic fixes for KnownAssignedXidsGetOldestXmin, per Fujii Masao.	2010-08-30 17:30:44 +00:00
Simon Riggs	e24d1dc069	Teach GetOldestXmin() about KnownAssignedXids during recovery. Very minor issue, though this is required for a later patch. Reported by Heikki Linnakangas.	2010-08-30 14:16:48 +00:00
Heikki Linnakangas	e1cc96dbf0	Fix typo in comment.	2010-08-30 06:33:22 +00:00
Tom Lane	b9defe0405	Marginal code cleanup for streaming replication. There is no reason that proc.c should have to get involved in this dirty hack for letting the postmaster know which children are walsenders. Revert that file to the way it was, and confine the kluge to pmsignal.c and postmaster.c.	2010-08-23 17:20:01 +00:00
Tom Lane	79dc97a401	Bring some sanity to the trace_recovery_messages code and docs. Per gripe from Fujii Masao, though this is not exactly his proposed patch. Categorize as DEVELOPER_OPTIONS and set context PGC_SIGHUP, as per Fujii, but set the default to LOG because higher values aren't really sensible (see the code for trace_recovery()). Fix the documentation to agree with the code and to try to explain what the variable actually does. Get rid of no-op calls trace_recovery(LOG), which accomplish nothing except to demonstrate that this option confuses even its author.	2010-08-19 22:55:01 +00:00
Robert Haas	30c22eb8fc	Correct sundry errors in Hot Standby-related comments. Fujii Masao	2010-08-12 23:24:54 +00:00
Bruce Momjian	239d769e7e	pgindent run for 9.0, second run	2010-07-06 19:19:02 +00:00
Tom Lane	aceedd88f6	Make vacuum_defer_cleanup_age be PGC_SIGHUP level, since it's not sensible to have different values in different processes of the primary server. Also put it into the "Streaming Replication" GUC category; it doesn't belong in "Standby Servers" because you use it on the master not the standby. In passing also correct guc.c's idea of wal_keep_segments' category.	2010-07-03 21:23:58 +00:00
Tom Lane	e76c1a0f4d	Replace max_standby_delay with two parameters, max_standby_archive_delay and max_standby_streaming_delay, and revise the implementation to avoid assuming that timestamps found in WAL records can meaningfully be compared to clock time on the standby server. Instead, the delay limits are compared to the elapsed time since we last obtained a new WAL segment from archive or since we were last "caught up" to WAL data arriving via streaming replication. This avoids problems with clock skew between primary and standby, as well as other corner cases that the original coding would misbehave in, such as the primary server having significant idle time between transactions. Per my complaint some time ago and considerable ensuing discussion. Do some desultory editing on the hot standby documentation, too.	2010-07-03 20:43:58 +00:00
Itagaki Takahiro	9e3cd37576	Remove max_standby_delay message from ps display of recovery process in waiting status. The parameter is not so interesting in ps display because it is referable in postgresql.conf.	2010-06-14 00:49:24 +00:00
Simon Riggs	f9dbac9476	HS Defer buffer pin deadlock check until deadlock_timeout has expired. During Hot Standby we need to check for buffer pin deadlocks when the Startup process begins to wait, in case it never wakes up again. We previously made the deadlock check immediately on the basis it was cheap, though clearer thinking and prima facie evidence shows that was too simple. Refactor existing code to make it easy to add in deferral of deadlock check until deadlock_timeout allowing a good reduction in deadlock checks since far few buffer pins are held for that duration. It's worth doing anyway, though major goal is to prevent further reports of context switching with high numbers of users on occasional tests.	2010-05-26 19:52:52 +00:00
Simon Riggs	fd34374b17	Add many new Asserts in code and fix simple bug that slipped through without them, related to previous commit. Report by Bruce Momjian.	2010-05-14 07:11:49 +00:00
Simon Riggs	8431e296ea	Cleanup initialization of Hot Standby. Clarify working with reanalysis of requirements and documentation on LogStandbySnapshot(). Fixes two minor bugs reported by Tom Lane that would lead to an incorrect snapshot after transaction wraparound. Also fix two other problems discovered that would give incorrect snapshots in certain cases. ProcArrayApplyRecoveryInfo() substantially rewritten. Some minor refactoring of xact_redo_apply() and ExpireTreeKnownAssignedTransactionIds().	2010-05-13 11:15:38 +00:00
Tom Lane	f9ed327f76	Clean up some awkward, inaccurate, and inefficient processing around MaxStandbyDelay. Use the GUC units mechanism for the value, and choose more appropriate timestamp functions for performing tests with it. Make the ps_activity manipulation in ResolveRecoveryConflictWithVirtualXIDs have behavior similar to ps_activity code elsewhere, notably not updating the display when update_process_title is off and not truncating the display contents at an arbitrarily-chosen length. Improve the docs to be explicit about what MaxStandbyDelay actually measures, viz the difference between primary and standby servers' clocks, and the possible hazards if their clocks aren't in sync.	2010-05-02 02:10:33 +00:00
Tom Lane	f0488bd57c	Rename the parameter recovery_connections to hot_standby, to reduce possible confusion with streaming-replication settings. Also, change its default value to "off", because of concern about executing new and poorly-tested code during ordinary non-replicating operation. Per discussion. In passing do some minor editing of related documentation.	2010-04-29 21:36:19 +00:00
Tom Lane	77acab75df	Modify ShmemInitStruct and ShmemInitHash to throw errors internally, rather than returning NULL for some-but-not-all failures as they used to. Remove now-redundant tests for NULL from call sites. We had to do something about this because many call sites were failing to check for NULL; and changing it like this seems a lot more useful and mistake-proof than adding checks to the call sites without them.	2010-04-28 16:54:16 +00:00
Heikki Linnakangas	9b8a73326e	Introduce wal_level GUC to explicitly control if information needed for archival or hot standby should be WAL-logged, instead of deducing that from other options like archive_mode. This replaces recovery_connections GUC in the primary, where it now has no effect, but it's still used in the standby to enable/disable hot standby. Remove the WAL-logging of "unlogged operations", like creating an index without WAL-logging and fsyncing it at the end. Instead, we keep a copy of the wal_mode setting and the settings that affect how much shared memory a hot standby server needs to track master transactions (max_connections, max_prepared_xacts, max_locks_per_xact) in pg_control. Whenever the settings change, at server restart, write a WAL record noting the new settings and update pg_control. This allows us to notice the change in those settings in the standby at the right moment, they used to be included in checkpoint records, but that meant that a changed value was not reflected in the standby until the first checkpoint after the change. Bump PG_CONTROL_VERSION and XLOG_PAGE_MAGIC. Whack XLOG_PAGE_MAGIC back to the sequence it used to follow, before hot standby and subsequent patches changed it to 0x9003.	2010-04-28 16:10:43 +00:00
Tom Lane	2871b4618a	Replace the KnownAssignedXids hash table with a sorted-array data structure, and be more tense about the locking requirements for it, to improve performance in Hot Standby mode. In passing fix a few bugs and improve a number of comments in the existing HS code. Simon Riggs, with some editorialization by Tom	2010-04-28 00:09:05 +00:00
Robert Haas	33980a0640	Fix various instances of "the the". Two of these were pointed out by Erik Rijkers; the rest I found.	2010-04-23 23:21:44 +00:00

... 6 7 8 9 10 ...

775 Commits