postgres

mirror of https://github.com/postgres/postgres.git synced 2025-11-13 16:22:44 +03:00

Author	SHA1	Message	Date
Tom Lane	bed756a820	Suppress some unused-variable complaints in new LOCK_DEBUG code. Jeff Janes	2015-03-26 12:00:30 -04:00
Kevin Grittner	2ed5b87f96	Reduce pinning and buffer content locking for btree scans. Even though the main benefit of the Lehman and Yao algorithm for btrees is that no locks need be held between page reads in an index search, we were holding a buffer pin on each leaf page after it was read until we were ready to read the next one. The reason was so that we could treat this as a weak lock to create an "interlock" with vacuum's deletion of heap line pointers, even though our README file pointed out that this was not necessary for a scan using an MVCC snapshot. The main goal of this patch is to reduce the blocking of vacuum processes by in-progress btree index scans (including a cursor which is idle), but the code rearrangement also allows for one less buffer content lock to be taken when a forward scan steps from one page to the next, which results in a small but consistent performance improvement in many workloads. This patch leaves behavior unchanged for some cases, which can be addressed separately so that each case can be evaluated on its own merits. These unchanged cases are when a scan uses a non-MVCC snapshot, an index-only scan, and a scan of a btree index for which modifications are not WAL-logged. If later patches allow all of these cases to drop the buffer pin after reading a leaf page, then the btree vacuum process can be simplified; it will no longer need the "super-exclusive" lock to delete tuples from a page. Reviewed by Heikki Linnakangas and Kyotaro Horiguchi	2015-03-25 14:24:43 -05:00
Robert Haas	372b97097e	Remove ill-advised pre-check for DSM segment exhaustion. dsm_control->nitems never decreases, so this is testing whether the server has ever run out of DSM segments, not whether it is currently out of DSM segments. Reported off-list by Amit Kapila.	2015-03-23 09:58:56 -04:00
Bruce Momjian	34afbba84e	Use mmap MAP_NOSYNC option to limit shared memory writes mmap() is rarely used for shared memory, but when it is, this option is useful, particularly on the BSDs. Patch by Sean Chittenden	2015-03-21 22:06:19 -04:00
Peter Eisentraut	28beb69f8b	Fix whitespace	2015-03-19 22:18:46 -04:00
Robert Haas	12968cf408	Add flags argument to dsm_create. Right now, there's only one flag, DSM_CREATE_NULL_IF_MAXSEGMENTS, which suppresses the error that would normally be thrown when the maximum number of segments already exists, instead returning NULL. It might be useful to add more flags in the future, such as one to ignore allocation errors, but I haven't done that here.	2015-03-19 13:03:03 -04:00
Andres Freund	bc208a5a2f	Guard against spurious signals in LockBufferForCleanup. When LockBufferForCleanup() has to wait for getting a cleanup lock on a buffer it does so by setting a flag in the buffer header and then wait for other backends to signal it using ProcWaitForSignal(). Unfortunately LockBufferForCleanup() missed that ProcWaitForSignal() can return for other reasons than the signal it is hoping for. If such a spurious signal arrives the wait flags on the buffer header will still be set. That then triggers "ERROR: multiple backends attempting to wait for pincount 1". The fix is simple, unset the flag if still set when retrying. That implies an additional spinlock acquisition/release, but that's unlikely to matter given the cost of waiting for a cleanup lock. Alternatively it'd have been possible to move responsibility for maintaining the relevant flag to the waiter all together, but that might have had negative consequences due to possible floods of signals. Besides being more invasive. This looks to be a very longstanding bug. The relevant code in LockBufferForCleanup() hasn't changed materially since its introduction and ProcWaitForSignal() was documented to return for unrelated reasons since 8.2. The master only patch series removing ImmediateInterruptOK made it much easier to hit though, as ProcSendSignal/ProcWaitForSignal now uses a latch shared with other tasks. Per discussion with Kevin Grittner, Tom Lane and me. Backpatch to all supported branches. Discussion: 11553.1423805224@sss.pgh.pa.us	2015-02-23 16:14:14 +01:00
Tom Lane	2e211211a7	Use FLEXIBLE_ARRAY_MEMBER in a number of other places. I think we're about done with this...	2015-02-21 16:12:14 -05:00
Tom Lane	33a3b03d63	Use FLEXIBLE_ARRAY_MEMBER in some more places. Fix a batch of structs that are only visible within individual .c files. Michael Paquier	2015-02-20 17:32:01 -05:00
Tom Lane	e38b1eb098	Use FLEXIBLE_ARRAY_MEMBER in struct varlena. This forces some minor coding adjustments in tuptoaster.c and inv_api.c, but the new coding there is cleaner anyway. Michael Paquier	2015-02-20 16:51:53 -05:00
Andres Freund	2505ce0be0	Remove remnants of ImmediateInterruptOK handling. Now that nothing sets ImmediateInterruptOK to true anymore, we can remove all the supporting code. Reviewed-By: Heikki Linnakangas	2015-02-03 23:25:47 +01:00
Andres Freund	d06995710b	Remove the option to service interrupts during PGSemaphoreLock(). The remaining caller (lwlocks) doesn't need that facility, and we plan to remove ImmedidateInterruptOK entirely. That means that interrupts can't be serviced race-free and portably anyway, so there's little reason for keeping the feature. Reviewed-By: Heikki Linnakangas	2015-02-03 23:25:00 +01:00
Andres Freund	6753333f55	Move deadlock and other interrupt handling in proc.c out of signal handlers. Deadlock checking was performed inside signal handlers up to now. While it's a remarkable feat to have made this work reliably, it's quite complex to understand why that is the case. Partially it worked due to the assumption that semaphores are signal safe - which is not actually documented to be the case for sysv semaphores. The reason we had to rely on performing this work inside signal handlers is that semaphores aren't guaranteed to be interruptable by signals on all platforms. But now that latches provide a somewhat similar API, which actually has the guarantee of being interruptible, we can avoid doing so. Signalling between ProcSleep, ProcWakeup, ProcWaitForSignal and ProcSendSignal is now done using latches. This increases the likelihood of spurious wakeups. As spurious wakeup already were possible and aren't likely to be frequent enough to be an actual problem, this seems acceptable. This change would allow for further simplification of the deadlock checking, now that it doesn't have to run in a signal handler. But even if I were motivated to do so right now, it would still be better to do that separately. Such a cleanup shouldn't have to be reviewed a the same time as the more fundamental changes in this commit. There is one possible usability regression due to this commit. Namely it is more likely than before that log_lock_waits messages are output more than once. Reviewed-By: Heikki Linnakangas	2015-02-03 23:24:38 +01:00
Andres Freund	4f85fde8eb	Introduce and use infrastructure for interrupt processing during client reads. Up to now large swathes of backend code ran inside signal handlers while reading commands from the client, to allow for speedy reaction to asynchronous events. Most prominently shared invalidation and NOTIFY handling. That means that complex code like the starting/stopping of transactions is run in signal handlers... The required code was fragile and verbose, and is likely to contain bugs. That approach also severely limited what could be done while communicating with the client. As the read might be from within openssl it wasn't safely possible to trigger an error, e.g. to cancel a backend in idle-in-transaction state. We did that in some cases, namely fatal errors, nonetheless. Now that FE/BE communication in the backend employs non-blocking sockets and latches to block, we can quite simply interrupt reads from signal handlers by setting the latch. That allows us to signal an interrupted read, which is supposed to be retried after returning from within the ssl library. As signal handlers now only need to set the latch to guarantee timely interrupt processing, remove a fair amount of complicated & fragile code from async.c and sinval.c. We could now actually start to process some kinds of interrupts, like sinval ones, more often that before, but that seems better done separately. This work will hopefully allow to handle cases like being blocked by sending data, interrupting idle transactions and similar to be implemented without too much effort. In addition to allowing getting rid of ImmediateInterruptOK, that is. Author: Andres Freund Reviewed-By: Heikki Linnakangas	2015-02-03 22:25:20 +01:00
Heikki Linnakangas	809d9a260b	Refactor page compactifying code. The logic to compact away removed tuples from page was duplicated with small differences in PageRepairFragmentation, PageIndexMultiDelete, and PageIndexDeleteNoCompact. Put it into a common function. Reviewed by Peter Geoghegan.	2015-02-03 14:09:29 +02:00
Heikki Linnakangas	2b3a8b20c2	Be more careful to not lose sync in the FE/BE protocol. If any error occurred while we were in the middle of reading a protocol message from the client, we could lose sync, and incorrectly try to interpret a part of another message as a new protocol message. That will usually lead to an "invalid frontend message" error that terminates the connection. However, this is a security issue because an attacker might be able to deliberately cause an error, inject a Query message in what's supposed to be just user data, and have the server execute it. We were quite careful to not have CHECK_FOR_INTERRUPTS() calls or other operations that could ereport(ERROR) in the middle of processing a message, but a query cancel interrupt or statement timeout could nevertheless cause it to happen. Also, the V2 fastpath and COPY handling were not so careful. It's very difficult to recover in the V2 COPY protocol, so we will just terminate the connection on error. In practice, that's what happened previously anyway, as we lost protocol sync. To fix, add a new variable in pqcomm.c, PqCommReadingMsg, that is set whenever we're in the middle of reading a message. When it's set, we cannot safely ERROR out and continue running, because we might've read only part of a message. PqCommReadingMsg acts somewhat similarly to critical sections in that if an error occurs while it's set, the error handler will force the connection to be terminated, as if the error was FATAL. It's not implemented by promoting ERROR to FATAL in elog.c, like ERROR is promoted to PANIC in critical sections, because we want to be able to use PG_TRY/CATCH to recover and regain protocol sync. pq_getmessage() takes advantage of that to prevent an OOM error from terminating the connection. To prevent unnecessary connection terminations, add a holdoff mechanism similar to HOLD/RESUME_INTERRUPTS() that can be used hold off query cancel interrupts, but still allow die interrupts. The rules on which interrupts are processed when are now a bit more complicated, so refactor ProcessInterrupts() and the calls to it in signal handlers so that the signal handlers always call it if ImmediateInterruptOK is set, and ProcessInterrupts() can decide to not do anything if the other conditions are not met. Reported by Emil Lenngren. Patch reviewed by Noah Misch and Andres Freund. Backpatch to all supported versions. Security: CVE-2015-0244	2015-02-02 17:09:53 +02:00
Andres Freund	17792bfc5b	Properly terminate the array returned by GetLockConflicts(). GetLockConflicts() has for a long time not properly terminated the returned array. During normal processing the returned array is zero initialized which, while not pretty, is sufficient to be recognized as a invalid virtual transaction id. But the HotStandby case is more than aesthetically broken: The allocated (and reused) array is neither zeroed upon allocation, nor reinitialized, nor terminated. Not having a terminating element means that the end of the array will not be recognized and that recovery conflict handling will thus read ahead into adjacent memory. Only terminating when hitting memory content that looks like a invalid virtual transaction id. Luckily this seems so far not have caused significant problems, besides making recovery conflict more expensive. Discussion: 20150127142713.GD29457@awork2.anarazel.de Backpatch into all supported branches.	2015-01-29 22:48:45 +01:00
Andres Freund	ed127002d8	Align buffer descriptors to cache line boundaries. Benchmarks has shown that aligning the buffer descriptor array to cache lines is important for scalability; especially on bigger, multi-socket, machines. Currently the array sometimes already happens to be aligned by happenstance, depending how large previous shared memory allocations were. That can lead to wildly varying performance results after minor configuration changes. In addition to aligning the start of descriptor array, also force the size of individual descriptors to be of a common cache line size (64 bytes). That happens to already be the case on 64bit platforms, but this way we can change the struct BufferDesc more easily. As the alignment primarily matters in highly concurrent workloads which probably all are 64bit these days, and the space wastage of element alignment would be a bit more noticeable on 32bit systems, we don't force the stride to be cacheline sized on 32bit platforms for now. If somebody does actual performance testing, we can reevaluate that decision by changing the definition of BUFFERDESC_PADDED_SIZE. Discussion: 20140202151319.GD32123@awork2.anarazel.de Per discussion with Bruce Momjan, Tom Lane, Robert Haas, and Peter Geoghegan.	2015-01-29 22:48:45 +01:00
Andres Freund	7142bfbbd3	Fix #ifdefed'ed out code to compile again.	2015-01-29 22:48:45 +01:00
Heikki Linnakangas	acc2b1e843	Fix typo in comment.	2015-01-28 10:26:30 +02:00
Andres Freund	2d115e47c8	Fix various shortcomings of the new PrivateRefCount infrastructure. As noted by Tom Lane the improvements in `4b4b680c3d` had the problem that in some situations we searched, entered and modified entries in the private refcount hash while holding a spinlock. I had tried to keep the logic entirely local to PinBuffer_Locked(), but that's not really possible given it's called with a spinlock held... Besides being disadvantageous from a performance point of view, this also has problems with error handling safety. If we failed inserting an entry into the hashtable due to an out of memory error, we'd error out with a held spinlock. Not good. Change the way private refcounts are manipulated: Before a buffer can be tracked an entry has to be reserved using ReservePrivateRefCountEntry(); then, if a entry is not found using GetPrivateRefCountEntry(), it can be entered with NewPrivateRefCountEntry(). Also take advantage of the fact that PinBuffer_Locked() currently is never called for buffers that already have been pinned by the current backend and don't search the private refcount entries for preexisting local pins. That results in a small, but measurable, performance improvement. Additionally make ReleaseBuffer() always call UnpinBuffer() for shared buffers. That avoids duplicating work in an eventual UnpinBuffer() call that already has been done in ReleaseBuffer() and also saves some code. Per discussion with Tom Lane. Discussion: 15028.1418772313@sss.pgh.pa.us	2015-01-19 23:59:41 +01:00
Andres Freund	59f71a0d0b	Add a default local latch for use in signal handlers. To do so, move InitializeLatchSupport() into the new common process initialization functions, and add a new global variable MyLatch. MyLatch is usable as soon InitPostmasterChild() has been called (i.e. very early during startup). Initially it points to a process local latch that exists in all processes. InitProcess/InitAuxiliaryProcess then replaces that local latch with PGPROC->procLatch. During shutdown the reverse happens. This is primarily advantageous for two reasons: For one it simplifies dealing with the shared process latch, especially in signal handlers, because instead of having to check for MyProc, MyLatch can be used unconditionally. For another, a later patch that makes FEs/BE communication use latches, now can rely on the existence of a latch, even before having gone through InitProcess. Discussion: 20140927191243.GD5423@alap3.anarazel.de	2015-01-14 18:45:22 +01:00
Andres Freund	14e8803f10	Add barriers to the latch code. Since their introduction latches have required barriers in SetLatch and ResetLatch - but when they were introduced there wasn't any barrier abstraction. Instead latches were documented to rely on the callsites to provide barrier semantics. Now that the barrier support looks halfway complete, add the necessary barriers to both latch implementations. Also remove a now superflous lock acquisition from syncrep.c and a superflous (and insufficient) barrier from freelist.c. There might be other cases that can now be simplified, but those are the only ones I've seen on a quick scan. We might want to backpatch this at some later point, but right now the barrier infrastructure in the backbranches isn't totally on par with master. Discussion: 20150112154026.GB2092@awork2.anarazel.de	2015-01-13 12:58:43 +01:00
Stephen Frost	1bf4a84d0f	Skip dead backends in MinimumActiveBackends Back in `ed0b409`, PGPROC was split and moved to static variables in procarray.c, with procs in ProcArrayStruct replaced by an array of integers representing process numbers (pgprocnos), with -1 indicating a dead process which has yet to be removed. Access to procArray is generally done under ProcArrayLock and therefore most code does not have to concern itself with -1 entries. However, MinimumActiveBackends intentionally does not take ProcArrayLock, which means it has to be extra careful when accessing procArray. Prior to `ed0b409`, this was handled by checking for a NULL in the pointer array, but that check was no longer valid after the split. Coverity pointed out that the check could never happen and so it was removed in `5592eba`. That didn't make anything worse, but it didn't fix the issue either. The correct fix is to check for pgprocno == -1 and skip over that entry if it is encountered. Back-patch to 9.2, since there can be attempts to access the arrays prior to their start otherwise. Note that the changes prior to 9.4 will look a bit different due to the change in `5592eba`. Note that MinimumActiveBackends only returns a bool for heuristic purposes and any pre-array accesses are strictly read-only and so there is no security implication and the lack of fields complaints indicates it's very unlikely to run into issues due to this. Pointed out by Noah.	2015-01-12 11:31:57 -05:00
Andres Freund	f454144a34	Remove comment that was intended to have been removed before commit. Noticed by Amit Kapila	2015-01-08 13:16:31 +01:00
Bruce Momjian	4baaf863ec	Update copyright for 2015 Backpatch certain files through 9.0	2015-01-06 11:43:47 -05:00
Andres Freund	740a4ec7f4	Blindly fix a dtrace probe in lwlock.c for a removed local variable. Per buildfarm member locust.	2014-12-25 19:48:46 +01:00
Andres Freund	d72731a704	Lockless StrategyGetBuffer clock sweep hot path. StrategyGetBuffer() has proven to be a bottleneck in a number of buffer acquisition heavy workloads. To some degree this has already been alleviated by `5d7962c6`, but it still can be quite a heavy bottleneck. The problem is that in unfortunate usage patterns a single StrategyGetBuffer() call will have to look at a large number of buffers - in turn making it likely that the process will be put to sleep while still holding the spinlock. Replace most of the usage of the buffer_strategy_lock spinlock for the clock sweep by a atomic nextVictimBuffer variable. That variable, modulo NBuffers, is the current hand of the clock sweep. The buffer clock-sweep then only needs to acquire the spinlock after a wraparound. And even then only in the process that did the wrapping around. That alleviates nearly all the contention on the relevant spinlock, although significant contention on the cacheline can still exist. Reviewed-By: Robert Haas and Amit Kapila Discussion: 20141010160020.GG6670@alap3.anarazel.de, 20141027133218.GA2639@awork2.anarazel.de	2014-12-25 18:26:25 +01:00
Andres Freund	ab5194e6f6	Improve LWLock scalability. The old LWLock implementation had the problem that concurrent lock acquisitions required exclusively acquiring a spinlock. Often that could lead to acquirers waiting behind the spinlock, even if the actual LWLock was free. The new implementation doesn't acquire the spinlock when acquiring the lock itself. Instead the new atomic operations are used to atomically manipulate the state. Only the waitqueue, used solely in the slow path, is still protected by the spinlock. Check lwlock.c's header for an explanation about the used algorithm. For some common workloads on larger machines this can yield significant performance improvements. Particularly in read mostly workloads. Reviewed-By: Amit Kapila and Robert Haas Author: Andres Freund Discussion: 20130926225545.GB26663@awork2.anarazel.de	2014-12-25 17:24:30 +01:00
Andres Freund	7882c3b0b9	Convert the PGPROC->lwWaitLink list into a dlist instead of open coding it. Besides being shorter and much easier to read it changes the logic in LWLockRelease() to release all shared lockers when waking up any. This can yield some significant performance improvements - and the fairness isn't really much worse than before, as we always allowed new shared lockers to jump the queue.	2014-12-25 17:24:30 +01:00
Andres Freund	37de8de9e3	Prevent potentially hazardous compiler/cpu reordering during lwlock release. In LWLockRelease() (and in 9.4+ LWLockUpdateVar()) we release enqueued waiters using PGSemaphoreUnlock(). As there are other sources of such unlocks backends only wake up if MyProc->lwWaiting is set to false; which is only done in the aforementioned functions. Before this commit there were dangers because the store to lwWaitLink could become visible before the store to lwWaitLink. This could both happen due to compiler reordering (on most compilers) and on some platforms due to the CPU reordering stores. The possible consequence of this is that a backend stops waiting before lwWaitLink is set to NULL. If that backend then tries to acquire another lock and has to wait there the list could become corrupted once the lwWaitLink store is finally performed. Add a write memory barrier to prevent that issue. Unfortunately the barrier support has been only added in 9.2. Given that the issue has not knowingly been observed in praxis it seems sufficient to prohibit compiler reordering using volatile for 9.0 and 9.1. Actual problems due to compiler reordering are more likely anyway. Discussion: 20140210134625.GA15246@awork2.anarazel.de	2014-12-19 14:29:52 +01:00
Tom Lane	4a14f13a0a	Improve hash_create's API for selecting simple-binary-key hash functions. Previously, if you wanted anything besides C-string hash keys, you had to specify a custom hashing function to hash_create(). Nearly all such callers were specifying tag_hash or oid_hash; which is tedious, and rather error-prone, since a caller could easily miss the opportunity to optimize by using hash_uint32 when appropriate. Replace this with a design whereby callers using simple binary-data keys just specify HASH_BLOBS and don't need to mess with specific support functions. hash_create() itself will take care of optimizing when the key size is four bytes. This nets out saving a few hundred bytes of code space, and offers a measurable performance improvement in tidbitmap.c (which was not exploiting the opportunity to use hash_uint32 for its 4-byte keys). There might be some wins elsewhere too, I didn't analyze closely. In future we could look into offering a similar optimized hashing function for 8-byte keys. Under this design that could be done in a centralized and machine-independent fashion, whereas getting it right for keys of platform-dependent sizes would've been notationally painful before. For the moment, the old way still works fine, so as not to break source code compatibility for loadable modules. Eventually we might want to remove tag_hash and friends from the exported API altogether, since there's no real need for them to be explicitly referenced from outside dynahash.c. Teodor Sigaev and Tom Lane	2014-12-18 13:36:36 -05:00
Alvaro Herrera	73c986adde	Keep track of transaction commit timestamps Transactions can now set their commit timestamp directly as they commit, or an external transaction commit timestamp can be fed from an outside system using the new function TransactionTreeSetCommitTsData(). This data is crash-safe, and truncated at Xid freeze point, same as pg_clog. This module is disabled by default because it causes a performance hit, but can be enabled in postgresql.conf requiring only a server restart. A new test in src/test/modules is included. Catalog version bumped due to the new subdirectory within PGDATA and a couple of new SQL functions. Authors: Álvaro Herrera and Petr Jelínek Reviewed to varying degrees by Michael Paquier, Andres Freund, Robert Haas, Amit Kapila, Fujii Masao, Jaime Casanova, Simon Riggs, Steven Singer, Peter Eisentraut	2014-12-03 11:53:02 -03:00
Heikki Linnakangas	2c03216d83	Revamp the WAL record format. Each WAL record now carries information about the modified relation and block(s) in a standardized format. That makes it easier to write tools that need that information, like pg_rewind, prefetching the blocks to speed up recovery, etc. There's a whole new API for building WAL records, replacing the XLogRecData chains used previously. The new API consists of XLogRegister* functions, which are called for each buffer and chunk of data that is added to the record. The new API also gives more control over when a full-page image is written, by passing flags to the XLogRegisterBuffer function. This also simplifies the XLogReadBufferForRedo() calls. The function can dig the relation and block number from the WAL record, so they no longer need to be passed as arguments. For the convenience of redo routines, XLogReader now disects each WAL record after reading it, copying the main data part and the per-block data into MAXALIGNed buffers. The data chunks are not aligned within the WAL record, but the redo routines can assume that the pointers returned by XLogRecGet* functions are. Redo routines are now passed the XLogReaderState, which contains the record in the already-disected format, instead of the plain XLogRecord. The new record format also makes the fixed size XLogRecord header smaller, by removing the xl_len field. The length of the "main data" portion is now stored at the end of the WAL record, and there's a separate header after XLogRecord for it. The alignment padding at the end of XLogRecord is also removed. This compansates for the fact that the new format would otherwise be more bulky than the old format. Reviewed by Andres Freund, Amit Kapila, Michael Paquier, Alvaro Herrera, Fujii Masao.	2014-11-20 18:46:41 +02:00
Andres Freund	98ec7fd903	Sync unlogged relations to disk after they have been reset. Unlogged relations are only reset when performing a unclean restart. That means they have to be synced to disk during clean shutdowns. During normal processing that's achieved by registering a buffer's file to be fsynced at the next checkpoint when flushed. But ResetUnloggedRelations() doesn't go through the buffer manager, so nothing will force reset relations to disk before the next shutdown checkpoint. So just make ResetUnloggedRelations() fsync the newly created main forks to disk. Discussion: 20140912112246.GA4984@alap3.anarazel.de Backpatch to 9.1 where unlogged tables were introduced. Abhijit Menon-Sen and Andres Freund	2014-11-15 01:19:31 +01:00
Heikki Linnakangas	81c4508196	Fix race condition between hot standby and restoring a full-page image. There was a window in RestoreBackupBlock where a page would be zeroed out, but not yet locked. If a backend pinned and locked the page in that window, it saw the zeroed page instead of the old page or new page contents, which could lead to missing rows in a result set, or errors. To fix, replace RBM_ZERO with RBM_ZERO_AND_LOCK, which atomically pins, zeroes, and locks the page, if it's not in the buffer cache already. In stable branches, the old RBM_ZERO constant is renamed to RBM_DO_NOT_USE, to avoid breaking any 3rd party extensions that might use RBM_ZERO. More importantly, this avoids renumbering the other enum values, which would cause even bigger confusion in extensions that use ReadBufferExtended, but haven't been recompiled. Backpatch to all supported versions; this has been racy since hot standby was introduced.	2014-11-13 20:02:37 +02:00
Alvaro Herrera	7516f52594	BRIN: Block Range Indexes BRIN is a new index access method intended to accelerate scans of very large tables, without the maintenance overhead of btrees or other traditional indexes. They work by maintaining "summary" data about block ranges. Bitmap index scans work by reading each summary tuple and comparing them with the query quals; all pages in the range are returned in a lossy TID bitmap if the quals are consistent with the values in the summary tuple, otherwise not. Normal index scans are not supported because these indexes do not store TIDs. As new tuples are added into the index, the summary information is updated (if the block range in which the tuple is added is already summarized) or not; in the latter case, a subsequent pass of VACUUM or the brin_summarize_new_values() function will create the summary information. For data types with natural 1-D sort orders, the summary info consists of the maximum and the minimum values of each indexed column within each page range. This type of operator class we call "Minmax", and we supply a bunch of them for most data types with B-tree opclasses. Since the BRIN code is generalized, other approaches are possible for things such as arrays, geometric types, ranges, etc; even for things such as enum types we could do something different than minmax with better results. In this commit I only include minmax. Catalog version bumped due to new builtin catalog entries. There's more that could be done here, but this is a good step forwards. Loosely based on ideas from Simon Riggs; code mostly by Álvaro Herrera, with contribution by Heikki Linnakangas. Patch reviewed by: Amit Kapila, Heikki Linnakangas, Robert Haas. Testing help from Jeff Janes, Erik Rijkers, Emanuel Calvo. PS: The research leading to these results has received funding from the European Union's Seventh Framework Programme (FP7/2007-2013) under grant agreement n° 318633.	2014-11-07 16:38:14 -03:00
Heikki Linnakangas	2076db2aea	Move the backup-block logic from XLogInsert to a new file, xloginsert.c. xlog.c is huge, this makes it a little bit smaller, which is nice. Functions related to putting together the WAL record are in xloginsert.c, and the lower level stuff for managing WAL buffers and such are in xlog.c. Also move the definition of XLogRecord to a separate header file. This causes churn in the #includes of all the files that write WAL records, and redo routines, but it avoids pulling in xlog.h into most places. Reviewed by Michael Paquier, Alvaro Herrera, Andres Freund and Amit Kapila.	2014-11-06 13:55:36 +02:00
Robert Haas	f7102b0463	Extend dsm API with a new function dsm_unpin_mapping. This reassociates a dynamic shared memory handle previous passed to dsm_pin_mapping with the current resource owner, so that it will be cleaned up at the end of the current query. Patch by me. Review of the function name by Andres Freund, Amit Kapila, Jim Nasby, Petr Jelinek, and Álvaro Herrera.	2014-10-30 14:55:23 -04:00
Robert Haas	6057c212f3	"Pin", rather than "keep", dynamic shared memory mappings and segments. Nobody seemed concerned about this naming when it originally went in, but there's a pending patch that implements the opposite of dsm_keep_mapping, and the term "unkeep" was judged unpalatable. "unpin" has existing precedent in the PostgreSQL code base, and the English language, so use this terminology instead. Per discussion, back-patch to 9.4.	2014-10-30 11:35:55 -04:00
Heikki Linnakangas	18f158ef69	Remove unnecessary assignment. Reported by MauMau.	2014-10-28 20:26:20 +02:00
Andres Freund	7dbb606938	Flush unlogged table's buffers when copying or moving databases. CREATE DATABASE and ALTER DATABASE .. SET TABLESPACE copy the source database directory on the filesystem level. To ensure the on disk state is consistent they block out users of the affected database and force a checkpoint to flush out all data to disk. Unfortunately, up to now, that checkpoint didn't flush out dirty buffers from unlogged relations. That bug means there could be leftover dirty buffers in either the template database, or the database in its old location. Leading to problems when accessing relations in an inconsistent state; and to possible problems during shutdown in the SET TABLESPACE case because buffers belonging files that don't exist anymore are flushed. This was reported in bug #10675 by Maxim Boguk. Fix by Pavan Deolasee, modified somewhat by me. Reviewed by MauMau and Fujii Masao. Backpatch to 9.1 where unlogged tables were introduced.	2014-10-20 23:43:46 +02:00
Heikki Linnakangas	e0d97d77bf	Fix deadlock with LWLockAcquireWithVar and LWLockWaitForVar. LWLockRelease should release all backends waiting with LWLockWaitForVar, even when another backend has already been woken up to acquire the lock, i.e. when releaseOK is false. LWLockWaitForVar can return as soon as the protected value changes, even if the other backend will acquire the lock. Fix that by resetting releaseOK to true in LWLockWaitForVar, whenever adding itself to the wait queue. This should fix the bug reported by MauMau, where the system occasionally hangs when there is a lot of concurrent WAL activity and a checkpoint. Backpatch to 9.4, where this code was added.	2014-10-14 10:06:47 +03:00
Robert Haas	7bb0e97407	Extend shm_mq API with new functions shm_mq_sendv, shm_mq_set_handle. shm_mq_sendv sends a message to the queue assembled from multiple locations. This is expected to be used by forthcoming patches to allow frontend/backend protocol messages to be sent via shm_mq, but might be useful for other purposes as well. shm_mq_set_handle associates a BackgroundWorkerHandle with an already-existing shm_mq_handle. This solves a timing problem when creating a shm_mq to communicate with a newly-launched background worker: if you attach to the queue first, and the background worker fails to start, you might block forever trying to do I/O on the queue; but if you start the background worker first, but then die before attaching to the queue, the background worrker might block forever trying to do I/O on the queue. This lets you attach before starting the worker (so that the worker is protected) and then associate the BackgroundWorkerHandle later (so that you are also protected). Patch by me, reviewed by Stephen Frost.	2014-10-08 14:38:31 -04:00
Robert Haas	3acc10c997	Increase the number of buffer mapping partitions to 128. Testing by Amit Kapila, Andres Freund, and myself, with and without other patches that also aim to improve scalability, seems to indicate that this change is a significant win over the current value and over smaller values such as 64. It's not clear how high we can push this value before it starts to have negative side-effects elsewhere, but going this far looks OK.	2014-10-02 13:58:50 -04:00
Andres Freund	b64d92f1a5	Add a basic atomic ops API abstracting away platform/architecture details. Several upcoming performance/scalability improvements require atomic operations. This new API avoids the need to splatter compiler and architecture dependent code over all the locations employing atomic ops. For several of the potential usages it'd be problematic to maintain both, a atomics using implementation and one using spinlocks or similar. In all likelihood one of the implementations would not get tested regularly under concurrency. To avoid that scenario the new API provides a automatic fallback of atomic operations to spinlocks. All properties of atomic operations are maintained. This fallback - obviously - isn't as fast as just using atomic ops, but it's not bad either. For one of the future users the atomics ontop spinlocks implementation was actually slightly faster than the old purely spinlock using implementation. That's important because it reduces the fear of regressing older platforms when improving the scalability for new ones. The API, loosely modeled after the C11 atomics support, currently provides 'atomic flags' and 32 bit unsigned integers. If the platform efficiently supports atomic 64 bit unsigned integers those are also provided. To implement atomics support for a platform/architecture/compiler for a type of atomics 32bit compare and exchange needs to be implemented. If available and more efficient native support for flags, 32 bit atomic addition, and corresponding 64 bit operations may also be provided. Additional useful atomic operations are implemented generically ontop of these. The implementation for various versions of gcc, msvc and sun studio have been tested. Additional existing stub implementations for * Intel icc * HUPX acc * IBM xlc are included but have never been tested. These will likely require fixes based on buildfarm and user feedback. As atomic operations also require barriers for some operations the existing barrier support has been moved into the atomics code. Author: Andres Freund with contributions from Oskari Saarenmaa Reviewed-By: Amit Kapila, Robert Haas, Heikki Linnakangas and Álvaro Herrera Discussion: CA+TgmoYBW+ux5-8Ja=Mcyuy8=VXAnVRHp3Kess6Pn3DMXAPAEA@mail.gmail.com, 20131015123303.GH5300@awork2.anarazel.de, 20131028205522.GI20248@awork2.anarazel.de	2014-09-25 23:49:05 +02:00
Robert Haas	5d7962c679	Change locking regimen around buffer replacement. Previously, we used an lwlock that was held from the time we began seeking a candidate buffer until the time when we found and pinned one, which is disastrous for concurrency. Instead, use a spinlock which is held just long enough to pop the freelist or advance the clock sweep hand, and then released. If we need to advance the clock sweep further, we reacquire the spinlock once per buffer. This represents a significant increase in atomic operations around buffer eviction, but it still wins on many workloads. On others, it may result in no gain, or even cause a regression, unless the number of buffer mapping locks is also increased. However, that seems like material for a separate commit. We may also need to consider other methods of mitigating contention on this spinlock, such as splitting it into multiple locks or jumping the clock sweep hand more than one buffer at a time, but those, too, seem like separate improvements. Patch by me, inspired by a much larger patch from Amit Kapila. Reviewed by Andres Freund.	2014-09-25 10:43:24 -04:00
Robert Haas	df4077cda2	Remove volatile qualifiers from lwlock.c. Now that spinlocks (hopefully!) act as compiler barriers, as of commit `0709b7ee72`, this should be safe. This serves as a demonstration of the new coding style, and may be optimized better on some machines as well.	2014-09-22 16:42:14 -04:00
Robert Haas	68e66923ff	Add missing volatile qualifier. Yet another silly mistake in `0709b7ee72`, again found by buildfarm member castoroides.	2014-09-11 09:07:32 -04:00
Robert Haas	0709b7ee72	Change the spinlock primitives to function as compiler barriers. Previously, they functioned as barriers against CPU reordering but not compiler reordering, an odd API that required extensive use of volatile everywhere that spinlocks are used. That's error-prone and has negative implications for performance, so change it. In theory, this makes it safe to remove many of the uses of volatile that we currently have in our code base, but we may find that there are some bugs in this effort when we do. In the long run, though, this should make for much more maintainable code. Patch by me. Review by Andres Freund.	2014-09-09 17:48:50 -04:00

1 2 3 4 5 ...

1601 Commits