postgres

mirror of https://github.com/postgres/postgres.git synced 2025-10-28 11:55:03 +03:00

Author	SHA1	Message	Date
Alvaro Herrera	21c09e99dc	Split heapam_xlog.h from heapam.h The heapam XLog functions are used by other modules, not all of which are interested in the rest of the heapam API. With this, we let them get just the XLog stuff in which they are interested and not pollute them with unrelated includes. Also, since heapam.h no longer requires xlog.h, many files that do include heapam.h no longer get xlog.h automatically, including a few headers. This is useful because heapam.h is getting pulled in by execnodes.h, which is in turn included by a lot of files.	2012-08-28 19:02:00 -04:00
Alvaro Herrera	45326c5a11	Split resowner.h This lets files that are mere users of ResourceOwner not automatically include the headers for stuff that is managed by the resowner mechanism.	2012-08-28 18:02:07 -04:00
Tom Lane	10685ec082	Avoid somewhat-theoretical overflow risks in RecordIsValid(). This improves on commit `51fed14d73` by eliminating the assumption that we can form <some pointer value> + <some offset> without overflow. The entire point of those tests is that we don't trust the offset value, so coding them in a way that could wrap around if the buffer happens to be near the top of memory doesn't seem sound. Instead, track the remaining space as a size_t variable and compare offsets against that. Also, improve comment about why we need the extra early check on xl_tot_len.	2012-08-21 18:41:52 -04:00
Heikki Linnakangas	51fed14d73	Don't get confused if a WAL partial record header has xl_tot_len == 0. If a WAL record header was split across pages, but xl_tot_len was 0, we would get confused and conclude that we had already read the whole record, and proceed to CRC check it. That can lead to a crash in RecordIsValid(), which isn't careful to not read beyond end-of-record, as defined by xl_tot_len. Add an explicit sanity check for xl_tot_len <= SizeOfXlogRecord. Also, make RecordIsValid() more robust by checking in each step that it doesn't try to access memory beyond end of record, even if a length field in the record's or a backup block's header is bogus. Per report and analysis by Tom Lane.	2012-08-20 19:58:21 +03:00
Bruce Momjian	3325957656	Delete inaccurate C comment about FSM and adding pages, per Robert Haas.	2012-08-16 19:02:58 -04:00
Heikki Linnakangas	89911b3ab8	Fix GiST buffering build bug, which caused "failed to re-find parent" errors. We use a hash table to track the parents of inner pages, but when inserting to a leaf page, the caller of gistbufferinginserttuples() must pass a correct block number of the leaf's parent page. Before gistProcessItup() descends to a child page, it checks if the downlink needs to be adjusted to accommodate the new tuple, and updates the downlink if necessary. However, updating the downlink might require splitting the page, which might move the downlink to a page to the right. gistProcessItup() doesn't realize that, so when it descends to the leaf page, it might pass an out-of-date parent block number as a result. Fix that by returning the block a tuple was inserted to from gistbufferinginserttuples(). This fixes the bug reported by Zdeněk Jílovec.	2012-08-16 12:56:24 +03:00
Bruce Momjian	41fa3dfb0a	Update C comment to NOTICE to reflect previous commit changing the error level, per report from Tom.	2012-08-15 19:09:37 -04:00
Simon Riggs	8143a56854	Fix minor bug in XLogFileRead() that accidentally worked. Cascading replication copied the incoming file into pg_xlog but didn't set path correctly, so the first attempt to open file failed causing it to loop around and look for file in pg_xlog. So the earlier coding worked, but accidentally rather than by design. Spotted by Fujii Masao, fix by Fujii Masao and Simon Riggs	2012-08-08 21:25:23 +01:00
Tom Lane	db108349bf	Fix TwoPhaseGetDummyBackendId(). This was broken in commit `ed0b409d22`, which revised the GlobalTransactionData struct to not include the associated PGPROC as its first member, but overlooked one place where a cast was used in reliance on that equivalence. The most effective way of fixing this seems to be to create a new function that looks up the GlobalTransactionData struct given the XID, and make both TwoPhaseGetDummyBackendId and TwoPhaseGetDummyProc rely on that. Per report from Robert Ross.	2012-08-08 11:52:02 -04:00
Simon Riggs	0f04fc67f7	fsync backup_label after pg_start_backup() Dave Kerr	2012-08-07 16:19:13 +01:00
Tom Lane	f786e91a75	Improve underdocumented btree_xlog_delete_get_latestRemovedXid() code. As noted by Noah Misch, btree_xlog_delete_get_latestRemovedXid is critically dependent on the assumption that it's examining a consistent state of the database. This was undocumented though, so the seemingly-unrelated check for no active HS sessions might be thought to be merely an optional optimization. Improve comments, and add an explicit check of reachedConsistency just to be sure. This function returns InvalidTransactionId (thereby killing all HS transactions) in several cases that are not nearly unlikely enough for my taste. This commit doesn't attempt to fix those deficiencies, just document them. Back-patch to 9.2, not from any real functional need but just to keep the branches more closely synced to simplify possible future back-patching.	2012-08-03 15:41:18 -04:00
Tom Lane	c1793f2e0c	In SPGiST replay, do conflict resolution before modifying the page. In yesterday's commit `962e0cc71e`, I added the ResolveRecoveryConflictWithSnapshot call in the wrong place. I correctly put it before spgRedoVacuumRedirect itself would modify the index page --- but not before RestoreBkpBlocks, so replay of a record with a full-page image would modify the page before kicking off any conflicting HS transactions. Oops.	2012-08-03 15:23:14 -04:00
Tom Lane	962e0cc71e	Fix race conditions associated with SPGiST redirection tuples. The correct test for whether a redirection tuple is removable is whether tuple's xid < RecentGlobalXmin, not OldestXmin; the previous coding failed to protect index searches being done in concurrent transactions that have no XID. This mirrors the recent fix in btree's page recycling logic made in commit `d3abbbebe5`. Also, WAL-log the newest XID of any removed redirection tuple on an index page, and apply ResolveRecoveryConflictWithSnapshot during InHotStandby WAL replay. This protects against concurrent Hot Standby transactions possibly needing to see the redirection tuple(s). Per my query of 2012-03-12 and subsequent discussion.	2012-08-02 15:34:14 -04:00
Tom Lane	4a9c30a8a1	Fix management of pendingOpsTable in auxiliary processes. mdinit() was misusing IsBootstrapProcessingMode() to decide whether to create an fsync pending-operations table in the current process. This led to creating a table not only in the startup and checkpointer processes as intended, but also in the bgwriter process, not to mention other auxiliary processes such as walwriter and walreceiver. Creation of the table in the bgwriter is fatal, because it absorbs fsync requests that should have gone to the checkpointer; instead they just sit in bgwriter local memory and are never acted on. So writes performed by the bgwriter were not being fsync'd which could result in data loss after an OS crash. I think there is no live bug with respect to walwriter and walreceiver because those never perform any writes of shared buffers; but the potential is there for future breakage in those processes too. To fix, make AuxiliaryProcessMain() export the current process's AuxProcType as a global variable, and then make mdinit() test directly for the types of aux process that should have a pendingOpsTable. Having done that, we might as well also get rid of the random bool flags such as am_walreceiver that some of the aux processes had grown. (Note that we could not have fixed the bug by examining those variables in mdinit(), because it's called from BaseInit() which is run by AuxiliaryProcessMain() before entering any of the process-type-specific code.) Back-patch to 9.2, where the problem was introduced by the split-up of bgwriter and checkpointer processes. The bogus pendingOpsTable exists in walwriter and walreceiver processes in earlier branches, but absent any evidence that it causes actual problems there, I'll leave the older branches alone.	2012-07-18 15:28:10 -04:00
Peter Eisentraut	dd16f9480a	Remove unreachable code The Solaris Studio compiler warns about these instances, unlike more mainstream compilers such as gcc. But manual inspection showed that the code is clearly not reachable, and we hope no worthy compiler will complain about removing this code.	2012-07-16 22:15:03 +03:00
Tom Lane	1a9405d265	Cosmetic cleanup of ginInsertValue(). Make it clearer that the passed stack mustn't be empty, and that we are not supposed to fall off the end of the stack in the main loop. Tighten the loop that extracts the root block number, too. Markus Wanner and Tom Lane	2012-07-13 11:37:39 -04:00
Robert Haas	3cf39e6ddb	Fix a stupid bug I introduced into XLogFlush(). Commit `f11e8be3e8` broke this; it was right in Peter's original patch, but I messed it up before committing.	2012-07-02 15:33:59 -04:00
Robert Haas	3bb592bb20	Fix position of WalSndWakeupRequest call. This avoids discriminating against wal_sync_method = open_sync or open_datasync. Fujii Masao, reviewed by Andres Freund	2012-07-02 14:44:10 -04:00
Peter Eisentraut	2b44306315	Assorted message style improvements	2012-07-02 21:12:46 +03:00
Robert Haas	82cdd2df75	Work a little harder on comments for walsender wakeup patch. Per gripe from Tom Lane.	2012-07-02 11:28:53 -04:00
Robert Haas	f11e8be3e8	Make commit_delay much smarter. Instead of letting every backend participating in a group commit wait independently, have the first one that becomes ready to flush WAL wait for the configured delay, and let all the others wait just long enough for that first process to complete its flush. This greatly increases the chances of being able to configure a commit_delay setting that actually improves performance. As a side consequence of this change, commit_delay now affects all WAL flushes, rather than just commits. There was some discussion on pgsql-hackers about whether to rename the GUC to, say, wal_flush_delay, but in the absence of consensus I am leaving it alone for now. Peter Geoghegan, with some changes, mostly to the documentation, by me.	2012-07-02 10:26:31 -04:00
Robert Haas	f83b59997d	Make walsender more responsive. Per testing by Andres Freund, this improves replication performance and reduces replication latency and latency jitter. I was a bit concerned about moving more work into XLogInsert, but testing seems to show that it's not a problem in practice. Along the way, improve comments for WaitLatchOrSocket. Andres Freund. Review and stylistic cleanup by me.	2012-07-02 09:41:01 -04:00
Heikki Linnakangas	567787f216	Validate xlog record header before enlarging the work area to store it. If the record header is garbled, we're now quite likely to notice it before we try to make a bogus memory allocation and run out of memory. That can still happen, if the xlog record is split across pages (we cannot verify the record header until reading the next page in that scenario), but this reduces the chances. An out-of-memory is treated as a corrupt record anyway, so this isn't a correctness issue, just a case of giving a better error message. Per Amit Kapila's suggestion.	2012-06-30 23:14:35 +03:00
Heikki Linnakangas	7a5c9ca93a	Initialize shared memory copy of ckptXidEpoch correctly when not in recovery. This bug was introduced by commit `20d98ab6e4`, so backpatch this to 9.0-9.2 like that one. This fixes bug #6710, reported by Tarvi Pillessaar	2012-06-29 19:32:15 +03:00
Heikki Linnakangas	8f85667a86	Update outdated commit; xlp_rem_len field is in page header now. Spotted by Amit Kapila	2012-06-28 20:35:18 +03:00
Heikki Linnakangas	a8f97b39c7	Fix two more neglected comments, still referring to log/seg. Fujii Masao	2012-06-27 19:11:26 +03:00
Heikki Linnakangas	ec786c6c81	I neglected many comments in the log+seg -> 64-bit segno patch. Fix. Reported by Amit Kapila.	2012-06-27 17:53:53 +03:00
Tom Lane	757773602c	Cope with smaller-than-normal BLCKSZ setting in SPGiST indexes on text. The original coding failed miserably for BLCKSZ of 4K or less, as reported by Josh Kupershmidt. With the present design for text indexes, a given inner tuple could have up to 256 labels (requiring either 3K or 4K bytes depending on MAXALIGN), which means that we can't positively guarantee no failures for smaller blocksizes. But we can at least make it behave sanely so long as there are few enough labels to fit on a page. Considering that btree is also more prone to "index tuple too large" failures when BLCKSZ is small, it's not clear that we should expend more work than this on this case.	2012-06-26 14:36:25 -04:00
Robert Haas	76837c1507	Reduce use of heavyweight locking inside hash AM. Avoid using LockPage(rel, 0, lockmode) to protect against changes to the bucket mapping. Instead, an exclusive buffer content lock is now viewed as sufficient permission to modify the metapage, and a shared buffer content lock is used when such modifications need to be prevented. This more relaxed locking regimen makes it possible that, when we're busy getting a heavyweight bucket on the bucket we intend to search or insert into, a bucket split might occur underneath us. To compenate for that possibility, we use a loop-and-retry system: release the metapage content lock, acquire the heavyweight lock on the target bucket, and then reacquire the metapage content lock and check that the bucket mapping has not changed. Normally it hasn't, and we're done. But if by chance it has, we simply unlock the metapage, release the heavyweight lock we acquired previously, lock the new bucket, and loop around again. Even in the worst case we cannot loop very many times here, since we don't split the same bucket again until we've split all the other buckets, and 2^N gets big pretty fast. This results in greatly improved concurrency, because we're effectively replacing two lwlock acquire-and-release cycles in exclusive mode (on one of the lock manager locks) with a single acquire-and-release cycle in shared mode (on the metapage buffer content lock). Testing shows that it's still not quite as good as btree; for that, we'd probably have to find some way of getting rid of the heavyweight bucket locks as well, which does not appear straightforward. Patch by me, review by Jeff Janes.	2012-06-26 06:56:10 -04:00
Alvaro Herrera	77ed0c6950	Tighten up includes in sinvaladt.h, twophase.h, proc.h Remove proc.h from sinvaladt.h and twophase.h; also replace xlog.h in proc.h with xlogdefs.h.	2012-06-25 18:40:40 -04:00
Peter Eisentraut	b8b2e3b2de	Replace int2/int4 in C code with int16/int32 The latter was already the dominant use, and it's preferable because in C the convention is that intXX means XX bits. Therefore, allowing mixed use of int2, int4, int8, int16, int32 is obviously confusing. Remove the typedefs for int2 and int4 for now. They don't seem to be widely used outside of the PostgreSQL source tree, and the few uses can probably be cleaned up by the time this ships.	2012-06-25 01:51:46 +03:00
Heikki Linnakangas	a218e23a08	Oops. Remove stray paren. I didn't notice this on my laptop as I don't HAVE_FSYNC_WRITETHROUGH.	2012-06-24 20:03:57 +03:00
Heikki Linnakangas	0ab9d1c4b3	Replace XLogRecPtr struct with a 64-bit integer. This simplifies code that needs to do arithmetic on XLogRecPtrs. To avoid changing on-disk format of data pages, the LSN on data pages is still stored in the old format. That should keep pg_upgrade happy. However, we have XLogRecPtrs embedded in the control file, and in the structs that are sent over the replication protocol, so this changes breaks compatibility of pg_basebackup and server. I didn't do anything about this in this patch, per discussion on -hackers, the right thing to do would to be to change the replication protocol to be architecture-independent, so that you could use a newer version of pg_receivexlog, for example, against an older server version.	2012-06-24 19:19:45 +03:00
Heikki Linnakangas	061e7efb1b	Allow WAL record header to be split across pages. This saves a few bytes of WAL space, but the real motivation is to make it predictable how much WAL space a record requires, as it no longer depends on whether we need to waste the last few bytes at end of WAL page because the header doesn't fit. The total length field of WAL record, xl_tot_len, is moved to the beginning of the WAL record header, so that it is still always found on the first page where a WAL record begins. Bump WAL version number again as this is an incompatible change.	2012-06-24 18:35:56 +03:00
Heikki Linnakangas	20ba5ca64c	Move WAL continuation record information to WAL page header. The continuation record only contained one field, xl_rem_len, so it makes things simpler to just include it in the WAL page header. This wastes four bytes on pages that don't begin with a continuation from previos page, plus four bytes on every page, because of padding. The motivation of this is to make it easier to calculate how much space a WAL record needs. Before this patch, it depended on how many page boundaries the record crosses. The motivation of that, in turn, is to separate the allocation of space in the WAL from the copying of the record data to the allocated space. Keeping the calculation of space required simple helps to keep the critical section of allocating the space from WAL short. But that's not included in this patch yet. Bump WAL version number again, as this is an incompatible change.	2012-06-24 18:35:30 +03:00
Heikki Linnakangas	dfda6ebaec	Don't waste the last segment of each 4GB logical log file. The comments claimed that wasting the last segment made it easier to do calculations with XLogRecPtrs, because you don't have problems representing last-byte-position-plus-1 that way. In my experience, however, it only made things more complicated, because the there was two ways to represent the boundary at the beginning of a logical log file: logid = n+1 and xrecoff = 0, or as xlogid = n and xrecoff = 4GB - XLOG_SEG_SIZE. Some functions were picky about which representation was used. Also, use a 64-bit segment number instead of the log/seg combination, to point to a certain WAL segment. We assume that all platforms have a working 64-bit integer type nowadays. This is an incompatible change in WAL format, so bumping WAL version number.	2012-06-24 18:35:29 +03:00
Peter Eisentraut	15b1918e7d	Improve reporting of permission errors for array types Because permissions are assigned to element types, not array types, complaining about permission denied on an array type would be misleading to users. So adjust the reporting to refer to the element type instead. In order not to duplicate the required logic in two dozen places, refactor the permission denied reporting for types a bit. pointed out by Yeb Havinga during the review of the type privilege feature	2012-06-15 22:55:03 +03:00
Robert Haas	8507c2f856	Improve readability and error messages in pg_backup_start_time. Gurjeet Singh, with corrections by me.	2012-06-14 15:20:08 -04:00
Robert Haas	68de499bda	New SQL functons pg_backup_in_progress() and pg_backup_start_time() Darold Gilles, reviewed by Gabriele Bartolini and others, rebased by Marco Nenciarini. Stylistic cleanup and OID fixes by me.	2012-06-14 13:25:43 -04:00
Robert Haas	cd80073445	During transaction cleanup, release locks before deleting files. There's no need to hold onto the locks until the files are needed, and by doing it this way, we reduce the impact on other backends who may be awaiting locks we hold. Noah Misch	2012-06-14 10:19:33 -04:00
Robert Haas	6cd015bea3	Add new function log_newpage_buffer. When I implemented the ginbuildempty() function as part of implementing unlogged tables, I falsified the note in the header comment for log_newpage. Although we could fix that up by changing the comment, it seems cleaner to add a new function which is specifically intended to handle this case. So do that.	2012-06-14 10:11:16 -04:00
Robert Haas	d2c86a1ccd	Remove RELKIND_UNCATALOGED. This may have been important at some point in the past, but it no longer does anything useful. Review by Tom Lane.	2012-06-14 09:47:30 -04:00
Tom Lane	b8b69d8990	Revert "Reduce checkpoints and WAL traffic on low activity database server" This reverts commit `18fb9d8d21`. Per discussion, it does not seem like a good idea to allow committed changes to go un-checkpointed indefinitely, as could happen in a low-traffic server; that makes us entirely reliant on the WAL stream with no redundancy that might aid data recovery in case of disk failure. This re-introduces the original problem of hot-standby setups generating a small continuing stream of WAL traffic even when idle, but there are other ways to address that without compromising crash recovery, so we'll revisit that issue in a future release cycle.	2012-06-13 18:48:44 -04:00
Bruce Momjian	927d61eeff	Run pgindent on 9.2 source tree in preparation for first 9.3 commit-fest.	2012-06-10 15:20:04 -04:00
Tom Lane	ece01aae47	Scan the buffer pool just once, not once per fork, during relation drop. This provides a speedup of about 4X when NBuffers is large enough. There is also a useful reduction in sinval traffic, since we only do CacheInvalidateSmgr() once not once per fork. Simon Riggs, reviewed and somewhat revised by Tom Lane	2012-06-07 17:43:11 -04:00
Simon Riggs	2c8a4e9be2	Wake WALSender to reduce data loss at failover for async commit. WALSender now woken up after each background flush by WALwriter, avoiding multi-second replication delay for an all-async commit workload. Replication delay reduced from 7s with default settings to 200ms and often much less, allowing significantly reduced data loss at failover. Andres Freund and Simon Riggs	2012-06-07 19:22:47 +01:00
Robert Haas	b50991eedb	Fix more crash-safe visibility map bugs, and improve comments. In lazy_scan_heap, we could issue bogus warnings about incorrect information in the visibility map, because we checked the visibility map bit before locking the heap page, creating a race condition. Fix by rechecking the visibility map bit before we complain. Rejigger some related logic so that we rely on the possibly-outdated all_visible_according_to_vm value as little as possible. In heap_multi_insert, it's not safe to clear the visibility map bit before beginning the critical section. The visibility map is not crash-safe unless we treat clearing the bit as a critical operation. Specifically, if the transaction were to error out after we set the bit and before entering the critical section, we could end up writing the heap page to disk (with the bit cleared) and crashing before the visibility map page made it to disk. That would be bad. heap_insert has this correct, but somehow the order of operations got rearranged when heap_multi_insert was added. Also, add some more comments to visibilitymap_test, lazy_scan_heap, and IndexOnlyNext, expounding on concurrency issues. Per extensive code review by Andres Freund, and further review by Tom Lane, who also made the original report about the bogus warnings.	2012-06-07 12:48:13 -04:00
Simon Riggs	d3abbbebe5	Avoid early reuse of btree pages, causing incorrect query results. When we allowed read-only transactions to skip assigning XIDs we introduced the possibility that a fully deleted btree page could be reused. This broke the index link sequence which could then lead to indexscans silently returning fewer rows than would have been correct. The actual incidence of silent errors from this is thought to be very low because of the exact workload required and locking pre-conditions. Fix is to remove pages only if index page opaque->btpo.xact precedes RecentGlobalXmin. Noah Misch, reviewed by Simon Riggs	2012-06-01 12:21:45 +01:00
Tom Lane	a04dc87db1	Improve comment for GetStableLatestTransactionId().	2012-05-31 11:20:02 -04:00
Simon Riggs	a2b516dab9	Only throw recovery conflicts when InHotStandby. Bug fix to recent patch to allow Index Only Scans on Hot Standby. Bug report from Jaime Casanova	2012-05-31 13:11:47 +01:00

1 2 3 4 5 ...

2033 Commits