postgres

mirror of https://github.com/postgres/postgres.git synced 2025-11-16 15:02:33 +03:00

Author	SHA1	Message	Date
Tom Lane	591b9b091c	Use ftruncate() not truncate() in mdunlink. Seems Windows doesn't support the latter.	2007-11-15 21:49:47 +00:00
Bruce Momjian	fdf5a5efb7	pgindent run for 8.3.	2007-11-15 21:14:46 +00:00
Tom Lane	6cc4451b5c	Prevent re-use of a deleted relation's relfilenode until after the next checkpoint. This guards against an unlikely data-loss scenario in which we re-use the relfilenode, then crash, then replay the deletion and recreation of the file. Even then we'd be OK if all insertions into the new relation had been WAL-logged ... but that's not guaranteed given all the no-WAL-logging optimizations that have recently been added. Patch by Heikki Linnakangas, per a discussion last month.	2007-11-15 20:36:40 +00:00
Tom Lane	69500b05d6	Prevent continuing disk-space bloat when profiling (with PROFILE_PID_DIR enabled) and autovacuum is on. Since there will be a steady stream of autovac worker processes exiting and dropping gmon.out files, allowing them to make separate subdirectories results in serious bloat; and it seems unlikely that anyone will care about those profiles anyway. Limit the damage by forcing all autovac workers to dump in one subdirectory, PGDATA/gprof/avworker/. Per report from Jrg Beyer and subsequent discussion.	2007-11-04 17:55:15 +00:00
Alvaro Herrera	acac68b2bc	Allow an autovacuum worker to be interrupted automatically when it is found to be locking another process (except when it's working to prevent Xid wraparound problems).	2007-10-26 20:45:10 +00:00
Alvaro Herrera	745c1b2c2a	Rearrange vacuum-related bits in PGPROC as a bitmask, to better support having several of them. Add two more flags: whether the process is executing an ANALYZE, and whether a vacuum is for Xid wraparound (which is obviously only set by autovacuum). Sneakily move the worker's recently-acquired PostAuthDelay to a more useful place.	2007-10-24 20:55:36 +00:00
Tom Lane	7a315a09dc	Dept. of second thoughts: fix loop in BgBufferSync so that the exit when bgwriter_lru_maxpages is exceeded leaves the loop variables in the expected state. In the original coding, we'd fail to advance next_to_clean, causing that buffer to be probably-uselessly rechecked next time, and also have an off-by-one idea of the number of buffers scanned.	2007-09-25 22:11:48 +00:00
Tom Lane	6f5c38dcd0	Just-in-time background writing strategy. This code avoids re-scanning buffers that cannot possibly need to be cleaned, and estimates how many buffers it should try to clean based on moving averages of recent allocation requests and density of reusable buffers. The patch also adds a couple more columns to pg_stat_bgwriter to help measure the effectiveness of the bgwriter. Greg Smith, building on his own work and ideas from several other people, in particular a much older patch from Itagaki Takahiro.	2007-09-25 20:03:38 +00:00
Tom Lane	1b3d400cac	TransactionIdIsInProgress can skip scanning the ProcArray if the target XID is later than latestCompletedXid, per Florian Pflug. Also some minor improvements in the XIDCACHE_DEBUG code --- make sure each call of TransactionIdIsInProgress is counted one way or another.	2007-09-23 18:50:38 +00:00
Tom Lane	cc59049daf	Improve handling of prune/no-prune decisions by storing a page's oldest unpruned XMAX in its header. At the cost of 4 bytes per page, this keeps us from performing heap_page_prune when there's no chance of pruning anything. Seems to be necessary per Heikki's preliminary performance testing.	2007-09-21 21:25:42 +00:00
Tom Lane	da072ab2ab	Make some simple performance improvements in TransactionIdIsInProgress(). For XIDs of our own transaction and subtransactions, it's cheaper to ask TransactionIdIsCurrentTransactionId() than to look in shared memory. Also, the xids[] work array is always the same size within any given process, so malloc it just once instead of doing a palloc/pfree on every call; aside from being faster this lets us get rid of some goto's, since we no longer have any end-of-function pfree to do. Both ideas by Heikki.	2007-09-21 17:36:53 +00:00
Tom Lane	282d2a03dd	HOT updates. When we update a tuple without changing any of its indexed columns, and the new version can be stored on the same heap page, we no longer generate extra index entries for the new version. Instead, index searches follow the HOT-chain links to ensure they find the correct tuple version. In addition, this patch introduces the ability to "prune" dead tuples on a per-page basis, without having to do a complete VACUUM pass to recover space. VACUUM is still needed to clean up dead index entries, however. Pavan Deolasee, with help from a bunch of other people.	2007-09-20 17:56:33 +00:00
Tom Lane	6889303531	Redefine the lp_flags field of item pointers as having four states, rather than two independent bits (one of which was never used in heap pages anyway, or at least hadn't been in a very long time). This gives us flexibility to add the HOT notions of redirected and dead item pointers without requiring anything so klugy as magic values of lp_off and lp_len. The state values are chosen so that for the states currently in use (pre-HOT) there is no change in the physical representation.	2007-09-12 22:10:26 +00:00
Tom Lane	6bd4f401b0	Replace the former method of determining snapshot xmax --- to wit, calling ReadNewTransactionId from GetSnapshotData --- with a "latestCompletedXid" variable that is updated during transaction commit or abort. Since latestCompletedXid is written only in places that had to lock ProcArrayLock exclusively anyway, and is read only in places that had to lock ProcArrayLock shared anyway, it adds no new locking requirements to the system despite being cluster-wide. Moreover, removing ReadNewTransactionId from snapshot acquisition eliminates the need to take both XidGenLock and ProcArrayLock at the same time. Since XidGenLock is sometimes held across I/O this can be a significant win. Some preliminary benchmarking suggested that this patch has no effect on average throughput but can significantly improve the worst-case transaction times seen in pgbench. Concept by Florian Pflug, implementation by Tom Lane.	2007-09-08 20:31:15 +00:00
Tom Lane	0a51e7073c	Don't take ProcArrayLock while exiting a transaction that has no XID; there is no need for serialization against snapshot-taking because the xact doesn't affect anyone else's snapshot anyway. Per discussion. Also, move various info about the interlocking of transactions and snapshots out of code comments and into a hopefully-more-cohesive discussion in access/transam/README. Also, remove a couple of now-obsolete comments about having to force some WAL to be written to persuade RecordTransactionCommit to do its thing.	2007-09-07 20:59:26 +00:00
Tom Lane	cd1aae5864	Allow CREATE INDEX CONCURRENTLY to disregard transactions in other databases, per gripe from hubert depesz lubaczewski. Patch from Simon Riggs.	2007-09-07 00:58:57 +00:00
Tom Lane	0ecb4ea773	Volatile-qualify the ProcArray PGPROC pointer in a bunch of routines that examine fields that could change under them. This is just to make really sure that when we are fetching a value 'only once', that's what actually happens. Possibly this is a bug that should be back-patched, but in the absence of solid evidence that it's needed, I won't bother.	2007-09-05 21:11:19 +00:00
Tom Lane	295e63983d	Implement lazy XID allocation: transactions that do not modify any database rows will normally never obtain an XID at all. We already did things this way for subtransactions, but this patch extends the concept to top-level transactions. In applications where there are lots of short read-only transactions, this should improve performance noticeably; not so much from removal of the actual XID-assignments, as from reduction of overhead that's driven by the rate of XID consumption. We add a concept of a "virtual transaction ID" so that active transactions can be uniquely identified even if they don't have a regular XID. This is a much lighter-weight concept: uniqueness of VXIDs is only guaranteed over the short term, and no on-disk record is made about them. Florian Pflug, with some editorialization by Tom.	2007-09-05 18:10:48 +00:00
Tom Lane	24d4517b3b	Improve behavior of log_lock_waits patch. Ensure that something gets logged even if the "deadlock detected" ERROR message is suppressed by an exception catcher. Be clearer about the event sequence when a soft deadlock is fixed: the fixing process might or might not still have to wait, so log that separately. Fix race condition when someone releases us from the lock partway through printing all this junk --- we'd not get confused about our state, but the log message sequence could have been misleading, ie, a "still waiting" message with no subsequent "acquired" message. Greg Stark and Tom Lane.	2007-08-28 03:23:44 +00:00
Tom Lane	e4f4a7f5a4	Remove FileUnlink(), which wasn't being used anywhere and interacted poorly with the recent patch to log temp file sizes at removal time. Doesn't seem worth fixing since it's unused. In passing, make a few elog messages conform to the message style guide.	2007-07-26 15:15:18 +00:00
Tom Lane	82eed4dba2	Arrange to put TOAST tables belonging to temporary tables into special schemas named pg_toast_temp_nnn, alongside the pg_temp_nnn schemas used for the temp tables themselves. This allows low-level code such as the relcache to recognize that these tables are indeed temporary, which enables various optimizations such as not WAL-logging changes and using local rather than shared buffers for access. Aside from obvious performance benefits, this provides a solution to bug #3483, in which other backends unexpectedly held open file references to temporary tables. The scheme preserves the property that TOAST tables are not in any schema that's normally in the search path, so they don't conflict with user table names. initdb forced because of changes in system view definitions.	2007-07-25 22:16:18 +00:00
Tom Lane	fdb5b69e9c	Suppress warning when compiling with -DPROFILE_PID_DIR: sys/stat.h is supposed to be included when using mkdir().	2007-07-25 19:58:56 +00:00
Tom Lane	04fbe29a83	Fix WAL replay of truncate operations to cope with the possibility that the truncated relation was deleted later in the WAL sequence. Since replay normally auto-creates a relation upon its first reference by a WAL log entry, failure is seen only if the truncate entry happens to be the first reference after the checkpoint we're restarting from; which is a pretty unusual case but of course not impossible. Fix by making truncate entries auto-create like the other ones do. Per report and test case from Dharmendra Goyal.	2007-07-20 16:29:53 +00:00
Tom Lane	82b3684672	Add comments spelling out why it's a good idea to release multiple partition locks in reverse order.	2007-07-16 21:09:50 +00:00
Tom Lane	b09cb0cf12	Remove the pgstat_drop_relation() call from smgr_internal_unlink(), because we don't know at that point which relation OID to tell pgstat to forget. The code was passing the relfilenode, which is incorrect, and could possibly cause some other relation's stats to be zeroed out. While we could try to clean this up, it seems much simpler and more reliable to let the next invocation of pgstat_vacuum_tabstat() fix things; which indeed is how it worked before I introduced the buggy code into 8.1.3 and later :-(. Problem noticed by Itagaki Takahiro, fix is per subsequent discussion.	2007-07-08 22:23:16 +00:00
Tom Lane	83aaebba63	Fix incorrect comment about the timing of AbsorbFsyncRequests() during checkpoint. The comment claimed that we could do this anytime after setting the checkpoint REDO point, but actually BufferSync is relying on the assumption that buffers dumped by other backends will be fsync'd too. So we really could not do it any sooner than we are doing it.	2007-07-03 14:51:24 +00:00
Tom Lane	beba73763b	Fix comments not updated in recent patch.	2007-07-01 02:22:23 +00:00
Tom Lane	9fc25c0511	Improve logging of checkpoints. Patch by Greg Smith, worked over by Heikki and a little bit by me.	2007-06-30 19:12:02 +00:00
Alvaro Herrera	10af02b912	Arrange for SIGINT in autovacuum workers to cancel the current table and continue with the schedule. Change current uses of SIGINT to abort a worker into SIGTERM, which keeps the old behaviour of terminating the process. Patch from ITAGAKI Takahiro, with some editorializing of my own.	2007-06-29 17:07:39 +00:00
Tom Lane	867e2c91a0	Implement "distributed" checkpoints in which the checkpoint I/O is spread over a fairly long period of time, rather than being spat out in a burst. This happens only for background checkpoints carried out by the bgwriter; other cases, such as a shutdown checkpoint, are still done at full speed. Remove the "all buffers" scan in the bgwriter, and associated stats infrastructure, since this seems no longer very useful when the checkpoint itself is properly throttled. Original patch by Itagaki Takahiro, reworked by Heikki Linnakangas, and some minor API editorialization by me.	2007-06-28 00:02:40 +00:00
Tom Lane	9cce91dba0	Only log 'process acquired lock' if we actually did get the lock. This test seems inessential right now since the only control path for not getting the lock is via CHECK_FOR_INTERRUPTS which won't return control to ProcSleep, but it would be important if we ever allow the deadlock code to kill someone else's transaction instead of our own.	2007-06-19 22:01:15 +00:00
Tom Lane	6e07228728	Code review for log_lock_waits patch. Don't try to issue log messages from within a signal handler (this might be safe given the relatively narrow code range in which the interrupt is enabled, but it seems awfully risky); do issue more informative log messages that tell what is being waited for and the exact length of the wait; minor other code cleanup. Greg Stark and Tom Lane	2007-06-19 20:13:22 +00:00
Tom Lane	de6a6383a7	Update obsolete comment: it's no longer the case that mdread() will allow reads beyond EOF, except by special coercion.	2007-06-18 00:47:20 +00:00
Tom Lane	e976fd43c6	Add some simple defenses against null fields in pg_largeobject, and add comments noting that there's an alignment assumption now that the data field could be in 1-byte-header format. Per discussion with Greg Stark.	2007-06-12 19:46:24 +00:00
Tom Lane	a04a423599	Arrange for large sequential scans to synchronize with each other, so that when multiple backends are scanning the same relation concurrently, each page is (ideally) read only once. Jeff Davis, with review by Heikki and Tom.	2007-06-08 18:23:53 +00:00
Tom Lane	6d6d14b6d5	Redefine IsTransactionState() to only return true for TRANS_INPROGRESS state, which is the only state in which it's safe to initiate database queries. It turns out that all but two of the callers thought that's what it meant; and the other two were using it as a proxy for "will GetTopTransactionId() return a nonzero XID"? Since it was in fact an unreliable guide to that, make those two just invoke GetTopTransactionId() always, then deal with a zero result if they get one.	2007-06-07 21:45:59 +00:00
Tom Lane	24ee8af573	Rework temp_tablespaces patch so that temp tablespaces are assigned separately for each temp file, rather than once per sort or hashjoin; this allows spreading the data of a large sort or join across multiple tablespaces. (I remain dubious that this will make any difference in practice, but certain people insisted.) Arrange to cache the results of parsing the GUC variable instead of recomputing from scratch on every demand, and push usage of the cache down to the bottommost fd.c level.	2007-06-07 19:19:57 +00:00
Tom Lane	acfce502ba	Create a GUC parameter temp_tablespaces that allows selection of the tablespace(s) in which to store temp tables and temporary files. This is a list to allow spreading the load across multiple tablespaces (a random list element is chosen each time a temp object is to be created). Temp files are not stored in per-database pgsql_tmp/ directories anymore, but per-tablespace directories. Jaime Casanova and Albert Cervera, with review by Bernd Helmle and Tom Lane.	2007-06-03 17:08:34 +00:00
Tom Lane	964ec46cfe	Fix aboriginal bug in BufFileDumpBuffer that would cause it to write the wrong data when dumping a bufferload that crosses a component-file boundary. This probably has not been seen in the wild because (a) component files are normally 1GB apiece and (b) non-block-aligned buffer usage is relatively rare. But it's fairly easy to reproduce a problem if one reduces RELSEG_SIZE in a test build. Kudos to Kurt Harriman for spotting the bug.	2007-06-01 23:43:11 +00:00
Tom Lane	bd0a260928	Make CREATE/DROP/RENAME DATABASE wait a little bit to see if other backends will exit before failing because of conflicting DB usage. Per discussion, this seems a good idea to help mask the fact that backend exit takes nonzero time. Remove a couple of thereby-obsoleted sleeps in contrib and PL regression test sequences.	2007-06-01 19:38:07 +00:00
Tom Lane	d526575f89	Make large sequential scans and VACUUMs work in a limited-size "ring" of buffers, rather than blowing out the whole shared-buffer arena. Aside from avoiding cache spoliation, this fixes the problem that VACUUM formerly tended to cause a WAL flush for every page it modified, because we had it hacked to use only a single buffer. Those flushes will now occur only once per ring-ful. The exact ring size, and the threshold for seqscans to switch into the ring usage pattern, remain under debate; but the infrastructure seems done. The key bit of infrastructure is a new optional BufferAccessStrategy object that can be passed to ReadBuffer operations; this replaces the former StrategyHintVacuum API. This patch also changes the buffer usage-count methodology a bit: we now advance usage_count when first pinning a buffer, rather than when last unpinning it. To preserve the behavior that a buffer's lifetime starts to decrease when it's released, the clock sweep code is modified to not decrement usage_count of pinned buffers. Work not done in this commit: teach GiST and GIN indexes to use the vacuum BufferAccessStrategy for vacuum-driven fetches. Original patch by Simon, reworked by Heikki and again by Tom.	2007-05-30 20:12:03 +00:00
Tom Lane	77947c51c0	Fix up pgstats counting of live and dead tuples to recognize that committed and aborted transactions have different effects; also teach it not to assume that prepared transactions are always committed. Along the way, simplify the pgstats API by tying counting directly to Relations; I cannot detect any redeeming social value in having stats pointers in HeapScanDesc and IndexScanDesc structures. And fix a few corner cases in which counts might be missed because the relation's pgstat_info pointer hadn't been set.	2007-05-27 03:50:39 +00:00
Tom Lane	63735ca815	Dept. of second thoughts: add comments cautioning against using ReadOrZeroBuffer to fetch pages from beyond physical EOF. This would usually work, but would cause problems for md.c if writes occurred beyond a segment boundary when the previous segment file hadn't been fully extended.	2007-05-02 23:34:48 +00:00
Tom Lane	8c3cc86e7b	During WAL recovery, when reading a page that we intend to overwrite completely from the WAL data, don't bother to physically read it; just have bufmgr.c return a zeroed-out buffer instead. This speeds recovery significantly, and also avoids unnecessary failures when a page-to-be-overwritten has corrupt page headers on disk. This replaces a former kluge that accomplished the latter by pretending zero_damaged_pages was always ON during WAL recovery; which was OK when the kluge was put in, but is unsafe when restoring a WAL log that was written with full_page_writes off. Heikki Linnakangas	2007-05-02 23:18:03 +00:00
Bruce Momjian	1c8302cab3	Add comment on why deadlock detection error messages only prints numbers.	2007-04-20 20:15:52 +00:00
Alvaro Herrera	e2a186b03c	Add a multi-worker capability to autovacuum. This allows multiple worker processes to be running simultaneously. Also, now autovacuum processes do not count towards the max_connections limit; they are counted separately from regular processes, and are limited by the new GUC variable autovacuum_max_workers. The launcher now has intelligence to launch workers on each database every autovacuum_naptime seconds, limited only on the max amount of worker slots available. Also, the global worker I/O utilization is limited by the vacuum cost-based delay feature. Workers are "balanced" so that the total I/O consumption does not exceed the established limit. This part of the patch was contributed by ITAGAKI Takahiro. Per discussion.	2007-04-16 18:30:04 +00:00
Tom Lane	995ba280c1	Rearrange mdsync() looping logic to avoid the problem that a sufficiently fast flow of new fsync requests can prevent mdsync() from ever completing. This was an unforeseen consequence of a patch added in Mar 2006 to prevent the fsync request queue from overflowing. Problem identified by Heikki Linnakangas and independently by ITAGAKI Takahiro; fix based on ideas from Takahiro-san, Heikki, and Tom. Back-patch as far as 8.1 because a previous back-patch introduced the problem into 8.1 ...	2007-04-12 17:10:55 +00:00
Tom Lane	3e23b68dac	Support varlena fields with single-byte headers and unaligned storage. This commit breaks any code that assumes that the mere act of forming a tuple (without writing it to disk) does not "toast" any fields. While all available regression tests pass, I'm not totally sure that we've fixed every nook and cranny, especially in contrib. Greg Stark with some help from Tom Lane	2007-04-06 04:21:44 +00:00
Tom Lane	9c9b619473	Remove the CheckpointStartLock in favor of having backends show whether they are in their commit critical sections via flags in the ProcArray. Checkpoint can watch the ProcArray to determine when it's safe to proceed. This is a considerably better solution to the original problem of race conditions between checkpoint and transaction commit: it speeds up commit, since there's one less lock to fool with, and it prevents the problem of checkpoint being delayed indefinitely when there's a constant flow of commits. Heikki, with some kibitzing from Tom.	2007-04-03 16:34:36 +00:00
Magnus Hagander	335feca441	Add some instrumentation to the bgwriter, through the stats collector. New view pg_stat_bgwriter, and the functions required to build it.	2007-03-30 18:34:56 +00:00

1 2 3 4 5 ...

1012 Commits