1
0
mirror of https://github.com/postgres/postgres.git synced 2025-11-16 15:02:33 +03:00
Commit Graph

1221 Commits

Author SHA1 Message Date
Tom Lane
6f906905b1 Fix WaitOnLock() to ensure that the process's "waiting" flag is reset after
erroring out of a wait.  We can use a PG_TRY block for this, but add a comment
explaining why it'd be a bad idea to use it for any other state cleanup.

Back-patch to 8.2.  Prior releases had the same issue, but only with respect
to the process title, which is likely to get reset almost immediately anyway
after the transaction aborts, so it seems not worth changing them.  In 8.2
and HEAD, the pg_stat_activity "waiting" flag could remain set incorrectly
for a long time.

Per report from Gurjeet Singh.
2008-02-02 22:26:17 +00:00
Tom Lane
6322e84430 Change StatementCancelHandler() to check the DoingCommandRead flag to decide
whether to execute an immediate interrupt, rather than testing whether
LockWaitCancel() cancelled a lock wait.  The old way misclassified the case
where we were blocked in ProcWaitForSignal(), and arguably would misclassify
any other future additions of new ImmediateInterruptOK states too.  This
allows reverting the old kluge that gave LockWaitCancel() a return value,
since no callers care anymore.  Improve comments in the various
implementations of PGSemaphoreLock() to explain that on some platforms, the
assumption that semop() exits after a signal is wrong, and so we must ensure
that the signal handler itself throws elog if we want cancel or die interrupts
to be effective.  Per testing related to bug #3883, though this patch doesn't
solve those problems fully.

Perhaps this change should be back-patched, but since pre-8.3 branches aren't
really relying on autovacuum to respond to SIGINT, it doesn't seem critical
for them.
2008-01-26 19:55:08 +00:00
Tom Lane
ceb9360067 Fix CREATE INDEX CONCURRENTLY to not deadlock against an automatic or manual
VACUUM that is blocked waiting to get lock on the table being indexed.
Per report and fix suggestion from Greg Stark.
2008-01-09 21:52:36 +00:00
Tom Lane
da3df47c84 lmgr.c:DescribeLockTag was never taught about virtual xids, per Greg Stark.
Also a couple of minor tweaks to try to future-proof the code a bit better
against future locktag additions.
2008-01-08 23:18:51 +00:00
Bruce Momjian
9098ab9e32 Update copyrights in source tree to 2008. 2008-01-01 19:46:01 +00:00
Peter Eisentraut
5ca3d50db7 Clarify log messages 2007-12-13 11:55:44 +00:00
Tom Lane
895a94de6d Avoid incrementing the CommandCounter when CommandCounterIncrement is called
but no database changes have been made since the last CommandCounterIncrement.
This should result in a significant improvement in the number of "commands"
that can typically be performed within a transaction before hitting the 2^32
CommandId size limit.  In particular this buys back (and more) the possible
adverse consequences of my previous patch to fix plan caching behavior.

The implementation requires tracking whether the current CommandCounter
value has been "used" to mark any tuples.  CommandCounter values stored into
snapshots are presumed not to be used for this purpose.  This requires some
small executor changes, since the executor used to conflate the curcid of
the snapshot it was using with the command ID to mark output tuples with.
Separating these concepts allows some small simplifications in executor APIs.

Something for the TODO list: look into having CommandCounterIncrement not do
AcceptInvalidationMessages.  It seems fairly bogus to be doing it there,
but exactly where to do it instead isn't clear, and I'm disinclined to mess
with asynchronous behavior during late beta.
2007-11-30 21:22:54 +00:00
Tom Lane
eae7e00f1f Fix stupid typo in recently-added code :-( 2007-11-16 00:57:55 +00:00
Bruce Momjian
f6e8730d11 Re-run pgindent with updated list of typedefs. (Updated README should
avoid this problem in the future.)
2007-11-15 22:25:18 +00:00
Tom Lane
591b9b091c Use ftruncate() not truncate() in mdunlink. Seems Windows doesn't
support the latter.
2007-11-15 21:49:47 +00:00
Bruce Momjian
fdf5a5efb7 pgindent run for 8.3. 2007-11-15 21:14:46 +00:00
Tom Lane
6cc4451b5c Prevent re-use of a deleted relation's relfilenode until after the next
checkpoint.  This guards against an unlikely data-loss scenario in which
we re-use the relfilenode, then crash, then replay the deletion and
recreation of the file.  Even then we'd be OK if all insertions into the
new relation had been WAL-logged ... but that's not guaranteed given all
the no-WAL-logging optimizations that have recently been added.

Patch by Heikki Linnakangas, per a discussion last month.
2007-11-15 20:36:40 +00:00
Tom Lane
69500b05d6 Prevent continuing disk-space bloat when profiling (with PROFILE_PID_DIR
enabled) and autovacuum is on.  Since there will be a steady stream of autovac
worker processes exiting and dropping gmon.out files, allowing them to make
separate subdirectories results in serious bloat; and it seems unlikely that
anyone will care about those profiles anyway.  Limit the damage by forcing all
autovac workers to dump in one subdirectory, PGDATA/gprof/avworker/.

Per report from Jšrg Beyer and subsequent discussion.
2007-11-04 17:55:15 +00:00
Alvaro Herrera
acac68b2bc Allow an autovacuum worker to be interrupted automatically when it is found
to be locking another process (except when it's working to prevent Xid
wraparound problems).
2007-10-26 20:45:10 +00:00
Alvaro Herrera
745c1b2c2a Rearrange vacuum-related bits in PGPROC as a bitmask, to better support
having several of them.  Add two more flags: whether the process is
executing an ANALYZE, and whether a vacuum is for Xid wraparound (which
is obviously only set by autovacuum).

Sneakily move the worker's recently-acquired PostAuthDelay to a more useful
place.
2007-10-24 20:55:36 +00:00
Tom Lane
7a315a09dc Dept. of second thoughts: fix loop in BgBufferSync so that the exit when
bgwriter_lru_maxpages is exceeded leaves the loop variables in the
expected state.  In the original coding, we'd fail to advance
next_to_clean, causing that buffer to be probably-uselessly rechecked next
time, and also have an off-by-one idea of the number of buffers scanned.
2007-09-25 22:11:48 +00:00
Tom Lane
6f5c38dcd0 Just-in-time background writing strategy. This code avoids re-scanning
buffers that cannot possibly need to be cleaned, and estimates how many
buffers it should try to clean based on moving averages of recent allocation
requests and density of reusable buffers.  The patch also adds a couple
more columns to pg_stat_bgwriter to help measure the effectiveness of the
bgwriter.

Greg Smith, building on his own work and ideas from several other people,
in particular a much older patch from Itagaki Takahiro.
2007-09-25 20:03:38 +00:00
Tom Lane
1b3d400cac TransactionIdIsInProgress can skip scanning the ProcArray if the target XID is
later than latestCompletedXid, per Florian Pflug.  Also some minor
improvements in the XIDCACHE_DEBUG code --- make sure each call of
TransactionIdIsInProgress is counted one way or another.
2007-09-23 18:50:38 +00:00
Tom Lane
cc59049daf Improve handling of prune/no-prune decisions by storing a page's oldest
unpruned XMAX in its header.  At the cost of 4 bytes per page, this keeps us
from performing heap_page_prune when there's no chance of pruning anything.
Seems to be necessary per Heikki's preliminary performance testing.
2007-09-21 21:25:42 +00:00
Tom Lane
da072ab2ab Make some simple performance improvements in TransactionIdIsInProgress().
For XIDs of our own transaction and subtransactions, it's cheaper to ask
TransactionIdIsCurrentTransactionId() than to look in shared memory.
Also, the xids[] work array is always the same size within any given
process, so malloc it just once instead of doing a palloc/pfree on every
call; aside from being faster this lets us get rid of some goto's, since
we no longer have any end-of-function pfree to do.  Both ideas by Heikki.
2007-09-21 17:36:53 +00:00
Tom Lane
282d2a03dd HOT updates. When we update a tuple without changing any of its indexed
columns, and the new version can be stored on the same heap page, we no longer
generate extra index entries for the new version.  Instead, index searches
follow the HOT-chain links to ensure they find the correct tuple version.

In addition, this patch introduces the ability to "prune" dead tuples on a
per-page basis, without having to do a complete VACUUM pass to recover space.
VACUUM is still needed to clean up dead index entries, however.

Pavan Deolasee, with help from a bunch of other people.
2007-09-20 17:56:33 +00:00
Tom Lane
6889303531 Redefine the lp_flags field of item pointers as having four states, rather
than two independent bits (one of which was never used in heap pages anyway,
or at least hadn't been in a very long time).  This gives us flexibility to
add the HOT notions of redirected and dead item pointers without requiring
anything so klugy as magic values of lp_off and lp_len.  The state values
are chosen so that for the states currently in use (pre-HOT) there is no
change in the physical representation.
2007-09-12 22:10:26 +00:00
Tom Lane
6bd4f401b0 Replace the former method of determining snapshot xmax --- to wit, calling
ReadNewTransactionId from GetSnapshotData --- with a "latestCompletedXid"
variable that is updated during transaction commit or abort.  Since
latestCompletedXid is written only in places that had to lock ProcArrayLock
exclusively anyway, and is read only in places that had to lock ProcArrayLock
shared anyway, it adds no new locking requirements to the system despite being
cluster-wide.  Moreover, removing ReadNewTransactionId from snapshot
acquisition eliminates the need to take both XidGenLock and ProcArrayLock at
the same time.  Since XidGenLock is sometimes held across I/O this can be a
significant win.  Some preliminary benchmarking suggested that this patch has
no effect on average throughput but can significantly improve the worst-case
transaction times seen in pgbench.  Concept by Florian Pflug, implementation
by Tom Lane.
2007-09-08 20:31:15 +00:00
Tom Lane
0a51e7073c Don't take ProcArrayLock while exiting a transaction that has no XID; there is
no need for serialization against snapshot-taking because the xact doesn't
affect anyone else's snapshot anyway.  Per discussion.  Also, move various
info about the interlocking of transactions and snapshots out of code comments
and into a hopefully-more-cohesive discussion in access/transam/README.

Also, remove a couple of now-obsolete comments about having to force some WAL
to be written to persuade RecordTransactionCommit to do its thing.
2007-09-07 20:59:26 +00:00
Tom Lane
cd1aae5864 Allow CREATE INDEX CONCURRENTLY to disregard transactions in other
databases, per gripe from hubert depesz lubaczewski.  Patch from
Simon Riggs.
2007-09-07 00:58:57 +00:00
Tom Lane
0ecb4ea773 Volatile-qualify the ProcArray PGPROC pointer in a bunch of routines
that examine fields that could change under them.  This is just to make
really sure that when we are fetching a value 'only once', that's what
actually happens.  Possibly this is a bug that should be back-patched,
but in the absence of solid evidence that it's needed, I won't bother.
2007-09-05 21:11:19 +00:00
Tom Lane
295e63983d Implement lazy XID allocation: transactions that do not modify any database
rows will normally never obtain an XID at all.  We already did things this way
for subtransactions, but this patch extends the concept to top-level
transactions.  In applications where there are lots of short read-only
transactions, this should improve performance noticeably; not so much from
removal of the actual XID-assignments, as from reduction of overhead that's
driven by the rate of XID consumption.  We add a concept of a "virtual
transaction ID" so that active transactions can be uniquely identified even
if they don't have a regular XID.  This is a much lighter-weight concept:
uniqueness of VXIDs is only guaranteed over the short term, and no on-disk
record is made about them.

Florian Pflug, with some editorialization by Tom.
2007-09-05 18:10:48 +00:00
Tom Lane
24d4517b3b Improve behavior of log_lock_waits patch. Ensure that something gets logged
even if the "deadlock detected" ERROR message is suppressed by an exception
catcher.  Be clearer about the event sequence when a soft deadlock is fixed:
the fixing process might or might not still have to wait, so log that
separately.  Fix race condition when someone releases us from the lock partway
through printing all this junk --- we'd not get confused about our state, but
the log message sequence could have been misleading, ie, a "still waiting"
message with no subsequent "acquired" message.  Greg Stark and Tom Lane.
2007-08-28 03:23:44 +00:00
Tom Lane
e4f4a7f5a4 Remove FileUnlink(), which wasn't being used anywhere and interacted poorly
with the recent patch to log temp file sizes at removal time.  Doesn't seem
worth fixing since it's unused.
In passing, make a few elog messages conform to the message style guide.
2007-07-26 15:15:18 +00:00
Tom Lane
82eed4dba2 Arrange to put TOAST tables belonging to temporary tables into special schemas
named pg_toast_temp_nnn, alongside the pg_temp_nnn schemas used for the temp
tables themselves.  This allows low-level code such as the relcache to
recognize that these tables are indeed temporary, which enables various
optimizations such as not WAL-logging changes and using local rather than
shared buffers for access.  Aside from obvious performance benefits, this
provides a solution to bug #3483, in which other backends unexpectedly held
open file references to temporary tables.  The scheme preserves the property
that TOAST tables are not in any schema that's normally in the search path,
so they don't conflict with user table names.

initdb forced because of changes in system view definitions.
2007-07-25 22:16:18 +00:00
Tom Lane
fdb5b69e9c Suppress warning when compiling with -DPROFILE_PID_DIR: sys/stat.h is
supposed to be included when using mkdir().
2007-07-25 19:58:56 +00:00
Tom Lane
04fbe29a83 Fix WAL replay of truncate operations to cope with the possibility that the
truncated relation was deleted later in the WAL sequence.  Since replay
normally auto-creates a relation upon its first reference by a WAL log entry,
failure is seen only if the truncate entry happens to be the first reference
after the checkpoint we're restarting from; which is a pretty unusual case but
of course not impossible.  Fix by making truncate entries auto-create like
the other ones do.  Per report and test case from Dharmendra Goyal.
2007-07-20 16:29:53 +00:00
Tom Lane
82b3684672 Add comments spelling out why it's a good idea to release multiple
partition locks in reverse order.
2007-07-16 21:09:50 +00:00
Tom Lane
b09cb0cf12 Remove the pgstat_drop_relation() call from smgr_internal_unlink(), because
we don't know at that point which relation OID to tell pgstat to forget.
The code was passing the relfilenode, which is incorrect, and could possibly
cause some other relation's stats to be zeroed out.  While we could try to
clean this up, it seems much simpler and more reliable to let the next
invocation of pgstat_vacuum_tabstat() fix things; which indeed is how it
worked before I introduced the buggy code into 8.1.3 and later :-(.
Problem noticed by Itagaki Takahiro, fix is per subsequent discussion.
2007-07-08 22:23:16 +00:00
Tom Lane
83aaebba63 Fix incorrect comment about the timing of AbsorbFsyncRequests() during
checkpoint.  The comment claimed that we could do this anytime after
setting the checkpoint REDO point, but actually BufferSync is relying
on the assumption that buffers dumped by other backends will be fsync'd
too.  So we really could not do it any sooner than we are doing it.
2007-07-03 14:51:24 +00:00
Tom Lane
beba73763b Fix comments not updated in recent patch. 2007-07-01 02:22:23 +00:00
Tom Lane
9fc25c0511 Improve logging of checkpoints. Patch by Greg Smith, worked over
by Heikki and a little bit by me.
2007-06-30 19:12:02 +00:00
Alvaro Herrera
10af02b912 Arrange for SIGINT in autovacuum workers to cancel the current table and
continue with the schedule.  Change current uses of SIGINT to abort a worker
into SIGTERM, which keeps the old behaviour of terminating the process.

Patch from ITAGAKI Takahiro, with some editorializing of my own.
2007-06-29 17:07:39 +00:00
Tom Lane
867e2c91a0 Implement "distributed" checkpoints in which the checkpoint I/O is spread
over a fairly long period of time, rather than being spat out in a burst.
This happens only for background checkpoints carried out by the bgwriter;
other cases, such as a shutdown checkpoint, are still done at full speed.

Remove the "all buffers" scan in the bgwriter, and associated stats
infrastructure, since this seems no longer very useful when the checkpoint
itself is properly throttled.

Original patch by Itagaki Takahiro, reworked by Heikki Linnakangas,
and some minor API editorialization by me.
2007-06-28 00:02:40 +00:00
Tom Lane
9cce91dba0 Only log 'process acquired lock' if we actually did get the lock. This
test seems inessential right now since the only control path for not
getting the lock is via CHECK_FOR_INTERRUPTS which won't return control
to ProcSleep, but it would be important if we ever allow the deadlock
code to kill someone else's transaction instead of our own.
2007-06-19 22:01:15 +00:00
Tom Lane
6e07228728 Code review for log_lock_waits patch. Don't try to issue log messages from
within a signal handler (this might be safe given the relatively narrow code
range in which the interrupt is enabled, but it seems awfully risky); do issue
more informative log messages that tell what is being waited for and the exact
length of the wait; minor other code cleanup.  Greg Stark and Tom Lane
2007-06-19 20:13:22 +00:00
Tom Lane
de6a6383a7 Update obsolete comment: it's no longer the case that mdread() will allow
reads beyond EOF, except by special coercion.
2007-06-18 00:47:20 +00:00
Tom Lane
e976fd43c6 Add some simple defenses against null fields in pg_largeobject, and add
comments noting that there's an alignment assumption now that the data
field could be in 1-byte-header format.  Per discussion with Greg Stark.
2007-06-12 19:46:24 +00:00
Tom Lane
a04a423599 Arrange for large sequential scans to synchronize with each other, so that
when multiple backends are scanning the same relation concurrently, each page
is (ideally) read only once.

Jeff Davis, with review by Heikki and Tom.
2007-06-08 18:23:53 +00:00
Tom Lane
6d6d14b6d5 Redefine IsTransactionState() to only return true for TRANS_INPROGRESS state,
which is the only state in which it's safe to initiate database queries.
It turns out that all but two of the callers thought that's what it meant;
and the other two were using it as a proxy for "will GetTopTransactionId()
return a nonzero XID"?  Since it was in fact an unreliable guide to that,
make those two just invoke GetTopTransactionId() always, then deal with a
zero result if they get one.
2007-06-07 21:45:59 +00:00
Tom Lane
24ee8af573 Rework temp_tablespaces patch so that temp tablespaces are assigned separately
for each temp file, rather than once per sort or hashjoin; this allows
spreading the data of a large sort or join across multiple tablespaces.
(I remain dubious that this will make any difference in practice, but certain
people insisted.)  Arrange to cache the results of parsing the GUC variable
instead of recomputing from scratch on every demand, and push usage of the
cache down to the bottommost fd.c level.
2007-06-07 19:19:57 +00:00
Tom Lane
acfce502ba Create a GUC parameter temp_tablespaces that allows selection of the
tablespace(s) in which to store temp tables and temporary files.  This is a
list to allow spreading the load across multiple tablespaces (a random list
element is chosen each time a temp object is to be created).  Temp files are
not stored in per-database pgsql_tmp/ directories anymore, but per-tablespace
directories.

Jaime Casanova and Albert Cervera, with review by Bernd Helmle and Tom Lane.
2007-06-03 17:08:34 +00:00
Tom Lane
964ec46cfe Fix aboriginal bug in BufFileDumpBuffer that would cause it to write the
wrong data when dumping a bufferload that crosses a component-file boundary.
This probably has not been seen in the wild because (a) component files are
normally 1GB apiece and (b) non-block-aligned buffer usage is relatively
rare.  But it's fairly easy to reproduce a problem if one reduces RELSEG_SIZE
in a test build.  Kudos to Kurt Harriman for spotting the bug.
2007-06-01 23:43:11 +00:00
Tom Lane
bd0a260928 Make CREATE/DROP/RENAME DATABASE wait a little bit to see if other backends
will exit before failing because of conflicting DB usage.  Per discussion,
this seems a good idea to help mask the fact that backend exit takes nonzero
time.  Remove a couple of thereby-obsoleted sleeps in contrib and PL
regression test sequences.
2007-06-01 19:38:07 +00:00
Tom Lane
d526575f89 Make large sequential scans and VACUUMs work in a limited-size "ring" of
buffers, rather than blowing out the whole shared-buffer arena.  Aside from
avoiding cache spoliation, this fixes the problem that VACUUM formerly tended
to cause a WAL flush for every page it modified, because we had it hacked to
use only a single buffer.  Those flushes will now occur only once per
ring-ful.  The exact ring size, and the threshold for seqscans to switch into
the ring usage pattern, remain under debate; but the infrastructure seems
done.  The key bit of infrastructure is a new optional BufferAccessStrategy
object that can be passed to ReadBuffer operations; this replaces the former
StrategyHintVacuum API.

This patch also changes the buffer usage-count methodology a bit: we now
advance usage_count when first pinning a buffer, rather than when last
unpinning it.  To preserve the behavior that a buffer's lifetime starts to
decrease when it's released, the clock sweep code is modified to not decrement
usage_count of pinned buffers.

Work not done in this commit: teach GiST and GIN indexes to use the vacuum
BufferAccessStrategy for vacuum-driven fetches.

Original patch by Simon, reworked by Heikki and again by Tom.
2007-05-30 20:12:03 +00:00