This patch introduces two additional lock modes for tuples: "SELECT FOR
KEY SHARE" and "SELECT FOR NO KEY UPDATE". These don't block each
other, in contrast with already existing "SELECT FOR SHARE" and "SELECT
FOR UPDATE". UPDATE commands that do not modify the values stored in
the columns that are part of the key of the tuple now grab a SELECT FOR
NO KEY UPDATE lock on the tuple, allowing them to proceed concurrently
with tuple locks of the FOR KEY SHARE variety.
Foreign key triggers now use FOR KEY SHARE instead of FOR SHARE; this
means the concurrency improvement applies to them, which is the whole
point of this patch.
The added tuple lock semantics require some rejiggering of the multixact
module, so that the locking level that each transaction is holding can
be stored alongside its Xid. Also, multixacts now need to persist
across server restarts and crashes, because they can now represent not
only tuple locks, but also tuple updates. This means we need more
careful tracking of lifetime of pg_multixact SLRU files; since they now
persist longer, we require more infrastructure to figure out when they
can be removed. pg_upgrade also needs to be careful to copy
pg_multixact files over from the old server to the new, or at least part
of multixact.c state, depending on the versions of the old and new
servers.
Tuple time qualification rules (HeapTupleSatisfies routines) need to be
careful not to consider tuples with the "is multi" infomask bit set as
being only locked; they might need to look up MultiXact values (i.e.
possibly do pg_multixact I/O) to find out the Xid that updated a tuple,
whereas they previously were assured to only use information readily
available from the tuple header. This is considered acceptable, because
the extra I/O would involve cases that would previously cause some
commands to block waiting for concurrent transactions to finish.
Another important change is the fact that locking tuples that have
previously been updated causes the future versions to be marked as
locked, too; this is essential for correctness of foreign key checks.
This causes additional WAL-logging, also (there was previously a single
WAL record for a locked tuple; now there are as many as updated copies
of the tuple there exist.)
With all this in place, contention related to tuples being checked by
foreign key rules should be much reduced.
As a bonus, the old behavior that a subtransaction grabbing a stronger
tuple lock than the parent (sub)transaction held on a given tuple and
later aborting caused the weaker lock to be lost, has been fixed.
Many new spec files were added for isolation tester framework, to ensure
overall behavior is sane. There's probably room for several more tests.
There were several reviewers of this patch; in particular, Noah Misch
and Andres Freund spent considerable time in it. Original idea for the
patch came from Simon Riggs, after a problem report by Joel Jacobson.
Most code is from me, with contributions from Marti Raudsepp, Alexander
Shulgin, Noah Misch and Andres Freund.
This patch was discussed in several pgsql-hackers threads; the most
important start at the following message-ids:
AANLkTimo9XVcEzfiBR-ut3KVNDkjm2Vxh+t8kAmWjPuv@mail.gmail.com1290721684-sup-3951@alvh.no-ip.org1294953201-sup-2099@alvh.no-ip.org1320343602-sup-2290@alvh.no-ip.org1339690386-sup-8927@alvh.no-ip.org4FE5FF020200002500048A3D@gw.wicourts.gov4FEAB90A0200002500048B7D@gw.wicourts.gov
Commit 8cb53654dbdb4c386369eb988062d0bbb6de725e, which introduced DROP
INDEX CONCURRENTLY, managed to break CREATE INDEX CONCURRENTLY via a poor
choice of catalog state representation. The pg_index state for an index
that's reached the final pre-drop stage was the same as the state for an
index just created by CREATE INDEX CONCURRENTLY. This meant that the
(necessary) change to make RelationGetIndexList ignore about-to-die indexes
also made it ignore freshly-created indexes; which is catastrophic because
the latter do need to be considered in HOT-safety decisions. Failure to
do so leads to incorrect index entries and subsequently wrong results from
queries depending on the concurrently-created index.
To fix, add an additional boolean column "indislive" to pg_index, so that
the freshly-created and about-to-die states can be distinguished. (This
change obviously is only possible in HEAD. This patch will need to be
back-patched, but in 9.2 we'll use a kluge consisting of overloading the
formerly-impossible state of indisvalid = true and indisready = false.)
In addition, change CREATE/DROP INDEX CONCURRENTLY so that the pg_index
flag changes they make without exclusive lock on the index are made via
heap_inplace_update() rather than a normal transactional update. The
latter is not very safe because moving the pg_index tuple could result in
concurrent SnapshotNow scans finding it twice or not at all, thus possibly
resulting in index corruption. This is a pre-existing bug in CREATE INDEX
CONCURRENTLY, which was copied into the DROP code.
In addition, fix various places in the code that ought to check to make
sure that the indexes they are manipulating are valid and/or ready as
appropriate. These represent bugs that have existed since 8.2, since
a failed CREATE INDEX CONCURRENTLY could leave a corrupt or invalid
index behind, and we ought not try to do anything that might fail with
such an index.
Also fix RelationReloadIndexInfo to ensure it copies all the pg_index
columns that are allowed to change after initial creation. Previously we
could have been left with stale values of some fields in an index relcache
entry. It's not clear whether this actually had any user-visible
consequences, but it's at least a bug waiting to happen.
In addition, do some code and docs review for DROP INDEX CONCURRENTLY;
some cosmetic code cleanup but mostly addition and revision of comments.
This will need to be back-patched, but in a noticeably different form,
so I'm committing it to HEAD before working on the back-patch.
Problem reported by Amit Kapila, diagnosis by Pavan Deolassee,
fix by Tom Lane and Andres Freund.
... and have sepgsql use it to determine whether to check permissions
during certain operations. Indexes that are being created as a result
of REINDEX, for instance, do not need to have their permissions checked;
they were already checked when the index was created.
Author: KaiGai Kohei, slightly revised by me
The latter was already the dominant use, and it's preferable because
in C the convention is that intXX means XX bits. Therefore, allowing
mixed use of int2, int4, int8, int16, int32 is obviously confusing.
Remove the typedefs for int2 and int4 for now. They don't seem to be
widely used outside of the PostgreSQL source tree, and the few uses
can probably be cleaned up by the time this ships.
This doesn't do anything useful just yet, but is intended as supporting
infrastructure for allowing sepgsql to sensibly check DROP permissions.
KaiGai Kohei and Robert Haas
In commit 7b0d0e9356963d5c3e4d329a917f5fbb82a2ef05, I made CLUSTER and
VACUUM FULL try to preserve toast value OIDs from the original toast table
to the new one. However, if we have to copy both live and recently-dead
versions of a row that has a toasted column, those versions may well
reference the same toast value with the same OID. The patch then led to
duplicate-key failures as we tried to insert the toast value twice with the
same OID. (The previous behavior was not very desirable either, since it
would have silently inserted the same value twice with different OIDs.
That wastes space, but what's worse is that the toast values inserted for
already-dead heap rows would not be reclaimed by subsequent ordinary
VACUUMs, since they go into the new toast table marked live not deleted.)
To fix, check if the copied OID already exists in the new toast table, and
if so, assume that it stores the desired value. This is reasonably safe
since the only case where we will copy an OID from a previous toast pointer
is when toast_insert_or_update was given that toast pointer and so we just
pulled the data from the old table; if we got two different values that way
then we have big problems anyway. We do have to assume that no other
backend is inserting items into the new toast table concurrently, but
that's surely safe for CLUSTER and VACUUM FULL.
Per bug #6393 from Maxim Boguk. Back-patch to 9.0, same as the previous
patch.
In the previous coding, a user could queue up for an AccessExclusiveLock
on a table they did not have permission to cluster, thus potentially
interfering with access by authorized users who got stuck waiting behind
the AccessExclusiveLock. This approach avoids that. cluster() has the
same permissions-checking requirements as REINDEX TABLE, so this commit
moves the now-shared callback to tablecmds.c and renames it, per
discussion with Noah Misch.
Previously, renaming a table, sequence, view, index, foreign table,
column, or trigger checked permissions before locking the object, which
meant that if permissions were revoked during the lock wait, we would
still allow the operation. Similarly, if the original object is dropped
and a new one with the same name is created, the operation will be allowed
if we had permissions on the old object; the permissions on the new
object don't matter. All this is now fixed.
Along the way, attempting to rename a trigger on a foreign table now gives
the same error message as trying to create one there in the first place
(i.e. that it's not a table or view) rather than simply stating that no
trigger by that name exists.
Patch by me; review by Noah Misch.
Add a column pg_class.relallvisible to remember the number of pages that
were all-visible according to the visibility map as of the last VACUUM
(or ANALYZE, or some other operations that update pg_class.relpages).
Use relallvisible/relpages, instead of an arbitrary constant, to estimate
how many heap page fetches can be avoided during an index-only scan.
This is pretty primitive and will no doubt see refinements once we've
acquired more field experience with the index-only scan mechanism, but
it's way better than using a constant.
Note: I had to adjust an underspecified query in the window.sql regression
test, because it was changing answers when the plan changed to use an
index-only scan. Some of the adjacent tests perhaps should be adjusted
as well, but I didn't do that here.
This works around the problem that a catalog cache entry might contain a
toast pointer that we try to dereference just as a VACUUM FULL completes
on that catalog. We will see the sinval message on the cache entry when
we acquire lock on the toast table, but by that point we've already told
tuptoaster.c "here's the pointer to fetch", so it's difficult from a code
structural standpoint to update the pointer before we use it. Much less
painful to ensure that toast pointers are not invalidated in the first
place. We have to add a bit of code to deal with the case that a value
that previously wasn't toasted becomes so; but that should be a
seldom-exercised corner case, so the inefficiency shouldn't be significant.
Back-patch to 9.0. In prior versions, we didn't allow CLUSTER on system
catalogs, and VACUUM FULL didn't result in reassignment of toast OIDs, so
there was no problem.
Some callers were creating copies of tuple descriptors to pass to that
function, stating in code comments that it was necessary because it
modified the passed descriptor. Code inspection reveals this not to be
true, and indeed not all callers are passing copies in the first place.
So remove the extra ones and the misleading comments about this behavior
as well.
Truncating or dropping a table is treated like deletion of all tuples, and
check for conflicts accordingly. If a table is clustered or rewritten by
ALTER TABLE, all predicate locks on the heap are promoted to relation-level
locks, because the tuple or page ids of any existing tuples will change and
won't be valid after rewriting the table. Arguably ALTER TABLE should be
treated like a mass-UPDATE of every row, but if you e.g change the datatype
of a column, you could also argue that it's just a change to the physical
layout, not a logical change. Reindexing promotes all locks on the index to
relation-level lock on the heap.
Kevin Grittner, with a lot of cosmetic changes by me.
We must lock out autovacuuming of the old toast table before computing the
OldestXmin horizon we will use. Otherwise, autovacuum could start on the
toast table later, compute a later OldestXmin horizon, and remove as DEAD
toast tuples that we still need (because we think their parent tuples are
only RECENTLY_DEAD). Per further thought about bug #5998.
Per bug #5988, reported by Marko Tiikkaja, and further analyzed by Tom
Lane, the previous coding was broken in several respects: even if the
target table already existed, a subsequent CREATE TABLE IF NOT EXISTS
might try to add additional constraints or sequences-for-serial
specified in the new CREATE TABLE statement.
In passing, this also fixes a minor information leak: it's no longer
possible to figure out whether a schema to which you don't have CREATE
access contains a sequence named like "x_y_seq" by attempting to create a
table in that schema called "x" with a serial column called "y".
Some more refactoring of this code in the future might be warranted,
but that will need to wait for a later major release.
There can never be a need to push the indcheckxmin horizon forward, since
any HOT chains that are actually broken with respect to the index must
pre-date its original creation. So we can just avoid changing pg_index
altogether during a REINDEX operation.
This offers a cleaner solution than my previous patch for the problem
found a few days ago that we mustn't try to update pg_index while we are
reindexing it. System catalog indexes will always be created with
indcheckxmin = false during initdb, and with this modified code we should
never try to change their pg_index entries. This avoids special-casing
system catalogs as the former patch did, and should provide a performance
benefit for many cases where REINDEX formerly caused an index to be
considered unusable for a short time.
Back-patch to 8.3 to cover all versions containing HOT. Note that this
patch changes the API for index_build(), but I believe it is unlikely that
any add-on code is calling that directly.
For what seem entirely historical reasons, a bitmask "flags" argument was
recently added to reindex_relation without subsuming its existing boolean
argument into that bitmask. This seems a bit bizarre, so fold them
together.
This patch adds the server infrastructure to support extensions.
There is still one significant loose end, namely how to make it play nice
with pg_upgrade, so I am not yet committing the changes that would make
all the contrib modules depend on this feature.
In passing, fix a disturbingly large amount of breakage in
AlterObjectNamespace() and callers.
Dimitri Fontaine, reviewed by Anssi Kääriäinen,
Itagaki Takahiro, Tom Lane, and numerous others
Failure to do so can lead to constraint violations. This was broken by
commit 1ddc2703a936d03953657f43345460b9242bbed1 on 2010-02-07, so
back-patch to 9.0.
Noah Misch. Regression test by me.
The only use we have had for amindexnulls is in determining whether an
index is safe to cluster on; but since the addition of the amclusterable
flag, that usage is pretty redundant.
In passing, clean up assorted sloppiness from the last patch that touched
pg_am.h: Natts_pg_am was wrong, and ambuildempty was not documented.
This commit replaces pg_class.relistemp with pg_class.relpersistence;
and also modifies the RangeVar node type to carry relpersistence rather
than istemp. It also removes removes rd_istemp from RelationData and
instead performs the correct computation based on relpersistence.
For clarity, we add three new macros: RelationNeedsWAL(),
RelationUsesLocalBuffers(), and RelationUsesTempNamespace(), so that we
can clarify the purpose of each check that previous depended on
rd_istemp.
This is intended as infrastructure for the upcoming unlogged tables
patch, as well as for future possible work on global temporary tables.
This is a heavily revised version of builtin_knngist_core-0.9. The
ordering operators are no longer mixed in with actual quals, which would
have confused not only humans but significant parts of the planner.
Instead, ordering operators are carried separately throughout planning and
execution.
Since the API for ambeginscan and amrescan functions had to be changed
anyway, this commit takes the opportunity to rationalize that a bit.
RelationGetIndexScan no longer forces a premature index_rescan call;
instead, callers of index_beginscan must call index_rescan too. Aside from
making the AM-side initialization logic a bit less peculiar, this has the
advantage that we do not make a useless extra am_rescan call when there are
runtime key values. AMs formerly could not assume that the key values
passed to amrescan were actually valid; now they can.
Teodor Sigaev and Tom Lane
This patch resurrects some of the information that could be logged by the
old, now-dead implementation of VACUUM FULL, in particular counts of live
and dead tuples and the time taken for the table rebuild proper. There's
still no logging about the ensuing index rebuilds, though.
Itagaki Takahiro
archival or hot standby should be WAL-logged, instead of deducing that from
other options like archive_mode. This replaces recovery_connections GUC in
the primary, where it now has no effect, but it's still used in the standby
to enable/disable hot standby.
Remove the WAL-logging of "unlogged operations", like creating an index
without WAL-logging and fsyncing it at the end. Instead, we keep a copy of
the wal_mode setting and the settings that affect how much shared memory a
hot standby server needs to track master transactions (max_connections,
max_prepared_xacts, max_locks_per_xact) in pg_control. Whenever the settings
change, at server restart, write a WAL record noting the new settings and
update pg_control. This allows us to notice the change in those settings in
the standby at the right moment, they used to be included in checkpoint
records, but that meant that a changed value was not reflected in the
standby until the first checkpoint after the change.
Bump PG_CONTROL_VERSION and XLOG_PAGE_MAGIC. Whack XLOG_PAGE_MAGIC back to
the sequence it used to follow, before hot standby and subsequent patches
changed it to 0x9003.
The purpose of this change is to eliminate the need for every caller
of SearchSysCache, SearchSysCacheCopy, SearchSysCacheExists,
GetSysCacheOid, and SearchSysCacheList to know the maximum number
of allowable keys for a syscache entry (currently 4). This will
make it far easier to increase the maximum number of keys in a
future release should we choose to do so, and it makes the code
shorter, too.
Design and review by Tom Lane.
Move rd_targblock, rd_fsm_nblocks, and rd_vm_nblocks from relcache to the smgr
relation entries, so that they will get reset to InvalidBlockNumber whenever
an smgr-level flush happens. Because we now send smgr invalidation messages
immediately (not at end of transaction) when a relation truncation occurs,
this ensures that other backends will reset their values before they next
access the relation. We no longer need the unreliable assumption that a
VACUUM that's doing a truncation will hold its AccessExclusive lock until
commit --- in fact, we can intentionally release that lock as soon as we've
completed the truncation. This patch therefore reverts (most of) Alvaro's
patch of 2009-11-10, as well as my marginal hacking on it yesterday. We can
also get rid of assorted no-longer-needed relcache flushes, which are far more
expensive than an smgr flush because they kill a lot more state.
In passing this patch fixes smgr_redo's failure to perform visibility-map
truncation, and cleans up some rather dubious assumptions in freespace.c and
visibilitymap.c about when rd_fsm_nblocks and rd_vm_nblocks can be out of
date.
as per my recent proposal.
First, teach IndexBuildHeapScan to not wait for INSERT_IN_PROGRESS or
DELETE_IN_PROGRESS tuples to commit unless the index build is checking
uniqueness/exclusion constraints. If it isn't, there's no harm in just
indexing the in-doubt tuple.
Second, modify VACUUM FULL/CLUSTER to suppress reverifying
uniqueness/exclusion constraint properties while rebuilding indexes of
the target relation. This is reasonable because these commands aren't
meant to deal with corrupted-data situations. Constraint properties
will still be rechecked when an index is rebuilt by a REINDEX command.
This gets us out of the problem that new-style VACUUM FULL would often
wait for other transactions while holding exclusive lock on a system
catalog, leading to probable deadlock because those other transactions
need to look at the catalogs too. Although the real ultimate cause of
the problem is a debatable choice to release locks early after modifying
system catalogs, changing that choice would require pretty serious
analysis and is not something to be undertaken lightly or on a tight
schedule. The present patch fixes the problem in a fairly reasonable
way and should also improve the speed of VACUUM FULL/CLUSTER a little bit.
of shared or nailed system catalogs. This has two key benefits:
* The new CLUSTER-based VACUUM FULL can be applied safely to all catalogs.
* We no longer have to use an unsafe reindex-in-place approach for reindexing
shared catalogs.
CLUSTER on nailed catalogs now works too, although I left it disabled on
shared catalogs because the resulting pg_index.indisclustered update would
only be visible in one database.
Since reindexing shared system catalogs is now fully transactional and
crash-safe, the former special cases in REINDEX behavior have been removed;
shared catalogs are treated the same as non-shared.
This commit does not do anything about the recently-discussed problem of
deadlocks between VACUUM FULL/CLUSTER on a system catalog and other
concurrent queries; will address that in a separate patch. As a stopgap,
parallel_schedule has been tweaked to run vacuum.sql by itself, to avoid
such failures during the regression tests.
of old and new toast tables can be done either at the logical level (by
swapping the heaps' reltoastrelid links) or at the physical level (by swapping
the relfilenodes of the toast tables and their indexes). This is necessary
infrastructure for upcoming changes to support CLUSTER/VAC FULL on shared
system catalogs, where we cannot change reltoastrelid. The physical swap
saves a few catalog updates too.
We unfortunately have to keep the logical-level swap logic because in some
cases we will be adding or deleting a toast table, so there's no possibility
of a physical swap. However, that only happens as a consequence of schema
changes in the table, which we do not need to support for system catalogs,
so such cases aren't an obstacle for that.
In passing, refactor the cluster support functions a little bit to eliminate
unnecessarily-duplicated code; and fix the problem that while CLUSTER had
been taught to rename the final toast table at need, ALTER TABLE had not.
relations (they don't live in pg_toast). This caused an Assert failure in
assert-enabled builds. So far as I can see, in a non-assert build it would
only have messed up the checks for conflicting names, so a failure would be
quite improbable but perhaps not impossible.
that would've been WAL-logged if archiving was enabled. If we encounter
such records in archive recovery anyway, we know that some data is
missing from the log. A WARNING is emitted in that case.
Original patch by Fujii Masao, with changes by me.
This includes two new kinds of postmaster processes, walsenders and
walreceiver. Walreceiver is responsible for connecting to the primary server
and streaming WAL to disk, while walsender runs in the primary server and
streams WAL from disk to the client.
Documentation still needs work, but the basics are there. We will probably
pull the replication section to a new chapter later on, as well as the
sections describing file-based replication. But let's do that as a separate
patch, so that it's easier to see what has been added/changed. This patch
also adds a new section to the chapter about FE/BE protocol, documenting the
protocol used by walsender/walreceivxer.
Bump catalog version because of two new functions,
pg_last_xlog_receive_location() and pg_last_xlog_replay_location(), for
monitoring the progress of replication.
Fujii Masao, with additional hacking by me
VACUUM FULL was renamed to VACUUM FULL INPLACE. Also added a new
option -i, --inplace for vacuumdb to perform FULL INPLACE vacuuming.
Since the new VACUUM FULL uses CLUSTER infrastructure, we cannot
use it for system tables. VACUUM FULL for system tables always
fall back into VACUUM FULL INPLACE silently.
Itagaki Takahiro, reviewed by Jeff Davis and Simon Riggs.
the privileges that will be applied to subsequently-created objects.
Such adjustments are always per owning role, and can be restricted to objects
created in particular schemas too. A notable benefit is that users can
override the traditional default privilege settings, eg, the PUBLIC EXECUTE
privilege traditionally granted by default for functions.
Petr Jelinek