1
0
mirror of https://github.com/postgres/postgres.git synced 2025-11-09 06:21:09 +03:00
Commit Graph

1938 Commits

Author SHA1 Message Date
Heikki Linnakangas
2502f45979 When a GiST page is split during index build, it might not have a buffer.
Previously it was thought that it's impossible as the code stands, because
insertions create buffers as tuples are cascaded downwards, and index
split also creaters buffers eagerly for all halves. But the example from
Jay Levitt demonstrates that it can happen, when the root page is split.
It's in fact OK if the buffer doesn't exist, so we just need to remove the
sanity check. In fact, we've been discussing the possibility of destroying
empty buffers to conserve memory, which would render the sanity check
completely useless anyway.

Fix by Alexander Korotkov
2012-03-02 13:16:09 +02:00
Peter Eisentraut
973e9fb294 Add const qualifiers where they are accidentally cast away
This only produces warnings under -Wcast-qual, but it's more correct
and consistent in any case.
2012-02-28 12:42:08 +02:00
Tom Lane
1b630751d0 Fix some more bugs in GIN's WAL replay logic.
In commit 4016bdef8a I fixed a bunch of
ginxlog.c bugs having to do with not handling XLogReadBuffer failures
correctly.  However, in ginRedoUpdateMetapage and ginRedoDeleteListPages,
I unaccountably thought that failure to read the metapage would be
impossible and just put in an elog(PANIC) call.  This is of course wrong:
failure is exactly what will happen if the index got dropped (or rebuilt)
between creation of the WAL record and the crash we're trying to recover
from.  I believe this explains Nicholas Wilson's recent report of these
errors getting reached.

Also, fix memory leak in forgetIncompleteSplit.  This wasn't of much
concern when the code was written, but in a long-running standby server
page split records could be expected to accumulate indefinitely.

Back-patch to 8.4 --- before that, GIN didn't have a metapage.
2012-02-26 15:12:17 -05:00
Peter Eisentraut
9cfd800aab Add some enumeration commas, for consistency 2012-02-24 11:04:45 +02:00
Tom Lane
593a9631a7 Don't clear btpo_cycleid during _bt_vacuum_one_page.
When "vacuuming" a single btree page by removing LP_DEAD tuples, we are not
actually within a vacuum operation, but rather in an ordinary insertion
process that could well be running concurrently with a vacuum.  So clearing
the cycleid is incorrect, and could cause the concurrent vacuum to miss
removing tuples that it needs to remove.  This is a longstanding bug
introduced by commit e6284649b9 of
2006-07-25.  I believe it explains Maxim Boguk's recent report of index
corruption, and probably some other previously unexplained reports.

In 9.0 and up this is a one-line fix; before that we need to introduce a
flag to tell _bt_delitems what to do.
2012-02-21 15:03:36 -05:00
Tom Lane
9789c99d01 Cosmetic cleanup for commit a760893dbd.
Mostly, fixing overlooked comments.
2012-02-21 14:14:16 -05:00
Heikki Linnakangas
21b1634275 Fix heap_multi_insert to set t_self field in the caller's tuples.
If tuples were toasted, heap_multi_insert didn't update the ctid on the
original tuples. This caused a failure if there was an after trigger
(including a foreign key), on the table, and a tuple got toasted.

Per off-list report and test case from Ted Phelps
2012-02-13 10:20:50 +02:00
Tom Lane
331bf6712c Throw error sooner for unlogged GiST indexes.
Throwing an error only after we've built the main index fork is pretty
unfriendly when the table already contains data.  Per gripe from Jay
Levitt.
2012-02-08 16:19:27 -05:00
Heikki Linnakangas
1a01560cbb Rename LWLockWaitUntilFree to LWLockAcquireOrWait.
LWLockAcquireOrWait makes it more clear that the lock is acquired if it's
free.
2012-02-08 09:17:13 +02:00
Tom Lane
c6d76d7c82 Add locking around WAL-replay modification of shared-memory variables.
Originally, most of this code assumed that no Postgres backends could be
running concurrently with it, and so no locking could be needed.  That
assumption fails in Hot Standby.  While it's still true that Hot Standby
backends should never change values like nextXid, they can examine them,
and consistency is important in some cases such as when computing a
snapshot.  Therefore, prudence requires that WAL replay code obtain the
relevant locks when modifying such variables, even though it can examine
them without taking a lock.  We were following that coding rule in some
places but not all.  This commit applies the coding rule uniformly to all
updates of ShmemVariableCache and MultiXactState fields; a search of the
replay routines did not find any other cases that seemed to be at risk.

In addition, this commit fixes a longstanding thinko in replay of NEXTOID
and checkpoint records: we tried to advance nextOid only if it was behind
the value in the WAL record, but the comparison would draw the wrong
conclusion if OID wraparound had occurred since the previous value.
Better to just unconditionally assign the new value, since OID assignment
shouldn't be happening during replay anyway.

The additional locking seems to be more in the nature of future-proofing
than fixing any live bug, so I am not going to back-patch it.  The NEXTOID
fix will be back-patched separately.
2012-02-06 12:34:10 -05:00
Tom Lane
17118825b8 Fix transient clobbering of shared buffers during WAL replay.
RestoreBkpBlocks was in the habit of zeroing and refilling the target
buffer; which was perfectly safe when the code was written, but is unsafe
during Hot Standby operation.  The reason is that we have coding rules
that allow backends to continue accessing a tuple in a heap relation while
holding only a pin on its buffer.  Such a backend could see transiently
zeroed data, if WAL replay had occasion to change other data on the page.
This has been shown to be the cause of bug #6425 from Duncan Rance (who
deserves kudos for developing a sufficiently-reproducible test case) as
well as Bridget Frey's re-report of bug #6200.  It most likely explains the
original report as well, though we don't yet have confirmation of that.

To fix, change the code so that only bytes that are supposed to change will
change, even transiently.  This actually saves cycles in RestoreBkpBlocks,
since it's not writing the same bytes twice.

Also fix seq_redo, which has the same disease, though it has to work a bit
harder to meet the requirement.

So far as I can tell, no other WAL replay routines have this type of bug.
In particular, the index-related replay routines, which would certainly be
broken if they had to meet the same standard, are not at risk because we
do not have coding rules that allow access to an index page when not
holding a buffer lock on it.

Back-patch to 9.0 where Hot Standby was added.
2012-02-05 15:49:17 -05:00
Tom Lane
ee68a44106 Improve comment. 2012-02-04 22:37:34 -05:00
Robert Haas
b4e0741727 Avoid re-checking for visibility map extension too frequently.
When testing bits (but not when setting or clearing them), we now
won't check whether the map has been extended.  This significantly
improves performance in the case where the visibility map doesn't
exist yet, by avoiding an extra system call per tuple.  To make
sure backends notice eventually, send an smgr inval on VM extension.

Dean Rasheed, with minor modifications by me.
2012-02-01 20:35:42 -05:00
Heikki Linnakangas
9b38d46d9f Make group commit more effective.
When a backend needs to flush the WAL, and someone else is already flushing
the WAL, wait until it releases the WALInsertLock and check if we still need
to do the flush or if the other backend already did the work for us, before
acquiring WALInsertLock. This helps group commit, because when the WAL flush
finishes, all the backends that were waiting for it can be woken up in one
go, and the can all concurrently observe that they're done, rather than
waking them up one by one in a cascading fashion.

This is based on a new LWLock function, LWLockWaitUntilFree(), which has
peculiar semantics. If the lock is immediately free, it grabs the lock and
returns true. If it's not free, it waits until it is released, but then
returns false without grabbing the lock. This is used in XLogFlush(), so
that when the lock is acquired, the backend flushes the WAL, but if it's
not, the backend first checks the current flush location before retrying.

Original patch and benchmarking by Peter Geoghegan and Simon Riggs, although
this patch as committed ended up being very different from that.
2012-01-30 16:53:48 +02:00
Tom Lane
ad10853b30 Assorted comment fixes, mostly just typos, but some obsolete statements.
YAMAMOTO Takashi
2012-01-29 19:23:56 -05:00
Simon Riggs
8366c7803e Allow pg_basebackup from standby node with safety checking.
Base backup follows recommended procedure, plus goes to great
lengths to ensure that partial page writes are avoided.

Jun Ishizuka and Fujii Masao, with minor modifications
2012-01-25 18:02:04 +00:00
Robert Haas
754b8140a1 fastgetattr is in access/htup.h, not access/heapam.h
Noted by Peter Geoghegan
2012-01-16 20:37:01 -05:00
Simon Riggs
5530623d03 Correctly initialise shared recoveryLastRecPtr in recovery.
Previously we used ReadRecPtr rather than EndRecPtr, which was
not a serious error but caused pg_stat_replication to report
incorrect replay_location until at least one WAL record is replayed.

Fujii Masao
2012-01-13 13:02:44 +00:00
Tom Lane
21b446dd09 Fix CLUSTER/VACUUM FULL for toast values owned by recently-updated rows.
In commit 7b0d0e9356, I made CLUSTER and
VACUUM FULL try to preserve toast value OIDs from the original toast table
to the new one.  However, if we have to copy both live and recently-dead
versions of a row that has a toasted column, those versions may well
reference the same toast value with the same OID.  The patch then led to
duplicate-key failures as we tried to insert the toast value twice with the
same OID.  (The previous behavior was not very desirable either, since it
would have silently inserted the same value twice with different OIDs.
That wastes space, but what's worse is that the toast values inserted for
already-dead heap rows would not be reclaimed by subsequent ordinary
VACUUMs, since they go into the new toast table marked live not deleted.)

To fix, check if the copied OID already exists in the new toast table, and
if so, assume that it stores the desired value.  This is reasonably safe
since the only case where we will copy an OID from a previous toast pointer
is when toast_insert_or_update was given that toast pointer and so we just
pulled the data from the old table; if we got two different values that way
then we have big problems anyway.  We do have to assume that no other
backend is inserting items into the new toast table concurrently, but
that's surely safe for CLUSTER and VACUUM FULL.

Per bug #6393 from Maxim Boguk.  Back-patch to 9.0, same as the previous
patch.
2012-01-12 16:40:14 -05:00
Heikki Linnakangas
1b9dea04b5 Remove useless 'needlock' argument from GetXLogInsertRecPtr. It was always
passed as 'true'.
2012-01-11 11:01:47 +02:00
Heikki Linnakangas
9c808f89c2 Refactor XLogInsert a bit. The rdata entries for backup blocks are now
constructed before acquiring WALInsertLock, which slightly reduces the time
the lock is held. Although I could not measure any benefit in benchmarks,
the code is more readable this way.
2012-01-11 11:01:47 +02:00
Robert Haas
33aaa139e6 Make the number of CLOG buffers adaptive, based on shared_buffers.
Previously, this was hardcoded: we always had 8.  Performance testing
shows that isn't enough, especially on big SMP systems, so we allow it
to scale up as high as 32 when there's adequate memory.  On the flip
side, when shared_buffers is very small, drop the number of CLOG buffers
down to as little as 4, so that we can start the postmaster even
when very little shared memory is available.

Per extensive discussion with Simon Riggs, Tom Lane, and others on
pgsql-hackers.
2012-01-06 14:32:18 -05:00
Bruce Momjian
e126958c2e Update copyright notices for year 2012. 2012-01-01 18:01:58 -05:00
Simon Riggs
64233902d2 Send new protocol keepalive messages to standby servers.
Allows streaming replication users to calculate transfer latency
and apply delay via internal functions. No external functions yet.
2011-12-31 13:30:26 +00:00
Tom Lane
472d3935a2 Rethink representation of index clauses' mapping to index columns.
In commit e2c2c2e8b1 I made use of nested
list structures to show which clauses went with which index columns, but
on reflection that's a data structure that only an old-line Lisp hacker
could love.  Worse, it adds unnecessary complication to the many places
that don't much care which clauses go with which index columns.  Revert
to the previous arrangement of flat lists of clauses, and instead add a
parallel integer list of column numbers.  The places that care about the
pairing can chase both lists with forboth(), while the places that don't
care just examine one list the same as before.

The only real downside to this is that there are now two more lists that
need to be passed to amcostestimate functions in case they care about
column matching (which btcostestimate does, so not passing the info is not
an option).  Rather than deal with 11-argument amcostestimate functions,
pass just the IndexPath and expect the functions to extract fields from it.
That gets us down to 7 arguments which is better than 11, and it seems
more future-proof against likely additions to the information we keep
about an index path.
2011-12-24 19:03:21 -05:00
Robert Haas
0e4611c023 Add a security_barrier option for views.
When a view is marked as a security barrier, it will not be pulled up
into the containing query, and no quals will be pushed down into it,
so that no function or operator chosen by the user can be applied to
rows not exposed by the view.  Views not configured with this
option cannot provide robust row-level security, but will perform far
better.

Patch by KaiGai Kohei; original problem report by Heikki Linnakangas
(in October 2009!).  Review (in earlier versions) by Noah Misch and
others.  Design advice by Tom Lane and myself.  Further review and
cleanup by me.
2011-12-22 16:16:31 -05:00
Tom Lane
d0024cd188 Avoid crashing when we have problems unlinking files post-commit.
smgrdounlink takes care to not throw an ERROR if it fails to unlink
something, but that caution was rendered useless by commit
3396000684, which put an smgrexists call in
front of it; smgrexists *does* throw error if anything looks funny, such
as getting a permissions error from trying to open the file.  If that
happens post-commit, you get a PANIC, and what's worse the same logic
appears in the WAL replay code, so the database even fails to restart.

Restore the intended behavior by removing the smgrexists call --- it isn't
accomplishing anything that we can't do better by adjusting mdunlink's
ideas of whether it ought to warn about ENOENT or not.

Per report from Joseph Shraibman of unrecoverable crash after trying to
drop a table whose FSM fork had somehow gotten chmod'd to 000 permissions.
Backpatch to 8.4, where the bogus coding was introduced.
2011-12-20 15:00:36 -05:00
Peter Eisentraut
729205571e Add support for privileges on types
This adds support for the more or less SQL-conforming USAGE privilege
on types and domains.  The intent is to be able restrict which users
can create dependencies on types, which restricts the way in which
owners can alter types.

reviewed by Yeb Havinga
2011-12-20 00:05:19 +02:00
Tom Lane
8f57b064fd Rename updateNodeLink to spgUpdateNodeLink.
On reflection, the original name seems way too generic for a global
symbol.  A quick check shows this is the only exported function name
in SP-GiST that doesn't begin with "spg" or contain "SpGist", so the
rest of them seem all right.
2011-12-19 15:38:32 -05:00
Tom Lane
9220362493 Teach SP-GiST to do index-only scans.
Operator classes can specify whether or not they support this; this
preserves the flexibility to use lossy representations within an index.

In passing, move constant data about a given index into the rd_amcache
cache area, instead of doing fresh lookups each time we start an index
operation.  This is mainly to try to make sure that spgcanreturn() has
insignificant cost; I still don't have any proof that it matters for
actual index accesses.  Also, get rid of useless copying of FmgrInfo
pointers; we can perfectly well use the relcache's versions in-place.
2011-12-19 14:58:41 -05:00
Tom Lane
3695a55513 Replace simple constant pg_am.amcanreturn with an AM support function.
The need for this was debated when we put in the index-only-scan feature,
but at the time we had no near-term expectation of having AMs that could
support such scans for only some indexes; so we kept it simple.  However,
the SP-GiST AM forces the issue, so let's fix it.

This patch only installs the new API; no behavior actually changes.
2011-12-18 15:50:37 -05:00
Tom Lane
b7a0e8fb4d Defend against null scankeys in spgist searches.
Should've thought of that one earlier.
2011-12-17 19:08:28 -05:00
Tom Lane
dd45d3ad33 Fix some long-obsolete references to XLogOpenRelation.
These were missed in commit a213f1ee6c,
which removed that function.
2011-12-17 18:26:52 -05:00
Tom Lane
85df5dbf5a Fix compiler warning seen on 64-bit machine. 2011-12-17 16:51:36 -05:00
Tom Lane
8daeb5ddd6 Add SP-GiST (space-partitioned GiST) index access method.
SP-GiST is comparable to GiST in flexibility, but supports non-balanced
partitioned search structures rather than balanced trees.  As described at
PGCon 2011, this new indexing structure can beat GiST in both index build
time and query speed for search problems that it is well matched to.

There are a number of areas that could still use improvement, but at this
point the code seems committable.

Teodor Sigaev and Oleg Bartunov, with considerable revisions by Tom Lane
2011-12-17 16:42:30 -05:00
Tom Lane
2dd9322ba6 Move BKP_REMOVABLE bit from individual WAL records to WAL page headers.
Removing this bit from xl_info allows us to restore the old limit of four
(not three) separate pages touched by a WAL record, which is needed for the
upcoming SP-GiST feature, and will likely be useful elsewhere in future.

When we implemented XLR_BKP_REMOVABLE in 2007, we had to do it like that
because no special WAL-visible action was taken when starting a backup.
However, now we force a segment switch when starting a backup, so a
compressing WAL archiver (such as pglesslog) that uses the state shown in
the current page header will not be fooled as to removability of backup
blocks.  The only downside is that the archiver will not return to
compressing mode for up to one WAL page after the backup is over, which is
a small price to pay for getting back the extra xl_info bit.  In any case
the archiver could look for XLOG_BACKUP_END records if it thought it was
worth the trouble to do so.

Bump XLOG_PAGE_MAGIC since this is effectively a change in WAL format.
2011-12-12 16:22:14 -05:00
Heikki Linnakangas
9f0d2bdc88 Don't set reachedMinRecoveryPoint during crash recovery. In crash recovery,
we don't reach consistency before replaying all of the WAL. Rename the
variable to reachedConsistency, to make its intention clearer.

In master, that was an active bug because of the recent patch to
immediately PANIC if a reference to a missing page is found in WAL after
reaching consistency, as Tom Lane's test case demonstrated. In 9.1 and 9.0,
the only consequence was a misleading "consistent recovery state reached at
%X/%X" message in the log at the beginning of crash recovery (the database
is not consistent at that point yet). In 8.4, the log message was not
printed in crash recovery, even though there was a similar
reachedMinRecoveryPoint local variable that was also set early. So,
backpatch to 9.1 and 9.0.
2011-12-09 15:21:12 +02:00
Tom Lane
c6e3ac11b6 Create a "sort support" interface API for faster sorting.
This patch creates an API whereby a btree index opclass can optionally
provide non-SQL-callable support functions for sorting.  In the initial
patch, we only use this to provide a directly-callable comparator function,
which can be invoked with a bit less overhead than the traditional
SQL-callable comparator.  While that should be of value in itself, the real
reason for doing this is to provide a datatype-extensible framework for
more aggressive optimizations, as in Peter Geoghegan's recent work.

Robert Haas and Tom Lane
2011-12-07 00:19:39 -05:00
Heikki Linnakangas
1e616f6391 During recovery, if we reach consistent state and still have entries in the
invalid-page hash table, PANIC immediately. Immediate PANIC is much better
than waiting for end-of-recovery, which is what we did before, because the
end-of-recovery might not come until months later if this is a standby
server.

Also refrain from creating a restartpoint if there are invalid-page entries
in the hash table. Restarting recovery from such a restartpoint would not
see the invalid references, and wouldn't be able to cross-check them when
consistency is reached. That wouldn't matter when things are going smoothly,
but the more sanity checks you have the better.

Fujii Masao
2011-12-02 10:49:54 +02:00
Robert Haas
2ad36c4e44 Improve table locking behavior in the face of current DDL.
In the previous coding, callers were faced with an awkward choice:
look up the name, do permissions checks, and then lock the table; or
look up the name, lock the table, and then do permissions checks.
The first choice was wrong because the results of the name lookup
and permissions checks might be out-of-date by the time the table
lock was acquired, while the second allowed a user with no privileges
to interfere with access to a table by users who do have privileges
(e.g. if a malicious backend queues up for an AccessExclusiveLock on
a table on which AccessShareLock is already held, further attempts
to access the table will be blocked until the AccessExclusiveLock
is obtained and the malicious backend's transaction rolls back).

To fix, allow callers of RangeVarGetRelid() to pass a callback which
gets executed after performing the name lookup but before acquiring
the relation lock.  If the name lookup is retried (because
invalidation messages are received), the callback will be re-executed
as well, so we get the best of both worlds.  RangeVarGetRelid() is
renamed to RangeVarGetRelidExtended(); callers not wishing to supply
a callback can continue to invoke it as RangeVarGetRelid(), which is
now a macro.  Since the only one caller that uses nowait = true now
passes a callback anyway, the RangeVarGetRelid() macro defaults nowait
as well.  The callback can also be used for supplemental locking - for
example, REINDEX INDEX needs to acquire the table lock before the index
lock to reduce deadlock possibilities.

There's a lot more work to be done here to fix all the cases where this
can be a problem, but this commit provides the general infrastructure
and fixes the following specific cases: REINDEX INDEX, REINDEX TABLE,
LOCK TABLE, and and DROP TABLE/INDEX/SEQUENCE/VIEW/FOREIGN TABLE.

Per discussion with Noah Misch and Alvaro Herrera.
2011-11-30 10:27:00 -05:00
Tom Lane
9f4563f743 Use IEEE infinity, not 1e10, for null-and-not-null case in gistpenalty().
Use of a randomly chosen large value was never exactly graceful, and
now that there are penalty functions that are intentionally using infinity,
it doesn't seem like a good idea for null-vs-not-null to be using something
less.
2011-11-27 17:12:54 -05:00
Heikki Linnakangas
dea5f6cefe Take fillfactor into account in the new COPY bulk heap insert code.
Jeff Janes
2011-11-26 12:11:00 +02:00
Tom Lane
877b67c38b Fix erroneous replay of GIN_UPDATE_META_PAGE WAL records.
A simple thinko in ginRedoUpdateMetapage, namely failing to increment a
loop counter, led to inserting records into the last pending-list page in
the wrong order (the opposite of that intended).  So far as I can tell,
this would not upset the code that eventually flushes pending items into
the main part of the GIN index.  But it did break the code that searched
the pending list for matches, resulting in transient failure to find
matching entries during index lookups, as illustrated in bug #6307 from
Maksym Boguk.

Back-patch to 8.4 where the incorrect code was introduced.
2011-11-25 13:58:59 -05:00
Robert Haas
ed0b409d22 Move "hot" members of PGPROC into a separate PGXACT array.
This speeds up snapshot-taking and reduces ProcArrayLock contention.
Also, the PGPROC (and PGXACT) structures used by two-phase commit are
now allocated as part of the main array, rather than in a separate
array, and we keep ProcArray sorted in pointer order.  These changes
are intended to minimize the number of cache lines that must be pulled
in to take a snapshot, and testing shows a substantial increase in
performance on both read and write workloads at high concurrencies.

Pavan Deolasee, Heikki Linnakangas, Robert Haas
2011-11-25 08:02:10 -05:00
Simon Riggs
2d2841a56c Continue to allow VACUUM to mark last block of index dirty
even when there is no work to do. Further analysis required.
Revert of patch c1458cc495
2011-11-22 09:48:06 +00:00
Simon Riggs
c1458cc495 Avoid marking buffer dirty when VACUUM has no work to do.
When wal_level = 'hot_standby' we touched the last page of the
relation during a VACUUM, even if nothing else had happened.
That would alter the LSN of the last block and set the mtime
of the relation file unnecessarily. Noted by Thom Brown.
2011-11-18 16:06:53 +00:00
Simon Riggs
4de82f7d7c Wakeup WALWriter as needed for asynchronous commit performance.
Previously we waited for wal_writer_delay before flushing WAL. Now
we also wake WALWriter as soon as a WAL buffer page has filled.
Significant effect observed on performance of asynchronous commits
by Robert Haas, attributed to the ability to set hint bits on tuples
earlier and so reducing contention caused by clog lookups.
2011-11-13 09:00:57 +00:00
Heikki Linnakangas
2e02280726 Fix another bug in the redo of COPY batches.
I got alignment wrong in the redo routine. Spotted by redoing the log
genereated by copy regression test.
2011-11-10 12:21:43 +02:00
Heikki Linnakangas
f81648cb1e Fix bugs in the COPY heap-insert batching patch.
Forgot to call RestoreBkpBlocks() in the redo-function, as pointed out by
Simon Riggs. In redo of a regular heap insert, it's taken care of in
heap_redo(), but this new record type uses the heap2 RM, and heap2_redo()
does not take care of that for you.

Also, failed to reset the vmbuffer and all_visibile_cleared local variables
after switching to a new buffer.
2011-11-09 21:28:25 +02:00
Heikki Linnakangas
d326d9e8ea In COPY, insert tuples to the heap in batches.
This greatly reduces the WAL volume, especially when the table is narrow.
The overhead of locking the heap page is also reduced. Reduced WAL traffic
also makes it scale a lot better, if you run multiple COPY processes at
the same time.
2011-11-09 10:54:41 +02:00