1
0
mirror of https://github.com/postgres/postgres.git synced 2025-11-13 16:22:44 +03:00
Commit Graph

3624 Commits

Author SHA1 Message Date
Heikki Linnakangas
a4ad9afec2 Update obsolete comments.
We no longer have a TLI field in the page header.
2014-04-23 14:41:51 +03:00
Heikki Linnakangas
8fbfbf1472 Fix typos in comment. 2014-04-23 12:56:41 +03:00
Heikki Linnakangas
4fafc4ecd9 Cleanup of new b-tree page deletion code.
When marking a branch as half-dead, a pointer to the top of the branch is
stored in the leaf block's hi-key. During normal operation, the high key
was left in place, and the block number was just stored in the ctid field
of the high key tuple, but in WAL replay, the high key was recreated as a
truncated tuple with zero columns. For the sake of easier debugging, also
truncate the tuple in normal operation, so that the page is identical
after WAL replay. Also, rename the 'downlink' field in the WAL record to
'topparent', as that seems like a more descriptive name. And make sure
it's set to invalid when unlinking the leaf page.
2014-04-23 10:19:54 +03:00
Tom Lane
c6a4ace5bf Fix broken logic in logical_heap_rewrite_flush_mappings().
It's blatantly obvious that commit 4d0d607a45
wasn't tested.  The leak's real enough, though.
2014-04-22 22:33:35 -04:00
Bruce Momjian
cee850c403 revert 4d0d607a45
Revert due to contrib/test_decoding regression failure
2014-04-22 22:21:54 -04:00
Bruce Momjian
4d0d607a45 release memory used while flushing logical mappings
Patch by Ants Aasma
2014-04-22 18:05:44 -04:00
Heikki Linnakangas
4a5d55ec2b Fix bug in the new B-tree incomplete-split code.
Forgot to update LSN of left sibling's page, when creating a new root.
I fixed this for regular insertions and page splits earlier, but missed
new root creation.
2014-04-22 22:40:44 +03:00
Heikki Linnakangas
45e67a2ad7 Fix Gin README.
The README incorrectly claimed that GIN posting tree pages contain an array
of uncompressed items in addition to compressed posting lists. Earlier
versions of the GIN posting list compression patch worked that way, but not
the one that was committed.
2014-04-22 22:39:50 +03:00
Heikki Linnakangas
77fe2b6d79 Fix bug in new B-tree page deletion code.
When modifying a page, must hold an exclusive lock. A shared lock is
obviously not good enough.
2014-04-22 15:34:54 +03:00
Heikki Linnakangas
7e30c186da Retain original physical order of tuples in redo of b-tree splits.
It makes no difference to the system, but minimizing the differences
between a master and standby makes debugging simpler.
2014-04-22 13:03:37 +03:00
Heikki Linnakangas
7d98054f0d Fix rm_desc routine of b-tree page delete records.
A couple of typos from my refactoring of the page deletion patch.
2014-04-22 13:02:52 +03:00
Robert Haas
fab6170cab Fix typo.
Etsuro Fujita
2014-04-20 16:30:55 +02:00
Magnus Hagander
66b1084e2c Fix typo
Amit Langote
2014-04-18 12:49:54 +02:00
Bruce Momjian
83defef8c7 report stat() error in trigger file check
Permissions might prevent the existence of the trigger file from being
checked.

Per report from Andres Freund
2014-04-17 11:55:57 -04:00
Heikki Linnakangas
848b9f05ab Use correctly-sized buffer when zero-filling a WAL file.
I mixed up BLCKSZ and XLOG_BLCKSZ when I changed the way the buffer is
allocated a couple of weeks ago. With the default settings, they are both
8k, but they can be changed at compile-time.
2014-04-16 10:26:36 +03:00
Heikki Linnakangas
f1dadd34fa Set pd_lower on internal GIN posting tree pages.
This allows squeezing out the unused space in full-page writes. And more
importantly, it can be a useful debugging aid.

In hindsight we should've done this back when GIN was added - we wouldn't
need the 'maxoff' field in the page opaque struct if we had used pd_lower
and pd_upper like on normal pages. But as long as there can be pages in the
index that have been binary-upgraded from pre-9.4 versions, we can't rely
on that, and have to continue using 'maxoff'.

Most of the code churn comes from renaming some macros, now that they're
used on internal pages, too.

This change is completely backwards-compatible, no effect on pg_upgrade.
2014-04-14 21:13:19 +03:00
Tom Lane
4dfb065b3a Fix bogus handling of bad strategy number in GIST consistent() functions.
Make sure we throw an error instead of silently doing the wrong thing when
fed a strategy number we don't recognize.  Also, in the places that did
already throw an error, spell the error message in a way more consistent
with our message style guidelines.

Per report from Paul Jones.  Although this is a bug, it won't occur unless
a superuser tries to do something he shouldn't, so it doesn't seem worth
back-patching.
2014-04-14 11:18:47 -04:00
Heikki Linnakangas
e3e6e3af56 Remove dead checks for invalid left page in ginDeletePage.
In some places, the function assumes the left page is valid, and in others,
it checks if it is valid. Remove all the checks.
2014-04-14 15:27:32 +03:00
Heikki Linnakangas
1bd3842163 GIN entry pages follow the standard page layout - tell XLogInsert.
The entry B-tree pages all follow the standard page layout. The 9.3 code has
this right. I inadvertently changed this at some point during the big
refactorings in git master.
2014-04-14 14:51:28 +03:00
Heikki Linnakangas
614167c6d7 Fix bugs in GIN "fast scan" with partial match.
There were a couple of bugs here. First, if the fuzzy limit was exceeded,
the loop in entryGetItem might drop out too soon if a whole block needs to
be skipped because it's < advancePast ("continue" in a while-loop checks the
loop condition too). Secondly, the loop checked when stepping to a new page
that there is at least one offset on the page < advancePast, but we cannot
rely on that on subsequent calls of entryGetItem, because advancePast might
change in between. That caused the skipping loop to read bogus items in the
TbmIterateResult's offset array.

First item and fix by Alexander Korotkov, second bug pointed out by Fabrízio
de Royes Mello, by a small variation of Alexander's test query.
2014-04-10 23:42:04 +03:00
Heikki Linnakangas
787064cd00 Fix typo in comment.
Tomonari Katsumata
2014-04-10 13:11:49 +03:00
Heikki Linnakangas
7ca32e255b Fix hot standby bug with GiST scans.
Don't reset the rightlink of a page when replaying a page update record.
This was a leftover from pre-hot standby days, when it was not possible to
have scans concurrent with WAL replay. Resetting the right-link was not
necessary back then either, but it was done for the sake of tidiness. But
with hot standby, it's wrong, because a concurrent scan might still need it.

Backpatch all versions with hot standby, 9.0 and above.
2014-04-08 14:51:40 +03:00
Heikki Linnakangas
38a2b95c34 Zero padding byte at end of GIN posting list.
This isn't strictly necessary, but helps debugging.
2014-04-07 19:49:03 +03:00
Heikki Linnakangas
594bac4272 Fix WAL replay bug in the new GIN incomplete-split code.
Forgot to set the incomplete-split flag on the left page half, in redo of a
page split.

Spotted this by comparing the page contents on master and standby, after
inserting/applying each WAL record.
2014-04-07 14:37:30 +03:00
Heikki Linnakangas
ffbba6ee12 Fix another palloc in critical section.
Also add a regression test for a GIN index with enough items with the same
key, so that a GIN posting tree gets created. Apparently none of the
existing GIN tests were large enough for that.

This code is new, no backpatching required.
2014-04-05 22:15:58 +03:00
Robert Haas
59202fae04 Fix some compiler warnings that clang emits with -pedantic.
Andres Freund
2014-04-04 11:29:50 -04:00
Heikki Linnakangas
b1236f4b7b Move multixid allocation out of critical section.
It can fail if you run out of memory.

This call was added in 9.3, so backpatch to 9.3 only.
2014-04-04 18:20:22 +03:00
Heikki Linnakangas
d9e7873bbb In checkpoint, move the check for in-progress xacts out of critical section.
GetVirtualXIDsDelayingChkpt calls palloc, which isn't safe in a critical
section. I thought I covered this case with the exemption for the
checkpointer, but CreateCheckPoint is also called from the startup process.
2014-04-04 17:31:22 +03:00
Heikki Linnakangas
877b088785 Avoid allocations in critical sections.
If a palloc in a critical section fails, it becomes a PANIC.
2014-04-04 13:35:44 +03:00
Heikki Linnakangas
04e298b826 Avoid palloc in critical section in GiST WAL-logging.
Memory allocation can fail if you run out of memory, and inside a critical
section that will lead to a PANIC. Use conservatively-sized arrays in stack
instead.

There was previously no explicit limit on the number of pages a GiST split
can produce, it was only limited by the number of LWLocks that can be held
simultaneously (100 at the moment). This patch adds an explicit limit of 75
pages. That should be plenty, a typical split shouldn't produce more than
2-3 page halves.

The bug has been there forever, but only backpatch down to 9.1. The code
was changed significantly in 9.1, and it doesn't seem worth the risk or
trouble to adapt this for 9.0 and 8.4.
2014-04-03 15:43:50 +03:00
Heikki Linnakangas
8bbbcb91ba Fix bug in the new GIN incomplete-split code.
Inserting a downlink to an internal page clears the incomplete-split flag
of the child's left sibling, so the left sibling's LSN also needs to be
updated and it needs to be marked dirty. The codepath for an insertion got
this right, but the case where the internal node is split because of
inserting the new downlink missed that.
2014-04-01 22:49:47 +03:00
Heikki Linnakangas
cfe992e7eb Remove dead check for backup block, replace with Assert.
We don't use backup blocks with GIN vacuum records anymore, the page is
always recreated from scratch.
2014-04-01 21:16:10 +03:00
Heikki Linnakangas
954523cdfe Fix bug in the new B-tree incomplete-split code.
Inserting a downlink to an internal page clears the incomplete-split flag
of the child's left sibling, so the left sibling's LSN also needs to be
updated.
2014-04-01 19:19:47 +03:00
Heikki Linnakangas
14d02f0bb3 Rewrite the way GIN posting lists are packed on a page, to reduce WAL volume.
Inserting (in retail) into the new 9.4 format GIN posting tree created much
larger WAL records than in 9.3. The previous strategy to WAL logging was
basically to log the whole page on each change, with the exception of
completely unmodified segments up to the first modified one. That was not
too bad when appending to the end of the page, as only the last segment had
to be WAL-logged, but per Fujii Masao's testing, even that produced 2x the
WAL volume that 9.3 did.

The new strategy is to keep track of changes to the posting lists in a more
fine-grained fashion, and also make the repacking" code smarter to avoid
decoding and re-encoding segments unnecessarily.
2014-03-31 15:23:50 +03:00
Heikki Linnakangas
0cfa34c25a Rename GinLogicValue to GinTernaryValue.
It's more descriptive. Also, get rid of the enum, and use #defines instead,
per Greg Stark's suggestion.
2014-03-31 10:26:38 +03:00
Heikki Linnakangas
c2a6724823 Pass more than the first XLogRecData entry to rm_desc, with WAL_DEBUG.
If you compile with WAL_DEBUG and enable it with wal_debug=on, we used to
only pass the first XLogRecData entry to the rm_desc routine. I think the
original assumprion was that the first XLogRecData entry contains all the
necessary information for the rm_desc routine, but that's a pretty shaky
assumption. At least standby_redo didn't get the memo.

To fix, piece together all the data in a temporary buffer, and pass that to
the rm_desc routine.

It's been like this forever, but the patch didn't apply cleanly to
back-branches. Probably wouldn't be hard to fix the conflicts, but it's
not worth the trouble.
2014-03-26 18:17:53 +02:00
Fujii Masao
49638868f8 Don't forget to flush XLOG_PARAMETER_CHANGE record.
Backpatch to 9.0 where XLOG_PARAMETER_CHANGE record was instroduced.
2014-03-26 02:12:39 +09:00
Heikki Linnakangas
bb42e21be2 Change ginMergeItemPointers to return a palloc'd array.
That seems nicer than making it the caller's responsibility to pass a
suitable-sized array. All the callers were just palloc'ing an array anyway.
2014-03-24 18:44:40 +02:00
Heikki Linnakangas
2f3afc0979 Remove dead code and add comments.
'cbuffer' variable was left over from an earlier version of the patch to
rewrite the incomplete split handling.
2014-03-24 11:02:23 +02:00
Heikki Linnakangas
3ed249b741 Fix "the the" typos.
Erik Rijkers
2014-03-24 08:42:13 +02:00
Noah Misch
c31305de5f Address ccvalid/ccnoinherit in TupleDesc support functions.
equalTupleDescs() neglected both of these ConstrCheck fields, and
CreateTupleDescCopyConstr() neglected ccnoinherit.  At this time, the
only known behavior defect resulting from these omissions is constraint
exclusion disregarding a CHECK constraint validated by an ALTER TABLE
VALIDATE CONSTRAINT statement issued earlier in the same transaction.
Back-patch to 9.2, where these fields were introduced.
2014-03-23 02:13:43 -04:00
Heikki Linnakangas
68a2e52bba Replace the XLogInsert slots with regular LWLocks.
The special feature the XLogInsert slots had over regular LWLocks is the
insertingAt value that was updated atomically with releasing backends
waiting on it. Add new functions to the LWLock API to do that, and replace
the slots with LWLocks. This reduces the amount of duplicated code.
(There's still some duplication, but at least it's all in lwlock.c now.)

Reviewed by Andres Freund.
2014-03-21 15:10:48 +01:00
Alvaro Herrera
f88d4cfc9d Setup error context callback for transaction lock waits
With this in place, a session blocking behind another one because of
tuple locks will get a context line mentioning the relation name, tuple
TID, and operation being done on tuple.  For example:

LOG:  process 11367 still waiting for ShareLock on transaction 717 after 1000.108 ms
DETAIL:  Process holding the lock: 11366. Wait queue: 11367.
CONTEXT:  while updating tuple (0,2) in relation "foo"
STATEMENT:  UPDATE foo SET value = 3;

Most usefully, the new line is displayed by log entries due to
log_lock_waits, although of course it will be printed by any other log
message as well.

Author: Christian Kruse, some tweaks by Álvaro Herrera
Reviewed-by: Amit Kapila, Andres Freund, Tom Lane, Robert Haas
2014-03-19 15:10:36 -03:00
Heikki Linnakangas
59a5ab3f42 Remove rm_safe_restartpoint machinery.
It is no longer used, none of the resource managers have multi-record
actions that would make it unsafe to perform a restartpoint.

Also don't allow rm_cleanup to write WAL records, it's also no longer
required. Move the call to rm_cleanup routines to make it more symmetric
with rm_startup.
2014-03-18 22:10:35 +02:00
Heikki Linnakangas
40dae7ec53 Make the handling of interrupted B-tree page splits more robust.
Splitting a page consists of two separate steps: splitting the child page,
and inserting the downlink for the new right page to the parent. Previously,
we handled the case that you crash in between those steps with a cleanup
routine after the WAL recovery had finished, which finished the incomplete
split. However, that doesn't help if the page split is interrupted but the
database doesn't crash, so that you don't perform WAL recovery. That could
happen for example if you run out of disk space.

Remove the end-of-recovery cleanup step. Instead, when a page is split, the
left page is marked with a new INCOMPLETE_SPLIT flag, and when the downlink
is inserted to the parent, the flag is cleared again. If an insertion sees
a page with the flag set, it knows that the split was interrupted for some
reason, and inserts the missing downlink before proceeding.

I used the same approach to fix GIN and GiST split algorithms earlier. This
was the last WAL cleanup routine, so we could get rid of that whole
machinery now, but I'll leave that for a separate patch.

Reviewed by Peter Geoghegan.
2014-03-18 20:50:44 +02:00
Heikki Linnakangas
d663d4399e Fix thinko: have trueTriConsistentFn return GIN_TRUE.
While we're at it, also improve comments in ginlogic.c.
2014-03-17 17:29:04 +02:00
Fujii Masao
2bccced110 Fix typos in comments.
Thom Brown
2014-03-17 20:47:28 +09:00
Heikki Linnakangas
efada2b8e9 Fix race condition in B-tree page deletion.
In short, we don't allow a page to be deleted if it's the rightmost child
of its parent, but that situation can change after we check for it.

Problem
-------

We check that the page to be deleted is not the rightmost child of its
parent, and then lock its left sibling, the page itself, its right sibling,
and the parent, in that order. However, if the parent page is split after
the check but before acquiring the locks, the target page might become the
rightmost child, if the split happens at the right place. That leads to an
error in vacuum (I reproduced this by setting a breakpoint in debugger):

ERROR:  failed to delete rightmost child 41 of block 3 in index "foo_pkey"

We currently re-check that the page is still the rightmost child, and throw
the above error if it's not. We could easily just give up rather than throw
an error, but that approach doesn't scale to half-dead pages. To recap,
although we don't normally allow deleting the rightmost child, if the page
is the *only* child of its parent, we delete the child page and mark the
parent page as half-dead in one atomic operation. But before we do that, we
check that the parent can later be deleted, by checking that it in turn is
not the rightmost child of the grandparent (potentially recursing all the
way up to the root). But the same situation can arise there - the
grandparent can be split while we're not holding the locks. We end up with
a half-dead page that we cannot delete.

To make things worse, the keyspace of the deleted page has already been
transferred to its right sibling. As the README points out, the keyspace at
the grandparent level is "out-of-whack" until the half-dead page is deleted,
and if enough tuples with keys in the transferred keyspace are inserted, the
page might get split and a downlink might be inserted into the grandparent
that is out-of-order. That might not cause any serious problem if it's
transient (as the README ponders), but is surely bad if it stays that way.

Solution
--------

This patch changes the page deletion algorithm to avoid that problem. After
checking that the topmost page in the chain of to-be-deleted pages is not
the rightmost child of its parent, and then deleting the pages from bottom
up, unlink the pages from top to bottom. This way, the intermediate stages
are similar to the intermediate stages in page splitting, and there is no
transient stage where the keyspace is "out-of-whack". The topmost page in
the to-be-deleted chain doesn't have a downlink pointing to it, like a page
split before the downlink has been inserted.

This also allows us to get rid of the cleanup step after WAL recovery, if we
crash during page deletion. The deletion will be continued at next VACUUM,
but the tree is consistent for searches and insertions at every step.

This bug is old, all supported versions are affected, but this patch is too
big to back-patch (and changes the WAL record formats of related records).
We have not heard any reports of the bug from users, so clearly it's not
easy to bump into. Maybe backpatch later, after this has had some field
testing.

Reviewed by Kevin Grittner and Peter Geoghegan.
2014-03-14 16:07:19 +02:00
Bruce Momjian
886c0be3f6 C comments: remove odd blank lines after #ifdef WIN32 lines 2014-03-13 01:34:42 -04:00
Heikki Linnakangas
a3115f0d9e Only WAL-log the modified portion in an UPDATE, if possible.
When a row is updated, and the new tuple version is put on the same page as
the old one, only WAL-log the part of the new tuple that's not identical to
the old. This saves significantly on the amount of WAL that needs to be
written, in the common case that most fields are not modified.

Amit Kapila, with a lot of back and forth with me, Robert Haas, and others.
2014-03-12 23:28:36 +02:00