When gin_fuzzy_search_limit was used, we could jump out of startScan()
without calling startScanKey(). That was harmless in 9.3 and below, because
startScanKey()() didn't do anything interesting, but in 9.4 it initializes
information needed for skipping entries (aka GIN fast scans), and you
readily get a segfault if it's not done. Nevertheless, it was clearly wrong
all along, so backpatch all the way to 9.1 where the early return was
introduced.
(AFAICS startScanKey() did nothing useful in 9.3 and below, because the
fields it initialized were already initialized in ginFillScanKey(), but I
don't dare to change that in a minor release. ginFillScanKey() is always
called in gingetbitmap() even though there's a check there to see if the
scan keys have already been initialized, because they never are; ginrescan()
free's them.)
In the passing, remove unnecessary if-check from the second inner loop in
startScan(). We already check in the first loop that the condition is true
for all entries.
Reported by Olaf Gawenda, bug #12694, Backpatch to 9.1 and above, although
AFAICS it causes a live bug only in 9.4.
When returning rows from a bitmap, as done with partial match queries, we
would get stuck in an infinite loop if the bitmap contained a lossy page
reference.
This bug is new in master, it was introduced by the patch to allow skipping
items refuted by other entries in GIN scans.
Report and fix by Alexander Korotkov
There were a couple of bugs here. First, if the fuzzy limit was exceeded,
the loop in entryGetItem might drop out too soon if a whole block needs to
be skipped because it's < advancePast ("continue" in a while-loop checks the
loop condition too). Secondly, the loop checked when stepping to a new page
that there is at least one offset on the page < advancePast, but we cannot
rely on that on subsequent calls of entryGetItem, because advancePast might
change in between. That caused the skipping loop to read bogus items in the
TbmIterateResult's offset array.
First item and fix by Alexander Korotkov, second bug pointed out by Fabrízio
de Royes Mello, by a small variation of Alexander's test query.
The shimTriConstistentFn, which calls the opclass's consistent function with
all combinations of TRUE/FALSE for any MAYBE argument, modifies the entryRes
array passed by the caller. Change startScanKey to re-initialize it between
each call to accommodate that.
It's actually a bad habit by shimTriConsistentFn to modify its argument. But
the only caller that doesn't already re-initialize the entryRes array was
startScanKey, and it's easy for startScanKey to do so. Add a comment to
shimTriConsistentFn about that.
Note: this does not give a free pass to opclass-provided consistent
functions to modify the entryRes argument; shimTriConsistent assumes that
they don't, even though it does it itself.
While at it, refactor startScanKey to allocate the requiredEntries and
additionalEntries after it knows exactly how large they need to be. Saves a
little bit of memory, and looks nicer anyway.
Per complaint by Tom Lane, buildfarm and the pg_trgm regression test.
If you have a GIN query like "rare & frequent", we currently fetch all the
items that match either rare or frequent, call the consistent function for
each item, and let the consistent function filter out items that only match
one of the terms. However, if we can deduce that "rare" must be present for
the overall qual to be true, we can scan all the rare items, and for each
rare item, skip over to the next frequent item with the same or greater TID.
That greatly speeds up "rare & frequent" type queries.
To implement that, introduce the concept of a tri-state consistent function,
where the 3rd value is MAYBE, indicating that we don't know if that term is
present. Operator classes only provide a boolean consistent function, so we
simulate the tri-state consistent function by calling the boolean function
several times, with the MAYBE arguments set to all combinations of TRUE and
FALSE. Testing all combinations is only feasible for a small number of MAYBE
arguments, but it is envisioned that we'll provide a way for operator
classes to provide a native tri-state consistent function, which can be much
more efficient. But that is not included in this patch.
We were already using that trick to for lossy pages, calling the consistent
function with the lossy entry set to TRUE and FALSE. Now that we have the
tri-state consistent function, use it for lossy pages too.
Alexander Korotkov, with fair amount of refactoring by me.
When skipping over some items in a posting tree, re-find the new location
by descending the tree from root, rather than walking the right links.
This can save a lot of I/O.
Heavily modified from Alexander Korotkov's fast scan patch.
If we're skipping past a certain TID, avoid decoding posting list segments
that only contain smaller TIDs.
Extracted from Alexander Korotkov's fast scan patch, heavily modified.
In a multi-key search, ie. something like "col @> 'foo' AND col @> 'bar'",
as soon as we find the next item that matches the first criteria, we don't
need to check the second criteria for TIDs smaller the first match. That
saves a lot of effort, especially if one of the terms is rare, while the
second occurs very frequently.
Based on ideas from Alexander Korotkov's fast scan patch.
GIN posting lists are now encoded using varbyte-encoding, which allows them
to fit in much smaller space than the straight ItemPointer array format used
before. The new encoding is used for both the lists stored in-line in entry
tree items, and in posting tree leaf pages.
To maintain backwards-compatibility and keep pg_upgrade working, the code
can still read old-style pages and tuples. Posting tree leaf pages in the
new format are flagged with GIN_COMPRESSED flag, to distinguish old and new
format pages. Likewise, entry tree tuples in the new format have a
GIN_ITUP_COMPRESSED flag set in a bit that was previously unused.
This patch bumps GIN_CURRENT_VERSION from 1 to 2. New indexes created with
version 9.4 will therefore have version number 2 in the metapage, while old
pg_upgraded indexes will have version 1. The code treats them the same, but
it might be come handy in the future, if we want to drop support for the
uncompressed format.
Alexander Korotkov and me. Reviewed by Tomas Vondra and Amit Langote.
Separate the insertion payload from the more static portions of GinBtree.
GinBtree now only contains information related to searching the tree, and
the information of what to insert is passed separately.
Add root block number to GinBtree, instead of passing it around all the
functions as argument.
Split off ginFinishSplit() from ginInsertValue(). ginFinishSplit is
responsible for finding the parent and inserting the downlink to it.
If a page is deleted, and reused for something else, just as a search is
following a rightlink to it from its left sibling, the search would continue
scanning whatever the new contents of the page are. That could lead to
incorrect query results, or even something more curious if the page is
reused for a different kind of a page.
To fix, modify the search algorithm to lock the next page before releasing
the previous one, and refrain from deleting pages from the leftmost branch
of the tree.
Add a new Concurrency section to the README, explaining why this works.
There is a lot more one could say about concurrency in GIN, but that's for
another patch.
Backpatch to all supported versions.
It makes for cleaner code to have separate Get/Add functions for PostingItems
and ItemPointers. A few callsites that have to deal with both types need to
be duplicated because of this, but all the callers have to know which one
they're dealing with anyway. Overall, this reduces the amount of casting
required.
Extracted from Alexander Korotkov's larger patch to change the data page
format.
The Solaris Studio compiler warns about these instances, unlike more
mainstream compilers such as gcc. But manual inspection showed that
the code is clearly not reachable, and we hope no worthy compiler will
complain about removing this code.
Experimentation with contrib/btree_gist shows that the majority of the GIST
support functions potentially need collation information. Safest policy
seems to be to pass it to all of them, instead of making assumptions about
which ones could possibly need it.
Since collation is effectively an argument, not a property of the function,
FmgrInfo is really the wrong place for it; and this becomes critical in
cases where a cached FmgrInfo is used for varying purposes that might need
different collation settings. Fix by passing it in FunctionCallInfoData
instead. In particular this allows a clean fix for bug #5970 (record_cmp
not working). This requires touching a bit more code than the original
method, but nobody ever thought that collations would not be an invasive
patch...
The original coding could combine duplicate entries only when they
originated from the same qual condition. In particular it could not
combine cases where multiple qual conditions all give rise to full-index
scan requests, which is an expensive case well worth optimizing. Refactor
so that duplicates are recognized across all the quals.
Per my recent proposal(s). Null key datums can now be returned by
extractValue and extractQuery functions, and will be stored in the index.
Also, placeholder entries are made for indexable items that are NULL or
contain no keys according to extractValue. This means that the index is
now always complete, having at least one entry for every indexed heap TID,
and so we can get rid of the prohibition on full-index scans. A full-index
scan is implemented much the same way as partial-match scans were already:
we build a bitmap representing all the TIDs found in the index, and then
drive the results off that.
Also, introduce a concept of a "search mode" that can be requested by
extractQuery when the operator requires matching to empty items (this is
just as cheap as matching to a single key) or requires a full index scan
(which is not so cheap, but it sure beats failing or giving wrong answers).
The behavior remains backward compatible for opclasses that don't return
any null keys or request a non-default search mode.
Using these features, we can now make the GIN index opclass for anyarray
behave in a way that matches the actual anyarray operators for &&, <@, @>,
and = ... which it failed to do before in assorted corner cases.
This commit fixes the core GIN code and ginarrayprocs.c, updates the
documentation, and adds some simple regression test cases for the new
behaviors using the array operators. The tsearch and contrib GIN opclass
support functions still need to be looked over and probably fixed.
Another thing I intend to fix separately is that this is pretty inefficient
for cases where more than one scan condition needs a full-index search:
we'll run duplicate GinScanEntrys, each one of which builds a large bitmap.
There is some existing logic to merge duplicate GinScanEntrys but it needs
refactoring to make it work for entries belonging to different scan keys.
Note that most of gin.h has been split out into a new file gin_private.h,
so that gin.h doesn't export anything that's not supposed to be used by GIN
opclasses or the rest of the backend. I did quite a bit of other code
beautification work as well, mostly fixing comments and choosing more
appropriate names for things.
The GIN code has absolutely no business exporting GIN-specific functions
with names as generic as compareItemPointers() or newScanKey(); that's
just trouble waiting to happen. I got annoyed about this again just now
and decided to fix it. This commit ensures that all global symbols
defined in access/gin/ have names including "gin" or "Gin". There were a
couple of cases, like names involving "PostingItem", where arguably the
names were already sufficiently nongeneric; but I figured as long as I was
risking creating merge problems for unapplied GIN patches I might as well
impose a uniform policy.
I didn't touch any static symbol names. There might be some places
where it'd be appropriate to rename some static functions to match
siblings that are exported, but I'll leave that for another time.
Although the key-combining code claimed to work correctly if its input
contained both lossy and exact pointers for a single page in a single TID
stream, in fact this did not work, and could not work without pretty
fundamental redesign. Modify keyGetItem so that it will not return such a
stream, by handling lossy-pointer cases a bit more explicitly than we did
before.
Per followup investigation of a gripe from Artur Dabrowski.
An example of a query that failed given his data set is
select count(*) from search_tab where
(to_tsvector('german', keywords ) @@ to_tsquery('german', 'ee:* | dd:*')) and
(to_tsvector('german', keywords ) @@ to_tsquery('german', 'aa:*'));
Back-patch to 8.4 where the lossy pointer code was introduced.
routines to make them behave better in the presence of "lossy" index pointers.
The previous coding was outright incorrect for some cases, as recently
reported by Artur Dabrowski: scanGetItem would fail to return index entries in
cases where one index key had multiple exact pointers on the same page as
another key had a lossy pointer. Also, keyGetItem was extremely inefficient
for cases where a single index key generates multiple "entry" streams, such as
an @@ operator with a multiple-clause tsquery. The presence of a lossy page
pointer in any one stream defeated its ability to use the opclass
consistentFn, resulting in probing many heap pages that didn't really need to
be visited. In Artur's example case, a query like
WHERE tsvector @@ to_tsquery('a & b')
was about 50X slower than the theoretically equivalent
WHERE tsvector @@ to_tsquery('a') AND tsvector @@ to_tsquery('b')
The way that I chose to fix this was to have GIN call the consistentFn
twice with both TRUE and FALSE values for the in-doubt entry stream,
returning a hit if either call produces TRUE, but not if they both return
FALSE. The code handles this for the case of a single in-doubt entry stream,
but punts (falling back to the stupid behavior) if there's more than one lossy
reference to the same page. The idea could be scaled up to deal with multiple
lossy references, but I think that would probably be wasted complexity. At
least to judge by Artur's example, such cases don't occur often enough to be
worth trying to optimize.
Back-patch to 8.4. 8.3 did not have lossy GIN index pointers, so not
subject to these problems.
User-defined consistent functions believes the check array
contains at least one true element which was not a true for
scanning pending list.
Per report from Yury Don <yura@vpcit.ru>
method to pass extra data to the consistent() and comparePartial() methods.
This is the core infrastructure needed to support the soon-to-appear
contrib/btree_gin module. The APIs are still upward compatible with the
definitions used in 8.3 and before, although *not* with the previous 8.4devel
function definitions.
catversion bump for changes in pg_proc entries (although these are just
cosmetic, since GIN doesn't actually look at the function signature before
calling it...)
Teodor Sigaev and Oleg Bartunov
multiple index entries in a holding area before adding them to the main index
structure. This helps because bulk insert is (usually) significantly faster
than retail insert for GIN.
This patch also removes GIN support for amgettuple-style index scans. The
API defined for amgettuple is difficult to support with fastupdate, and
the previously committed partial-match feature didn't really work with
it either. We might eventually figure a way to put back amgettuple
support, but it won't happen for 8.4.
catversion bumped because of change in GIN's pg_am entry, and because
the format of GIN indexes changed on-disk (there's a metapage now,
and possibly a pending list).
Teodor Sigaev
bitmap. This is extracted from Greg Stark's posix_fadvise patch; it seems
worth committing separately, since it's potentially useful independently of
posix_fadvise.
corresponding struct definitions. This allows other headers to avoid including
certain highly-loaded headers such as rel.h and relscan.h, instead using just
relcache.h, heapam.h or genam.h, which are more lightweight and thus cause less
unnecessary dependencies.
unnecessary #include lines in it. Also, move some tuple routine prototypes and
macros to htup.h, which allows removal of heapam.h inclusion from some .c
files.
For this to work, a new header file access/sysattr.h needed to be created,
initially containing attribute numbers of system columns, for pg_dump usage.
While at it, make contrib ltree, intarray and hstore header files more
consistent with our header style.