We must filter out hashtable entries with frequencies less than those
specified by the algorithm, else we risk emitting junk entries whose
actual frequency is much less than other lexemes that did not get
tabulated. This is bad enough by itself, but even worse is that
tsquerysel() believes that the minimum frequency seen in pg_statistic is a
hard upper bound for lexemes not included, and was thus underestimating
the frequency of non-MCEs.
Also, set the threshold frequency to something with a little bit of theory
behind it, to wit assume that the input distribution is approximately
Zipfian. This might need adjustment in future, but some preliminary
experiments suggest that it's not too unreasonable.
Back-patch to 8.4, where this code was introduced.
Jan Urbanski, with some editorialization by Tom
are not an alphabetic character although they are not word-breakers too.
So, treat them as part of word.
Per off-list discussion with Dibyendra Hyoju <dibyendra@gmail.com> and
and Bal Krishna Bal <balkrishna7bal@gmail.com> about Nepali language and
Devanagari alphabet.
- pg_wchar and wchar_t could have different size, so char2wchar
doesn't call pg_mb2wchar_with_len to prevent out-of-bound
memory bug
- make char2wchar/wchar2char symmetric, now they should not be
called with C-locale because mbstowcs/wcstombs oftenly doesn't
work correct with C-locale.
- Text parser uses pg_mb2wchar_with_len directly in case of
C-locale and multibyte encoding
Per bug report by Hiroshi Inoue <inoue@tpf.co.jp> and
following discussion.
Backpatch up to 8.2 when multybyte support was implemented in tsearch.
to 10, to compensate for the recent change in default statistics target.
The original number was pulled out of the air anyway :-(, but it was picked
in the context of the old default, so holding the default size of the
MCELEM array constant seems the best thing. Per discussion.
on the most common individual lexemes in place of the mostly-useless default
behavior of counting duplicate tsvectors. Future work: create selectivity
estimation functions that actually do something with these stats.
(Some other things we ought to look at doing: using the Lossy Counting
algorithm in compute_minimal_stats, and using the element-counting idea for
stats on regular arrays.)
Jan Urbanski
by installing an error context subroutine that will provide the file name
and line number for all errors detected while reading a config file.
Some of the reader routines were already doing that in an ad-hoc way for
errors detected directly in the reader, but it didn't help for problems
detected in subroutines, such as encoding violations.
Back-patch to 8.3 because 8.3 is where people will be trying to debug
configuration files.
unnecessary #include lines in it. Also, move some tuple routine prototypes and
macros to htup.h, which allows removal of heapam.h inclusion from some .c
files.
For this to work, a new header file access/sysattr.h needed to be created,
initially containing attribute numbers of system columns, for pg_dump usage.
While at it, make contrib ltree, intarray and hstore header files more
consistent with our header style.
strings. This patch introduces four support functions cstring_to_text,
cstring_to_text_with_len, text_to_cstring, and text_to_cstring_buffer, and
two macros CStringGetTextDatum and TextDatumGetCString. A number of
existing macros that provided variants on these themes were removed.
Most of the places that need to make such conversions now require just one
function or macro call, in place of the multiple notational layers that used
to be needed. There are no longer any direct calls of textout or textin,
and we got most of the places that were using handmade conversions via
memcpy (there may be a few still lurking, though).
This commit doesn't make any serious effort to eliminate transient memory
leaks caused by detoasting toasted text objects before they reach
text_to_cstring. We changed PG_GETARG_TEXT_P to PG_GETARG_TEXT_PP in a few
places where it was easy, but much more could be done.
Brendan Jurd and Tom Lane
a unused memory holes in tsquery.
Per report by Richard Huxton <dev@archonet.com>.
It was working well because in fact tsquery->size is not used for any
kind of operation except comparing tsqueries. So, in HEAD it's enough to
fix to_tsquery function, but for previous version it's needed to
remove optimization in CompareTSQ to prevent requirement of renew all
stored tsquery.
regis. Correct the latter's oversight that a bracket-expression needs to be
terminated. Reduce the ereports to elogs, since they are now not expected to
ever be hit (thus addressing Alvaro's original complaint).
In passing, const-ify the string argument to RS_compile.
subtlety that this function only returns a null terminator if it's
fed input that includes one; which, in the usage here, it's not.
This probably fixes bugs reported by Thomas Haegi.
Allow tag and entity names that follow XML rules. Provide for hexadecimal
as well as decimal numeric entities. Adjust code names to coincide with
new descriptions.
gives the old behavior; selecting false allows the dictionary to be used
as a filter ahead of other dictionaries, because it will pass on rather
than accept words that aren't in its stopword list.
Jan Urbanski