1
0
mirror of https://github.com/postgres/postgres.git synced 2025-08-31 17:02:12 +03:00
Commit Graph

75 Commits

Author SHA1 Message Date
Tom Lane
f8fc6082b4 Fix misuse of Lossy Counting (LC) algorithm in compute_tsvector_stats().
We must filter out hashtable entries with frequencies less than those
specified by the algorithm, else we risk emitting junk entries whose
actual frequency is much less than other lexemes that did not get
tabulated.  This is bad enough by itself, but even worse is that
tsquerysel() believes that the minimum frequency seen in pg_statistic is a
hard upper bound for lexemes not included, and was thus underestimating
the frequency of non-MCEs.

Also, set the threshold frequency to something with a little bit of theory
behind it, to wit assume that the input distribution is approximately
Zipfian.  This might need adjustment in future, but some preliminary
experiments suggest that it's not too unreasonable.

Back-patch to 8.4, where this code was introduced.

Jan Urbanski, with some editorialization by Tom
2010-05-30 21:59:09 +00:00
Tom Lane
b382cbb044 Avoid core dump on empty thesaurus dictionary.
Per report from Robert Gravsjö.
2009-11-30 16:38:40 +00:00
Peter Eisentraut
4272c8724f Make text search parser accept underscores in XML attributes (bug #5075) 2009-11-15 13:55:42 +00:00
Tom Lane
ba5317237f Remove duplicate variable initializations identified by clang static checker.
One of these represents a nontrivial bug (a promptly-leaked palloc), so
backpatch.

Greg Stark
2009-08-30 16:53:37 +00:00
Bruce Momjian
d747140279 8.4 pgindent run, with new combined Linux/FreeBSD/MinGW typedef list
provided by Andrew.
2009-06-11 14:49:15 +00:00
Tom Lane
a734979e0a Fix tsquerysel() to not fail on an empty TSQuery. Per report from
Tatsuo Ishii.
2009-06-03 18:42:13 +00:00
Teodor Sigaev
e43bb5beb7 Some languages have symbols with zero display's width or/and vowels/signs which
are not an alphabetic character although they are not word-breakers too.
So, treat them as part of word.

Per off-list discussion with Dibyendra Hyoju <dibyendra@gmail.com> and
and Bal Krishna Bal <balkrishna7bal@gmail.com> about Nepali language and
Devanagari alphabet.
2009-03-11 16:03:40 +00:00
Teodor Sigaev
42831729f7 Prevent recursion during parse of email-like string with multiple '@'.
Patch by Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>
2009-03-10 17:32:14 +00:00
Teodor Sigaev
32032d42b5 Fix usage of char2wchar/wchar2char. Changes:
- pg_wchar and wchar_t could have different size, so char2wchar
  doesn't call pg_mb2wchar_with_len to prevent out-of-bound
  memory bug
- make char2wchar/wchar2char symmetric, now they should not be
  called with C-locale because mbstowcs/wcstombs oftenly doesn't
  work correct with C-locale.
- Text parser uses pg_mb2wchar_with_len directly in case of
  C-locale and multibyte encoding

Per bug report by Hiroshi Inoue <inoue@tpf.co.jp> and
following discussion.

Backpatch up to 8.2 when multybyte support was implemented in tsearch.
2009-03-02 15:10:09 +00:00
Teodor Sigaev
b5b3134813 Fix incorrect dereferencing of char* to array's index.
Per Tommy Gildseth <tommy.gildseth@usit.uio.no> report
2009-01-29 16:22:10 +00:00
Teodor Sigaev
41d17e042b Fix URL generation in headline. Only tag lexeme will be replaced by space.
Per http://archives.postgresql.org/pgsql-bugs/2008-12/msg00013.php
2009-01-15 16:33:59 +00:00
Teodor Sigaev
8fd07a35ba Fix generation too long headline with ShortWords.
Per http://archives.postgresql.org/pgsql-hackers/2008-09/msg01088.php
2009-01-15 16:33:28 +00:00
Bruce Momjian
511db38ace Update copyright for 2009. 2009-01-01 17:24:05 +00:00
Tom Lane
301194f8ea Reduce the scaling factor for attstattarget to number-of-lexemes from 100
to 10, to compensate for the recent change in default statistics target.
The original number was pulled out of the air anyway :-(, but it was picked
in the context of the old default, so holding the default size of the
MCELEM array constant seems the best thing.  Per discussion.
2008-12-15 15:06:31 +00:00
Tom Lane
65e3ea7641 Increase the default value of default_statistics_target from 10 to 100,
and its maximum value from 1000 to 10000.  ALTER TABLE SET STATISTICS
similarly now allows a value up to 10000.  Per discussion.
2008-12-13 19:13:44 +00:00
Heikki Linnakangas
a93b3b98cd Fix bug in the tsvector stats collection function, which caused a crash if
the sample contains just a one tsvector, containing only one lexeme.
2008-11-27 21:17:39 +00:00
Tom Lane
2b74d45c1b pg_do_encoding_conversion cannot return NULL (at least not unless the input
is NULL), so remove some useless tests for the case.
2008-11-10 15:18:40 +00:00
Teodor Sigaev
2a0083ede8 Improve headeline generation. Now headline can contain
several fragments a-la Google.

Sushant Sinha <sushant354@gmail.com>
2008-10-17 18:05:19 +00:00
Teodor Sigaev
906b7e5f6c Fix small bug in headline generation.
Patch from Sushant Sinha <sushant354@gmail.com>
http://archives.postgresql.org/pgsql-hackers/2008-07/msg00785.php
2008-10-17 17:27:46 +00:00
Tom Lane
4e57668da4 Create a selectivity estimation function for the text search @@ operator.
Jan Urbanski
2008-09-19 19:03:41 +00:00
Tom Lane
6f6d863258 Create a type-specific typanalyze routine for tsvector, which collects stats
on the most common individual lexemes in place of the mostly-useless default
behavior of counting duplicate tsvectors.  Future work: create selectivity
estimation functions that actually do something with these stats.

(Some other things we ought to look at doing: using the Lossy Counting
algorithm in compute_minimal_stats, and using the element-counting idea for
stats on regular arrays.)

Jan Urbanski
2008-07-14 00:51:46 +00:00
Tom Lane
30dc388a0d Fix a few places that were non-multibyte-safe in tsearch configuration file
parsing.  Per bug #4253 from Giorgio Valoti.
2008-06-19 16:52:24 +00:00
Tom Lane
fbeb9da22b Improve error reporting for problems in text search configuration files
by installing an error context subroutine that will provide the file name
and line number for all errors detected while reading a config file.
Some of the reader routines were already doing that in an ad-hoc way for
errors detected directly in the reader, but it didn't help for problems
detected in subroutines, such as encoding violations.

Back-patch to 8.3 because 8.3 is where people will be trying to debug
configuration files.
2008-06-18 20:55:42 +00:00
Bruce Momjian
9de09c087d Move wchar2char() and char2wchar() from tsearch into /mb to be easier to
use for other modules;  also move pnstrdup().

Clean up code slightly.
2008-06-18 18:42:54 +00:00
Bruce Momjian
dc69c0362f Move USE_WIDE_UPPER_LOWER define to c.h, and remove TS_USE_WIDE and use
USE_WIDE_UPPER_LOWER instead.
2008-06-17 16:09:06 +00:00
Tom Lane
e6dbcb72fa Extend GIN to support partial-match searches, and extend tsquery to support
prefix matching using this facility.

Teodor Sigaev and Oleg Bartunov
2008-05-16 16:31:02 +00:00
Alvaro Herrera
f8c4d7db60 Restructure some header files a bit, in particular heapam.h, by removing some
unnecessary #include lines in it.  Also, move some tuple routine prototypes and
macros to htup.h, which allows removal of heapam.h inclusion from some .c
files.

For this to work, a new header file access/sysattr.h needed to be created,
initially containing attribute numbers of system columns, for pg_dump usage.

While at it, make contrib ltree, intarray and hstore header files more
consistent with our header style.
2008-05-12 00:00:54 +00:00
Tom Lane
220db7ccd8 Simplify and standardize conversions between TEXT datums and ordinary C
strings.  This patch introduces four support functions cstring_to_text,
cstring_to_text_with_len, text_to_cstring, and text_to_cstring_buffer, and
two macros CStringGetTextDatum and TextDatumGetCString.  A number of
existing macros that provided variants on these themes were removed.

Most of the places that need to make such conversions now require just one
function or macro call, in place of the multiple notational layers that used
to be needed.  There are no longer any direct calls of textout or textin,
and we got most of the places that were using handmade conversions via
memcpy (there may be a few still lurking, though).

This commit doesn't make any serious effort to eliminate transient memory
leaks caused by detoasting toasted text objects before they reach
text_to_cstring.  We changed PG_GETARG_TEXT_P to PG_GETARG_TEXT_PP in a few
places where it was easy, but much more could be done.

Brendan Jurd and Tom Lane
2008-03-25 22:42:46 +00:00
Tom Lane
7953fdcd9e Add a CaseSensitive parameter to synonym dictionaries.
Simon Riggs
2008-03-10 03:01:28 +00:00
Teodor Sigaev
3b8bca335d Fix memory arrangement of tsquery after removing stop words. It causes
a unused memory holes in tsquery.

Per report by Richard Huxton <dev@archonet.com>.

It was working well because in fact tsquery->size is not used for any
kind of operation except comparing tsqueries. So, in HEAD it's enough to
fix to_tsquery function, but for previous version it's needed to
remove optimization in CompareTSQ to prevent requirement of renew all
stored tsquery.
2008-03-07 14:30:20 +00:00
Bruce Momjian
910bc51862 When text search string is too long, in error message report actual and
maximum number of bytes allowed.
2008-03-05 15:50:37 +00:00
Peter Eisentraut
0474dcb608 Refactor backend makefiles to remove lots of duplicate code 2008-02-19 10:30:09 +00:00
Peter Eisentraut
a345dcd2f7 Observe errors in makefile 2008-02-18 16:04:32 +00:00
Tom Lane
716e8b8374 Fix RS_isRegis() to agree exactly with RS_compile()'s idea of what's a valid
regis.  Correct the latter's oversight that a bracket-expression needs to be
terminated.  Reduce the ereports to elogs, since they are now not expected to
ever be hit (thus addressing Alvaro's original complaint).
In passing, const-ify the string argument to RS_compile.
2008-01-21 02:46:11 +00:00
Teodor Sigaev
cd42dd5a17 Fix core dump with buffer-overrun by too long infinitive. Add checking of using
fixed length arrays to prevent array's overrun. Per report by
Hannes Dorbath <light@theendofthetunnel.de> and comments by Tom.
2008-01-16 13:01:03 +00:00
Tom Lane
deb7deda26 Tweak new error message to conform to style guidelines. 2008-01-15 18:22:47 +00:00
Teodor Sigaev
f7807f1de8 Add check of headline method presence. Per report by Yoshiyuki Asaba <y-asaba@sraoss.co.jp> 2008-01-15 17:16:01 +00:00
Bruce Momjian
9098ab9e32 Update copyrights in source tree to 2008. 2008-01-01 19:46:01 +00:00
Peter Eisentraut
f5f1355dc4 Wording improvements 2007-12-27 13:02:48 +00:00
Tom Lane
bb0e3011f8 Make a cleanup pass over error reports in tsearch code. Use ereport
for user-facing errors, fix some poor choices of errcode, adhere to
message style guide.
2007-11-28 21:56:30 +00:00
Peter Eisentraut
a238bd146d Proper capitalization of Ispell 2007-11-28 15:42:46 +00:00
Peter Eisentraut
2609345c85 Improve terminology 2007-11-28 13:30:36 +00:00
Bruce Momjian
43e082fc98 Change a stop word on the right-hand-side in the thesaurus file to be an
ERROR, not NOTICE.
2007-11-28 04:24:38 +00:00
Andrew Dunstan
5575826b70 Allow for X as well as x to be the prefix for hexadecimal character ref entity numbers,
as in HTML.
2007-11-25 19:35:41 +00:00
Andrew Dunstan
3de1f0daac Fix XML tag namespace change inadvertantly missed from previous fix. Add
regression test for XML names and numeric entities.
2007-11-25 15:37:11 +00:00
Tom Lane
ae3ff7adf7 Fix (I think) broken usage of MultiByteToWideChar. I had missed the
subtlety that this function only returns a null terminator if it's
fed input that includes one; which, in the usage here, it's not.
This probably fixes bugs reported by Thomas Haegi.
2007-11-24 21:20:07 +00:00
Andrew Dunstan
1157f3cc81 Change descriptions of entity and tag objects to "XML entity" and "XML tag".
Allow tag and entity names that follow XML rules. Provide for hexadecimal
as well as decimal numeric entities. Adjust code names to coincide with
new descriptions.
2007-11-20 02:25:22 +00:00
Bruce Momjian
f6e8730d11 Re-run pgindent with updated list of typedefs. (Updated README should
avoid this problem in the future.)
2007-11-15 22:25:18 +00:00
Bruce Momjian
fdf5a5efb7 pgindent run for 8.3. 2007-11-15 21:14:46 +00:00
Tom Lane
ca450a07ee Add an Accept parameter to "simple" dictionaries. The default of true
gives the old behavior; selecting false allows the dictionary to be used
as a filter ahead of other dictionaries, because it will pass on rather
than accept words that aren't in its stopword list.
Jan Urbanski
2007-11-14 18:36:37 +00:00