postgres

mirror of https://github.com/postgres/postgres.git synced 2025-08-31 17:02:12 +03:00

Author	SHA1	Message	Date
Tom Lane	f8fc6082b4	Fix misuse of Lossy Counting (LC) algorithm in compute_tsvector_stats(). We must filter out hashtable entries with frequencies less than those specified by the algorithm, else we risk emitting junk entries whose actual frequency is much less than other lexemes that did not get tabulated. This is bad enough by itself, but even worse is that tsquerysel() believes that the minimum frequency seen in pg_statistic is a hard upper bound for lexemes not included, and was thus underestimating the frequency of non-MCEs. Also, set the threshold frequency to something with a little bit of theory behind it, to wit assume that the input distribution is approximately Zipfian. This might need adjustment in future, but some preliminary experiments suggest that it's not too unreasonable. Back-patch to 8.4, where this code was introduced. Jan Urbanski, with some editorialization by Tom	2010-05-30 21:59:09 +00:00
Tom Lane	b382cbb044	Avoid core dump on empty thesaurus dictionary. Per report from Robert Gravsjö.	2009-11-30 16:38:40 +00:00
Peter Eisentraut	4272c8724f	Make text search parser accept underscores in XML attributes (bug #5075 )	2009-11-15 13:55:42 +00:00
Tom Lane	ba5317237f	Remove duplicate variable initializations identified by clang static checker. One of these represents a nontrivial bug (a promptly-leaked palloc), so backpatch. Greg Stark	2009-08-30 16:53:37 +00:00
Bruce Momjian	d747140279	8.4 pgindent run, with new combined Linux/FreeBSD/MinGW typedef list provided by Andrew.	2009-06-11 14:49:15 +00:00
Tom Lane	a734979e0a	Fix tsquerysel() to not fail on an empty TSQuery. Per report from Tatsuo Ishii.	2009-06-03 18:42:13 +00:00
Teodor Sigaev	e43bb5beb7	Some languages have symbols with zero display's width or/and vowels/signs which are not an alphabetic character although they are not word-breakers too. So, treat them as part of word. Per off-list discussion with Dibyendra Hyoju <dibyendra@gmail.com> and and Bal Krishna Bal <balkrishna7bal@gmail.com> about Nepali language and Devanagari alphabet.	2009-03-11 16:03:40 +00:00
Teodor Sigaev	42831729f7	Prevent recursion during parse of email-like string with multiple '@'. Patch by Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>	2009-03-10 17:32:14 +00:00
Teodor Sigaev	32032d42b5	Fix usage of char2wchar/wchar2char. Changes: - pg_wchar and wchar_t could have different size, so char2wchar doesn't call pg_mb2wchar_with_len to prevent out-of-bound memory bug - make char2wchar/wchar2char symmetric, now they should not be called with C-locale because mbstowcs/wcstombs oftenly doesn't work correct with C-locale. - Text parser uses pg_mb2wchar_with_len directly in case of C-locale and multibyte encoding Per bug report by Hiroshi Inoue <inoue@tpf.co.jp> and following discussion. Backpatch up to 8.2 when multybyte support was implemented in tsearch.	2009-03-02 15:10:09 +00:00
Teodor Sigaev	b5b3134813	Fix incorrect dereferencing of char* to array's index. Per Tommy Gildseth <tommy.gildseth@usit.uio.no> report	2009-01-29 16:22:10 +00:00
Teodor Sigaev	41d17e042b	Fix URL generation in headline. Only tag lexeme will be replaced by space. Per http://archives.postgresql.org/pgsql-bugs/2008-12/msg00013.php	2009-01-15 16:33:59 +00:00
Teodor Sigaev	8fd07a35ba	Fix generation too long headline with ShortWords. Per http://archives.postgresql.org/pgsql-hackers/2008-09/msg01088.php	2009-01-15 16:33:28 +00:00
Bruce Momjian	511db38ace	Update copyright for 2009.	2009-01-01 17:24:05 +00:00
Tom Lane	301194f8ea	Reduce the scaling factor for attstattarget to number-of-lexemes from 100 to 10, to compensate for the recent change in default statistics target. The original number was pulled out of the air anyway :-(, but it was picked in the context of the old default, so holding the default size of the MCELEM array constant seems the best thing. Per discussion.	2008-12-15 15:06:31 +00:00
Tom Lane	65e3ea7641	Increase the default value of default_statistics_target from 10 to 100, and its maximum value from 1000 to 10000. ALTER TABLE SET STATISTICS similarly now allows a value up to 10000. Per discussion.	2008-12-13 19:13:44 +00:00
Heikki Linnakangas	a93b3b98cd	Fix bug in the tsvector stats collection function, which caused a crash if the sample contains just a one tsvector, containing only one lexeme.	2008-11-27 21:17:39 +00:00
Tom Lane	2b74d45c1b	pg_do_encoding_conversion cannot return NULL (at least not unless the input is NULL), so remove some useless tests for the case.	2008-11-10 15:18:40 +00:00
Teodor Sigaev	2a0083ede8	Improve headeline generation. Now headline can contain several fragments a-la Google. Sushant Sinha <sushant354@gmail.com>	2008-10-17 18:05:19 +00:00
Teodor Sigaev	906b7e5f6c	Fix small bug in headline generation. Patch from Sushant Sinha <sushant354@gmail.com> http://archives.postgresql.org/pgsql-hackers/2008-07/msg00785.php	2008-10-17 17:27:46 +00:00
Tom Lane	4e57668da4	Create a selectivity estimation function for the text search @@ operator. Jan Urbanski	2008-09-19 19:03:41 +00:00
Tom Lane	6f6d863258	Create a type-specific typanalyze routine for tsvector, which collects stats on the most common individual lexemes in place of the mostly-useless default behavior of counting duplicate tsvectors. Future work: create selectivity estimation functions that actually do something with these stats. (Some other things we ought to look at doing: using the Lossy Counting algorithm in compute_minimal_stats, and using the element-counting idea for stats on regular arrays.) Jan Urbanski	2008-07-14 00:51:46 +00:00
Tom Lane	30dc388a0d	Fix a few places that were non-multibyte-safe in tsearch configuration file parsing. Per bug #4253 from Giorgio Valoti.	2008-06-19 16:52:24 +00:00
Tom Lane	fbeb9da22b	Improve error reporting for problems in text search configuration files by installing an error context subroutine that will provide the file name and line number for all errors detected while reading a config file. Some of the reader routines were already doing that in an ad-hoc way for errors detected directly in the reader, but it didn't help for problems detected in subroutines, such as encoding violations. Back-patch to 8.3 because 8.3 is where people will be trying to debug configuration files.	2008-06-18 20:55:42 +00:00
Bruce Momjian	9de09c087d	Move wchar2char() and char2wchar() from tsearch into /mb to be easier to use for other modules; also move pnstrdup(). Clean up code slightly.	2008-06-18 18:42:54 +00:00
Bruce Momjian	dc69c0362f	Move USE_WIDE_UPPER_LOWER define to c.h, and remove TS_USE_WIDE and use USE_WIDE_UPPER_LOWER instead.	2008-06-17 16:09:06 +00:00
Tom Lane	e6dbcb72fa	Extend GIN to support partial-match searches, and extend tsquery to support prefix matching using this facility. Teodor Sigaev and Oleg Bartunov	2008-05-16 16:31:02 +00:00
Alvaro Herrera	f8c4d7db60	Restructure some header files a bit, in particular heapam.h, by removing some unnecessary #include lines in it. Also, move some tuple routine prototypes and macros to htup.h, which allows removal of heapam.h inclusion from some .c files. For this to work, a new header file access/sysattr.h needed to be created, initially containing attribute numbers of system columns, for pg_dump usage. While at it, make contrib ltree, intarray and hstore header files more consistent with our header style.	2008-05-12 00:00:54 +00:00
Tom Lane	220db7ccd8	Simplify and standardize conversions between TEXT datums and ordinary C strings. This patch introduces four support functions cstring_to_text, cstring_to_text_with_len, text_to_cstring, and text_to_cstring_buffer, and two macros CStringGetTextDatum and TextDatumGetCString. A number of existing macros that provided variants on these themes were removed. Most of the places that need to make such conversions now require just one function or macro call, in place of the multiple notational layers that used to be needed. There are no longer any direct calls of textout or textin, and we got most of the places that were using handmade conversions via memcpy (there may be a few still lurking, though). This commit doesn't make any serious effort to eliminate transient memory leaks caused by detoasting toasted text objects before they reach text_to_cstring. We changed PG_GETARG_TEXT_P to PG_GETARG_TEXT_PP in a few places where it was easy, but much more could be done. Brendan Jurd and Tom Lane	2008-03-25 22:42:46 +00:00
Tom Lane	7953fdcd9e	Add a CaseSensitive parameter to synonym dictionaries. Simon Riggs	2008-03-10 03:01:28 +00:00
Teodor Sigaev	3b8bca335d	Fix memory arrangement of tsquery after removing stop words. It causes a unused memory holes in tsquery. Per report by Richard Huxton <dev@archonet.com>. It was working well because in fact tsquery->size is not used for any kind of operation except comparing tsqueries. So, in HEAD it's enough to fix to_tsquery function, but for previous version it's needed to remove optimization in CompareTSQ to prevent requirement of renew all stored tsquery.	2008-03-07 14:30:20 +00:00
Bruce Momjian	910bc51862	When text search string is too long, in error message report actual and maximum number of bytes allowed.	2008-03-05 15:50:37 +00:00
Peter Eisentraut	0474dcb608	Refactor backend makefiles to remove lots of duplicate code	2008-02-19 10:30:09 +00:00
Peter Eisentraut	a345dcd2f7	Observe errors in makefile	2008-02-18 16:04:32 +00:00
Tom Lane	716e8b8374	Fix RS_isRegis() to agree exactly with RS_compile()'s idea of what's a valid regis. Correct the latter's oversight that a bracket-expression needs to be terminated. Reduce the ereports to elogs, since they are now not expected to ever be hit (thus addressing Alvaro's original complaint). In passing, const-ify the string argument to RS_compile.	2008-01-21 02:46:11 +00:00
Teodor Sigaev	cd42dd5a17	Fix core dump with buffer-overrun by too long infinitive. Add checking of using fixed length arrays to prevent array's overrun. Per report by Hannes Dorbath <light@theendofthetunnel.de> and comments by Tom.	2008-01-16 13:01:03 +00:00
Tom Lane	deb7deda26	Tweak new error message to conform to style guidelines.	2008-01-15 18:22:47 +00:00
Teodor Sigaev	f7807f1de8	Add check of headline method presence. Per report by Yoshiyuki Asaba <y-asaba@sraoss.co.jp>	2008-01-15 17:16:01 +00:00
Bruce Momjian	9098ab9e32	Update copyrights in source tree to 2008.	2008-01-01 19:46:01 +00:00
Peter Eisentraut	f5f1355dc4	Wording improvements	2007-12-27 13:02:48 +00:00
Tom Lane	bb0e3011f8	Make a cleanup pass over error reports in tsearch code. Use ereport for user-facing errors, fix some poor choices of errcode, adhere to message style guide.	2007-11-28 21:56:30 +00:00
Peter Eisentraut	a238bd146d	Proper capitalization of Ispell	2007-11-28 15:42:46 +00:00
Peter Eisentraut	2609345c85	Improve terminology	2007-11-28 13:30:36 +00:00
Bruce Momjian	43e082fc98	Change a stop word on the right-hand-side in the thesaurus file to be an ERROR, not NOTICE.	2007-11-28 04:24:38 +00:00
Andrew Dunstan	5575826b70	Allow for X as well as x to be the prefix for hexadecimal character ref entity numbers, as in HTML.	2007-11-25 19:35:41 +00:00
Andrew Dunstan	3de1f0daac	Fix XML tag namespace change inadvertantly missed from previous fix. Add regression test for XML names and numeric entities.	2007-11-25 15:37:11 +00:00
Tom Lane	ae3ff7adf7	Fix (I think) broken usage of MultiByteToWideChar. I had missed the subtlety that this function only returns a null terminator if it's fed input that includes one; which, in the usage here, it's not. This probably fixes bugs reported by Thomas Haegi.	2007-11-24 21:20:07 +00:00
Andrew Dunstan	1157f3cc81	Change descriptions of entity and tag objects to "XML entity" and "XML tag". Allow tag and entity names that follow XML rules. Provide for hexadecimal as well as decimal numeric entities. Adjust code names to coincide with new descriptions.	2007-11-20 02:25:22 +00:00
Bruce Momjian	f6e8730d11	Re-run pgindent with updated list of typedefs. (Updated README should avoid this problem in the future.)	2007-11-15 22:25:18 +00:00
Bruce Momjian	fdf5a5efb7	pgindent run for 8.3.	2007-11-15 21:14:46 +00:00
Tom Lane	ca450a07ee	Add an Accept parameter to "simple" dictionaries. The default of true gives the old behavior; selecting false allows the dictionary to be used as a filter ahead of other dictionaries, because it will pass on rather than accept words that aren't in its stopword list. Jan Urbanski	2007-11-14 18:36:37 +00:00

1 2

75 Commits