mirror of
https://github.com/postgres/postgres.git
synced 2025-07-05 07:21:24 +03:00
If the input word exceeds 1000 bytes, don't pass it to the stemmer; just return it as-is after case folding. Such an input is surely not a word in any human language, so whatever the stemmer might do to it would be pretty dubious in the first place. Adding this restriction protects us against a known recursion-to-stack-overflow problem in the Turkish stemmer, and it seems like good insurance against any other safety or performance issues that may exist in the Snowball stemmers. (I note, for example, that they contain no CHECK_FOR_INTERRUPTS calls, so we really don't want them running for a long time.) The threshold of 1000 bytes is arbitrary. An alternative definition could have been to treat such words as stopwords, but that seems like a bigger break from the old behavior. Per report from Egor Chindyaskin and Alexander Lakhin. Thanks to Olly Betts for the recommendation to fix it this way. Discussion: https://postgr.es/m/1661334672.728714027@f473.i.mail.ru
src/backend/snowball/README Snowball-Based Stemming ======================= This module uses the word stemming code developed by the Snowball project, http://snowballstem.org (formerly http://snowball.tartarus.org) which is released by them under a BSD-style license. The Snowball project does not often make formal releases; it's best to pull from their git repository git clone https://github.com/snowballstem/snowball.git and then building the derived files is as simple as cd snowball make At least on Linux, no platform-specific adjustment is needed. Postgres' files under src/backend/snowball/libstemmer/ and src/include/snowball/libstemmer/ are taken directly from the Snowball files, with only some minor adjustments of file inclusions. Note that most of these files are in fact derived files, not original source. The original sources are in the Snowball language, and are built using the Snowball-to-C compiler that is also part of the Snowball project. We choose to include the derived files in the PostgreSQL distribution because most installations will not have the Snowball compiler available. We are currently synced with the Snowball git commit 4764395431c8f2a0b4fe18b816ab1fc966a45837 (tag v2.1.0) of 2021-01-21. To update the PostgreSQL sources from a new Snowball version: 0. If you didn't do it already, "make -C snowball". 1. Copy the *.c files in snowball/src_c/ to src/backend/snowball/libstemmer with replacement of "../runtime/header.h" by "header.h", for example for f in .../snowball/src_c/*.c do sed 's|\.\./runtime/header\.h|header.h|' $f >libstemmer/`basename $f` done Do not copy stemmers that are listed in libstemmer/modules.txt as nonstandard, such as "german2" or "lovins". 2. Copy the *.c files in snowball/runtime/ to src/backend/snowball/libstemmer, and edit them to remove direct inclusions of system headers such as <stdio.h> --- they should only include "header.h". (This removal avoids portability problems on some platforms where <stdio.h> is sensitive to largefile compilation options.) 3. Copy the *.h files in snowball/src_c/ and snowball/runtime/ to src/include/snowball/libstemmer. At this writing the header files do not require any changes. 4. Check whether any stemmer modules have been added or removed. If so, edit the OBJS list in Makefile, the list of #include's in dict_snowball.c, and the stemmer_modules[] table in dict_snowball.c, as well as the list in the documentation in textsearch.sgml. You might also need to change the LANGUAGES list in Makefile and tsearch_config_languages in initdb.c. 5. The various stopword files in stopwords/ must be downloaded individually from pages on the snowballstem.org website. Be careful that these files must be stored in UTF-8 encoding.