mirror of
https://github.com/postgres/postgres.git
synced 2025-06-13 07:41:39 +03:00
When we grabbed this file off the Snowball project's website, we mistakenly supposed that it was in LATIN1 encoding, but evidently it was actually in LATIN2. This resulted in ő (o-double-acute, U+0151, which is code 0xF5 in LATIN2) being misconverted into õ (o-tilde, U+00F5), as complained of in bug #10589 from Zoltán Sörös. We'd have messed up u-double-acute too, but there aren't any of those in the file. Other characters used in the file have the same codes in LATIN1 and LATIN2, which no doubt helped hide the problem for so long. The error is not only ours: the Snowball project also was confused about which encoding is required for Hungarian. But dealing with that will require source-code changes that I'm not at all sure we'll wish to back-patch. Fixing the stopword file seems reasonably safe to back-patch however.
src/backend/snowball/README Snowball-Based Stemming ======================= This module uses the word stemming code developed by the Snowball project, http://snowball.tartarus.org/ which is released by them under a BSD-style license. The files under src/backend/snowball/libstemmer/ and src/include/snowball/libstemmer/ are taken directly from their libstemmer_c distribution, with only some minor adjustments of file inclusions. Note that most of these files are in fact derived files, not master source. The master sources are in the Snowball language, and are available along with the Snowball-to-C compiler from the Snowball project. We choose to include the derived files in the PostgreSQL distribution because most installations will not have the Snowball compiler available. To update the PostgreSQL sources from a new Snowball libstemmer_c distribution: 1. Copy the *.c files in libstemmer_c/src_c/ to src/backend/snowball/libstemmer with replacement of "../runtime/header.h" by "header.h", for example for f in libstemmer_c/src_c/*.c do sed 's|\.\./runtime/header\.h|header.h|' $f >libstemmer/`basename $f` done (Alternatively, if you rebuild the stemmer files from the master Snowball sources, just omit "-r ../runtime" from the Snowball compiler switches.) 2. Copy the *.c files in libstemmer_c/runtime/ to src/backend/snowball/libstemmer, and edit them to remove direct inclusions of system headers such as <stdio.h> --- they should only include "header.h". (This removal avoids portability problems on some platforms where <stdio.h> is sensitive to largefile compilation options.) 3. Copy the *.h files in libstemmer_c/src_c/ and libstemmer_c/runtime/ to src/include/snowball/libstemmer. At this writing the header files do not require any changes. 4. Check whether any stemmer modules have been added or removed. If so, edit the OBJS list in Makefile, the list of #include's in dict_snowball.c, and the stemmer_modules[] table in dict_snowball.c. 5. The various stopword files in stopwords/ must be downloaded individually from pages on the snowball.tartarus.org website. Be careful that these files must be stored in UTF-8 encoding.