1
0
mirror of https://github.com/postgres/postgres.git synced 2025-07-27 12:41:57 +03:00

Improve support of Hunspell in ispell dictionary.

Now it's possible to load recent version of Hunspell for several languages.
To handle these dictionaries Hunspell patch adds support for:
* FLAG long - sets the double extended ASCII character flag type
* FLAG num - sets the decimal number flag type (from 1 to 65535)
* AF parameter - alias for flag's set

Also it moves test dictionaries into separate directory.

Author: Artur Zakirov with editorization by me
This commit is contained in:
Teodor Sigaev
2016-03-04 20:08:10 +03:00
parent 9445db925e
commit d78a7d9c7f
15 changed files with 1105 additions and 91 deletions

View File

@ -2615,18 +2615,41 @@ SELECT plainto_tsquery('supernova star');
</para>
<para>
To create an <application>Ispell</> dictionary, use the built-in
<literal>ispell</literal> template and specify several parameters:
To create an <application>Ispell</> dictionary perform these steps:
</para>
<itemizedlist spacing="compact" mark="bullet">
<listitem>
<para>
download dictionary configuration files. <productname>OpenOffice</>
extension files have the <filename>.oxt</> extension. It is necessary
to extract <filename>.aff</> and <filename>.dic</> files, change
extensions to <filename>.affix</> and <filename>.dict</>. For some
dictionary files it is also needed to convert characters to the UTF-8
encoding with commands (for example, for norwegian language dictionary):
<programlisting>
CREATE TEXT SEARCH DICTIONARY english_ispell (
TEMPLATE = ispell,
DictFile = english,
AffFile = english,
StopWords = english
);
iconv -f ISO_8859-1 -t UTF-8 -o nn_no.affix nn_NO.aff
iconv -f ISO_8859-1 -t UTF-8 -o nn_no.dict nn_NO.dic
</programlisting>
</para>
</listitem>
<listitem>
<para>
copy files to the <filename>$SHAREDIR/tsearch_data</> directory
</para>
</listitem>
<listitem>
<para>
load files into PostgreSQL with the following command:
<programlisting>
CREATE TEXT SEARCH DICTIONARY english_hunspell (
TEMPLATE = ispell,
DictFile = en_us,
AffFile = en_us,
Stopwords = english);
</programlisting>
</para>
</listitem>
</itemizedlist>
<para>
Here, <literal>DictFile</>, <literal>AffFile</>, and <literal>StopWords</>
@ -2642,6 +2665,56 @@ CREATE TEXT SEARCH DICTIONARY english_ispell (
example, a Snowball dictionary, which recognizes everything.
</para>
<para>
The <filename>.affix</> file of <application>Ispell</> has the following
structure:
<programlisting>
prefixes
flag *A:
. > RE # As in enter > reenter
suffixes
flag T:
E > ST # As in late > latest
[^AEIOU]Y > -Y,IEST # As in dirty > dirtiest
[AEIOU]Y > EST # As in gray > grayest
[^EY] > EST # As in small > smallest
</programlisting>
</para>
<para>
And the <filename>.dict</> file has the following structure:
<programlisting>
lapse/ADGRS
lard/DGRS
large/PRTY
lark/MRS
</programlisting>
</para>
<para>
Format of the <filename>.dict</> file is:
<programlisting>
basic_form/affix_class_name
</programlisting>
</para>
<para>
In the <filename>.affix</> file every affix flag is described in the
following format:
<programlisting>
condition > [-stripping_letters,] adding_affix
</programlisting>
</para>
<para>
Here, condition has a format similar to the format of regular expressions.
It can use groupings <literal>[...]</> and <literal>[^...]</>.
For example, <literal>[AEIOU]Y</> means that the last letter of the word
is <literal>"y"</> and the penultimate letter is <literal>"a"</>,
<literal>"e"</>, <literal>"i"</>, <literal>"o"</> or <literal>"u"</>.
<literal>[^EY]</> means that the last letter is neither <literal>"e"</>
nor <literal>"y"</>.
</para>
<para>
Ispell dictionaries support splitting compound words;
a useful feature.
@ -2663,6 +2736,65 @@ SELECT ts_lexize('norwegian_ispell', 'sjokoladefabrikk');
</programlisting>
</para>
<para>
<application>MySpell</> format is a subset of <application>Hunspell</>.
The <filename>.affix</> file of <application>Hunspell</> has the following
structure:
<programlisting>
PFX A Y 1
PFX A 0 re .
SFX T N 4
SFX T 0 st e
SFX T y iest [^aeiou]y
SFX T 0 est [aeiou]y
SFX T 0 est [^ey]
</programlisting>
</para>
<para>
The first line of an affix class is the header. Fields of an affix rules are
listed after the header:
</para>
<itemizedlist spacing="compact" mark="bullet">
<listitem>
<para>
parameter name (PFX or SFX)
</para>
</listitem>
<listitem>
<para>
flag (name of the affix class)
</para>
</listitem>
<listitem>
<para>
stripping characters from beginning (at prefix) or end (at suffix) of the
word
</para>
</listitem>
<listitem>
<para>
adding affix
</para>
</listitem>
<listitem>
<para>
condition that has a format similar to the format of regular expressions.
</para>
</listitem>
</itemizedlist>
<para>
The <filename>.dict</> file looks like the <filename>.dict</> file of
<application>Ispell</>:
<programlisting>
larder/M
lardy/RT
large/RSPMYT
largehearted
</programlisting>
</para>
<note>
<para>
<application>MySpell</> does not support compound words.