1
0
mirror of https://github.com/postgres/postgres.git synced 2025-07-28 23:42:10 +03:00

Add SQL functions for Unicode normalization

This adds SQL expressions NORMALIZE() and IS NORMALIZED to convert and
check Unicode normal forms, per SQL standard.

To support fast IS NORMALIZED tests, we pull in a new data file
DerivedNormalizationProps.txt from Unicode and build a lookup table
from that, using techniques similar to ones already used for other
Unicode data.  make update-unicode will keep it up to date.  We only
build and use these tables for the NFC and NFKC forms, because they
are too big for NFD and NFKD and the improvement is not significant
enough there.

Reviewed-by: Daniel Verite <daniel@manitou-mail.org>
Reviewed-by: Andreas Karlsson <andreas@proxel.se>
Discussion: https://www.postgresql.org/message-id/flat/c1909f27-c269-2ed9-12f8-3ab72c8caf7a@2ndquadrant.com
This commit is contained in:
Peter Eisentraut
2020-03-26 08:14:00 +01:00
parent 070c3d3937
commit 2991ac5fc9
20 changed files with 6764 additions and 7 deletions

View File

@ -934,6 +934,16 @@ CREATE COLLATION ignore_accents (provider = icu, locale = 'und-u-ks-level1-kc-tr
such as pattern matching operations. Therefore, they should be used
only in cases where they are specifically wanted.
</para>
<tip>
<para>
To deal with text in different Unicode normalization forms, it is also
an option to use the functions/expressions
<function>normalize</function> and <literal>is normalized</literal> to
preprocess or check the strings, instead of using nondeterministic
collations. There are different trade-offs for each approach.
</para>
</tip>
</sect3>
</sect2>
</sect1>

View File

@ -1560,6 +1560,30 @@
<entry><literal>Value: 42</literal></entry>
</row>
<row>
<entry>
<indexterm>
<primary>normalized</primary>
</indexterm>
<indexterm>
<primary>Unicode normalization</primary>
</indexterm>
<literal><parameter>string</parameter> is <optional>not</optional> <optional><parameter>form</parameter></optional> normalized</literal>
</entry>
<entry><type>boolean</type></entry>
<entry>
Checks whether the string is in the specified Unicode normalization
form. The optional parameter specifies the form:
<literal>NFC</literal> (default), <literal>NFD</literal>,
<literal>NFKC</literal>, <literal>NFKD</literal>. This expression can
only be used if the server encoding is <literal>UTF8</literal>. Note
that checking for normalization using this expression is often faster
than normalizing possibly already normalized strings.
</entry>
<entry><literal>U&amp;'\0061\0308bc' IS NFD NORMALIZED</literal></entry>
<entry><literal>true</literal></entry>
</row>
<row>
<entry>
<indexterm>
@ -1610,6 +1634,30 @@
<entry><literal>tom</literal></entry>
</row>
<row>
<entry>
<indexterm>
<primary>normalize</primary>
</indexterm>
<indexterm>
<primary>Unicode normalization</primary>
</indexterm>
<literal><function>normalize(<parameter>string</parameter> <type>text</type>
<optional>, <parameter>form</parameter> </optional>)</function></literal>
</entry>
<entry><type>text</type></entry>
<entry>
Converts the string in the first argument to the specified Unicode
normalization form. The optional second argument specifies the form
as an identifier: <literal>NFC</literal> (default),
<literal>NFD</literal>, <literal>NFKC</literal>,
<literal>NFKD</literal>. This function can only be used if the server
encoding is <literal>UTF8</literal>.
</entry>
<entry><literal>normalize(U&amp;'\0061\0308bc', NFC)</literal></entry>
<entry><literal>U&amp;'\00E4bc'</literal></entry>
</row>
<row>
<entry>
<indexterm>