postgres

mirror of https://github.com/postgres/postgres.git synced 2025-11-16 15:02:33 +03:00

Files

Jeff Davis 27bdec0684 Optimization for lower(), upper(), casefold() functions.

Improve performance and reduce table sizes for case mapping.

The main case mapping table stores only 16-bit offsets, which can be
used to look up the mapped code point in any of the case tables (fold,
lower, upper, or title case). Simple case pairs point to the same
offsets.

Generate a function in generate-unicode_case_table.pl that consists of
a nested branches to test for specific codepoint ranges that determine
the offset in the main table.

Other approaches were considered, such as representing these ranges as
another structure (rather than branches in a generated function), or a
different approach such as a radix tree, or perfect hashing. The
author implemented and tested these alternatives and settled on the
generated branches.

Author: Alexander Borisov <lex.borisov@gmail.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/7cac7e66-9a3b-4e3f-a997-42aa0c401f80%40gmail.com

2025-03-15 13:00:50 -07:00

.gitignore

Update src/common/unicode/.gitignore

2024-04-22 09:16:33 +02:00

case_test.c

Add support for Unicode case folding.

2025-01-23 09:06:50 -08:00

category_test.c

Update copyright for 2025

2025-01-01 11:21:55 -05:00

generate-norm_test_table.pl

Update copyright for 2025

2025-01-01 11:21:55 -05:00

generate-unicode_case_table.pl

Optimization for lower(), upper(), casefold() functions.

2025-03-15 13:00:50 -07:00

generate-unicode_category_table.pl

Update copyright for 2025

2025-01-01 11:21:55 -05:00

generate-unicode_east_asian_fw_table.pl

Update copyright for 2025

2025-01-01 11:21:55 -05:00

generate-unicode_nonspacing_table.pl

Update copyright for 2025

2025-01-01 11:21:55 -05:00

generate-unicode_norm_table.pl

Update copyright for 2025

2025-01-01 11:21:55 -05:00

generate-unicode_normprops_table.pl

Update copyright for 2025

2025-01-01 11:21:55 -05:00

generate-unicode_version.pl

2025-01-01 11:21:55 -05:00

Makefile

Add support for Unicode case folding.

2025-01-23 09:06:50 -08:00

meson.build

Add support for Unicode case folding.

2025-01-23 09:06:50 -08:00

norm_test.c

2025-01-01 11:21:55 -05:00

README

Update src/common/unicode/README.

2024-03-18 16:39:29 -07:00

README

This directory contains tools to download new Unicode data files and
generate static tables. These tables are used to normalize or
determine various properties of Unicode data.

The generated header files are copied to src/include/common/, and
included in the source tree, so these tools are not normally required
to build PostgreSQL.

Update Unicode Version
----------------------

Edit src/Makefile.global.in and src/common/unicode/meson.build
to update the UNICODE_VERSION.

Then, generate the new header files with:

    make update-unicode

or if using meson:

    ninja update-unicode

from the top level of the source tree. Examine the result to make sure
the changes look reasonable (that is, that the diff size and scope is
comparable to the Unicode changes since the last update), and then
commit it.

Tests
-----

Normalization tests:

The Unicode consortium publishes a comprehensive test suite for the
normalization algorithm, in a file called NormalizationTest.txt. This
directory also contains a perl script and some C code, to run our
normalization code with all the test strings in NormalizationTest.txt.
To download NormalizationTest.txt and run the tests:

    make normalization-check

This is also run as part of the update-unicode target.

Category, Property and Case tests:

The files case_test.c and category_test.c test Unicode categories,
properties, and case mapping by exhaustively comparing results with
ICU. For these tests to be effective, the version of the Unicode data
files must be similar to the version of Unicode on which ICU is
based. Mismatched Unicode versions will cause the tests to skip over
codepoints that are assigned in one version and not the other, and may
falsely report failures. This test is run as a part of the
update-unicode target.