1
0
mirror of https://github.com/postgres/postgres.git synced 2025-11-15 03:41:20 +03:00
Commit Graph

70 Commits

Author SHA1 Message Date
Jeff Davis
3853a6956c Use C11 char16_t and char32_t for Unicode code points.
Reviewed-by: Tatsuo Ishii <ishii@postgresql.org>
Reviewed-by: Thomas Munro <thomas.munro@gmail.com>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://postgr.es/m/bedcc93d06203dfd89815b10f815ca2de8626e85.camel%40j-davis.com
2025-10-29 14:17:13 -07:00
Jeff Davis
90260e2ec6 Fix INITCAP() word boundaries for PG_UNICODE_FAST.
Word boundaries are based on whether a character is alphanumeric or
not. For the PG_UNICODE_FAST collation, alphanumeric includes
non-ASCII digits; whereas for the PG_C_UTF8 collation, it only
includes digits 0-9. Pass down the right information from the
pg_locale_t into initcap_wbnext to differentiate the behavior.

Reported-by: Noah Misch <noah@leadboat.com>
Reviewed-by: Noah Misch <noah@leadboat.com>
Discussion: https://postgr.es/m/20250417135841.33.nmisch@google.com
2025-04-21 12:34:58 -07:00
Peter Eisentraut
82a46cca99 Update Unicode data to Unicode 16.0.0
Reviewed-by: Jeff Davis <pgsql@j-davis.com>
Discussion: https://www.postgresql.org/message-id/flat/146349e4-4687-4321-91af-f235572490a8@eisentraut.org
2025-04-03 12:00:09 +02:00
Peter Eisentraut
bbf24fe2f1 Update code comment
Commit 4e7f62bc38 added a new input file to a script but didn't
update the comment listing the input files.
2025-04-03 09:20:25 +02:00
Peter Eisentraut
34f04aa653 Fix update-unicode make target
The addition of SpecialCasing.txt by commit 286a365b9c was not added
to the make target dependencies, so the invoked script would fail
because the required file wasn't downloaded first.  (The meson version
appears to work correctly.)
2025-04-03 09:20:25 +02:00
Jeff Davis
549ea06e42 Fix headerscheck warning.
Reported-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/93731.1742310701@sss.pgh.pa.us
2025-03-18 08:37:07 -07:00
Andrew Dunstan
5eabd91a83 Silence perl critic
Commit 27bdec0684 uses a loop variable that is not strictly local to
the loop. Perlcritic disapproves, and there's really no reason as the
variable is not used outside the loop.

Per buildfarm animals koel and crake.
2025-03-15 17:41:54 -04:00
Jeff Davis
27bdec0684 Optimization for lower(), upper(), casefold() functions.
Improve performance and reduce table sizes for case mapping.

The main case mapping table stores only 16-bit offsets, which can be
used to look up the mapped code point in any of the case tables (fold,
lower, upper, or title case). Simple case pairs point to the same
offsets.

Generate a function in generate-unicode_case_table.pl that consists of
a nested branches to test for specific codepoint ranges that determine
the offset in the main table.

Other approaches were considered, such as representing these ranges as
another structure (rather than branches in a generated function), or a
different approach such as a radix tree, or perfect hashing. The
author implemented and tested these alternatives and settled on the
generated branches.

Author: Alexander Borisov <lex.borisov@gmail.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: https://postgr.es/m/7cac7e66-9a3b-4e3f-a997-42aa0c401f80%40gmail.com
2025-03-15 13:00:50 -07:00
Jeff Davis
4e7f62bc38 Add support for Unicode case folding.
Expand case mapping tables to include entries for case folding, which
are parsed from CaseFolding.txt.

Discussion: https://postgr.es/m/a1886ddfcd8f60cb3e905c93009b646b4cfb74c5.camel%40j-davis.com
2025-01-23 09:06:50 -08:00
Jeff Davis
286a365b9c Support Unicode full case mapping and conversion.
Generate tables from Unicode SpecialCasing.txt to support more
sophisticated case mapping behavior:

 * support case mappings to multiple codepoints, such as "ß"
   uppercasing to "SS"
 * support conditional case mappings, such as the "final sigma"
 * support titlecase variants, such as "dž" uppercasing to "DŽ" but
   titlecasing to "Dž"

Discussion: https://postgr.es/m/ddfd67928818f138f51635712529bc5e1d25e4e7.camel@j-davis.com
Discussion: https://postgr.es/m/27bb0e52-801d-4f73-a0a4-02cfdd4a9ada@eisentraut.org
Reviewed-by: Peter Eisentraut, Daniel Verite
2025-01-17 15:56:20 -08:00
Bruce Momjian
50e6eb731d Update copyright for 2025
Backpatch-through: 13
2025-01-01 11:21:55 -05:00
Álvaro Herrera
e6c32d9fad Clean up newlines following left parentheses
Most came in during the 17 cycle, so backpatch there.  Some
(particularly reorderbuffer.h) are very old, but backpatching doesn't
seem useful.

Like commits c9d2977519, c4f113e8fe.
2024-11-26 17:10:07 +01:00
Peter Eisentraut
2845cd1ca0 meson: Add missing dependency to unicode test programs
The test programs in src/common/unicode/ (case_test, category_test,
norm_test), don't build with meson if the nls option is enabled,
because a libintl dependency is missing.  Fix that.  (The makefiles
are ok.)

Reviewed-by: Aleksander Alekseev <aleksander@timescale.com>
Discussion: https://www.postgresql.org/message-id/flat/52db1d2b-4b96-473e-b323-a4b16a950fba%40eisentraut.org
2024-10-30 08:30:00 +01:00
Jeff Davis
a7f2f6adc2 Whitespace fixup from generated unicode tables.
When running the 'update-unicode' build target, generate files that
conform to pgindent whitespace rules.
2024-10-16 12:21:13 -07:00
Peter Eisentraut
cc70e170c0 Make all Perl warnings fatal, catch-up
Apply c538592959 to new Perl files that had missed the note.
2024-05-15 10:10:19 +02:00
David Rowley
8cb653b245 Fix incorrect year in some copyright notices added this year
A few patches that were written <= 2023 and committed in 2024 still
contained 2023 copyright year.  Fix that.

Discussion: https://postgr.es/m/CAApHDvr5egyW3FmHbAg-Uq2p_Aizwco1Zjs6Vbq18KqN64-hRA@mail.gmail.com
2024-05-15 15:01:21 +12:00
Peter Eisentraut
7e44ac3741 Update src/common/unicode/.gitignore
for new downloaded files and new build results.
2024-04-22 09:16:33 +02:00
Jeff Davis
b0be28761e Run perltidy on generate-unicode_version.pl. 2024-03-27 13:21:29 -07:00
Jeff Davis
503c0ad976 Fix convert_case(), introduced in 5c40364dd6.
Check source length before checking for NUL terminator to avoid
reading one byte past the string end. Also fix unreachable bug when
caller does not expect NUL-terminated result.

Add unit test coverage of convert_case() in case_test.c, which makes
it easier to reproduce the valgrind failure.

Discussion: https://postgr.es/m/7a9fd36d-7a38-4dc2-e676-fc939491a95a@gmail.com
Reported-by: Alexander Lakhin
2024-03-24 16:28:34 -07:00
Jeff Davis
f9f3fb1cb7 Update src/common/unicode/README.
Change the test description to include the case mapping
test. Oversight in 5c40364dd6.
2024-03-18 16:39:29 -07:00
Jeff Davis
5c40364dd6 Unicode case mapping tables and functions.
Implements Unicode simple case mapping, in which all code points map
to exactly one other code point unconditionally.

These tables are generated from UnicodeData.txt, which is already
being used by other infrastructure in src/common/unicode. The tables
are checked into the source tree, so they only need to be regenerated
when we update the Unicode version.

In preparation for the builtin collation provider, and possibly useful
for other callers.

Discussion: https://postgr.es/m/ff4c2f2f9c8fc7ca27c1c24ae37ecaeaeaff6b53.camel%40j-davis.com
Reviewed-by: Peter Eisentraut, Daniel Verite, Jeremy Schneider
2024-03-07 11:15:06 -08:00
Jeff Davis
ad49994538 Add Unicode property tables.
Provide functions to test for Unicode properties, such as Alphabetic
or Cased. These functions use tables derived from Unicode data files,
similar to the tables for Unicode normalization or general category,
and those tables can be updated with the 'update-unicode' build
target.

Use Unicode properties to provide functions to test for regex
character classes, like 'punct' or 'alnum'.

Infrastructure in preparation for a builtin collation provider, and
may also be useful for other callers.

Discussion: https://postgr.es/m/ff4c2f2f9c8fc7ca27c1c24ae37ecaeaeaff6b53.camel%40j-davis.com
Reviewed-by: Daniel Verite, Peter Eisentraut, Jeremy Schneider
2024-03-06 12:50:01 -08:00
Jeff Davis
cf64d4e99f Cleanup for unicode-update build target and test.
In preparation for adding more Unicode tables.

Discussion: https://postgr.es/m/63cd8625-68fa-4760-844a-6b7f643336f2@ardentperf.com
Reviewed-by: Jeremy Schneider
2024-01-11 12:35:29 -08:00
Bruce Momjian
29275b1d17 Update copyright for 2024
Reported-by: Michael Paquier

Discussion: https://postgr.es/m/ZZKTDPxBBMt3C0J9@paquier.xyz

Backpatch-through: 12
2024-01-03 20:49:05 -05:00
Peter Eisentraut
c538592959 Make all Perl warnings fatal
There are a lot of Perl scripts in the tree, mostly code generation
and TAP tests.  Occasionally, these scripts produce warnings.  These
are probably always mistakes on the developer side (true positives).
Typical examples are warnings from genbki.pl or related when you make
a mess in the catalog files during development, or warnings from tests
when they massage a config file that looks different on different
hosts, or mistakes during merges (e.g., duplicate subroutine
definitions), or just mistakes that weren't noticed because there is a
lot of output in a verbose build.

This changes all warnings into fatal errors, by replacing

    use warnings;

by

    use warnings FATAL => 'all';

in all Perl files.

Discussion: https://www.postgresql.org/message-id/flat/06f899fd-1826-05ab-42d6-adeb1fd5e200%40eisentraut.org
2023-12-29 18:20:00 +01:00
Jeff Davis
719b342d36 Shrink Unicode category table.
Missing entries can implicitly be considered "unassigned".

Discussion: https://postgr.es/m/ff4c2f2f9c8fc7ca27c1c24ae37ecaeaeaff6b53.camel@j-davis.com
2023-12-07 15:44:03 -08:00
Peter Eisentraut
721856ff24 Remove distprep
A PostgreSQL release tarball contains a number of prebuilt files, in
particular files produced by bison, flex, perl, and well as html and
man documentation.  We have done this consistent with established
practice at the time to not require these tools for building from a
tarball.  Some of these tools were hard to get, or get the right
version of, from time to time, and shipping the prebuilt output was a
convenience to users.

Now this has at least two problems:

One, we have to make the build system(s) work in two modes: Building
from a git checkout and building from a tarball.  This is pretty
complicated, but it works so far for autoconf/make.  It does not
currently work for meson; you can currently only build with meson from
a git checkout.  Making meson builds work from a tarball seems very
difficult or impossible.  One particular problem is that since meson
requires a separate build directory, we cannot make the build update
files like gram.h in the source tree.  So if you were to build from a
tarball and update gram.y, you will have a gram.h in the source tree
and one in the build tree, but the way things work is that the
compiler will always use the one in the source tree.  So you cannot,
for example, make any gram.y changes when building from a tarball.
This seems impossible to fix in a non-horrible way.

Second, there is increased interest nowadays in precisely tracking the
origin of software.  We can reasonably track contributions into the
git tree, and users can reasonably track the path from a tarball to
packages and downloads and installs.  But what happens between the git
tree and the tarball is obscure and in some cases non-reproducible.

The solution for both of these issues is to get rid of the step that
adds prebuilt files to the tarball.  The tarball now only contains
what is in the git tree (*).  Getting the additional build
dependencies is no longer a problem nowadays, and the complications to
keep these dual build modes working are significant.  And of course we
want to get the meson build system working universally.

This commit removes the make distprep target altogether.  The make
dist target continues to do its job, it just doesn't call distprep
anymore.

(*) - The tarball also contains the INSTALL file that is built at make
dist time, but not by distprep.  This is unchanged for now.

The make maintainer-clean target, whose job it is to remove the
prebuilt files in addition to what make distclean does, is now just an
alias to make distprep.  (In practice, it is probably obsolete given
that git clean is available.)

The following programs are now hard build requirements in configure
(they were already required by meson.build):

- bison
- flex
- perl

Reviewed-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://www.postgresql.org/message-id/flat/e07408d9-e5f2-d9fd-5672-f53354e9305e@eisentraut.org
2023-11-06 15:18:04 +01:00
Jeff Davis
a02b37fc08 Additional unicode primitive functions.
Introduce unicode_version(), icu_unicode_version(), and
unicode_assigned().

The latter requires introducing a new lookup table for the Unicode
General Category, which is generated along with the other Unicode
lookup tables.

Discussion: https://postgr.es/m/CA+TgmoYzYR-yhU6k1XFCADeyj=Oyz2PkVsa3iKv+keM8wp-F_A@mail.gmail.com
Reviewed-by: Peter Eisentraut
2023-11-01 22:47:06 -07:00
Peter Eisentraut
9d17e5f16f Update Unicode data to Unicode 15.1.0 2023-09-18 07:26:34 +02:00
Peter Eisentraut
5c08927d36 Make Unicode script fit for future versions
Between Unicode 15.0.0 and 15.1.0, the whitespace in
EastAsianWidth.txt has changed a bit, such as from

0020;Na          # Zs         SPACE

to

0020           ; Na # Zs         SPACE

with space around the semicolon.  Adjust the script to be able to
parse that.
2023-09-18 07:25:46 +02:00
Andres Freund
a1cd982098 meson: Add dependencies to perl modules to various script invocations
Eventually it is likely worth trying to deal with this in a more expansive
way, by generating dependency files generated within the scripts. But it's not
entirely obvious how to do that in perl and is work more suitable for 17
anyway.

Reported-by: Dagfinn Ilmari Mannsåker <ilmari@ilmari.org>
Reviewed-by: Tristan Partin <tristan@neon.tech>
Discussion: https://postgr.es/m/87v8g7s6bf.fsf@wibble.ilmari.org
2023-06-09 20:12:16 -07:00
Tom Lane
0245f8db36 Pre-beta mechanical code beautification.
Run pgindent, pgperltidy, and reformat-dat-files.

This set of diffs is a bit larger than typical.  We've updated to
pg_bsd_indent 2.1.2, which properly indents variable declarations that
have multi-line initialization expressions (the continuation lines are
now indented one tab stop).  We've also updated to perltidy version
20230309 and changed some of its settings, which reduces its desire to
add whitespace to lines to make assignments etc. line up.  Going
forward, that should make for fewer random-seeming changes to existing
code.

Discussion: https://postgr.es/m/20230428092545.qfb3y5wcu4cm75ur@alvherre.pgsql
2023-05-19 17:24:48 -04:00
Andres Freund
401874ab02 meson: don't require 'touch' binary, make use of 'cp' optional
We already didn't use touch (some earlier version of the meson build did ),
and cp is only used for updating unicode files. The latter already depends on
the optional availability of 'wget', so doing the same for 'cp' makes sense.

Eventually we probably want a portable command for updating source code as
part of a target, but for now...

Reported-by: Andrew Dunstan <andrew@dunslane.net>
Discussion: https://postgr.es/m/70e96c34-64ee-e549-8c4a-f91a7a668804@dunslane.net
2023-03-07 18:44:42 -08:00
Bruce Momjian
c8e1ba736b Update copyright for 2023
Backpatch-through: 11
2023-01-02 15:00:37 -05:00
Andrew Dunstan
8284cf5f74 Add copyright notices to meson files
Discussion: https://postgr.es/m/222b43a5-2fb3-2c1b-9cd0-375d376c8246@dunslane.net
2022-12-20 07:54:39 -05:00
Andres Freund
e6927270cd meson: Add initial version of meson based build system
Autoconf is showing its age, fewer and fewer contributors know how to wrangle
it. Recursive make has a lot of hard to resolve dependency issues and slow
incremental rebuilds. Our home-grown MSVC build system is hard to maintain for
developers not using Windows and runs tests serially. While these and other
issues could individually be addressed with incremental improvements, together
they seem best addressed by moving to a more modern build system.

After evaluating different build system choices, we chose to use meson, to a
good degree based on the adoption by other open source projects.

We decided that it's more realistic to commit a relatively early version of
the new build system and mature it in tree.

This commit adds an initial version of a meson based build system. It supports
building postgres on at least AIX, FreeBSD, Linux, macOS, NetBSD, OpenBSD,
Solaris and Windows (however only gcc is supported on aix, solaris). For
Windows/MSVC postgres can now be built with ninja (faster, particularly for
incremental builds) and msbuild (supporting the visual studio GUI, but
building slower).

Several aspects (e.g. Windows rc file generation, PGXS compatibility, LLVM
bitcode generation, documentation adjustments) are done in subsequent commits
requiring further review. Other aspects (e.g. not installing test-only
extensions) are not yet addressed.

When building on Windows with msbuild, builds are slower when using a visual
studio version older than 2019, because those versions do not support
MultiToolTask, required by meson for intra-target parallelism.

The plan is to remove the MSVC specific build system in src/tools/msvc soon
after reaching feature parity. However, we're not planning to remove the
autoconf/make build system in the near future. Likely we're going to keep at
least the parts required for PGXS to keep working around until all supported
versions build with meson.

Some initial help for postgres developers is at
https://wiki.postgresql.org/wiki/Meson

With contributions from Thomas Munro, John Naylor, Stone Tickle and others.

Author: Andres Freund <andres@anarazel.de>
Author: Nazir Bilal Yavuz <byavuz81@gmail.com>
Author: Peter Eisentraut <peter@eisentraut.org>
Reviewed-By: Peter Eisentraut <peter.eisentraut@enterprisedb.com>
Discussion: https://postgr.es/m/20211012083721.hvixq4pnh2pixr3j@alap3.anarazel.de
2022-09-21 22:37:17 -07:00
John Naylor
0bd9c62973 Treat Unicode codepoints of category "Format" as non-spacing
Commit d8594d123 updated the list of non-spacing codepoints used
for calculating display width, but in doing so inadvertently removed
some, since the script used for that commit only considered combining
characters.

For complete coverage for zero-width characters, include codepoints in
the category Cf (Format). To reflect the wider purpose, also rename files
and update comments that referred specifically to combining characters.

Some of these ranges have been missing since v12, but due to lack of
field complaints it was determined not important enough to justify adding
special-case logic the backbranches.

Kyotaro Horiguchi

Report by Pavel Stehule
Discussion: https://www.postgresql.org/message-id/flat/CAFj8pRBE8yvpQ0FSkPCoe0Ny1jAAsAQ6j3qMgVwWvkqAoaaNmQ%40mail.gmail.com
2022-09-13 16:13:33 +07:00
Andres Freund
c8a9246e09 Add output directory argument to generate-unicode_norm_table.pl
This is in preparation for building postgres with meson / ninja.

When building with meson, commands are run at the root of the build tree. Add
an option to put build output into the appropriate place.

Author: Andres Freund <andres@anarazel.de>
Author: Peter Eisentraut <peter@eisentraut.org>
Author: Nazir Bilal Yavuz <byavuz81@gmail.com>
Discussion: https://postgr.es/m/5e216522-ba3c-f0e6-7f97-5276d0270029@enterprisedb.com
2022-07-18 12:24:39 -07:00
Peter Eisentraut
c64fb698d0 Make update-unicode target work in vpath builds
Author: Andres Freund <andres@anarazel.de>
Discussion: https://www.postgresql.org/message-id/616c6873-83b5-85c0-93cb-548977c39c60@enterprisedb.com
2022-03-25 09:47:50 +01:00
Bruce Momjian
27b77ecf9f Update copyright for 2022
Backpatch-through: 10
2022-01-07 19:04:57 -05:00
John Naylor
5bc429aacb Extend collection of Unicode combining characters to beyond the BMP
The former limit was perhaps a carryover from an older hand-coded
table. Since commit bab982161 we have enough space in mbinterval to
store larger codepoints, so collect all combining characters.

Discussion: https://www.postgresql.org/message-id/49ad1fa0-174e-c901-b14c-c484b60907f1%40enterprisedb.com
2021-08-26 13:07:34 -04:00
John Naylor
bab982161e Update display widths as part of updating Unicode
The hardcoded "wide character" set in ucs_wcwidth() was last updated
around the Unicode 5.0 era.  This led to misalignment when printing
emojis and other codepoints that have since been designated
wide or full-width.

To fix and keep up to date, extend update-unicode to download the list
of wide and full-width codepoints from the offical sources.

In passing, remove some comments about non-spacing characters that
haven't been accurate since we removed the former hardcoded logic.

Jacob Champion

Reported and reviewed by Pavel Stehule
Discussion: https://www.postgresql.org/message-id/flat/CAFj8pRCeX21O69YHxmykYySYyprZAqrKWWg0KoGKdjgqcGyygg@mail.gmail.com
2021-08-26 10:53:56 -04:00
John Naylor
1563ecbc1b Revert "Rename unicode_combining_table to unicode_width_table"
This reverts commit eb0d0d2c73.

After I had committed eb0d0d2c7 and 78ab944cd, I decided to add
a sanity check for a "can't happen" scenario just to be cautious.
It turned out that it already happened in the official Unicode source
data, namely that a character can be both wide and a combining
character. This fact renders the aforementioned commits unnecessary,
so revert both of them.

Discussion: https://www.postgresql.org/message-id/CAFBsxsH5ejH4-1xaTLpSK8vWoK1m6fA1JBtTM6jmBsLfmDki1g%40mail.gmail.com
2021-08-26 10:06:12 -04:00
John Naylor
f8c8a8bccc Revert "Change mbbisearch to return the character range"
This reverts commit 78ab944cd4.

After I had committed eb0d0d2c7 and 78ab944cd, I decided to add
a sanity check for a "can't happen" scenario just to be cautious.
It turned out that it already happened in the official Unicode source
data, namely that a character can be both wide and a combining
character. This fact renders the aforementioned commits unnecessary,
so revert both of them.

Discussion:
https://www.postgresql.org/message-id/CAFBsxsH5ejH4-1xaTLpSK8vWoK1m6fA1JBtTM6jmBsLfmDki1g%40mail.gmail.com
2021-08-26 09:58:28 -04:00
John Naylor
78ab944cd4 Change mbbisearch to return the character range
Add a width field to mbinterval and have mbbisearch return a
pointer to the found range rather than just bool for success.
A future commit will add another width besides zero, and this
will allow that to use the same search.

Reviewed by Jacob Champion
Discussion: https://www.postgresql.org/message-id/CAFBsxsGOCpzV7c-f3a8ADsA1n4uZ%3D8puCctQp%2Bx7W0vgkv%3Dw%2Bg%40mail.gmail.com
2021-08-25 13:08:11 -04:00
John Naylor
eb0d0d2c73 Rename unicode_combining_table to unicode_width_table
No functional changes. A future commit will use this table for
other purposes besides combining characters.
2021-08-25 13:01:35 -04:00
Bruce Momjian
ca3b37487b Update copyright for 2021
Backpatch-through: 9.5
2021-01-02 13:06:25 -05:00
Michael Paquier
ceaeac54f7 Fix minor issues with new unicode {de,re}composition code
The table generation script would incorrectly complain in the
recomposition sorting when matching code points.  This would not have
caused the generation of an incorrect table.  Note that this condition
is not reachable yet, but could have been reached with future updates.

pg_bswap.h does not need to be included in the frontend.x

Author: John Naylor
Discussion: https://postgr.es/m/CAFBsxsGWmExpvv=61vtDKCs7+kBbhkwBDL2Ph9CacziFKnV_yw@mail.gmail.com
2020-11-07 10:15:58 +09:00
Michael Paquier
783f0cc64d Improve performance of Unicode {de,re}composition in the backend
This replaces the existing binary search with two perfect hash functions
for the composition and the decomposition in the backend code, at the
cost of slightly-larger binaries there (35kB in libpgcommon_srv.a).  Per
the measurements done, this improves the speed of the recomposition and
decomposition by up to 30~40 times for the NFC and NFKC conversions,
while all other operations get at least 40% faster.  This is not as
"good" as what libicu has, but it closes the gap a lot as per the
feedback from Daniel Verite.

The decomposition table remains the same, getting used for the binary
search in the frontend code, where we care more about the size of the
libraries like libpq over performance as this gets involved only in code
paths related to the SCRAM authentication.  In consequence, note that
the perfect hash function for the recomposition needs to use a new
inverse lookup array back to to the existing decomposition table.

The size of all frontend deliverables remains unchanged, even with
--enable-debug, including libpq.

Author: John Naylor
Reviewed-by: Michael Paquier, Tom Lane
Discussion: https://postgr.es/m/CAFBsxsHUuMFCt6-pU+oG-F1==CmEp8wR+O+bRouXWu6i8kXuqA@mail.gmail.com
2020-10-23 11:05:46 +09:00
Michael Paquier
80f8eb79e2 Use perfect hash for NFC and NFKC Unicode Normalization quick check
This makes the normalization quick check about 30% faster for NFC and
50% faster for NFKC than the binary search used previously.  The hash
lookup reuses the existing array of bit fields used for the binary
search to get the quick check property and is generated as part of "make
update-unicode" in src/common/unicode/.

Author: John Naylor
Reviewed-by: Mark Dilger, Michael Paquier
Discussion: https://postgr.es/m/CACPNZCt4fbJ0_bGrN5QPt34N4whv=mszM0LMVQdoa2rC9UMRXA@mail.gmail.com
2020-10-11 19:09:01 +09:00