Replace the mapping tables used to convert between UTF-8 and other
character encodings with new radix tree-based maps. Looking up an entry in
a radix tree is much faster than a binary search in the old maps. As a
bonus, the radix tree representation is also more compact, making the
binaries slightly smaller.
The "combined" maps work the same as before, with binary search. They are
much smaller than the main tables, so it doesn't matter so much. However,
the "combined" maps are now stored in the same .map files as the main
tables. This seems more clear, since they're always used together, and
generated from the same source files.
Patch by Kyotaro Horiguchi, with lot of hacking by me at various stages.
Reviewed by Michael Paquier and Daniel Gustafsson.
Discussion: https://www.postgresql.org/message-id/20170306.171609.204324917.horiguchi.kyotaro%40lab.ntt.co.jp
Generate EUC_CN mappings from gb-18030-2000.xml, because GB2312.TXT is no
longer available.
Get UHC from windows-949-2000.xml, it's more up-to-date.
Plus tons more small changes. With these changes, the perl scripts
faithfully produce the *.map files we have in the repository, from the
external source files.
In the passing, fix the Makefile to also download CP932.TXT and CP950.TXT.
Based on patches by Kyotaro Horiguchi, reviewed by Daniel Gustafsson.
Discussion: https://postgr.es/m/08e7892a-d55c-eefe-76e6-7910bc8dd1f3@iki.fi
The common style is to pad to 4 digits.
Running the current perl scripts to generate these map files would override
this change, but the next commit will rewrite the perl scripts to produce
this style. I'm doing this as a separate commit, to make it more clear what
non-cosmetic changes the next commit makes to the map files.
Discussion: https://postgr.es/m/08e7892a-d55c-eefe-76e6-7910bc8dd1f3@iki.fi
PostgreSQL treats characters with < 0x80 leading byte as plain ASCII, and
they are not even passed to the conversion routines. There is no point in
having them in the conversion tables.
Everything in the tables were direct ASCII-ASCII mappings, except for two:
* SHIFT_JIS_2004 code point 0x5C (backslash in ASCII) was mapped to Unicode
YEN SIGN character.
* Unicode 0x5C (backslash again) was mapped to "REVERSE SOLIDUS" in
SHIFT_JIS_2004
These mappings never had any effect, so there's no functional change from
removing them.
Discussion: https://postgr.es/m/08e7892a-d55c-eefe-76e6-7910bc8dd1f3@iki.fi
This serves as implicit documentation and is handy if someone wants to
tweak things. The rules are not part of a normal build, like this
entire directory.
0xc19c is not a valid UTF-8 byte sequence. It doesn't do any harm, AFAICS,
but it's surely not intentional. No backpatching though, just to be sure.
In the passing, also add a file header comment to the file, like the
UCS_to_SJIS.pl script would produce. (The file was originally created with
UCS_to_SJIS.pl, but has been modified by hand since then. That's
questionable, but I'll leave fixing that for later.)
Kyotaro Horiguchi
Discussion: <20160907.155050.233844095.horiguchi.kyotaro@lab.ntt.co.jp>
Some of the Unicode/*.map files had identification comments added to them,
evidently by hand. Others did not. Modify the generating scripts to
produce these comments automatically, and update the generated files that
lacked them.
This is just minor cleanup as a by-product of trying to verify that the
*.map files can indeed be reproduced from authoritative data. There are a
depressingly large number that fail to reproduce from the claimed sources.
I have not touched those in this commit, except for the JIS 2004-related
files which required only a single comment update to match.
Since this only affects comments, no need to consider a back-patch.
Our previous code for GB18030 <-> UTF8 conversion only covered Unicode code
points up to U+FFFF, but the actual spec defines conversions for all code
points up to U+10FFFF. That would be rather impractical as a lookup table,
but fortunately there is a simple algorithmic conversion between the
additional code points and the equivalent GB18030 byte patterns. Make use
of the just-added callback facility in LocalToUtf/UtfToLocal to perform the
additional conversions.
Having created the infrastructure to do that, we can use the same code to
map certain linearly-related subranges of the Unicode space below U+FFFF,
allowing removal of the corresponding lookup table entries. This more
than halves the lookup table size, which is a substantial savings;
utf8_and_gb18030.so drops from nearly a megabyte to about half that.
In support of doing that, replace ISO10646-GB18030.TXT with the data file
gb-18030-2000.xml (retrieved from
http://source.icu-project.org/repos/icu/data/trunk/charset/data/xml/ )
in which these subranges have been deleted from the simple lookup entries.
Per bug #12845 from Arjen Nienhuis. The conversion code added here is
based on his proposed patch, though I whacked it around rather heavily.
Until now, these functions have only supported encoding conversions using
lookup tables, which is fine as long as there's not too many code points
to convert. However, GB18030 expects all 1.1 million Unicode code points
to be convertible, which would require a ridiculously-sized lookup table.
Fortunately, a large fraction of those conversions can be expressed through
arithmetic, ie the conversions are one-to-one in certain defined ranges.
To support that, provide a callback function that is used after consulting
the lookup tables. (This patch doesn't actually change anything about the
GB18030 conversion behavior, just provide infrastructure for fixing it.)
Since this requires changing the APIs of UtfToLocal/LocalToUtf anyway,
take the opportunity to rearrange their argument lists into what seems
to me a saner order. And beautify the call sites by using lengthof()
instead of error-prone sizeof() arithmetic.
In passing, also mark all the lookup tables used by these calls "const".
This moves an impressive amount of stuff into the text segment, at least
on my machine, and is safer anyhow.