MDEV-31071 Refactor case folding data types in Unicode collations

mirror of https://github.com/MariaDB/server.git synced 2025-08-07 00:04:31 +03:00

This is a non-functional change. It changes the way how case folding data
and weight data (for simple Unicode collations) are stored:

- Removing data types MY_UNICASE_CHARACTER, MY_UNICASE_INFO
- Using data types MY_CASEFOLD_CHARACTER, MY_CASEFOLD_INFO instead.

This patch changes simple Unicode collations in a similar way
how MDEV-30695 previously changed Asian collations.

No new MTR tests are needed. The underlying code is thoroughly
covered by a number of ctype_*_ws.test and ctype_*_casefold.test
files, which were added recently as a preparation
for this change.

Old and new Unicode data layout
-------------------------------

Case folding data is now stored in separate tables
consisting of MY_CASEFOLD_CHARACTER elements with two members:

    typedef struct casefold_info_char_t
    {
      uint32 toupper;
      uint32 tolower;
    } MY_CASEFOLD_CHARACTER;

while weight data (for simple non-UCA collations xxx_general_ci
and xxx_general_mysql500_ci) is stored in separate arrays of
uint16 elements.

Before this change case folding data and simple weight data were
stored together, in tables of the following elements with three members:

    typedef struct unicase_info_char_st
    {
      uint32 toupper;
      uint32 tolower;
      uint32 sort;          /* weights for simple collations */
    } MY_UNICASE_CHARACTER;

This data format was redundant, because weights (the "sort" member) were
needed only for these two simple Unicode collations:
- xxx_general_ci
- xxx_general_mysql500_ci

Adding case folding information for Unicode-14.0.0 using the old
format would waste memory without purpose.

Detailed changes
----------------
- Changing the underlying data types as described above

- Including unidata-dump.c into the sources.
  This program was earlier used to dump UnicodeData.txt
  (e.g. https://www.unicode.org/Public/14.0.0/ucd/UnicodeData.txt)
  into MySQL / MariaDB source files.
  It was originally written in 2002, but has not been distributed yet
  together with MySQL / MariaDB sources.

- Removing the old format Unicode data earlier dumped from UnicodeData.txt
  (versions 3.0.0 and 5.2.0) from ctype-utf8.c.
  Adding Unicode data in the new format into separate header files,
  to maintain the code easier:

    - ctype-unicode300-casefold.h
    - ctype-unicode300-casefold-tr.h
    - ctype-unicode300-general_ci.h
    - ctype-unicode300-general_mysql500_ci.h
    - ctype-unicode520-casefold.h

- Adding a new file ctype-unidata.c as an aggregator for
  the header files listed above.

This commit is contained in:

Alexander Barkov

2023-02-24 19:22:32 +04:00

parent 2ad287caad

commit 6075f12c65

29 changed files with 7471 additions and 5195 deletions

									
										1

strings/conf_to_src.c
									
												View File
												
				@@ -409,7 +409,6 @@ void dispcset(FILE *f,CHARSET_INFO *cs)

				  fprintf(f,"  NULL,                       /* from_uni      */\n");

				  fprintf(f,"  NULL,                       /* casefold      */\n");

				  fprintf(f,"  &my_unicase_default,        /* caseinfo      */\n");

				  fprintf(f,"  NULL,                       /* state map     */\n");

				  fprintf(f,"  NULL,                       /* ident map     */\n");

				  fprintf(f,"  1,                          /* strxfrm_multiply*/\n");

MDEV-31071 Refactor case folding data types in Unicode collations

1 strings/conf_to_src.c Unescape Escape View File

1

strings/conf_to_src.c

View File