diff --git a/ext/icu/README.txt b/ext/icu/README.txt new file mode 100644 index 0000000000..7073413d34 --- /dev/null +++ b/ext/icu/README.txt @@ -0,0 +1,164 @@ + +This directory contains source code for the SQLite "ICU" extension, an +integration of the "International Components for Unicode" library with +SQLite. Documentation follows. + + 1. Features + + 1.1 SQL Scalars upper() and lower() + 1.2 Unicode Aware LIKE Operator + 1.3 ICU Collation Sequences + 1.4 SQL REGEXP Operator + + 2. Compilation and Usage + + 3. Bugs, Problems and Security Issues + + 3.1 The "case_sensitive_like" Pragma + 3.2 The SQLITE_MAX_LIKE_PATTERN_LENGTH Macro + 3.3 Collation Sequence Security Issue + + +1. FEATURES + + 1.1 SQL Scalars upper() and lower() + + SQLite's built-in implementations of these two functions only + provide case mapping for the 26 letters used in the English + language. The ICU based functions provided by this extension + provide case mapping, where defined, for the full range of + unicode characters. + + ICU provides two types of case mapping, "general" case mapping and + "language specific". Refer to ICU documentation for the differences + between the two. Specifically: + + http://www.icu-project.org/userguide/caseMappings.html + http://www.icu-project.org/userguide/posix.html#case_mappings + + To utilise "general" case mapping, the upper() or lower() scalar + functions are invoked with one argument: + + upper('ABC') -> 'abc' + lower('abc') -> 'ABC' + + To access ICU "language specific" case mapping, upper() or lower() + should be invoked with two arguments. The second argument is the name + of the locale to use. Passing an empty string ("") or SQL NULL value + as the second argument is the same as invoking the 1 argument version + of upper() or lower(): + + lower('I', 'en_us') -> 'i' + lower('I', 'tr_tr') -> 'ı' (small dotless i) + + 1.2 Unicode Aware LIKE Operator + + Similarly to the upper() and lower() functions, the built-in SQLite LIKE + operator understands case equivalence for the 26 letters of the English + language alphabet. The implementation of LIKE included in this + extension uses the ICU function u_foldCase() to provide case + independent comparisons for the full range of unicode characters. + + The U_FOLD_CASE_DEFAULT flag is passed to u_foldCase(), meaning the + dotless 'I' character used in the Turkish language is considered + to be in the same equivalence class as the dotted 'I' character + used by many languages (including English). + + 1.3 ICU Collation Sequences + + A special SQL scalar function, icu_load_collation() is provided that + may be used to register ICU collation sequences with SQLite. It + is always called with exactly two arguments, the ICU locale + identifying the collation sequence to ICU, and the name of the + SQLite collation sequence to create. For example, to create an + SQLite collation sequence named "turkish" using Turkish language + sorting rules, the SQL statement: + + SELECT icu_load_collation('tr_TR', 'turkish'); + + Or, for Australian English: + + SELECT icu_load_collation('en_AU', 'australian'); + + The identifiers "turkish" and "australian" may then be used + as collation sequence identifiers in SQL statements: + + CREATE TABLE aust_turkish_penpals( + australian_penpal_name TEXT COLLATE australian, + turkish_penpal_name TEXT COLLATE turkish + ); + + 1.4 SQL REGEXP Operator + + This extension provides an implementation of the SQL binary + comparision operator "REGEXP", based on the regular expression functions + provided by the ICU library. The syntax of the operator is as described + in SQLite documentation: + + REGEXP + + This extension uses the ICU defaults for regular expression matching + behaviour. Specifically, this means that: + + * Matching is case-sensitive, + * Regular expression comments are not allowed within patterns, and + * The '^' and '$' characters match the beginning and end of the + argument, not the beginning and end of lines within + the argument. + + Even more specifically, the value passed to the "flags" parameter + of ICU C function uregex_open() is 0. + + +2 COMPILATION AND USAGE + + The easiest way to compile and use the ICU extension is to build + and use it as a dynamically loadable SQLite extension. + + + + +3 BUGS, PROBLEMS AND SECURITY ISSUES + + 3.1 The "case_sensitive_like" Pragma + + This extension does not work well with the "case_sensitive_like" + pragma. If this pragma is used before the ICU extension is loaded, + then the pragma has no effect. If the pragma is used after the ICU + extension is loaded, then SQLite ignores the ICU implementation and + always uses the built-in LIKE operator. + + The ICU extension LIKE operator is always case insensitive. + + 3.2 The SQLITE_MAX_LIKE_PATTERN_LENGTH Macro + + Passing very long patterns to the built-in SQLite LIKE operator can + cause a stack overflow. To curb this problem, SQLite defines the + SQLITE_MAX_LIKE_PATTERN_LENGTH macro as the maximum length of a + pattern in bytes (irrespective of encoding). The default value is + defined in internal header file "limits.h". + + The ICU extension LIKE implementation suffers from the same + problem and uses the same solution. However, since the ICU extension + code does not include the SQLite file "limits.h", modifying + the default value therein does not affect the ICU extension. + The default value of SQLITE_MAX_LIKE_PATTERN_LENGTH used by + the ICU extension LIKE operator is 50000, defined in source + file "icu.c". + + 3.3 Collation Sequence Security Issue + + Internally, SQLite assumes that indices stored in database files + are sorted according to the collation sequence indicated by the + SQL schema. Changing the definition of a collation sequence after + an index has been built is therefore equivalent to database + corruption. The SQLite library is not very well tested under + these conditions, and may contain potential buffer overruns + or other programming errors that could be exploited by a malicious + programmer. + + If the ICU extension is used in an environment where potentially + malicious users may execute arbitrary SQL (i.e. gears), they + should be prevented from invoking the icu_load_collation() function, + possibly using the authorisation callback. + diff --git a/ext/icu/icu.c b/ext/icu/icu.c index 7e9e8cfad8..0be817eaac 100644 --- a/ext/icu/icu.c +++ b/ext/icu/icu.c @@ -1,5 +1,16 @@ - /* +** 2007 May 6 +** +** The author disclaims copyright to this source code. In place of +** a legal notice, here is a blessing: +** +** May you do good and not evil. +** May you find forgiveness for yourself and forgive others. +** May you share freely, never taking more than you give. +** +************************************************************************* +** $Id: icu.c,v 1.5 2007/06/11 08:00:00 danielk1977 Exp $ +** ** This file implements an integration between the ICU library ** ("International Components for Unicode", an open-source library ** for handling unicode data) and SQLite. The integration uses @@ -8,16 +19,18 @@ ** * An implementation of the SQL regexp() function (and hence REGEXP ** operator) using the ICU uregex_XX() APIs. ** -** * Implementations of the SQL scalar upper() and lower() -** functions for case mapping. +** * Implementations of the SQL scalar upper() and lower() functions +** for case mapping. ** -** * Collation sequences +** * Integration of ICU and SQLite collation seqences. ** -** * LIKE +** * An implementation of the LIKE operator that uses ICU to +** provide case-independent matching. */ #if !defined(SQLITE_CORE) || defined(SQLITE_ENABLE_ICU) +/* Include ICU headers */ #include #include #include @@ -32,12 +45,12 @@ #endif /* -** Collation sequences: -** -** ucol_open() -** ucol_strcoll() -** ucol_close() +** Maximum length (in bytes) of the pattern in a LIKE or GLOB +** operator. */ +#ifndef SQLITE_MAX_LIKE_PATTERN_LENGTH +# define SQLITE_MAX_LIKE_PATTERN_LENGTH 50000 +#endif /* ** Version of sqlite3_free() that is always a function, never a macro. @@ -52,7 +65,7 @@ static void xFree(void *p){ ** false (0) if they are different. */ static int icuLikeCompare( - const uint8_t *zPattern, /* The UTF-8 LIKE pattern */ + const uint8_t *zPattern, /* LIKE pattern */ const uint8_t *zString, /* The UTF-8 string to compare against */ const UChar32 uEsc /* The escape character */ ){ @@ -151,6 +164,15 @@ static void icuLikeFunc( const unsigned char *zB = sqlite3_value_text(argv[1]); UChar32 uEsc = 0; + /* Limit the length of the LIKE or GLOB pattern to avoid problems + ** of deep recursion and N*N behavior in patternCompare(). + */ + if( sqlite3_value_bytes(argv[0])>SQLITE_MAX_LIKE_PATTERN_LENGTH ){ + sqlite3_result_error(context, "LIKE or GLOB pattern too complex", -1); + return; + } + + if( argc==3 ){ /* The escape character string must consist of a single UTF-8 character. ** Otherwise, return an error. @@ -291,7 +313,7 @@ static void icuRegexpFunc(sqlite3_context *p, int nArg, sqlite3_value **apArg){ ** To access ICU "language specific" case mapping, upper() or lower() ** should be invoked with two arguments. The second argument is the name ** of the locale to use. Passing an empty string ("") or SQL NULL value -** as the second argument is the smae as invoking the 1 argument version +** as the second argument is the same as invoking the 1 argument version ** of upper() or lower(). ** ** lower('I', 'en_us') -> 'i' diff --git a/manifest b/manifest index 985dd9d6c8..8439c6d730 100644 --- a/manifest +++ b/manifest @@ -1,5 +1,5 @@ -C Define\sisnan()\son\swindows.\s\sTicket\s#2399.\s(CVS\s4054) -D 2007-06-10T22:57:33 +C Add\sa\sREADME.txt\sfile\sfor\sthe\sICU\sextension.\s(CVS\s4055) +D 2007-06-11T08:00:00 F Makefile.in 31d9f7cd42c3d73ae117fcdb4b0ecd029fa8f50b F Makefile.linux-gcc 2d8574d1ba75f129aba2019f0b959db380a90935 F README 9c4e2d6706bdcc3efdd773ce752a8cdab4f90028 @@ -44,7 +44,8 @@ F ext/fts2/fts2_porter.c 991a45463553c7318063fe7773368a6c0f39e35d F ext/fts2/fts2_tokenizer.h 4c5ffe31d63622869eb6eec1503df7f6996fd1bd F ext/fts2/fts2_tokenizer1.c 5c979fe8815f95396beb22b627571da895a025af F ext/fts2/mkfts2amal.tcl 2a9ec76b0760fe7f3669dca5bc0d60728bc1c977 -F ext/icu/icu.c 6b47f5bbaf32bce03112282ecca1f54bec969e42 +F ext/icu/README.txt a470afe5adf6534cc0bdafca31e6cf4d88c321fa +F ext/icu/icu.c daab19e2c5221685688ecff2bb75bf9e0eea361d F install-sh 9d4de14ab9fb0facae2f48780b874848cbf2f895 F ltmain.sh 56abb507100ed2d4261f6dd1653dec3cf4066387 F main.mk 5bc9827b6fc59db504210bf68cbe335f3250588a @@ -502,7 +503,7 @@ F www/tclsqlite.tcl bb0d1357328a42b1993d78573e587c6dcbc964b9 F www/vdbe.tcl 87a31ace769f20d3627a64fa1fade7fed47b90d0 F www/version3.tcl 890248cf7b70e60c383b0e84d77d5132b3ead42b F www/whentouse.tcl fc46eae081251c3c181bd79c5faef8195d7991a5 -P 4ca6cdae94f6d0a2c95755d4a250f9f3bc7a0d7b -R 13a39a5ca48ff870fc2a261fa80a07e3 -U drh -Z 60e96b24716296bff6dbc0c8d1c6203d +P fed9373e27b9d5338159a41772f8983420b902b0 +R c45d0693cbf40360dc0b5addeae5d9aa +U danielk1977 +Z 7dc46a81cb300982dd5e07eef7c7a05c diff --git a/manifest.uuid b/manifest.uuid index d42bce77bc..c06a82bb54 100644 --- a/manifest.uuid +++ b/manifest.uuid @@ -1 +1 @@ -fed9373e27b9d5338159a41772f8983420b902b0 \ No newline at end of file +7b6927829f18d39052e67eebca4275e7aa496035 \ No newline at end of file