mirror of
https://github.com/MariaDB/server.git
synced 2025-07-30 16:24:05 +03:00
Upgrading the bundled PCRE to 8.34
This commit is contained in:
@ -23,25 +23,26 @@ man page, in case the conversion went wrong.
|
||||
<li><a name="TOC8" href="#SEC8">MATCHING A SINGLE DATA UNIT</a>
|
||||
<li><a name="TOC9" href="#SEC9">SQUARE BRACKETS AND CHARACTER CLASSES</a>
|
||||
<li><a name="TOC10" href="#SEC10">POSIX CHARACTER CLASSES</a>
|
||||
<li><a name="TOC11" href="#SEC11">VERTICAL BAR</a>
|
||||
<li><a name="TOC12" href="#SEC12">INTERNAL OPTION SETTING</a>
|
||||
<li><a name="TOC13" href="#SEC13">SUBPATTERNS</a>
|
||||
<li><a name="TOC14" href="#SEC14">DUPLICATE SUBPATTERN NUMBERS</a>
|
||||
<li><a name="TOC15" href="#SEC15">NAMED SUBPATTERNS</a>
|
||||
<li><a name="TOC16" href="#SEC16">REPETITION</a>
|
||||
<li><a name="TOC17" href="#SEC17">ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS</a>
|
||||
<li><a name="TOC18" href="#SEC18">BACK REFERENCES</a>
|
||||
<li><a name="TOC19" href="#SEC19">ASSERTIONS</a>
|
||||
<li><a name="TOC20" href="#SEC20">CONDITIONAL SUBPATTERNS</a>
|
||||
<li><a name="TOC21" href="#SEC21">COMMENTS</a>
|
||||
<li><a name="TOC22" href="#SEC22">RECURSIVE PATTERNS</a>
|
||||
<li><a name="TOC23" href="#SEC23">SUBPATTERNS AS SUBROUTINES</a>
|
||||
<li><a name="TOC24" href="#SEC24">ONIGURUMA SUBROUTINE SYNTAX</a>
|
||||
<li><a name="TOC25" href="#SEC25">CALLOUTS</a>
|
||||
<li><a name="TOC26" href="#SEC26">BACKTRACKING CONTROL</a>
|
||||
<li><a name="TOC27" href="#SEC27">SEE ALSO</a>
|
||||
<li><a name="TOC28" href="#SEC28">AUTHOR</a>
|
||||
<li><a name="TOC29" href="#SEC29">REVISION</a>
|
||||
<li><a name="TOC11" href="#SEC11">COMPATIBILITY FEATURE FOR WORD BOUNDARIES</a>
|
||||
<li><a name="TOC12" href="#SEC12">VERTICAL BAR</a>
|
||||
<li><a name="TOC13" href="#SEC13">INTERNAL OPTION SETTING</a>
|
||||
<li><a name="TOC14" href="#SEC14">SUBPATTERNS</a>
|
||||
<li><a name="TOC15" href="#SEC15">DUPLICATE SUBPATTERN NUMBERS</a>
|
||||
<li><a name="TOC16" href="#SEC16">NAMED SUBPATTERNS</a>
|
||||
<li><a name="TOC17" href="#SEC17">REPETITION</a>
|
||||
<li><a name="TOC18" href="#SEC18">ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS</a>
|
||||
<li><a name="TOC19" href="#SEC19">BACK REFERENCES</a>
|
||||
<li><a name="TOC20" href="#SEC20">ASSERTIONS</a>
|
||||
<li><a name="TOC21" href="#SEC21">CONDITIONAL SUBPATTERNS</a>
|
||||
<li><a name="TOC22" href="#SEC22">COMMENTS</a>
|
||||
<li><a name="TOC23" href="#SEC23">RECURSIVE PATTERNS</a>
|
||||
<li><a name="TOC24" href="#SEC24">SUBPATTERNS AS SUBROUTINES</a>
|
||||
<li><a name="TOC25" href="#SEC25">ONIGURUMA SUBROUTINE SYNTAX</a>
|
||||
<li><a name="TOC26" href="#SEC26">CALLOUTS</a>
|
||||
<li><a name="TOC27" href="#SEC27">BACKTRACKING CONTROL</a>
|
||||
<li><a name="TOC28" href="#SEC28">SEE ALSO</a>
|
||||
<li><a name="TOC29" href="#SEC29">AUTHOR</a>
|
||||
<li><a name="TOC30" href="#SEC30">REVISION</a>
|
||||
</ul>
|
||||
<br><a name="SEC1" href="#TOC1">PCRE REGULAR EXPRESSION DETAILS</a><br>
|
||||
<P>
|
||||
@ -116,21 +117,33 @@ appearance causes an error.
|
||||
Unicode property support
|
||||
</b><br>
|
||||
<P>
|
||||
Another special sequence that may appear at the start of a pattern is
|
||||
<pre>
|
||||
(*UCP)
|
||||
</pre>
|
||||
Another special sequence that may appear at the start of a pattern is (*UCP).
|
||||
This has the same effect as setting the PCRE_UCP option: it causes sequences
|
||||
such as \d and \w to use Unicode properties to determine character types,
|
||||
instead of recognizing only characters with codes less than 128 via a lookup
|
||||
table.
|
||||
</P>
|
||||
<br><b>
|
||||
Disabling auto-possessification
|
||||
</b><br>
|
||||
<P>
|
||||
If a pattern starts with (*NO_AUTO_POSSESS), it has the same effect as setting
|
||||
the PCRE_NO_AUTO_POSSESS option at compile time. This stops PCRE from making
|
||||
quantifiers possessive when what follows cannot match the repeated item. For
|
||||
example, by default a+b is treated as a++b. For more details, see the
|
||||
<a href="pcreapi.html"><b>pcreapi</b></a>
|
||||
documentation.
|
||||
</P>
|
||||
<br><b>
|
||||
Disabling start-up optimizations
|
||||
</b><br>
|
||||
<P>
|
||||
If a pattern starts with (*NO_START_OPT), it has the same effect as setting the
|
||||
PCRE_NO_START_OPTIMIZE option either at compile or matching time.
|
||||
PCRE_NO_START_OPTIMIZE option either at compile or matching time. This disables
|
||||
several optimizations for quickly reaching "no match" results. For more
|
||||
details, see the
|
||||
<a href="pcreapi.html"><b>pcreapi</b></a>
|
||||
documentation.
|
||||
<a name="newlines"></a></P>
|
||||
<br><b>
|
||||
Newline conventions
|
||||
@ -193,10 +206,10 @@ pattern of the form
|
||||
(*LIMIT_RECURSION=d)
|
||||
</pre>
|
||||
where d is any number of decimal digits. However, the value of the setting must
|
||||
be less than the value set by the caller of <b>pcre_exec()</b> for it to have
|
||||
any effect. In other words, the pattern writer can lower the limit set by the
|
||||
programmer, but not raise it. If there is more than one setting of one of these
|
||||
limits, the lower value is used.
|
||||
be less than the value set (or defaulted) by the caller of <b>pcre_exec()</b>
|
||||
for it to have any effect. In other words, the pattern writer can lower the
|
||||
limits set by the programmer, but not raise them. If there is more than one
|
||||
setting of one of these limits, the lower value is used.
|
||||
</P>
|
||||
<br><a name="SEC3" href="#TOC1">EBCDIC CHARACTER CODES</a><br>
|
||||
<P>
|
||||
@ -283,10 +296,11 @@ backslash. All other characters (in particular, those whose codepoints are
|
||||
greater than 127) are treated as literals.
|
||||
</P>
|
||||
<P>
|
||||
If a pattern is compiled with the PCRE_EXTENDED option, white space in the
|
||||
pattern (other than in a character class) and characters between a # outside
|
||||
a character class and the next newline are ignored. An escaping backslash can
|
||||
be used to include a white space or # character as part of the pattern.
|
||||
If a pattern is compiled with the PCRE_EXTENDED option, most white space in the
|
||||
pattern (other than in a character class), and characters between a # outside a
|
||||
character class and the next newline, inclusive, are ignored. An escaping
|
||||
backslash can be used to include a white space or # character as part of the
|
||||
pattern.
|
||||
</P>
|
||||
<P>
|
||||
If you want to remove the special meaning from a sequence of characters, you
|
||||
@ -324,7 +338,9 @@ one of the following escape sequences than the binary character it represents:
|
||||
\n linefeed (hex 0A)
|
||||
\r carriage return (hex 0D)
|
||||
\t tab (hex 09)
|
||||
\0dd character with octal code 0dd
|
||||
\ddd character with octal code ddd, or back reference
|
||||
\o{ddd..} character with octal code ddd..
|
||||
\xhh character with hex code hh
|
||||
\x{hhh..} character with hex code hhh.. (non-JavaScript mode)
|
||||
\uhhhh character with hex code hhhh (JavaScript mode only)
|
||||
@ -347,42 +363,6 @@ the EBCDIC letters are disjoint, \cZ becomes hex 29 (Z is E9), and other
|
||||
characters also generate different values.
|
||||
</P>
|
||||
<P>
|
||||
By default, after \x, from zero to two hexadecimal digits are read (letters
|
||||
can be in upper or lower case). Any number of hexadecimal digits may appear
|
||||
between \x{ and }, but the character code is constrained as follows:
|
||||
<pre>
|
||||
8-bit non-UTF mode less than 0x100
|
||||
8-bit UTF-8 mode less than 0x10ffff and a valid codepoint
|
||||
16-bit non-UTF mode less than 0x10000
|
||||
16-bit UTF-16 mode less than 0x10ffff and a valid codepoint
|
||||
32-bit non-UTF mode less than 0x80000000
|
||||
32-bit UTF-32 mode less than 0x10ffff and a valid codepoint
|
||||
</pre>
|
||||
Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so-called
|
||||
"surrogate" codepoints), and 0xffef.
|
||||
</P>
|
||||
<P>
|
||||
If characters other than hexadecimal digits appear between \x{ and }, or if
|
||||
there is no terminating }, this form of escape is not recognized. Instead, the
|
||||
initial \x will be interpreted as a basic hexadecimal escape, with no
|
||||
following digits, giving a character whose value is zero.
|
||||
</P>
|
||||
<P>
|
||||
If the PCRE_JAVASCRIPT_COMPAT option is set, the interpretation of \x is
|
||||
as just described only when it is followed by two hexadecimal digits.
|
||||
Otherwise, it matches a literal "x" character. In JavaScript mode, support for
|
||||
code points greater than 256 is provided by \u, which must be followed by
|
||||
four hexadecimal digits; otherwise it matches a literal "u" character.
|
||||
Character codes specified by \u in JavaScript mode are constrained in the same
|
||||
was as those specified by \x in non-JavaScript mode.
|
||||
</P>
|
||||
<P>
|
||||
Characters whose value is less than 256 can be defined by either of the two
|
||||
syntaxes for \x (or by \u in JavaScript mode). There is no difference in the
|
||||
way they are handled. For example, \xdc is exactly the same as \x{dc} (or
|
||||
\u00dc in JavaScript mode).
|
||||
</P>
|
||||
<P>
|
||||
After \0 up to two further octal digits are read. If there are fewer than two
|
||||
digits, just those that are present are used. Thus the sequence \0\x\07
|
||||
specifies two binary zeros followed by a BEL character (code value 7). Make
|
||||
@ -390,9 +370,23 @@ sure you supply two digits after the initial zero if the pattern character that
|
||||
follows is itself an octal digit.
|
||||
</P>
|
||||
<P>
|
||||
The handling of a backslash followed by a digit other than 0 is complicated.
|
||||
Outside a character class, PCRE reads it and any following digits as a decimal
|
||||
number. If the number is less than 10, or if there have been at least that many
|
||||
The escape \o must be followed by a sequence of octal digits, enclosed in
|
||||
braces. An error occurs if this is not the case. This escape is a recent
|
||||
addition to Perl; it provides way of specifying character code points as octal
|
||||
numbers greater than 0777, and it also allows octal numbers and back references
|
||||
to be unambiguously specified.
|
||||
</P>
|
||||
<P>
|
||||
For greater clarity and unambiguity, it is best to avoid following \ by a
|
||||
digit greater than zero. Instead, use \o{} or \x{} to specify character
|
||||
numbers, and \g{} to specify back references. The following paragraphs
|
||||
describe the old, ambiguous syntax.
|
||||
</P>
|
||||
<P>
|
||||
The handling of a backslash followed by a digit other than 0 is complicated,
|
||||
and Perl has changed in recent releases, causing PCRE also to change. Outside a
|
||||
character class, PCRE reads the digit and any following digits as a decimal
|
||||
number. If the number is less than 8, or if there have been at least that many
|
||||
previous capturing left parentheses in the expression, the entire sequence is
|
||||
taken as a <i>back reference</i>. A description of how this works is given
|
||||
<a href="#backreferences">later,</a>
|
||||
@ -400,12 +394,11 @@ following the discussion of
|
||||
<a href="#subpattern">parenthesized subpatterns.</a>
|
||||
</P>
|
||||
<P>
|
||||
Inside a character class, or if the decimal number is greater than 9 and there
|
||||
have not been that many capturing subpatterns, PCRE re-reads up to three octal
|
||||
digits following the backslash, and uses them to generate a data character. Any
|
||||
subsequent digits stand for themselves. The value of the character is
|
||||
constrained in the same way as characters specified in hexadecimal.
|
||||
For example:
|
||||
Inside a character class, or if the decimal number following \ is greater than
|
||||
7 and there have not been that many capturing subpatterns, PCRE handles \8 and
|
||||
\9 as the literal characters "8" and "9", and otherwise re-reads up to three
|
||||
octal digits following the backslash, using them to generate a data character.
|
||||
Any subsequent digits stand for themselves. For example:
|
||||
<pre>
|
||||
\040 is another way of writing an ASCII space
|
||||
\40 is the same, provided there are fewer than 40 previous capturing subpatterns
|
||||
@ -415,12 +408,53 @@ For example:
|
||||
\0113 is a tab followed by the character "3"
|
||||
\113 might be a back reference, otherwise the character with octal code 113
|
||||
\377 might be a back reference, otherwise the value 255 (decimal)
|
||||
\81 is either a back reference, or a binary zero followed by the two characters "8" and "1"
|
||||
\81 is either a back reference, or the two characters "8" and "1"
|
||||
</pre>
|
||||
Note that octal values of 100 or greater must not be introduced by a leading
|
||||
zero, because no more than three octal digits are ever read.
|
||||
Note that octal values of 100 or greater that are specified using this syntax
|
||||
must not be introduced by a leading zero, because no more than three octal
|
||||
digits are ever read.
|
||||
</P>
|
||||
<P>
|
||||
By default, after \x that is not followed by {, from zero to two hexadecimal
|
||||
digits are read (letters can be in upper or lower case). Any number of
|
||||
hexadecimal digits may appear between \x{ and }. If a character other than
|
||||
a hexadecimal digit appears between \x{ and }, or if there is no terminating
|
||||
}, an error occurs.
|
||||
</P>
|
||||
<P>
|
||||
If the PCRE_JAVASCRIPT_COMPAT option is set, the interpretation of \x is
|
||||
as just described only when it is followed by two hexadecimal digits.
|
||||
Otherwise, it matches a literal "x" character. In JavaScript mode, support for
|
||||
code points greater than 256 is provided by \u, which must be followed by
|
||||
four hexadecimal digits; otherwise it matches a literal "u" character.
|
||||
</P>
|
||||
<P>
|
||||
Characters whose value is less than 256 can be defined by either of the two
|
||||
syntaxes for \x (or by \u in JavaScript mode). There is no difference in the
|
||||
way they are handled. For example, \xdc is exactly the same as \x{dc} (or
|
||||
\u00dc in JavaScript mode).
|
||||
</P>
|
||||
<br><b>
|
||||
Constraints on character values
|
||||
</b><br>
|
||||
<P>
|
||||
Characters that are specified using octal or hexadecimal numbers are
|
||||
limited to certain values, as follows:
|
||||
<pre>
|
||||
8-bit non-UTF mode less than 0x100
|
||||
8-bit UTF-8 mode less than 0x10ffff and a valid codepoint
|
||||
16-bit non-UTF mode less than 0x10000
|
||||
16-bit UTF-16 mode less than 0x10ffff and a valid codepoint
|
||||
32-bit non-UTF mode less than 0x100000000
|
||||
32-bit UTF-32 mode less than 0x10ffff and a valid codepoint
|
||||
</pre>
|
||||
Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so-called
|
||||
"surrogate" codepoints), and 0xffef.
|
||||
</P>
|
||||
<br><b>
|
||||
Escape sequences in character classes
|
||||
</b><br>
|
||||
<P>
|
||||
All the sequences that define a single character value can be used both inside
|
||||
and outside character classes. In addition, inside a character class, \b is
|
||||
interpreted as the backspace character (hex 08).
|
||||
@ -498,11 +532,14 @@ matching point is at the end of the subject string, all of them fail, because
|
||||
there is no character to match.
|
||||
</P>
|
||||
<P>
|
||||
For compatibility with Perl, \s does not match the VT character (code 11).
|
||||
This makes it different from the the POSIX "space" class. The \s characters
|
||||
are HT (9), LF (10), FF (12), CR (13), and space (32). If "use locale;" is
|
||||
included in a Perl script, \s may match the VT character. In PCRE, it never
|
||||
does.
|
||||
For compatibility with Perl, \s did not used to match the VT character (code
|
||||
11), which made it different from the the POSIX "space" class. However, Perl
|
||||
added VT at release 5.18, and PCRE followed suit at release 8.34. The default
|
||||
\s characters are now HT (9), LF (10), VT (11), FF (12), CR (13), and space
|
||||
(32), which are defined as white space in the "C" locale. This list may vary if
|
||||
locale-specific matching is taking place. For example, in some locales the
|
||||
"non-breaking space" character (\xA0) is recognized as white space, and in
|
||||
others the VT character is not.
|
||||
</P>
|
||||
<P>
|
||||
A "word" character is an underscore or any character that is a letter or digit.
|
||||
@ -513,21 +550,23 @@ place (see
|
||||
in the
|
||||
<a href="pcreapi.html"><b>pcreapi</b></a>
|
||||
page). For example, in a French locale such as "fr_FR" in Unix-like systems,
|
||||
or "french" in Windows, some character codes greater than 128 are used for
|
||||
or "french" in Windows, some character codes greater than 127 are used for
|
||||
accented letters, and these are then matched by \w. The use of locales with
|
||||
Unicode is discouraged.
|
||||
</P>
|
||||
<P>
|
||||
By default, in a UTF mode, characters with values greater than 128 never match
|
||||
\d, \s, or \w, and always match \D, \S, and \W. These sequences retain
|
||||
their original meanings from before UTF support was available, mainly for
|
||||
efficiency reasons. However, if PCRE is compiled with Unicode property support,
|
||||
and the PCRE_UCP option is set, the behaviour is changed so that Unicode
|
||||
properties are used to determine character types, as follows:
|
||||
By default, characters whose code points are greater than 127 never match \d,
|
||||
\s, or \w, and always match \D, \S, and \W, although this may vary for
|
||||
characters in the range 128-255 when locale-specific matching is happening.
|
||||
These escape sequences retain their original meanings from before Unicode
|
||||
support was available, mainly for efficiency reasons. If PCRE is compiled with
|
||||
Unicode property support, and the PCRE_UCP option is set, the behaviour is
|
||||
changed so that Unicode properties are used to determine character types, as
|
||||
follows:
|
||||
<pre>
|
||||
\d any character that \p{Nd} matches (decimal digit)
|
||||
\s any character that \p{Z} matches, plus HT, LF, FF, CR
|
||||
\w any character that \p{L} or \p{N} matches, plus underscore
|
||||
\d any character that matches \p{Nd} (decimal digit)
|
||||
\s any character that matches \p{Z} or \h or \v
|
||||
\w any character that matches \p{L} or \p{N}, plus underscore
|
||||
</pre>
|
||||
The upper case escapes match the inverse sets of characters. Note that \d
|
||||
matches only decimal digits, whereas \w matches any Unicode digit, as well as
|
||||
@ -538,7 +577,7 @@ is noticeably slower when PCRE_UCP is set.
|
||||
<P>
|
||||
The sequences \h, \H, \v, and \V are features that were added to Perl at
|
||||
release 5.10. In contrast to the other sequences, which match only ASCII
|
||||
characters by default, these always match certain high-valued codepoints,
|
||||
characters by default, these always match certain high-valued code points,
|
||||
whether or not PCRE_UCP is set. The horizontal space characters are:
|
||||
<pre>
|
||||
U+0009 Horizontal tab (HT)
|
||||
@ -913,9 +952,9 @@ PCRE's additional properties
|
||||
<P>
|
||||
As well as the standard Unicode properties described above, PCRE supports four
|
||||
more that make it possible to convert traditional escape sequences such as \w
|
||||
and \s and POSIX character classes to use Unicode properties. PCRE uses these
|
||||
non-standard, non-Perl properties internally when PCRE_UCP is set. However,
|
||||
they may also be used explicitly. These properties are:
|
||||
and \s to use Unicode properties. PCRE uses these non-standard, non-Perl
|
||||
properties internally when PCRE_UCP is set. However, they may also be used
|
||||
explicitly. These properties are:
|
||||
<pre>
|
||||
Xan Any alphanumeric character
|
||||
Xps Any POSIX space character
|
||||
@ -925,8 +964,9 @@ they may also be used explicitly. These properties are:
|
||||
Xan matches characters that have either the L (letter) or the N (number)
|
||||
property. Xps matches the characters tab, linefeed, vertical tab, form feed, or
|
||||
carriage return, and any other character that has the Z (separator) property.
|
||||
Xsp is the same as Xps, except that vertical tab is excluded. Xwd matches the
|
||||
same characters as Xan, plus underscore.
|
||||
Xsp is the same as Xps; it used to exclude vertical tab, for Perl
|
||||
compatibility, but Perl changed, and so PCRE followed at release 8.34. Xwd
|
||||
matches the same characters as Xan, plus underscore.
|
||||
</P>
|
||||
<P>
|
||||
There is another non-standard property, Xuc, which matches any character that
|
||||
@ -1218,7 +1258,9 @@ The minus (hyphen) character can be used to specify a range of characters in a
|
||||
character class. For example, [d-m] matches any letter between d and m,
|
||||
inclusive. If a minus character is required in a class, it must be escaped with
|
||||
a backslash or appear in a position where it cannot be interpreted as
|
||||
indicating a range, typically as the first or last character in the class.
|
||||
indicating a range, typically as the first or last character in the class, or
|
||||
immediately after a range. For example, [b-d-z] matches letters in the range b
|
||||
to d, a hyphen character, or z.
|
||||
</P>
|
||||
<P>
|
||||
It is not possible to have the literal character "]" as the end character of a
|
||||
@ -1230,6 +1272,12 @@ followed by two other characters. The octal or hexadecimal representation of
|
||||
"]" can also be used to end a range.
|
||||
</P>
|
||||
<P>
|
||||
An error is generated if a POSIX character class (see below) or an escape
|
||||
sequence other than one that defines a single character appears at a point
|
||||
where a range ending character is expected. For example, [z-\xff] is valid,
|
||||
but [A-\d] and [A-[:digit:]] are not.
|
||||
</P>
|
||||
<P>
|
||||
Ranges operate in the collating sequence of character values. They can also be
|
||||
used for characters specified numerically, for example [\000-\037]. Ranges
|
||||
can include any characters that are valid for the current mode.
|
||||
@ -1269,9 +1317,9 @@ something AND NOT ...".
|
||||
The only metacharacters that are recognized in character classes are backslash,
|
||||
hyphen (only where it can be interpreted as specifying a range), circumflex
|
||||
(only at the start), opening square bracket (only when it can be interpreted as
|
||||
introducing a POSIX class name - see the next section), and the terminating
|
||||
closing square bracket. However, escaping other non-alphanumeric characters
|
||||
does no harm.
|
||||
introducing a POSIX class name, or for a special compatibility feature - see
|
||||
the next two sections), and the terminating closing square bracket. However,
|
||||
escaping other non-alphanumeric characters does no harm.
|
||||
</P>
|
||||
<br><a name="SEC10" href="#TOC1">POSIX CHARACTER CLASSES</a><br>
|
||||
<P>
|
||||
@ -1294,15 +1342,17 @@ are:
|
||||
lower lower case letters
|
||||
print printing characters, including space
|
||||
punct printing characters, excluding letters and digits and space
|
||||
space white space (not quite the same as \s)
|
||||
space white space (the same as \s from PCRE 8.34)
|
||||
upper upper case letters
|
||||
word "word" characters (same as \w)
|
||||
xdigit hexadecimal digits
|
||||
</pre>
|
||||
The "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13), and
|
||||
space (32). Notice that this list includes the VT character (code 11). This
|
||||
makes "space" different to \s, which does not include VT (for Perl
|
||||
compatibility).
|
||||
The default "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13),
|
||||
and space (32). If locale-specific matching is taking place, the list of space
|
||||
characters may be different; there may be fewer or more of them. "Space" used
|
||||
to be different to \s, which did not include VT, for Perl compatibility.
|
||||
However, Perl changed at release 5.18, and PCRE followed at release 8.34.
|
||||
"Space" and \s now match the same set of characters.
|
||||
</P>
|
||||
<P>
|
||||
The name "word" is a Perl extension, and "blank" is a GNU extension from Perl
|
||||
@ -1316,11 +1366,11 @@ syntax [.ch.] and [=ch=] where "ch" is a "collating element", but these are not
|
||||
supported, and an error is given if they are encountered.
|
||||
</P>
|
||||
<P>
|
||||
By default, in UTF modes, characters with values greater than 128 do not match
|
||||
any of the POSIX character classes. However, if the PCRE_UCP option is passed
|
||||
to <b>pcre_compile()</b>, some of the classes are changed so that Unicode
|
||||
character properties are used. This is achieved by replacing the POSIX classes
|
||||
by other sequences, as follows:
|
||||
By default, characters with values greater than 128 do not match any of the
|
||||
POSIX character classes. However, if the PCRE_UCP option is passed to
|
||||
<b>pcre_compile()</b>, some of the classes are changed so that Unicode character
|
||||
properties are used. This is achieved by replacing certain POSIX classes by
|
||||
other sequences, as follows:
|
||||
<pre>
|
||||
[:alnum:] becomes \p{Xan}
|
||||
[:alpha:] becomes \p{L}
|
||||
@ -1331,11 +1381,56 @@ by other sequences, as follows:
|
||||
[:upper:] becomes \p{Lu}
|
||||
[:word:] becomes \p{Xwd}
|
||||
</pre>
|
||||
Negated versions, such as [:^alpha:] use \P instead of \p. The other POSIX
|
||||
classes are unchanged, and match only characters with code points less than
|
||||
128.
|
||||
Negated versions, such as [:^alpha:] use \P instead of \p. Three other POSIX
|
||||
classes are handled specially in UCP mode:
|
||||
</P>
|
||||
<br><a name="SEC11" href="#TOC1">VERTICAL BAR</a><br>
|
||||
<P>
|
||||
[:graph:]
|
||||
This matches characters that have glyphs that mark the page when printed. In
|
||||
Unicode property terms, it matches all characters with the L, M, N, P, S, or Cf
|
||||
properties, except for:
|
||||
<pre>
|
||||
U+061C Arabic Letter Mark
|
||||
U+180E Mongolian Vowel Separator
|
||||
U+2066 - U+2069 Various "isolate"s
|
||||
|
||||
</PRE>
|
||||
</P>
|
||||
<P>
|
||||
[:print:]
|
||||
This matches the same characters as [:graph:] plus space characters that are
|
||||
not controls, that is, characters with the Zs property.
|
||||
</P>
|
||||
<P>
|
||||
[:punct:]
|
||||
This matches all characters that have the Unicode P (punctuation) property,
|
||||
plus those characters whose code points are less than 128 that have the S
|
||||
(Symbol) property.
|
||||
</P>
|
||||
<P>
|
||||
The other POSIX classes are unchanged, and match only characters with code
|
||||
points less than 128.
|
||||
</P>
|
||||
<br><a name="SEC11" href="#TOC1">COMPATIBILITY FEATURE FOR WORD BOUNDARIES</a><br>
|
||||
<P>
|
||||
In the POSIX.2 compliant library that was included in 4.4BSD Unix, the ugly
|
||||
syntax [[:<:]] and [[:>:]] is used for matching "start of word" and "end of
|
||||
word". PCRE treats these items as follows:
|
||||
<pre>
|
||||
[[:<:]] is converted to \b(?=\w)
|
||||
[[:>:]] is converted to \b(?<=\w)
|
||||
</pre>
|
||||
Only these exact character sequences are recognized. A sequence such as
|
||||
[a[:<:]b] provokes error for an unrecognized POSIX class name. This support is
|
||||
not compatible with Perl. It is provided to help migrations from other
|
||||
environments, and is best not used in any new patterns. Note that \b matches
|
||||
at the start and the end of a word (see
|
||||
<a href="#smallassertions">"Simple assertions"</a>
|
||||
above), and in a Perl-style pattern the preceding or following character
|
||||
normally shows which is wanted, without the need for the assertions that are
|
||||
used above in order to give exactly the POSIX behaviour.
|
||||
</P>
|
||||
<br><a name="SEC12" href="#TOC1">VERTICAL BAR</a><br>
|
||||
<P>
|
||||
Vertical bar characters are used to separate alternative patterns. For example,
|
||||
the pattern
|
||||
@ -1350,7 +1445,7 @@ that succeeds is used. If the alternatives are within a subpattern
|
||||
"succeeds" means matching the rest of the main pattern as well as the
|
||||
alternative in the subpattern.
|
||||
</P>
|
||||
<br><a name="SEC12" href="#TOC1">INTERNAL OPTION SETTING</a><br>
|
||||
<br><a name="SEC13" href="#TOC1">INTERNAL OPTION SETTING</a><br>
|
||||
<P>
|
||||
The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and
|
||||
PCRE_EXTENDED options (which are Perl-compatible) can be changed from within
|
||||
@ -1413,7 +1508,7 @@ options, respectively. The (*UTF) sequence is a generic version that can be
|
||||
used with any of the libraries. However, the application can set the
|
||||
PCRE_NEVER_UTF option, which locks out the use of the (*UTF) sequences.
|
||||
<a name="subpattern"></a></P>
|
||||
<br><a name="SEC13" href="#TOC1">SUBPATTERNS</a><br>
|
||||
<br><a name="SEC14" href="#TOC1">SUBPATTERNS</a><br>
|
||||
<P>
|
||||
Subpatterns are delimited by parentheses (round brackets), which can be nested.
|
||||
Turning part of a pattern into a subpattern does two things:
|
||||
@ -1469,7 +1564,7 @@ from left to right, and options are not reset until the end of the subpattern
|
||||
is reached, an option setting in one branch does affect subsequent branches, so
|
||||
the above patterns match "SUNDAY" as well as "Saturday".
|
||||
<a name="dupsubpatternnumber"></a></P>
|
||||
<br><a name="SEC14" href="#TOC1">DUPLICATE SUBPATTERN NUMBERS</a><br>
|
||||
<br><a name="SEC15" href="#TOC1">DUPLICATE SUBPATTERN NUMBERS</a><br>
|
||||
<P>
|
||||
Perl 5.10 introduced a feature whereby each alternative in a subpattern uses
|
||||
the same numbers for its capturing parentheses. Such a subpattern starts with
|
||||
@ -1513,7 +1608,7 @@ true if any of the subpatterns of that number have matched.
|
||||
An alternative approach to using this "branch reset" feature is to use
|
||||
duplicate named subpatterns, as described in the next section.
|
||||
</P>
|
||||
<br><a name="SEC15" href="#TOC1">NAMED SUBPATTERNS</a><br>
|
||||
<br><a name="SEC16" href="#TOC1">NAMED SUBPATTERNS</a><br>
|
||||
<P>
|
||||
Identifying capturing parentheses by number is simple, but it can be very hard
|
||||
to keep track of the numbers in complicated regular expressions. Furthermore,
|
||||
@ -1535,11 +1630,12 @@ and
|
||||
can be made by name as well as by number.
|
||||
</P>
|
||||
<P>
|
||||
Names consist of up to 32 alphanumeric characters and underscores. Named
|
||||
capturing parentheses are still allocated numbers as well as names, exactly as
|
||||
if the names were not present. The PCRE API provides function calls for
|
||||
extracting the name-to-number translation table from a compiled pattern. There
|
||||
is also a convenience function for extracting a captured substring by name.
|
||||
Names consist of up to 32 alphanumeric characters and underscores, but must
|
||||
start with a non-digit. Named capturing parentheses are still allocated numbers
|
||||
as well as names, exactly as if the names were not present. The PCRE API
|
||||
provides function calls for extracting the name-to-number translation table
|
||||
from a compiled pattern. There is also a convenience function for extracting a
|
||||
captured substring by name.
|
||||
</P>
|
||||
<P>
|
||||
By default, a name must be unique within a pattern, but it is possible to relax
|
||||
@ -1568,9 +1664,23 @@ matched. This saves searching to find which numbered subpattern it was.
|
||||
</P>
|
||||
<P>
|
||||
If you make a back reference to a non-unique named subpattern from elsewhere in
|
||||
the pattern, the one that corresponds to the first occurrence of the name is
|
||||
used. In the absence of duplicate numbers (see the previous section) this is
|
||||
the one with the lowest number. If you use a named reference in a condition
|
||||
the pattern, the subpatterns to which the name refers are checked in the order
|
||||
in which they appear in the overall pattern. The first one that is set is used
|
||||
for the reference. For example, this pattern matches both "foofoo" and
|
||||
"barbar" but not "foobar" or "barfoo":
|
||||
<pre>
|
||||
(?:(?<n>foo)|(?<n>bar))\k<n>
|
||||
|
||||
</PRE>
|
||||
</P>
|
||||
<P>
|
||||
If you make a subroutine call to a non-unique named subpattern, the one that
|
||||
corresponds to the first occurrence of the name is used. In the absence of
|
||||
duplicate numbers (see the previous section) this is the one with the lowest
|
||||
number.
|
||||
</P>
|
||||
<P>
|
||||
If you use a named reference in a condition
|
||||
test (see the
|
||||
<a href="#conditions">section about conditions</a>
|
||||
below), either to check whether a subpattern has matched, or to check for
|
||||
@ -1585,10 +1695,11 @@ documentation.
|
||||
<b>Warning:</b> You cannot use different names to distinguish between two
|
||||
subpatterns with the same number because PCRE uses only the numbers when
|
||||
matching. For this reason, an error is given at compile time if different names
|
||||
are given to subpatterns with the same number. However, you can give the same
|
||||
name to subpatterns with the same number, even when PCRE_DUPNAMES is not set.
|
||||
are given to subpatterns with the same number. However, you can always give the
|
||||
same name to subpatterns with the same number, even when PCRE_DUPNAMES is not
|
||||
set.
|
||||
</P>
|
||||
<br><a name="SEC16" href="#TOC1">REPETITION</a><br>
|
||||
<br><a name="SEC17" href="#TOC1">REPETITION</a><br>
|
||||
<P>
|
||||
Repetition is specified by quantifiers, which can follow any of the following
|
||||
items:
|
||||
@ -1756,7 +1867,7 @@ example, after
|
||||
</pre>
|
||||
matches "aba" the value of the second captured substring is "b".
|
||||
<a name="atomicgroup"></a></P>
|
||||
<br><a name="SEC17" href="#TOC1">ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS</a><br>
|
||||
<br><a name="SEC18" href="#TOC1">ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS</a><br>
|
||||
<P>
|
||||
With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy")
|
||||
repetition, failure of what follows normally causes the repeated item to be
|
||||
@ -1860,7 +1971,7 @@ an atomic group, like this:
|
||||
</pre>
|
||||
sequences of non-digits cannot be broken, and failure happens quickly.
|
||||
<a name="backreferences"></a></P>
|
||||
<br><a name="SEC18" href="#TOC1">BACK REFERENCES</a><br>
|
||||
<br><a name="SEC19" href="#TOC1">BACK REFERENCES</a><br>
|
||||
<P>
|
||||
Outside a character class, a backslash followed by a digit greater than 0 (and
|
||||
possibly further digits) is a back reference to a capturing subpattern earlier
|
||||
@ -1988,7 +2099,7 @@ as an
|
||||
Once the whole group has been matched, a subsequent matching failure cannot
|
||||
cause backtracking into the middle of the group.
|
||||
<a name="bigassertions"></a></P>
|
||||
<br><a name="SEC19" href="#TOC1">ASSERTIONS</a><br>
|
||||
<br><a name="SEC20" href="#TOC1">ASSERTIONS</a><br>
|
||||
<P>
|
||||
An assertion is a test on the characters following or preceding the current
|
||||
matching point that does not actually consume any characters. The simple
|
||||
@ -2178,7 +2289,7 @@ preceded by "foo", while
|
||||
is another pattern that matches "foo" preceded by three digits and any three
|
||||
characters that are not "999".
|
||||
<a name="conditions"></a></P>
|
||||
<br><a name="SEC20" href="#TOC1">CONDITIONAL SUBPATTERNS</a><br>
|
||||
<br><a name="SEC21" href="#TOC1">CONDITIONAL SUBPATTERNS</a><br>
|
||||
<P>
|
||||
It is possible to cause the matching process to obey a subpattern
|
||||
conditionally or to choose between two alternative subpatterns, depending on
|
||||
@ -2252,12 +2363,7 @@ Checking for a used subpattern by name
|
||||
<P>
|
||||
Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a used
|
||||
subpattern by name. For compatibility with earlier versions of PCRE, which had
|
||||
this facility before Perl, the syntax (?(name)...) is also recognized. However,
|
||||
there is a possible ambiguity with this syntax, because subpattern names may
|
||||
consist entirely of digits. PCRE looks first for a named subpattern; if it
|
||||
cannot find one and the name consists entirely of digits, PCRE looks for a
|
||||
subpattern of that number, which must be greater than zero. Using subpattern
|
||||
names that consist entirely of digits is not recommended.
|
||||
this facility before Perl, the syntax (?(name)...) is also recognized.
|
||||
</P>
|
||||
<P>
|
||||
Rewriting the above example to use a named subpattern gives this:
|
||||
@ -2333,7 +2439,7 @@ subject is matched against the first alternative; otherwise it is matched
|
||||
against the second. This pattern matches strings in one of the two forms
|
||||
dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits.
|
||||
<a name="comments"></a></P>
|
||||
<br><a name="SEC21" href="#TOC1">COMMENTS</a><br>
|
||||
<br><a name="SEC22" href="#TOC1">COMMENTS</a><br>
|
||||
<P>
|
||||
There are two ways of including comments in patterns that are processed by
|
||||
PCRE. In both cases, the start of the comment must not be in a character class,
|
||||
@ -2362,7 +2468,7 @@ a newline in the pattern. The sequence \n is still literal at this stage, so
|
||||
it does not terminate the comment. Only an actual character with the code value
|
||||
0x0a (the default newline) does so.
|
||||
<a name="recursion"></a></P>
|
||||
<br><a name="SEC22" href="#TOC1">RECURSIVE PATTERNS</a><br>
|
||||
<br><a name="SEC23" href="#TOC1">RECURSIVE PATTERNS</a><br>
|
||||
<P>
|
||||
Consider the problem of matching a string in parentheses, allowing for
|
||||
unlimited nested parentheses. Without the use of recursion, the best that can
|
||||
@ -2577,7 +2683,7 @@ now match "b" and so the whole match succeeds. In Perl, the pattern fails to
|
||||
match because inside the recursive call \1 cannot access the externally set
|
||||
value.
|
||||
<a name="subpatternsassubroutines"></a></P>
|
||||
<br><a name="SEC23" href="#TOC1">SUBPATTERNS AS SUBROUTINES</a><br>
|
||||
<br><a name="SEC24" href="#TOC1">SUBPATTERNS AS SUBROUTINES</a><br>
|
||||
<P>
|
||||
If the syntax for a recursive subpattern call (either by number or by
|
||||
name) is used outside the parentheses to which it refers, it operates like a
|
||||
@ -2618,7 +2724,7 @@ different calls. For example, consider this pattern:
|
||||
It matches "abcabc". It does not match "abcABC" because the change of
|
||||
processing option does not affect the called subpattern.
|
||||
<a name="onigurumasubroutines"></a></P>
|
||||
<br><a name="SEC24" href="#TOC1">ONIGURUMA SUBROUTINE SYNTAX</a><br>
|
||||
<br><a name="SEC25" href="#TOC1">ONIGURUMA SUBROUTINE SYNTAX</a><br>
|
||||
<P>
|
||||
For compatibility with Oniguruma, the non-Perl syntax \g followed by a name or
|
||||
a number enclosed either in angle brackets or single quotes, is an alternative
|
||||
@ -2636,7 +2742,7 @@ plus or a minus sign it is taken as a relative reference. For example:
|
||||
Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are <i>not</i>
|
||||
synonymous. The former is a back reference; the latter is a subroutine call.
|
||||
</P>
|
||||
<br><a name="SEC25" href="#TOC1">CALLOUTS</a><br>
|
||||
<br><a name="SEC26" href="#TOC1">CALLOUTS</a><br>
|
||||
<P>
|
||||
Perl has a feature whereby using the sequence (?{...}) causes arbitrary Perl
|
||||
code to be obeyed in the middle of matching a regular expression. This makes it
|
||||
@ -2674,12 +2780,18 @@ During matching, when PCRE reaches a callout point, the external function is
|
||||
called. It is provided with the number of the callout, the position in the
|
||||
pattern, and, optionally, one item of data originally supplied by the caller of
|
||||
the matching function. The callout function may cause matching to proceed, to
|
||||
backtrack, or to fail altogether. A complete description of the interface to
|
||||
the callout function is given in the
|
||||
backtrack, or to fail altogether.
|
||||
</P>
|
||||
<P>
|
||||
By default, PCRE implements a number of optimizations at compile time and
|
||||
matching time, and one side-effect is that sometimes callouts are skipped. If
|
||||
you need all possible callouts to happen, you need to set options that disable
|
||||
the relevant optimizations. More details, and a complete description of the
|
||||
interface to the callout function, are given in the
|
||||
<a href="pcrecallout.html"><b>pcrecallout</b></a>
|
||||
documentation.
|
||||
<a name="backtrackcontrol"></a></P>
|
||||
<br><a name="SEC26" href="#TOC1">BACKTRACKING CONTROL</a><br>
|
||||
<br><a name="SEC27" href="#TOC1">BACKTRACKING CONTROL</a><br>
|
||||
<P>
|
||||
Perl 5.10 introduced a number of "Special Backtracking Control Verbs", which
|
||||
are still described in the Perl documentation as "experimental and subject to
|
||||
@ -3026,7 +3138,7 @@ example:
|
||||
<pre>
|
||||
...(*COMMIT)(*PRUNE)...
|
||||
</pre>
|
||||
If there is a matching failure to the right, backtracking onto (*PRUNE) cases
|
||||
If there is a matching failure to the right, backtracking onto (*PRUNE) causes
|
||||
it to be triggered, and its action is taken. There can never be a backtrack
|
||||
onto (*COMMIT).
|
||||
<a name="btrepeat"></a></P>
|
||||
@ -3093,12 +3205,12 @@ the subroutine match to fail.
|
||||
the subpattern that has alternatives. If there is no such group within the
|
||||
subpattern, (*THEN) causes the subroutine match to fail.
|
||||
</P>
|
||||
<br><a name="SEC27" href="#TOC1">SEE ALSO</a><br>
|
||||
<br><a name="SEC28" href="#TOC1">SEE ALSO</a><br>
|
||||
<P>
|
||||
<b>pcreapi</b>(3), <b>pcrecallout</b>(3), <b>pcrematching</b>(3),
|
||||
<b>pcresyntax</b>(3), <b>pcre</b>(3), <b>pcre16(3)</b>, <b>pcre32(3)</b>.
|
||||
</P>
|
||||
<br><a name="SEC28" href="#TOC1">AUTHOR</a><br>
|
||||
<br><a name="SEC29" href="#TOC1">AUTHOR</a><br>
|
||||
<P>
|
||||
Philip Hazel
|
||||
<br>
|
||||
@ -3107,9 +3219,9 @@ University Computing Service
|
||||
Cambridge CB2 3QH, England.
|
||||
<br>
|
||||
</P>
|
||||
<br><a name="SEC29" href="#TOC1">REVISION</a><br>
|
||||
<br><a name="SEC30" href="#TOC1">REVISION</a><br>
|
||||
<P>
|
||||
Last updated: 26 April 2013
|
||||
Last updated: 03 December 2013
|
||||
<br>
|
||||
Copyright © 1997-2013 University of Cambridge.
|
||||
<br>
|
||||
|
Reference in New Issue
Block a user