mirror of
				https://github.com/MariaDB/server.git
				synced 2025-10-31 15:50:51 +03:00 
			
		
		
		
	
		
			
				
	
	
		
			234 lines
		
	
	
		
			9.5 KiB
		
	
	
	
		
			Groff
		
	
	
	
	
	
			
		
		
	
	
			234 lines
		
	
	
		
			9.5 KiB
		
	
	
	
		
			Groff
		
	
	
	
	
	
| .TH REGEX 7 "7 Feb 1994"
 | |
| .BY "Henry Spencer"
 | |
| .SH NAME
 | |
| regex \- POSIX 1003.2 regular expressions
 | |
| .SH DESCRIPTION
 | |
| Regular expressions (``RE''s),
 | |
| as defined in POSIX 1003.2, come in two forms:
 | |
| modern REs (roughly those of
 | |
| .IR egrep ;
 | |
| 1003.2 calls these ``extended'' REs)
 | |
| and obsolete REs (roughly those of
 | |
| .IR ed ;
 | |
| 1003.2 ``basic'' REs).
 | |
| Obsolete REs mostly exist for backward compatibility in some old programs;
 | |
| they will be discussed at the end.
 | |
| 1003.2 leaves some aspects of RE syntax and semantics open;
 | |
| `\(dg' marks decisions on these aspects that
 | |
| may not be fully portable to other 1003.2 implementations.
 | |
| .PP
 | |
| A (modern) RE is one\(dg or more non-empty\(dg \fIbranches\fR,
 | |
| separated by `|'.
 | |
| It matches anything that matches one of the branches.
 | |
| .PP
 | |
| A branch is one\(dg or more \fIpieces\fR, concatenated.
 | |
| It matches a match for the first, followed by a match for the second, etc.
 | |
| .PP
 | |
| A piece is an \fIatom\fR possibly followed
 | |
| by a single\(dg `*', `+', `?', or \fIbound\fR.
 | |
| An atom followed by `*' matches a sequence of 0 or more matches of the atom.
 | |
| An atom followed by `+' matches a sequence of 1 or more matches of the atom.
 | |
| An atom followed by `?' matches a sequence of 0 or 1 matches of the atom.
 | |
| .PP
 | |
| A \fIbound\fR is `{' followed by an unsigned decimal integer,
 | |
| possibly followed by `,'
 | |
| possibly followed by another unsigned decimal integer,
 | |
| always followed by `}'.
 | |
| The integers must lie between 0 and RE_DUP_MAX (255\(dg) inclusive,
 | |
| and if there are two of them, the first may not exceed the second.
 | |
| An atom followed by a bound containing one integer \fIi\fR
 | |
| and no comma matches
 | |
| a sequence of exactly \fIi\fR matches of the atom.
 | |
| An atom followed by a bound
 | |
| containing one integer \fIi\fR and a comma matches
 | |
| a sequence of \fIi\fR or more matches of the atom.
 | |
| An atom followed by a bound
 | |
| containing two integers \fIi\fR and \fIj\fR matches
 | |
| a sequence of \fIi\fR through \fIj\fR (inclusive) matches of the atom.
 | |
| .PP
 | |
| An atom is a regular expression enclosed in `()' (matching a match for the
 | |
| regular expression),
 | |
| an empty set of `()' (matching the null string)\(dg,
 | |
| a \fIbracket expression\fR (see below), `.'
 | |
| (matching any single character), `^' (matching the null string at the
 | |
| beginning of a line), `$' (matching the null string at the
 | |
| end of a line), a `\e' followed by one of the characters
 | |
| `^.[$()|*+?{\e'
 | |
| (matching that character taken as an ordinary character),
 | |
| a `\e' followed by any other character\(dg
 | |
| (matching that character taken as an ordinary character,
 | |
| as if the `\e' had not been present\(dg),
 | |
| or a single character with no other significance (matching that character).
 | |
| A `{' followed by a character other than a digit is an ordinary
 | |
| character, not the beginning of a bound\(dg.
 | |
| It is illegal to end an RE with `\e'.
 | |
| .PP
 | |
| A \fIbracket expression\fR is a list of characters enclosed in `[]'.
 | |
| It normally matches any single character from the list (but see below).
 | |
| If the list begins with `^',
 | |
| it matches any single character
 | |
| (but see below) \fInot\fR from the rest of the list.
 | |
| If two characters in the list are separated by `\-', this is shorthand
 | |
| for the full \fIrange\fR of characters between those two (inclusive) in the
 | |
| collating sequence,
 | |
| e.g. `[0-9]' in ASCII matches any decimal digit.
 | |
| It is illegal\(dg for two ranges to share an
 | |
| endpoint, e.g. `a-c-e'.
 | |
| Ranges are very collating-sequence-dependent,
 | |
| and portable programs should avoid relying on them.
 | |
| .PP
 | |
| To include a literal `]' in the list, make it the first character
 | |
| (following a possible `^').
 | |
| To include a literal `\-', make it the first or last character,
 | |
| or the second endpoint of a range.
 | |
| To use a literal `\-' as the first endpoint of a range,
 | |
| enclose it in `[.' and `.]' to make it a collating element (see below).
 | |
| With the exception of these and some combinations using `[' (see next
 | |
| paragraphs), all other special characters, including `\e', lose their
 | |
| special significance within a bracket expression.
 | |
| .PP
 | |
| Within a bracket expression, a collating element (a character,
 | |
| a multi-character sequence that collates as if it were a single character,
 | |
| or a collating-sequence name for either)
 | |
| enclosed in `[.' and `.]' stands for the
 | |
| sequence of characters of that collating element.
 | |
| The sequence is a single element of the bracket expression's list.
 | |
| A bracket expression containing a multi-character collating element 
 | |
| can thus match more than one character,
 | |
| e.g. if the collating sequence includes a `ch' collating element,
 | |
| then the RE `[[.ch.]]*c' matches the first five characters
 | |
| of `chchcc'.
 | |
| .PP
 | |
| Within a bracket expression, a collating element enclosed in `[=' and
 | |
| `=]' is an equivalence class, standing for the sequences of characters
 | |
| of all collating elements equivalent to that one, including itself.
 | |
| (If there are no other equivalent collating elements,
 | |
| the treatment is as if the enclosing delimiters were `[.' and `.]'.)
 | |
| For example, if o and \o'o^' are the members of an equivalence class,
 | |
| then `[[=o=]]', `[[=\o'o^'=]]', and `[o\o'o^']' are all synonymous.
 | |
| An equivalence class may not\(dg be an endpoint
 | |
| of a range.
 | |
| .PP
 | |
| Within a bracket expression, the name of a \fIcharacter class\fR enclosed
 | |
| in `[:' and `:]' stands for the list of all characters belonging to that
 | |
| class.
 | |
| Standard character class names are:
 | |
| .PP
 | |
| .RS
 | |
| .nf
 | |
| .ta 3c 6c 9c
 | |
| alnum	digit	punct
 | |
| alpha	graph	space
 | |
| blank	lower	upper
 | |
| cntrl	print	xdigit
 | |
| .fi
 | |
| .RE
 | |
| .PP
 | |
| These stand for the character classes defined in
 | |
| .IR ctype (3).
 | |
| A locale may provide others.
 | |
| A character class may not be used as an endpoint of a range.
 | |
| .PP
 | |
| There are two special cases\(dg of bracket expressions:
 | |
| the bracket expressions `[[:<:]]' and `[[:>:]]' match the null string at
 | |
| the beginning and end of a word respectively.
 | |
| A word is defined as a sequence of
 | |
| word characters
 | |
| which is neither preceded nor followed by
 | |
| word characters.
 | |
| A word character is an
 | |
| .I alnum
 | |
| character (as defined by
 | |
| .IR ctype (3))
 | |
| or an underscore.
 | |
| This is an extension,
 | |
| compatible with but not specified by POSIX 1003.2,
 | |
| and should be used with
 | |
| caution in software intended to be portable to other systems.
 | |
| .PP
 | |
| In the event that an RE could match more than one substring of a given
 | |
| string,
 | |
| the RE matches the one starting earliest in the string.
 | |
| If the RE could match more than one substring starting at that point,
 | |
| it matches the longest.
 | |
| Subexpressions also match the longest possible substrings, subject to
 | |
| the constraint that the whole match be as long as possible,
 | |
| with subexpressions starting earlier in the RE taking priority over
 | |
| ones starting later.
 | |
| Note that higher-level subexpressions thus take priority over
 | |
| their lower-level component subexpressions.
 | |
| .PP
 | |
| Match lengths are measured in characters, not collating elements.
 | |
| A null string is considered longer than no match at all.
 | |
| For example,
 | |
| `bb*' matches the three middle characters of `abbbc',
 | |
| `(wee|week)(knights|nights)' matches all ten characters of `weeknights',
 | |
| when `(.*).*' is matched against `abc' the parenthesized subexpression
 | |
| matches all three characters, and
 | |
| when `(a*)*' is matched against `bc' both the whole RE and the parenthesized
 | |
| subexpression match the null string.
 | |
| .PP
 | |
| If case-independent matching is specified,
 | |
| the effect is much as if all case distinctions had vanished from the
 | |
| alphabet.
 | |
| When an alphabetic that exists in multiple cases appears as an
 | |
| ordinary character outside a bracket expression, it is effectively
 | |
| transformed into a bracket expression containing both cases,
 | |
| e.g. `x' becomes `[xX]'.
 | |
| When it appears inside a bracket expression, all case counterparts
 | |
| of it are added to the bracket expression, so that (e.g.) `[x]'
 | |
| becomes `[xX]' and `[^x]' becomes `[^xX]'.
 | |
| .PP
 | |
| No particular limit is imposed on the length of REs\(dg.
 | |
| Programs intended to be portable should not employ REs longer
 | |
| than 256 bytes,
 | |
| as an implementation can refuse to accept such REs and remain
 | |
| POSIX-compliant.
 | |
| .PP
 | |
| Obsolete (``basic'') regular expressions differ in several respects.
 | |
| `|', `+', and `?' are ordinary characters and there is no equivalent
 | |
| for their functionality.
 | |
| The delimiters for bounds are `\e{' and `\e}',
 | |
| with `{' and `}' by themselves ordinary characters.
 | |
| The parentheses for nested subexpressions are `\e(' and `\e)',
 | |
| with `(' and `)' by themselves ordinary characters.
 | |
| `^' is an ordinary character except at the beginning of the
 | |
| RE or\(dg the beginning of a parenthesized subexpression,
 | |
| `$' is an ordinary character except at the end of the
 | |
| RE or\(dg the end of a parenthesized subexpression,
 | |
| and `*' is an ordinary character if it appears at the beginning of the
 | |
| RE or the beginning of a parenthesized subexpression
 | |
| (after a possible leading `^').
 | |
| Finally, there is one new type of atom, a \fIback reference\fR:
 | |
| `\e' followed by a non-zero decimal digit \fId\fR
 | |
| matches the same sequence of characters
 | |
| matched by the \fId\fRth parenthesized subexpression
 | |
| (numbering subexpressions by the positions of their opening parentheses,
 | |
| left to right),
 | |
| so that (e.g.) `\e([bc]\e)\e1' matches `bb' or `cc' but not `bc'.
 | |
| .SH SEE ALSO
 | |
| regex(3)
 | |
| .PP
 | |
| POSIX 1003.2, section 2.8 (Regular Expression Notation).
 | |
| .SH BUGS
 | |
| Having two kinds of REs is a botch.
 | |
| .PP
 | |
| The current 1003.2 spec says that `)' is an ordinary character in
 | |
| the absence of an unmatched `(';
 | |
| this was an unintentional result of a wording error,
 | |
| and change is likely.
 | |
| Avoid relying on it.
 | |
| .PP
 | |
| Back references are a dreadful botch,
 | |
| posing major problems for efficient implementations.
 | |
| They are also somewhat vaguely defined
 | |
| (does
 | |
| `a\e(\e(b\e)*\e2\e)*d' match `abbbd'?).
 | |
| Avoid using them.
 | |
| .PP
 | |
| 1003.2's specification of case-independent matching is vague.
 | |
| The ``one case implies all cases'' definition given above
 | |
| is current consensus among implementors as to the right interpretation.
 | |
| .PP
 | |
| The syntax for word boundaries is incredibly ugly.
 | 
