mirror of
https://github.com/postgres/postgres.git
synced 2025-10-15 05:46:52 +03:00
Make some column widths more pleasing.
Note: Some of this relies on the reduced body indents introduced by
commit 37e06ba6e8
.
Author: Noboru Saito <noborusai@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CAAM3qnLyMUD79XF+SqAVwWCwURCF3hyuFY9Ki9Csbqs-zMwwnw@mail.gmail.com
2488 lines
98 KiB
Plaintext
2488 lines
98 KiB
Plaintext
<sect1 id="functions-matching">
|
|
<title>Pattern Matching</title>
|
|
|
|
<indexterm zone="functions-matching">
|
|
<primary>pattern matching</primary>
|
|
</indexterm>
|
|
|
|
<para>
|
|
There are three separate approaches to pattern matching provided
|
|
by <productname>PostgreSQL</productname>: the traditional
|
|
<acronym>SQL</acronym> <function>LIKE</function> operator, the
|
|
more recent <function>SIMILAR TO</function> operator (added in
|
|
SQL:1999), and <acronym>POSIX</acronym>-style regular
|
|
expressions. Aside from the basic <quote>does this string match
|
|
this pattern?</quote> operators, functions are available to extract
|
|
or replace matching substrings and to split a string at matching
|
|
locations.
|
|
</para>
|
|
|
|
<tip>
|
|
<para>
|
|
If you have pattern matching needs that go beyond this,
|
|
consider writing a user-defined function in Perl or Tcl.
|
|
</para>
|
|
</tip>
|
|
|
|
<caution>
|
|
<para>
|
|
While most regular-expression searches can be executed very quickly,
|
|
regular expressions can be contrived that take arbitrary amounts of
|
|
time and memory to process. Be wary of accepting regular-expression
|
|
search patterns from hostile sources. If you must do so, it is
|
|
advisable to impose a statement timeout.
|
|
</para>
|
|
|
|
<para>
|
|
Searches using <function>SIMILAR TO</function> patterns have the same
|
|
security hazards, since <function>SIMILAR TO</function> provides many
|
|
of the same capabilities as <acronym>POSIX</acronym>-style regular
|
|
expressions.
|
|
</para>
|
|
|
|
<para>
|
|
<function>LIKE</function> searches, being much simpler than the other
|
|
two options, are safer to use with possibly-hostile pattern sources.
|
|
</para>
|
|
</caution>
|
|
|
|
<para>
|
|
<function>SIMILAR TO</function> and <acronym>POSIX</acronym>-style regular
|
|
expressions do not support nondeterministic collations. If required, use
|
|
<function>LIKE</function> or apply a different collation to the expression
|
|
to work around this limitation.
|
|
</para>
|
|
|
|
<sect2 id="functions-like">
|
|
<title><function>LIKE</function></title>
|
|
|
|
<indexterm>
|
|
<primary>LIKE</primary>
|
|
</indexterm>
|
|
|
|
<synopsis>
|
|
<replaceable>string</replaceable> LIKE <replaceable>pattern</replaceable> <optional>ESCAPE <replaceable>escape-character</replaceable></optional>
|
|
<replaceable>string</replaceable> NOT LIKE <replaceable>pattern</replaceable> <optional>ESCAPE <replaceable>escape-character</replaceable></optional>
|
|
</synopsis>
|
|
|
|
<para>
|
|
The <function>LIKE</function> expression returns true if the
|
|
<replaceable>string</replaceable> matches the supplied
|
|
<replaceable>pattern</replaceable>. (As
|
|
expected, the <function>NOT LIKE</function> expression returns
|
|
false if <function>LIKE</function> returns true, and vice versa.
|
|
An equivalent expression is
|
|
<literal>NOT (<replaceable>string</replaceable> LIKE
|
|
<replaceable>pattern</replaceable>)</literal>.)
|
|
</para>
|
|
|
|
<para>
|
|
If <replaceable>pattern</replaceable> does not contain percent
|
|
signs or underscores, then the pattern only represents the string
|
|
itself; in that case <function>LIKE</function> acts like the
|
|
equals operator. An underscore (<literal>_</literal>) in
|
|
<replaceable>pattern</replaceable> stands for (matches) any single
|
|
character; a percent sign (<literal>%</literal>) matches any sequence
|
|
of zero or more characters.
|
|
</para>
|
|
|
|
<para>
|
|
Some examples:
|
|
<programlisting>
|
|
'abc' LIKE 'abc' <lineannotation>true</lineannotation>
|
|
'abc' LIKE 'a%' <lineannotation>true</lineannotation>
|
|
'abc' LIKE '_b_' <lineannotation>true</lineannotation>
|
|
'abc' LIKE 'c' <lineannotation>false</lineannotation>
|
|
</programlisting>
|
|
</para>
|
|
|
|
<para>
|
|
<function>LIKE</function> pattern matching supports nondeterministic
|
|
collations (see <xref linkend="collation-nondeterministic"/>), such as
|
|
case-insensitive collations or collations that, say, ignore punctuation.
|
|
So with a case-insensitive collation, one could have:
|
|
<programlisting>
|
|
'AbC' LIKE 'abc' COLLATE case_insensitive <lineannotation>true</lineannotation>
|
|
'AbC' LIKE 'a%' COLLATE case_insensitive <lineannotation>true</lineannotation>
|
|
</programlisting>
|
|
With collations that ignore certain characters or in general that consider
|
|
strings of different lengths equal, the semantics can become a bit more
|
|
complicated. Consider these examples:
|
|
<programlisting>
|
|
'.foo.' LIKE 'foo' COLLATE ign_punct <lineannotation>true</lineannotation>
|
|
'.foo.' LIKE 'f_o' COLLATE ign_punct <lineannotation>true</lineannotation>
|
|
'.foo.' LIKE '_oo' COLLATE ign_punct <lineannotation>false</lineannotation>
|
|
</programlisting>
|
|
The way the matching works is that the pattern is partitioned into
|
|
sequences of wildcards and non-wildcard strings (wildcards being
|
|
<literal>_</literal> and <literal>%</literal>). For example, the pattern
|
|
<literal>f_o</literal> is partitioned into <literal>f, _, o</literal>, the
|
|
pattern <literal>_oo</literal> is partitioned into <literal>_,
|
|
oo</literal>. The input string matches the pattern if it can be
|
|
partitioned in such a way that the wildcards match one character or any
|
|
number of characters respectively and the non-wildcard partitions are
|
|
equal under the applicable collation. So for example, <literal>'.foo.'
|
|
LIKE 'f_o' COLLATE ign_punct</literal> is true because one can partition
|
|
<literal>.foo.</literal> into <literal>.f, o, o.</literal>, and then
|
|
<literal>'.f' = 'f' COLLATE ign_punct</literal>, <literal>'o'</literal>
|
|
matches the <literal>_</literal> wildcard, and <literal>'o.' = 'o' COLLATE
|
|
ign_punct</literal>. But <literal>'.foo.' LIKE '_oo' COLLATE
|
|
ign_punct</literal> is false because <literal>.foo.</literal> cannot be
|
|
partitioned in a way that the first character is any character and the
|
|
rest of the string compares equal to <literal>oo</literal>. (Note that
|
|
the single-character wildcard always matches exactly one character,
|
|
independent of the collation. So in this example, the
|
|
<literal>_</literal> would match <literal>.</literal>, but then the rest
|
|
of the input string won't match the rest of the pattern.)
|
|
</para>
|
|
|
|
<para>
|
|
<function>LIKE</function> pattern matching always covers the entire
|
|
string. Therefore, if it's desired to match a sequence anywhere within
|
|
a string, the pattern must start and end with a percent sign.
|
|
</para>
|
|
|
|
<para>
|
|
To match a literal underscore or percent sign without matching
|
|
other characters, the respective character in
|
|
<replaceable>pattern</replaceable> must be
|
|
preceded by the escape character. The default escape
|
|
character is the backslash but a different one can be selected by
|
|
using the <literal>ESCAPE</literal> clause. To match the escape
|
|
character itself, write two escape characters.
|
|
</para>
|
|
|
|
<note>
|
|
<para>
|
|
If you have <xref linkend="guc-standard-conforming-strings"/> turned off,
|
|
any backslashes you write in literal string constants will need to be
|
|
doubled. See <xref linkend="sql-syntax-strings"/> for more information.
|
|
</para>
|
|
</note>
|
|
|
|
<para>
|
|
It's also possible to select no escape character by writing
|
|
<literal>ESCAPE ''</literal>. This effectively disables the
|
|
escape mechanism, which makes it impossible to turn off the
|
|
special meaning of underscore and percent signs in the pattern.
|
|
</para>
|
|
|
|
<para>
|
|
According to the SQL standard, omitting <literal>ESCAPE</literal>
|
|
means there is no escape character (rather than defaulting to a
|
|
backslash), and a zero-length <literal>ESCAPE</literal> value is
|
|
disallowed. <productname>PostgreSQL</productname>'s behavior in
|
|
this regard is therefore slightly nonstandard.
|
|
</para>
|
|
|
|
<para>
|
|
The key word <token>ILIKE</token> can be used instead of
|
|
<token>LIKE</token> to make the match case-insensitive according to the
|
|
active locale. (But this does not support nondeterministic collations.)
|
|
This is not in the <acronym>SQL</acronym> standard but is a
|
|
<productname>PostgreSQL</productname> extension.
|
|
</para>
|
|
|
|
<para>
|
|
The operator <literal>~~</literal> is equivalent to
|
|
<function>LIKE</function>, and <literal>~~*</literal> corresponds to
|
|
<function>ILIKE</function>. There are also
|
|
<literal>!~~</literal> and <literal>!~~*</literal> operators that
|
|
represent <function>NOT LIKE</function> and <function>NOT
|
|
ILIKE</function>, respectively. All of these operators are
|
|
<productname>PostgreSQL</productname>-specific. You may see these
|
|
operator names in <command>EXPLAIN</command> output and similar
|
|
places, since the parser actually translates <function>LIKE</function>
|
|
et al. to these operators.
|
|
</para>
|
|
|
|
<para>
|
|
The phrases <function>LIKE</function>, <function>ILIKE</function>,
|
|
<function>NOT LIKE</function>, and <function>NOT ILIKE</function> are
|
|
generally treated as operators
|
|
in <productname>PostgreSQL</productname> syntax; for example they can
|
|
be used in <replaceable>expression</replaceable>
|
|
<replaceable>operator</replaceable> ANY
|
|
(<replaceable>subquery</replaceable>) constructs, although
|
|
an <literal>ESCAPE</literal> clause cannot be included there. In some
|
|
obscure cases it may be necessary to use the underlying operator names
|
|
instead.
|
|
</para>
|
|
|
|
<para>
|
|
Also see the starts-with operator <literal>^@</literal> and the
|
|
corresponding <function>starts_with()</function> function, which are
|
|
useful in cases where simply matching the beginning of a string is
|
|
needed.
|
|
</para>
|
|
</sect2>
|
|
|
|
|
|
<sect2 id="functions-similarto-regexp">
|
|
<title><function>SIMILAR TO</function> Regular Expressions</title>
|
|
|
|
<indexterm>
|
|
<primary>regular expression</primary>
|
|
<!-- <seealso>pattern matching</seealso> breaks index build -->
|
|
</indexterm>
|
|
|
|
<indexterm>
|
|
<primary>SIMILAR TO</primary>
|
|
</indexterm>
|
|
<indexterm>
|
|
<primary>substring</primary>
|
|
</indexterm>
|
|
|
|
<synopsis>
|
|
<replaceable>string</replaceable> SIMILAR TO <replaceable>pattern</replaceable> <optional>ESCAPE <replaceable>escape-character</replaceable></optional>
|
|
<replaceable>string</replaceable> NOT SIMILAR TO <replaceable>pattern</replaceable> <optional>ESCAPE <replaceable>escape-character</replaceable></optional>
|
|
</synopsis>
|
|
|
|
<para>
|
|
The <function>SIMILAR TO</function> operator returns true or
|
|
false depending on whether its pattern matches the given string.
|
|
It is similar to <function>LIKE</function>, except that it
|
|
interprets the pattern using the SQL standard's definition of a
|
|
regular expression. SQL regular expressions are a curious cross
|
|
between <function>LIKE</function> notation and common (POSIX) regular
|
|
expression notation.
|
|
</para>
|
|
|
|
<para>
|
|
Like <function>LIKE</function>, the <function>SIMILAR TO</function>
|
|
operator succeeds only if its pattern matches the entire string;
|
|
this is unlike common regular expression behavior where the pattern
|
|
can match any part of the string.
|
|
Also like
|
|
<function>LIKE</function>, <function>SIMILAR TO</function> uses
|
|
<literal>_</literal> and <literal>%</literal> as wildcard characters denoting
|
|
any single character and any string, respectively (these are
|
|
comparable to <literal>.</literal> and <literal>.*</literal> in POSIX regular
|
|
expressions).
|
|
</para>
|
|
|
|
<para>
|
|
In addition to these facilities borrowed from <function>LIKE</function>,
|
|
<function>SIMILAR TO</function> supports these pattern-matching
|
|
metacharacters borrowed from POSIX regular expressions:
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
<literal>|</literal> denotes alternation (either of two alternatives).
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
<literal>*</literal> denotes repetition of the previous item zero
|
|
or more times.
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
<literal>+</literal> denotes repetition of the previous item one
|
|
or more times.
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
<literal>?</literal> denotes repetition of the previous item zero
|
|
or one time.
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
<literal>{</literal><replaceable>m</replaceable><literal>}</literal> denotes repetition
|
|
of the previous item exactly <replaceable>m</replaceable> times.
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
<literal>{</literal><replaceable>m</replaceable><literal>,}</literal> denotes repetition
|
|
of the previous item <replaceable>m</replaceable> or more times.
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
<literal>{</literal><replaceable>m</replaceable><literal>,</literal><replaceable>n</replaceable><literal>}</literal>
|
|
denotes repetition of the previous item at least <replaceable>m</replaceable> and
|
|
not more than <replaceable>n</replaceable> times.
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
Parentheses <literal>()</literal> can be used to group items into
|
|
a single logical item.
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
A bracket expression <literal>[...]</literal> specifies a character
|
|
class, just as in POSIX regular expressions.
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
Notice that the period (<literal>.</literal>) is not a metacharacter
|
|
for <function>SIMILAR TO</function>.
|
|
</para>
|
|
|
|
<para>
|
|
As with <function>LIKE</function>, a backslash disables the special
|
|
meaning of any of these metacharacters. A different escape character
|
|
can be specified with <literal>ESCAPE</literal>, or the escape
|
|
capability can be disabled by writing <literal>ESCAPE ''</literal>.
|
|
</para>
|
|
|
|
<para>
|
|
According to the SQL standard, omitting <literal>ESCAPE</literal>
|
|
means there is no escape character (rather than defaulting to a
|
|
backslash), and a zero-length <literal>ESCAPE</literal> value is
|
|
disallowed. <productname>PostgreSQL</productname>'s behavior in
|
|
this regard is therefore slightly nonstandard.
|
|
</para>
|
|
|
|
<para>
|
|
Another nonstandard extension is that following the escape character
|
|
with a letter or digit provides access to the escape sequences
|
|
defined for POSIX regular expressions; see
|
|
<xref linkend="posix-character-entry-escapes-table"/>,
|
|
<xref linkend="posix-class-shorthand-escapes-table"/>, and
|
|
<xref linkend="posix-constraint-escapes-table"/> below.
|
|
</para>
|
|
|
|
<para>
|
|
Some examples:
|
|
<programlisting>
|
|
'abc' SIMILAR TO 'abc' <lineannotation>true</lineannotation>
|
|
'abc' SIMILAR TO 'a' <lineannotation>false</lineannotation>
|
|
'abc' SIMILAR TO '%(b|d)%' <lineannotation>true</lineannotation>
|
|
'abc' SIMILAR TO '(b|c)%' <lineannotation>false</lineannotation>
|
|
'-abc-' SIMILAR TO '%\mabc\M%' <lineannotation>true</lineannotation>
|
|
'xabcy' SIMILAR TO '%\mabc\M%' <lineannotation>false</lineannotation>
|
|
</programlisting>
|
|
</para>
|
|
|
|
<para>
|
|
The <function>substring</function> function with three parameters
|
|
provides extraction of a substring that matches an SQL
|
|
regular expression pattern. The function can be written according
|
|
to standard SQL syntax:
|
|
<synopsis>
|
|
substring(<replaceable>string</replaceable> similar <replaceable>pattern</replaceable> escape <replaceable>escape-character</replaceable>)
|
|
</synopsis>
|
|
or using the now obsolete SQL:1999 syntax:
|
|
<synopsis>
|
|
substring(<replaceable>string</replaceable> from <replaceable>pattern</replaceable> for <replaceable>escape-character</replaceable>)
|
|
</synopsis>
|
|
or as a plain three-argument function:
|
|
<synopsis>
|
|
substring(<replaceable>string</replaceable>, <replaceable>pattern</replaceable>, <replaceable>escape-character</replaceable>)
|
|
</synopsis>
|
|
As with <literal>SIMILAR TO</literal>, the
|
|
specified pattern must match the entire data string, or else the
|
|
function fails and returns null. To indicate the part of the
|
|
pattern for which the matching data sub-string is of interest,
|
|
the pattern should contain
|
|
two occurrences of the escape character followed by a double quote
|
|
(<literal>"</literal>). <!-- " font-lock sanity -->
|
|
The text matching the portion of the pattern
|
|
between these separators is returned when the match is successful.
|
|
</para>
|
|
|
|
<para>
|
|
The escape-double-quote separators actually
|
|
divide <function>substring</function>'s pattern into three independent
|
|
regular expressions; for example, a vertical bar (<literal>|</literal>)
|
|
in any of the three sections affects only that section. Also, the first
|
|
and third of these regular expressions are defined to match the smallest
|
|
possible amount of text, not the largest, when there is any ambiguity
|
|
about how much of the data string matches which pattern. (In POSIX
|
|
parlance, the first and third regular expressions are forced to be
|
|
non-greedy.)
|
|
</para>
|
|
|
|
<para>
|
|
As an extension to the SQL standard, <productname>PostgreSQL</productname>
|
|
allows there to be just one escape-double-quote separator, in which case
|
|
the third regular expression is taken as empty; or no separators, in which
|
|
case the first and third regular expressions are taken as empty.
|
|
</para>
|
|
|
|
<para>
|
|
Some examples, with <literal>#"</literal> delimiting the return string:
|
|
<programlisting>
|
|
substring('foobar' similar '%#"o_b#"%' escape '#') <lineannotation>oob</lineannotation>
|
|
substring('foobar' similar '#"o_b#"%' escape '#') <lineannotation>NULL</lineannotation>
|
|
</programlisting>
|
|
</para>
|
|
</sect2>
|
|
|
|
<sect2 id="functions-posix-regexp">
|
|
<title><acronym>POSIX</acronym> Regular Expressions</title>
|
|
|
|
<indexterm zone="functions-posix-regexp">
|
|
<primary>regular expression</primary>
|
|
<seealso>pattern matching</seealso>
|
|
</indexterm>
|
|
<indexterm>
|
|
<primary>substring</primary>
|
|
</indexterm>
|
|
<indexterm>
|
|
<primary>regexp_count</primary>
|
|
</indexterm>
|
|
<indexterm>
|
|
<primary>regexp_instr</primary>
|
|
</indexterm>
|
|
<indexterm>
|
|
<primary>regexp_like</primary>
|
|
</indexterm>
|
|
<indexterm>
|
|
<primary>regexp_match</primary>
|
|
</indexterm>
|
|
<indexterm>
|
|
<primary>regexp_matches</primary>
|
|
</indexterm>
|
|
<indexterm>
|
|
<primary>regexp_replace</primary>
|
|
</indexterm>
|
|
<indexterm>
|
|
<primary>regexp_split_to_table</primary>
|
|
</indexterm>
|
|
<indexterm>
|
|
<primary>regexp_split_to_array</primary>
|
|
</indexterm>
|
|
<indexterm>
|
|
<primary>regexp_substr</primary>
|
|
</indexterm>
|
|
|
|
<para>
|
|
<xref linkend="functions-posix-table"/> lists the available
|
|
operators for pattern matching using POSIX regular expressions.
|
|
</para>
|
|
|
|
<table id="functions-posix-table">
|
|
<title>Regular Expression Match Operators</title>
|
|
|
|
<tgroup cols="1">
|
|
<thead>
|
|
<row>
|
|
<entry role="func_table_entry"><para role="func_signature">
|
|
Operator
|
|
</para>
|
|
<para>
|
|
Description
|
|
</para>
|
|
<para>
|
|
Example(s)
|
|
</para></entry>
|
|
</row>
|
|
</thead>
|
|
|
|
<tbody>
|
|
<row>
|
|
<entry role="func_table_entry"><para role="func_signature">
|
|
<type>text</type> <literal>~</literal> <type>text</type>
|
|
<returnvalue>boolean</returnvalue>
|
|
</para>
|
|
<para>
|
|
String matches regular expression, case sensitively
|
|
</para>
|
|
<para>
|
|
<literal>'thomas' ~ 't.*ma'</literal>
|
|
<returnvalue>t</returnvalue>
|
|
</para></entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry role="func_table_entry"><para role="func_signature">
|
|
<type>text</type> <literal>~*</literal> <type>text</type>
|
|
<returnvalue>boolean</returnvalue>
|
|
</para>
|
|
<para>
|
|
String matches regular expression, case-insensitively
|
|
</para>
|
|
<para>
|
|
<literal>'thomas' ~* 'T.*ma'</literal>
|
|
<returnvalue>t</returnvalue>
|
|
</para></entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry role="func_table_entry"><para role="func_signature">
|
|
<type>text</type> <literal>!~</literal> <type>text</type>
|
|
<returnvalue>boolean</returnvalue>
|
|
</para>
|
|
<para>
|
|
String does not match regular expression, case sensitively
|
|
</para>
|
|
<para>
|
|
<literal>'thomas' !~ 't.*max'</literal>
|
|
<returnvalue>t</returnvalue>
|
|
</para></entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry role="func_table_entry"><para role="func_signature">
|
|
<type>text</type> <literal>!~*</literal> <type>text</type>
|
|
<returnvalue>boolean</returnvalue>
|
|
</para>
|
|
<para>
|
|
String does not match regular expression, case-insensitively
|
|
</para>
|
|
<para>
|
|
<literal>'thomas' !~* 'T.*ma'</literal>
|
|
<returnvalue>f</returnvalue>
|
|
</para></entry>
|
|
</row>
|
|
</tbody>
|
|
</tgroup>
|
|
</table>
|
|
|
|
<para>
|
|
<acronym>POSIX</acronym> regular expressions provide a more
|
|
powerful means for pattern matching than the <function>LIKE</function> and
|
|
<function>SIMILAR TO</function> operators.
|
|
Many Unix tools such as <command>egrep</command>,
|
|
<command>sed</command>, or <command>awk</command> use a pattern
|
|
matching language that is similar to the one described here.
|
|
</para>
|
|
|
|
<para>
|
|
A regular expression is a character sequence that is an
|
|
abbreviated definition of a set of strings (a <firstterm>regular
|
|
set</firstterm>). A string is said to match a regular expression
|
|
if it is a member of the regular set described by the regular
|
|
expression. As with <function>LIKE</function>, pattern characters
|
|
match string characters exactly unless they are special characters
|
|
in the regular expression language — but regular expressions use
|
|
different special characters than <function>LIKE</function> does.
|
|
Unlike <function>LIKE</function> patterns, a
|
|
regular expression is allowed to match anywhere within a string, unless
|
|
the regular expression is explicitly anchored to the beginning or
|
|
end of the string.
|
|
</para>
|
|
|
|
<para>
|
|
Some examples:
|
|
<programlisting>
|
|
'abcd' ~ 'bc' <lineannotation>true</lineannotation>
|
|
'abcd' ~ 'a.c' <lineannotation>true — dot matches any character</lineannotation>
|
|
'abcd' ~ 'a.*d' <lineannotation>true — <literal>*</literal> repeats the preceding pattern item</lineannotation>
|
|
'abcd' ~ '(b|x)' <lineannotation>true — <literal>|</literal> means OR, parentheses group</lineannotation>
|
|
'abcd' ~ '^a' <lineannotation>true — <literal>^</literal> anchors to start of string</lineannotation>
|
|
'abcd' ~ '^(b|c)' <lineannotation>false — would match except for anchoring</lineannotation>
|
|
</programlisting>
|
|
</para>
|
|
|
|
<para>
|
|
The <acronym>POSIX</acronym> pattern language is described in much
|
|
greater detail below.
|
|
</para>
|
|
|
|
<para>
|
|
The <function>substring</function> function with two parameters,
|
|
<function>substring(<replaceable>string</replaceable> from
|
|
<replaceable>pattern</replaceable>)</function>, provides extraction of a
|
|
substring
|
|
that matches a POSIX regular expression pattern. It returns null if
|
|
there is no match, otherwise the first portion of the text that matched the
|
|
pattern. But if the pattern contains any parentheses, the portion
|
|
of the text that matched the first parenthesized subexpression (the
|
|
one whose left parenthesis comes first) is
|
|
returned. You can put parentheses around the whole expression
|
|
if you want to use parentheses within it without triggering this
|
|
exception. If you need parentheses in the pattern before the
|
|
subexpression you want to extract, see the non-capturing parentheses
|
|
described below.
|
|
</para>
|
|
|
|
<para>
|
|
Some examples:
|
|
<programlisting>
|
|
substring('foobar' from 'o.b') <lineannotation>oob</lineannotation>
|
|
substring('foobar' from 'o(.)b') <lineannotation>o</lineannotation>
|
|
</programlisting>
|
|
</para>
|
|
|
|
<para>
|
|
The <function>regexp_count</function> function counts the number of
|
|
places where a POSIX regular expression pattern matches a string.
|
|
It has the syntax
|
|
<function>regexp_count</function>(<replaceable>string</replaceable>,
|
|
<replaceable>pattern</replaceable>
|
|
<optional>, <replaceable>start</replaceable>
|
|
<optional>, <replaceable>flags</replaceable>
|
|
</optional></optional>).
|
|
<replaceable>pattern</replaceable> is searched for
|
|
in <replaceable>string</replaceable>, normally from the beginning of
|
|
the string, but if the <replaceable>start</replaceable> parameter is
|
|
provided then beginning from that character index.
|
|
The <replaceable>flags</replaceable> parameter is an optional text
|
|
string containing zero or more single-letter flags that change the
|
|
function's behavior. For example, including <literal>i</literal> in
|
|
<replaceable>flags</replaceable> specifies case-insensitive matching.
|
|
Supported flags are described in
|
|
<xref linkend="posix-embedded-options-table"/>.
|
|
</para>
|
|
|
|
<para>
|
|
Some examples:
|
|
<programlisting>
|
|
regexp_count('ABCABCAXYaxy', 'A.') <lineannotation>3</lineannotation>
|
|
regexp_count('ABCABCAXYaxy', 'A.', 1, 'i') <lineannotation>4</lineannotation>
|
|
</programlisting>
|
|
</para>
|
|
|
|
<para>
|
|
The <function>regexp_instr</function> function returns the starting or
|
|
ending position of the <replaceable>N</replaceable>'th match of a
|
|
POSIX regular expression pattern to a string, or zero if there is no
|
|
such match. It has the syntax
|
|
<function>regexp_instr</function>(<replaceable>string</replaceable>,
|
|
<replaceable>pattern</replaceable>
|
|
<optional>, <replaceable>start</replaceable>
|
|
<optional>, <replaceable>N</replaceable>
|
|
<optional>, <replaceable>endoption</replaceable>
|
|
<optional>, <replaceable>flags</replaceable>
|
|
<optional>, <replaceable>subexpr</replaceable>
|
|
</optional></optional></optional></optional></optional>).
|
|
<replaceable>pattern</replaceable> is searched for
|
|
in <replaceable>string</replaceable>, normally from the beginning of
|
|
the string, but if the <replaceable>start</replaceable> parameter is
|
|
provided then beginning from that character index.
|
|
If <replaceable>N</replaceable> is specified
|
|
then the <replaceable>N</replaceable>'th match of the pattern
|
|
is located, otherwise the first match is located.
|
|
If the <replaceable>endoption</replaceable> parameter is omitted or
|
|
specified as zero, the function returns the position of the first
|
|
character of the match. Otherwise, <replaceable>endoption</replaceable>
|
|
must be one, and the function returns the position of the character
|
|
following the match.
|
|
The <replaceable>flags</replaceable> parameter is an optional text
|
|
string containing zero or more single-letter flags that change the
|
|
function's behavior. Supported flags are described
|
|
in <xref linkend="posix-embedded-options-table"/>.
|
|
For a pattern containing parenthesized
|
|
subexpressions, <replaceable>subexpr</replaceable> is an integer
|
|
indicating which subexpression is of interest: the result identifies
|
|
the position of the substring matching that subexpression.
|
|
Subexpressions are numbered in the order of their leading parentheses.
|
|
When <replaceable>subexpr</replaceable> is omitted or zero, the result
|
|
identifies the position of the whole match regardless of
|
|
parenthesized subexpressions.
|
|
</para>
|
|
|
|
<para>
|
|
Some examples:
|
|
<programlisting>
|
|
regexp_instr('number of your street, town zip, FR', '[^,]+', 1, 2)
|
|
<lineannotation>23</lineannotation>
|
|
regexp_instr(string=>'ABCDEFGHI', pattern=>'(c..)(...)', start=>1, "N"=>1, endoption=>0, flags=>'i', subexpr=>2)
|
|
<lineannotation>6</lineannotation>
|
|
</programlisting>
|
|
</para>
|
|
|
|
<para>
|
|
The <function>regexp_like</function> function checks whether a match
|
|
of a POSIX regular expression pattern occurs within a string,
|
|
returning boolean true or false. It has the syntax
|
|
<function>regexp_like</function>(<replaceable>string</replaceable>,
|
|
<replaceable>pattern</replaceable>
|
|
<optional>, <replaceable>flags</replaceable> </optional>).
|
|
The <replaceable>flags</replaceable> parameter is an optional text
|
|
string containing zero or more single-letter flags that change the
|
|
function's behavior. Supported flags are described
|
|
in <xref linkend="posix-embedded-options-table"/>.
|
|
This function has the same results as the <literal>~</literal>
|
|
operator if no flags are specified. If only the <literal>i</literal>
|
|
flag is specified, it has the same results as
|
|
the <literal>~*</literal> operator.
|
|
</para>
|
|
|
|
<para>
|
|
Some examples:
|
|
<programlisting>
|
|
regexp_like('Hello World', 'world') <lineannotation>false</lineannotation>
|
|
regexp_like('Hello World', 'world', 'i') <lineannotation>true</lineannotation>
|
|
</programlisting>
|
|
</para>
|
|
|
|
<para>
|
|
The <function>regexp_match</function> function returns a text array of
|
|
matching substring(s) within the first match of a POSIX
|
|
regular expression pattern to a string. It has the syntax
|
|
<function>regexp_match</function>(<replaceable>string</replaceable>,
|
|
<replaceable>pattern</replaceable> <optional>, <replaceable>flags</replaceable> </optional>).
|
|
If there is no match, the result is <literal>NULL</literal>.
|
|
If a match is found, and the <replaceable>pattern</replaceable> contains no
|
|
parenthesized subexpressions, then the result is a single-element text
|
|
array containing the substring matching the whole pattern.
|
|
If a match is found, and the <replaceable>pattern</replaceable> contains
|
|
parenthesized subexpressions, then the result is a text array
|
|
whose <replaceable>n</replaceable>'th element is the substring matching
|
|
the <replaceable>n</replaceable>'th parenthesized subexpression of
|
|
the <replaceable>pattern</replaceable> (not counting <quote>non-capturing</quote>
|
|
parentheses; see below for details).
|
|
The <replaceable>flags</replaceable> parameter is an optional text string
|
|
containing zero or more single-letter flags that change the function's
|
|
behavior. Supported flags are described
|
|
in <xref linkend="posix-embedded-options-table"/>.
|
|
</para>
|
|
|
|
<para>
|
|
Some examples:
|
|
<programlisting>
|
|
SELECT regexp_match('foobarbequebaz', 'bar.*que');
|
|
regexp_match
|
|
--------------
|
|
{barbeque}
|
|
(1 row)
|
|
|
|
SELECT regexp_match('foobarbequebaz', '(bar)(beque)');
|
|
regexp_match
|
|
--------------
|
|
{bar,beque}
|
|
(1 row)
|
|
</programlisting>
|
|
</para>
|
|
|
|
<tip>
|
|
<para>
|
|
In the common case where you just want the whole matching substring
|
|
or <literal>NULL</literal> for no match, the best solution is to
|
|
use <function>regexp_substr()</function>.
|
|
However, <function>regexp_substr()</function> only exists
|
|
in <productname>PostgreSQL</productname> version 15 and up. When
|
|
working in older versions, you can extract the first element
|
|
of <function>regexp_match()</function>'s result, for example:
|
|
<programlisting>
|
|
SELECT (regexp_match('foobarbequebaz', 'bar.*que'))[1];
|
|
regexp_match
|
|
--------------
|
|
barbeque
|
|
(1 row)
|
|
</programlisting>
|
|
</para>
|
|
</tip>
|
|
|
|
<para>
|
|
The <function>regexp_matches</function> function returns a set of text arrays
|
|
of matching substring(s) within matches of a POSIX regular
|
|
expression pattern to a string. It has the same syntax as
|
|
<function>regexp_match</function>.
|
|
This function returns no rows if there is no match, one row if there is
|
|
a match and the <literal>g</literal> flag is not given, or <replaceable>N</replaceable>
|
|
rows if there are <replaceable>N</replaceable> matches and the <literal>g</literal> flag
|
|
is given. Each returned row is a text array containing the whole
|
|
matched substring or the substrings matching parenthesized
|
|
subexpressions of the <replaceable>pattern</replaceable>, just as described above
|
|
for <function>regexp_match</function>.
|
|
<function>regexp_matches</function> accepts all the flags shown
|
|
in <xref linkend="posix-embedded-options-table"/>, plus
|
|
the <literal>g</literal> flag which commands it to return all matches, not
|
|
just the first one.
|
|
</para>
|
|
|
|
<para>
|
|
Some examples:
|
|
<programlisting>
|
|
SELECT regexp_matches('foo', 'not there');
|
|
regexp_matches
|
|
----------------
|
|
(0 rows)
|
|
|
|
SELECT regexp_matches('foobarbequebazilbarfbonk', '(b[^b]+)(b[^b]+)', 'g');
|
|
regexp_matches
|
|
----------------
|
|
{bar,beque}
|
|
{bazil,barf}
|
|
(2 rows)
|
|
</programlisting>
|
|
</para>
|
|
|
|
<tip>
|
|
<para>
|
|
In most cases <function>regexp_matches()</function> should be used with
|
|
the <literal>g</literal> flag, since if you only want the first match, it's
|
|
easier and more efficient to use <function>regexp_match()</function>.
|
|
However, <function>regexp_match()</function> only exists
|
|
in <productname>PostgreSQL</productname> version 10 and up. When working in older
|
|
versions, a common trick is to place a <function>regexp_matches()</function>
|
|
call in a sub-select, for example:
|
|
<programlisting>
|
|
SELECT col1, (SELECT regexp_matches(col2, '(bar)(beque)')) FROM tab;
|
|
</programlisting>
|
|
This produces a text array if there's a match, or <literal>NULL</literal> if
|
|
not, the same as <function>regexp_match()</function> would do. Without the
|
|
sub-select, this query would produce no output at all for table rows
|
|
without a match, which is typically not the desired behavior.
|
|
</para>
|
|
</tip>
|
|
|
|
<para>
|
|
The <function>regexp_replace</function> function provides substitution of
|
|
new text for substrings that match POSIX regular expression patterns.
|
|
It has the syntax
|
|
<function>regexp_replace</function>(<replaceable>string</replaceable>,
|
|
<replaceable>pattern</replaceable>, <replaceable>replacement</replaceable>
|
|
<optional>, <replaceable>flags</replaceable> </optional>)
|
|
or
|
|
<function>regexp_replace</function>(<replaceable>string</replaceable>,
|
|
<replaceable>pattern</replaceable>, <replaceable>replacement</replaceable>,
|
|
<replaceable>start</replaceable>
|
|
<optional>, <replaceable>N</replaceable>
|
|
<optional>, <replaceable>flags</replaceable> </optional></optional>).
|
|
The source <replaceable>string</replaceable> is returned unchanged if
|
|
there is no match to the <replaceable>pattern</replaceable>. If there is a
|
|
match, the <replaceable>string</replaceable> is returned with the
|
|
<replaceable>replacement</replaceable> string substituted for the matching
|
|
substring. The <replaceable>replacement</replaceable> string can contain
|
|
<literal>\</literal><replaceable>n</replaceable>, where <replaceable>n</replaceable> is 1
|
|
through 9, to indicate that the source substring matching the
|
|
<replaceable>n</replaceable>'th parenthesized subexpression of the pattern should be
|
|
inserted, and it can contain <literal>\&</literal> to indicate that the
|
|
substring matching the entire pattern should be inserted. Write
|
|
<literal>\\</literal> if you need to put a literal backslash in the replacement
|
|
text.
|
|
<replaceable>pattern</replaceable> is searched for
|
|
in <replaceable>string</replaceable>, normally from the beginning of
|
|
the string, but if the <replaceable>start</replaceable> parameter is
|
|
provided then beginning from that character index.
|
|
By default, only the first match of the pattern is replaced.
|
|
If <replaceable>N</replaceable> is specified and is greater than zero,
|
|
then the <replaceable>N</replaceable>'th match of the pattern
|
|
is replaced.
|
|
If the <literal>g</literal> flag is given, or
|
|
if <replaceable>N</replaceable> is specified and is zero, then all
|
|
matches at or after the <replaceable>start</replaceable> position are
|
|
replaced. (The <literal>g</literal> flag is ignored
|
|
when <replaceable>N</replaceable> is specified.)
|
|
The <replaceable>flags</replaceable> parameter is an optional text
|
|
string containing zero or more single-letter flags that change the
|
|
function's behavior. Supported flags (though
|
|
not <literal>g</literal>) are
|
|
described in <xref linkend="posix-embedded-options-table"/>.
|
|
</para>
|
|
|
|
<para>
|
|
Some examples:
|
|
<programlisting>
|
|
regexp_replace('foobarbaz', 'b..', 'X')
|
|
<lineannotation>fooXbaz</lineannotation>
|
|
regexp_replace('foobarbaz', 'b..', 'X', 'g')
|
|
<lineannotation>fooXX</lineannotation>
|
|
regexp_replace('foobarbaz', 'b(..)', 'X\1Y', 'g')
|
|
<lineannotation>fooXarYXazY</lineannotation>
|
|
regexp_replace('A PostgreSQL function', 'a|e|i|o|u', 'X', 1, 0, 'i')
|
|
<lineannotation>X PXstgrXSQL fXnctXXn</lineannotation>
|
|
regexp_replace(string=>'A PostgreSQL function', pattern=>'a|e|i|o|u', replacement=>'X', start=>1, "N"=>3, flags=>'i')
|
|
<lineannotation>A PostgrXSQL function</lineannotation>
|
|
</programlisting>
|
|
</para>
|
|
|
|
<para>
|
|
The <function>regexp_split_to_table</function> function splits a string using a POSIX
|
|
regular expression pattern as a delimiter. It has the syntax
|
|
<function>regexp_split_to_table</function>(<replaceable>string</replaceable>, <replaceable>pattern</replaceable>
|
|
<optional>, <replaceable>flags</replaceable> </optional>).
|
|
If there is no match to the <replaceable>pattern</replaceable>, the function returns the
|
|
<replaceable>string</replaceable>. If there is at least one match, for each match it returns
|
|
the text from the end of the last match (or the beginning of the string)
|
|
to the beginning of the match. When there are no more matches, it
|
|
returns the text from the end of the last match to the end of the string.
|
|
The <replaceable>flags</replaceable> parameter is an optional text string containing
|
|
zero or more single-letter flags that change the function's behavior.
|
|
<function>regexp_split_to_table</function> supports the flags described in
|
|
<xref linkend="posix-embedded-options-table"/>.
|
|
</para>
|
|
|
|
<para>
|
|
The <function>regexp_split_to_array</function> function behaves the same as
|
|
<function>regexp_split_to_table</function>, except that <function>regexp_split_to_array</function>
|
|
returns its result as an array of <type>text</type>. It has the syntax
|
|
<function>regexp_split_to_array</function>(<replaceable>string</replaceable>, <replaceable>pattern</replaceable>
|
|
<optional>, <replaceable>flags</replaceable> </optional>).
|
|
The parameters are the same as for <function>regexp_split_to_table</function>.
|
|
</para>
|
|
|
|
<para>
|
|
Some examples:
|
|
<programlisting>
|
|
SELECT foo FROM regexp_split_to_table('the quick brown fox jumps over the lazy dog', '\s+') AS foo;
|
|
foo
|
|
-------
|
|
the
|
|
quick
|
|
brown
|
|
fox
|
|
jumps
|
|
over
|
|
the
|
|
lazy
|
|
dog
|
|
(9 rows)
|
|
|
|
SELECT regexp_split_to_array('the quick brown fox jumps over the lazy dog', '\s+');
|
|
regexp_split_to_array
|
|
-----------------------------------------------
|
|
{the,quick,brown,fox,jumps,over,the,lazy,dog}
|
|
(1 row)
|
|
|
|
SELECT foo FROM regexp_split_to_table('the quick brown fox', '\s*') AS foo;
|
|
foo
|
|
-----
|
|
t
|
|
h
|
|
e
|
|
q
|
|
u
|
|
i
|
|
c
|
|
k
|
|
b
|
|
r
|
|
o
|
|
w
|
|
n
|
|
f
|
|
o
|
|
x
|
|
(16 rows)
|
|
</programlisting>
|
|
</para>
|
|
|
|
<para>
|
|
As the last example demonstrates, the regexp split functions ignore
|
|
zero-length matches that occur at the start or end of the string
|
|
or immediately after a previous match. This is contrary to the strict
|
|
definition of regexp matching that is implemented by
|
|
the other regexp functions, but is usually the most convenient behavior
|
|
in practice. Other software systems such as Perl use similar definitions.
|
|
</para>
|
|
|
|
<para>
|
|
The <function>regexp_substr</function> function returns the substring
|
|
that matches a POSIX regular expression pattern,
|
|
or <literal>NULL</literal> if there is no match. It has the syntax
|
|
<function>regexp_substr</function>(<replaceable>string</replaceable>,
|
|
<replaceable>pattern</replaceable>
|
|
<optional>, <replaceable>start</replaceable>
|
|
<optional>, <replaceable>N</replaceable>
|
|
<optional>, <replaceable>flags</replaceable>
|
|
<optional>, <replaceable>subexpr</replaceable>
|
|
</optional></optional></optional></optional>).
|
|
<replaceable>pattern</replaceable> is searched for
|
|
in <replaceable>string</replaceable>, normally from the beginning of
|
|
the string, but if the <replaceable>start</replaceable> parameter is
|
|
provided then beginning from that character index.
|
|
If <replaceable>N</replaceable> is specified
|
|
then the <replaceable>N</replaceable>'th match of the pattern
|
|
is returned, otherwise the first match is returned.
|
|
The <replaceable>flags</replaceable> parameter is an optional text
|
|
string containing zero or more single-letter flags that change the
|
|
function's behavior. Supported flags are described
|
|
in <xref linkend="posix-embedded-options-table"/>.
|
|
For a pattern containing parenthesized
|
|
subexpressions, <replaceable>subexpr</replaceable> is an integer
|
|
indicating which subexpression is of interest: the result is the
|
|
substring matching that subexpression.
|
|
Subexpressions are numbered in the order of their leading parentheses.
|
|
When <replaceable>subexpr</replaceable> is omitted or zero, the result
|
|
is the whole match regardless of parenthesized subexpressions.
|
|
</para>
|
|
|
|
<para>
|
|
Some examples:
|
|
<programlisting>
|
|
regexp_substr('number of your street, town zip, FR', '[^,]+', 1, 2)
|
|
<lineannotation> town zip</lineannotation>
|
|
regexp_substr('ABCDEFGHI', '(c..)(...)', 1, 1, 'i', 2)
|
|
<lineannotation>FGH</lineannotation>
|
|
</programlisting>
|
|
</para>
|
|
|
|
<!-- derived from the re_syntax.n man page -->
|
|
|
|
<sect3 id="posix-syntax-details">
|
|
<title>Regular Expression Details</title>
|
|
|
|
<para>
|
|
<productname>PostgreSQL</productname>'s regular expressions are implemented
|
|
using a software package written by Henry Spencer. Much of
|
|
the description of regular expressions below is copied verbatim from his
|
|
manual.
|
|
</para>
|
|
|
|
<para>
|
|
Regular expressions (<acronym>RE</acronym>s), as defined in
|
|
<acronym>POSIX</acronym> 1003.2, come in two forms:
|
|
<firstterm>extended</firstterm> <acronym>RE</acronym>s or <acronym>ERE</acronym>s
|
|
(roughly those of <command>egrep</command>), and
|
|
<firstterm>basic</firstterm> <acronym>RE</acronym>s or <acronym>BRE</acronym>s
|
|
(roughly those of <command>ed</command>).
|
|
<productname>PostgreSQL</productname> supports both forms, and
|
|
also implements some extensions
|
|
that are not in the POSIX standard, but have become widely used
|
|
due to their availability in programming languages such as Perl and Tcl.
|
|
<acronym>RE</acronym>s using these non-POSIX extensions are called
|
|
<firstterm>advanced</firstterm> <acronym>RE</acronym>s or <acronym>ARE</acronym>s
|
|
in this documentation. AREs are almost an exact superset of EREs,
|
|
but BREs have several notational incompatibilities (as well as being
|
|
much more limited).
|
|
We first describe the ARE and ERE forms, noting features that apply
|
|
only to AREs, and then describe how BREs differ.
|
|
</para>
|
|
|
|
<note>
|
|
<para>
|
|
<productname>PostgreSQL</productname> always initially presumes that a regular
|
|
expression follows the ARE rules. However, the more limited ERE or
|
|
BRE rules can be chosen by prepending an <firstterm>embedded option</firstterm>
|
|
to the RE pattern, as described in <xref linkend="posix-metasyntax"/>.
|
|
This can be useful for compatibility with applications that expect
|
|
exactly the <acronym>POSIX</acronym> 1003.2 rules.
|
|
</para>
|
|
</note>
|
|
|
|
<para>
|
|
A regular expression is defined as one or more
|
|
<firstterm>branches</firstterm>, separated by
|
|
<literal>|</literal>. It matches anything that matches one of the
|
|
branches.
|
|
</para>
|
|
|
|
<para>
|
|
A branch is zero or more <firstterm>quantified atoms</firstterm> or
|
|
<firstterm>constraints</firstterm>, concatenated.
|
|
It matches a match for the first, followed by a match for the second, etc.;
|
|
an empty branch matches the empty string.
|
|
</para>
|
|
|
|
<para>
|
|
A quantified atom is an <firstterm>atom</firstterm> possibly followed
|
|
by a single <firstterm>quantifier</firstterm>.
|
|
Without a quantifier, it matches a match for the atom.
|
|
With a quantifier, it can match some number of matches of the atom.
|
|
An <firstterm>atom</firstterm> can be any of the possibilities
|
|
shown in <xref linkend="posix-atoms-table"/>.
|
|
The possible quantifiers and their meanings are shown in
|
|
<xref linkend="posix-quantifiers-table"/>.
|
|
</para>
|
|
|
|
<para>
|
|
A <firstterm>constraint</firstterm> matches an empty string, but matches only when
|
|
specific conditions are met. A constraint can be used where an atom
|
|
could be used, except it cannot be followed by a quantifier.
|
|
The simple constraints are shown in
|
|
<xref linkend="posix-constraints-table"/>;
|
|
some more constraints are described later.
|
|
</para>
|
|
|
|
|
|
<table id="posix-atoms-table">
|
|
<title>Regular Expression Atoms</title>
|
|
|
|
<tgroup cols="2">
|
|
<colspec colname="col1" colwidth="1*"/>
|
|
<colspec colname="col2" colwidth="3*"/>
|
|
<thead>
|
|
<row>
|
|
<entry>Atom</entry>
|
|
<entry>Description</entry>
|
|
</row>
|
|
</thead>
|
|
|
|
<tbody>
|
|
<row>
|
|
<entry> <literal>(</literal><replaceable>re</replaceable><literal>)</literal> </entry>
|
|
<entry> (where <replaceable>re</replaceable> is any regular expression)
|
|
matches a match for
|
|
<replaceable>re</replaceable>, with the match noted for possible reporting </entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry> <literal>(?:</literal><replaceable>re</replaceable><literal>)</literal> </entry>
|
|
<entry> as above, but the match is not noted for reporting
|
|
(a <quote>non-capturing</quote> set of parentheses)
|
|
(AREs only) </entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry> <literal>.</literal> </entry>
|
|
<entry> matches any single character </entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry> <literal>[</literal><replaceable>chars</replaceable><literal>]</literal> </entry>
|
|
<entry> a <firstterm>bracket expression</firstterm>,
|
|
matching any one of the <replaceable>chars</replaceable> (see
|
|
<xref linkend="posix-bracket-expressions"/> for more detail) </entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry> <literal>\</literal><replaceable>k</replaceable> </entry>
|
|
<entry> (where <replaceable>k</replaceable> is a non-alphanumeric character)
|
|
matches that character taken as an ordinary character,
|
|
e.g., <literal>\\</literal> matches a backslash character </entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry> <literal>\</literal><replaceable>c</replaceable> </entry>
|
|
<entry> where <replaceable>c</replaceable> is alphanumeric
|
|
(possibly followed by other characters)
|
|
is an <firstterm>escape</firstterm>, see <xref linkend="posix-escape-sequences"/>
|
|
(AREs only; in EREs and BREs, this matches <replaceable>c</replaceable>) </entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry> <literal>{</literal> </entry>
|
|
<entry> when followed by a character other than a digit,
|
|
matches the left-brace character <literal>{</literal>;
|
|
when followed by a digit, it is the beginning of a
|
|
<replaceable>bound</replaceable> (see below) </entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry> <replaceable>x</replaceable> </entry>
|
|
<entry> where <replaceable>x</replaceable> is a single character with no other
|
|
significance, matches that character </entry>
|
|
</row>
|
|
</tbody>
|
|
</tgroup>
|
|
</table>
|
|
|
|
<para>
|
|
An RE cannot end with a backslash (<literal>\</literal>).
|
|
</para>
|
|
|
|
<note>
|
|
<para>
|
|
If you have <xref linkend="guc-standard-conforming-strings"/> turned off,
|
|
any backslashes you write in literal string constants will need to be
|
|
doubled. See <xref linkend="sql-syntax-strings"/> for more information.
|
|
</para>
|
|
</note>
|
|
|
|
<table id="posix-quantifiers-table">
|
|
<title>Regular Expression Quantifiers</title>
|
|
|
|
<tgroup cols="2">
|
|
<colspec colname="col1" colwidth="1*"/>
|
|
<colspec colname="col2" colwidth="3*"/>
|
|
<thead>
|
|
<row>
|
|
<entry>Quantifier</entry>
|
|
<entry>Matches</entry>
|
|
</row>
|
|
</thead>
|
|
|
|
<tbody>
|
|
<row>
|
|
<entry> <literal>*</literal> </entry>
|
|
<entry> a sequence of 0 or more matches of the atom </entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry> <literal>+</literal> </entry>
|
|
<entry> a sequence of 1 or more matches of the atom </entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry> <literal>?</literal> </entry>
|
|
<entry> a sequence of 0 or 1 matches of the atom </entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry> <literal>{</literal><replaceable>m</replaceable><literal>}</literal> </entry>
|
|
<entry> a sequence of exactly <replaceable>m</replaceable> matches of the atom </entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry> <literal>{</literal><replaceable>m</replaceable><literal>,}</literal> </entry>
|
|
<entry> a sequence of <replaceable>m</replaceable> or more matches of the atom </entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry>
|
|
<literal>{</literal><replaceable>m</replaceable><literal>,</literal><replaceable>n</replaceable><literal>}</literal> </entry>
|
|
<entry> a sequence of <replaceable>m</replaceable> through <replaceable>n</replaceable>
|
|
(inclusive) matches of the atom; <replaceable>m</replaceable> cannot exceed
|
|
<replaceable>n</replaceable> </entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry> <literal>*?</literal> </entry>
|
|
<entry> non-greedy version of <literal>*</literal> </entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry> <literal>+?</literal> </entry>
|
|
<entry> non-greedy version of <literal>+</literal> </entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry> <literal>??</literal> </entry>
|
|
<entry> non-greedy version of <literal>?</literal> </entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry> <literal>{</literal><replaceable>m</replaceable><literal>}?</literal> </entry>
|
|
<entry> non-greedy version of <literal>{</literal><replaceable>m</replaceable><literal>}</literal> </entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry> <literal>{</literal><replaceable>m</replaceable><literal>,}?</literal> </entry>
|
|
<entry> non-greedy version of <literal>{</literal><replaceable>m</replaceable><literal>,}</literal> </entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry>
|
|
<literal>{</literal><replaceable>m</replaceable><literal>,</literal><replaceable>n</replaceable><literal>}?</literal> </entry>
|
|
<entry> non-greedy version of <literal>{</literal><replaceable>m</replaceable><literal>,</literal><replaceable>n</replaceable><literal>}</literal> </entry>
|
|
</row>
|
|
</tbody>
|
|
</tgroup>
|
|
</table>
|
|
|
|
<para>
|
|
The forms using <literal>{</literal><replaceable>...</replaceable><literal>}</literal>
|
|
are known as <firstterm>bounds</firstterm>.
|
|
The numbers <replaceable>m</replaceable> and <replaceable>n</replaceable> within a bound are
|
|
unsigned decimal integers with permissible values from 0 to 255 inclusive.
|
|
</para>
|
|
|
|
<para>
|
|
<firstterm>Non-greedy</firstterm> quantifiers (available in AREs only) match the
|
|
same possibilities as their corresponding normal (<firstterm>greedy</firstterm>)
|
|
counterparts, but prefer the smallest number rather than the largest
|
|
number of matches.
|
|
See <xref linkend="posix-matching-rules"/> for more detail.
|
|
</para>
|
|
|
|
<note>
|
|
<para>
|
|
A quantifier cannot immediately follow another quantifier, e.g.,
|
|
<literal>**</literal> is invalid.
|
|
A quantifier cannot
|
|
begin an expression or subexpression or follow
|
|
<literal>^</literal> or <literal>|</literal>.
|
|
</para>
|
|
</note>
|
|
|
|
<table id="posix-constraints-table">
|
|
<title>Regular Expression Constraints</title>
|
|
|
|
<tgroup cols="2">
|
|
<colspec colname="col1" colwidth="1*"/>
|
|
<colspec colname="col2" colwidth="3*"/>
|
|
<thead>
|
|
<row>
|
|
<entry>Constraint</entry>
|
|
<entry>Description</entry>
|
|
</row>
|
|
</thead>
|
|
|
|
<tbody>
|
|
<row>
|
|
<entry> <literal>^</literal> </entry>
|
|
<entry> matches at the beginning of the string </entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry> <literal>$</literal> </entry>
|
|
<entry> matches at the end of the string </entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry> <literal>(?=</literal><replaceable>re</replaceable><literal>)</literal> </entry>
|
|
<entry> <firstterm>positive lookahead</firstterm> matches at any point
|
|
where a substring matching <replaceable>re</replaceable> begins
|
|
(AREs only) </entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry> <literal>(?!</literal><replaceable>re</replaceable><literal>)</literal> </entry>
|
|
<entry> <firstterm>negative lookahead</firstterm> matches at any point
|
|
where no substring matching <replaceable>re</replaceable> begins
|
|
(AREs only) </entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry> <literal>(?<=</literal><replaceable>re</replaceable><literal>)</literal> </entry>
|
|
<entry> <firstterm>positive lookbehind</firstterm> matches at any point
|
|
where a substring matching <replaceable>re</replaceable> ends
|
|
(AREs only) </entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry> <literal>(?<!</literal><replaceable>re</replaceable><literal>)</literal> </entry>
|
|
<entry> <firstterm>negative lookbehind</firstterm> matches at any point
|
|
where no substring matching <replaceable>re</replaceable> ends
|
|
(AREs only) </entry>
|
|
</row>
|
|
</tbody>
|
|
</tgroup>
|
|
</table>
|
|
|
|
<para>
|
|
Lookahead and lookbehind constraints cannot contain <firstterm>back
|
|
references</firstterm> (see <xref linkend="posix-escape-sequences"/>),
|
|
and all parentheses within them are considered non-capturing.
|
|
</para>
|
|
</sect3>
|
|
|
|
<sect3 id="posix-bracket-expressions">
|
|
<title>Bracket Expressions</title>
|
|
|
|
<para>
|
|
A <firstterm>bracket expression</firstterm> is a list of
|
|
characters enclosed in <literal>[]</literal>. It normally matches
|
|
any single character from the list (but see below). If the list
|
|
begins with <literal>^</literal>, it matches any single character
|
|
<emphasis>not</emphasis> from the rest of the list.
|
|
If two characters
|
|
in the list are separated by <literal>-</literal>, this is
|
|
shorthand for the full range of characters between those two
|
|
(inclusive) in the collating sequence,
|
|
e.g., <literal>[0-9]</literal> in <acronym>ASCII</acronym> matches
|
|
any decimal digit. It is illegal for two ranges to share an
|
|
endpoint, e.g., <literal>a-c-e</literal>. Ranges are very
|
|
collating-sequence-dependent, so portable programs should avoid
|
|
relying on them.
|
|
</para>
|
|
|
|
<para>
|
|
To include a literal <literal>]</literal> in the list, make it the
|
|
first character (after <literal>^</literal>, if that is used). To
|
|
include a literal <literal>-</literal>, make it the first or last
|
|
character, or the second endpoint of a range. To use a literal
|
|
<literal>-</literal> as the first endpoint of a range, enclose it
|
|
in <literal>[.</literal> and <literal>.]</literal> to make it a
|
|
collating element (see below). With the exception of these characters,
|
|
some combinations using <literal>[</literal>
|
|
(see next paragraphs), and escapes (AREs only), all other special
|
|
characters lose their special significance within a bracket expression.
|
|
In particular, <literal>\</literal> is not special when following
|
|
ERE or BRE rules, though it is special (as introducing an escape)
|
|
in AREs.
|
|
</para>
|
|
|
|
<para>
|
|
Within a bracket expression, a collating element (a character, a
|
|
multiple-character sequence that collates as if it were a single
|
|
character, or a collating-sequence name for either) enclosed in
|
|
<literal>[.</literal> and <literal>.]</literal> stands for the
|
|
sequence of characters of that collating element. The sequence is
|
|
treated as a single element of the bracket expression's list. This
|
|
allows a bracket
|
|
expression containing a multiple-character collating element to
|
|
match more than one character, e.g., if the collating sequence
|
|
includes a <literal>ch</literal> collating element, then the RE
|
|
<literal>[[.ch.]]*c</literal> matches the first five characters of
|
|
<literal>chchcc</literal>.
|
|
</para>
|
|
|
|
<note>
|
|
<para>
|
|
<productname>PostgreSQL</productname> currently does not support multi-character collating
|
|
elements. This information describes possible future behavior.
|
|
</para>
|
|
</note>
|
|
|
|
<para>
|
|
Within a bracket expression, a collating element enclosed in
|
|
<literal>[=</literal> and <literal>=]</literal> is an <firstterm>equivalence
|
|
class</firstterm>, standing for the sequences of characters of all collating
|
|
elements equivalent to that one, including itself. (If there are
|
|
no other equivalent collating elements, the treatment is as if the
|
|
enclosing delimiters were <literal>[.</literal> and
|
|
<literal>.]</literal>.) For example, if <literal>o</literal> and
|
|
<literal>^</literal> are the members of an equivalence class, then
|
|
<literal>[[=o=]]</literal>, <literal>[[=^=]]</literal>, and
|
|
<literal>[o^]</literal> are all synonymous. An equivalence class
|
|
cannot be an endpoint of a range.
|
|
</para>
|
|
|
|
<para>
|
|
Within a bracket expression, the name of a character class
|
|
enclosed in <literal>[:</literal> and <literal>:]</literal> stands
|
|
for the list of all characters belonging to that class. A character
|
|
class cannot be used as an endpoint of a range.
|
|
The <acronym>POSIX</acronym> standard defines these character class
|
|
names:
|
|
<literal>alnum</literal> (letters and numeric digits),
|
|
<literal>alpha</literal> (letters),
|
|
<literal>blank</literal> (space and tab),
|
|
<literal>cntrl</literal> (control characters),
|
|
<literal>digit</literal> (numeric digits),
|
|
<literal>graph</literal> (printable characters except space),
|
|
<literal>lower</literal> (lower-case letters),
|
|
<literal>print</literal> (printable characters including space),
|
|
<literal>punct</literal> (punctuation),
|
|
<literal>space</literal> (any white space),
|
|
<literal>upper</literal> (upper-case letters),
|
|
and <literal>xdigit</literal> (hexadecimal digits).
|
|
The behavior of these standard character classes is generally
|
|
consistent across platforms for characters in the 7-bit ASCII set.
|
|
Whether a given non-ASCII character is considered to belong to one
|
|
of these classes depends on the <firstterm>collation</firstterm>
|
|
that is used for the regular-expression function or operator
|
|
(see <xref linkend="collation"/>), or by default on the
|
|
database's <envar>LC_CTYPE</envar> locale setting (see
|
|
<xref linkend="locale"/>). The classification of non-ASCII
|
|
characters can vary across platforms even in similarly-named
|
|
locales. (But the <literal>C</literal> locale never considers any
|
|
non-ASCII characters to belong to any of these classes.)
|
|
In addition to these standard character
|
|
classes, <productname>PostgreSQL</productname> defines
|
|
the <literal>word</literal> character class, which is the same as
|
|
<literal>alnum</literal> plus the underscore (<literal>_</literal>)
|
|
character, and
|
|
the <literal>ascii</literal> character class, which contains exactly
|
|
the 7-bit ASCII set.
|
|
</para>
|
|
|
|
<para>
|
|
There are two special cases of bracket expressions: the bracket
|
|
expressions <literal>[[:<:]]</literal> and
|
|
<literal>[[:>:]]</literal> are constraints,
|
|
matching empty strings at the beginning
|
|
and end of a word respectively. A word is defined as a sequence
|
|
of word characters that is neither preceded nor followed by word
|
|
characters. A word character is any character belonging to the
|
|
<literal>word</literal> character class, that is, any letter, digit,
|
|
or underscore. This is an extension, compatible with but not
|
|
specified by <acronym>POSIX</acronym> 1003.2, and should be used with
|
|
caution in software intended to be portable to other systems.
|
|
The constraint escapes described below are usually preferable; they
|
|
are no more standard, but are easier to type.
|
|
</para>
|
|
</sect3>
|
|
|
|
<sect3 id="posix-escape-sequences">
|
|
<title>Regular Expression Escapes</title>
|
|
|
|
<para>
|
|
<firstterm>Escapes</firstterm> are special sequences beginning with <literal>\</literal>
|
|
followed by an alphanumeric character. Escapes come in several varieties:
|
|
character entry, class shorthands, constraint escapes, and back references.
|
|
A <literal>\</literal> followed by an alphanumeric character but not constituting
|
|
a valid escape is illegal in AREs.
|
|
In EREs, there are no escapes: outside a bracket expression,
|
|
a <literal>\</literal> followed by an alphanumeric character merely stands for
|
|
that character as an ordinary character, and inside a bracket expression,
|
|
<literal>\</literal> is an ordinary character.
|
|
(The latter is the one actual incompatibility between EREs and AREs.)
|
|
</para>
|
|
|
|
<para>
|
|
<firstterm>Character-entry escapes</firstterm> exist to make it easier to specify
|
|
non-printing and other inconvenient characters in REs. They are
|
|
shown in <xref linkend="posix-character-entry-escapes-table"/>.
|
|
</para>
|
|
|
|
<para>
|
|
<firstterm>Class-shorthand escapes</firstterm> provide shorthands for certain
|
|
commonly-used character classes. They are
|
|
shown in <xref linkend="posix-class-shorthand-escapes-table"/>.
|
|
</para>
|
|
|
|
<para>
|
|
A <firstterm>constraint escape</firstterm> is a constraint,
|
|
matching the empty string if specific conditions are met,
|
|
written as an escape. They are
|
|
shown in <xref linkend="posix-constraint-escapes-table"/>.
|
|
</para>
|
|
|
|
<para>
|
|
A <firstterm>back reference</firstterm> (<literal>\</literal><replaceable>n</replaceable>) matches the
|
|
same string matched by the previous parenthesized subexpression specified
|
|
by the number <replaceable>n</replaceable>
|
|
(see <xref linkend="posix-constraint-backref-table"/>). For example,
|
|
<literal>([bc])\1</literal> matches <literal>bb</literal> or <literal>cc</literal>
|
|
but not <literal>bc</literal> or <literal>cb</literal>.
|
|
The subexpression must entirely precede the back reference in the RE.
|
|
Subexpressions are numbered in the order of their leading parentheses.
|
|
Non-capturing parentheses do not define subexpressions.
|
|
The back reference considers only the string characters matched by the
|
|
referenced subexpression, not any constraints contained in it. For
|
|
example, <literal>(^\d)\1</literal> will match <literal>22</literal>.
|
|
</para>
|
|
|
|
<table id="posix-character-entry-escapes-table">
|
|
<title>Regular Expression Character-Entry Escapes</title>
|
|
|
|
<tgroup cols="2">
|
|
<colspec colname="col1" colwidth="1*"/>
|
|
<colspec colname="col2" colwidth="3*"/>
|
|
<thead>
|
|
<row>
|
|
<entry>Escape</entry>
|
|
<entry>Description</entry>
|
|
</row>
|
|
</thead>
|
|
|
|
<tbody>
|
|
<row>
|
|
<entry> <literal>\a</literal> </entry>
|
|
<entry> alert (bell) character, as in C </entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry> <literal>\b</literal> </entry>
|
|
<entry> backspace, as in C </entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry> <literal>\B</literal> </entry>
|
|
<entry> synonym for backslash (<literal>\</literal>) to help reduce the need for backslash
|
|
doubling </entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry> <literal>\c</literal><replaceable>X</replaceable> </entry>
|
|
<entry> (where <replaceable>X</replaceable> is any character) the character whose
|
|
low-order 5 bits are the same as those of
|
|
<replaceable>X</replaceable>, and whose other bits are all zero </entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry> <literal>\e</literal> </entry>
|
|
<entry> the character whose collating-sequence name
|
|
is <literal>ESC</literal>,
|
|
or failing that, the character with octal value <literal>033</literal> </entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry> <literal>\f</literal> </entry>
|
|
<entry> form feed, as in C </entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry> <literal>\n</literal> </entry>
|
|
<entry> newline, as in C </entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry> <literal>\r</literal> </entry>
|
|
<entry> carriage return, as in C </entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry> <literal>\t</literal> </entry>
|
|
<entry> horizontal tab, as in C </entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry> <literal>\u</literal><replaceable>wxyz</replaceable> </entry>
|
|
<entry> (where <replaceable>wxyz</replaceable> is exactly four hexadecimal digits)
|
|
the character whose hexadecimal value is
|
|
<literal>0x</literal><replaceable>wxyz</replaceable>
|
|
</entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry> <literal>\U</literal><replaceable>stuvwxyz</replaceable> </entry>
|
|
<entry> (where <replaceable>stuvwxyz</replaceable> is exactly eight hexadecimal
|
|
digits)
|
|
the character whose hexadecimal value is
|
|
<literal>0x</literal><replaceable>stuvwxyz</replaceable>
|
|
</entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry> <literal>\v</literal> </entry>
|
|
<entry> vertical tab, as in C </entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry> <literal>\x</literal><replaceable>hhh</replaceable> </entry>
|
|
<entry> (where <replaceable>hhh</replaceable> is any sequence of hexadecimal
|
|
digits)
|
|
the character whose hexadecimal value is
|
|
<literal>0x</literal><replaceable>hhh</replaceable>
|
|
(a single character no matter how many hexadecimal digits are used)
|
|
</entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry> <literal>\0</literal> </entry>
|
|
<entry> the character whose value is <literal>0</literal> (the null byte)</entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry> <literal>\</literal><replaceable>xy</replaceable> </entry>
|
|
<entry> (where <replaceable>xy</replaceable> is exactly two octal digits,
|
|
and is not a <firstterm>back reference</firstterm>)
|
|
the character whose octal value is
|
|
<literal>0</literal><replaceable>xy</replaceable> </entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry> <literal>\</literal><replaceable>xyz</replaceable> </entry>
|
|
<entry> (where <replaceable>xyz</replaceable> is exactly three octal digits,
|
|
and is not a <firstterm>back reference</firstterm>)
|
|
the character whose octal value is
|
|
<literal>0</literal><replaceable>xyz</replaceable> </entry>
|
|
</row>
|
|
</tbody>
|
|
</tgroup>
|
|
</table>
|
|
|
|
<para>
|
|
Hexadecimal digits are <literal>0</literal>-<literal>9</literal>,
|
|
<literal>a</literal>-<literal>f</literal>, and <literal>A</literal>-<literal>F</literal>.
|
|
Octal digits are <literal>0</literal>-<literal>7</literal>.
|
|
</para>
|
|
|
|
<para>
|
|
Numeric character-entry escapes specifying values outside the ASCII range
|
|
(0–127) have meanings dependent on the database encoding. When the
|
|
encoding is UTF-8, escape values are equivalent to Unicode code points,
|
|
for example <literal>\u1234</literal> means the character <literal>U+1234</literal>.
|
|
For other multibyte encodings, character-entry escapes usually just
|
|
specify the concatenation of the byte values for the character. If the
|
|
escape value does not correspond to any legal character in the database
|
|
encoding, no error will be raised, but it will never match any data.
|
|
</para>
|
|
|
|
<para>
|
|
The character-entry escapes are always taken as ordinary characters.
|
|
For example, <literal>\135</literal> is <literal>]</literal> in ASCII, but
|
|
<literal>\135</literal> does not terminate a bracket expression.
|
|
</para>
|
|
|
|
<table id="posix-class-shorthand-escapes-table">
|
|
<title>Regular Expression Class-Shorthand Escapes</title>
|
|
|
|
<tgroup cols="2">
|
|
<colspec colname="col1" colwidth="1*"/>
|
|
<colspec colname="col2" colwidth="3*"/>
|
|
<thead>
|
|
<row>
|
|
<entry>Escape</entry>
|
|
<entry>Description</entry>
|
|
</row>
|
|
</thead>
|
|
|
|
<tbody>
|
|
<row>
|
|
<entry> <literal>\d</literal> </entry>
|
|
<entry> matches any digit, like
|
|
<literal>[[:digit:]]</literal> </entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry> <literal>\s</literal> </entry>
|
|
<entry> matches any whitespace character, like
|
|
<literal>[[:space:]]</literal> </entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry> <literal>\w</literal> </entry>
|
|
<entry> matches any word character, like
|
|
<literal>[[:word:]]</literal> </entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry> <literal>\D</literal> </entry>
|
|
<entry> matches any non-digit, like
|
|
<literal>[^[:digit:]]</literal> </entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry> <literal>\S</literal> </entry>
|
|
<entry> matches any non-whitespace character, like
|
|
<literal>[^[:space:]]</literal> </entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry> <literal>\W</literal> </entry>
|
|
<entry> matches any non-word character, like
|
|
<literal>[^[:word:]]</literal> </entry>
|
|
</row>
|
|
</tbody>
|
|
</tgroup>
|
|
</table>
|
|
|
|
<para>
|
|
The class-shorthand escapes also work within bracket expressions,
|
|
although the definitions shown above are not quite syntactically
|
|
valid in that context.
|
|
For example, <literal>[a-c\d]</literal> is equivalent to
|
|
<literal>[a-c[:digit:]]</literal>.
|
|
</para>
|
|
|
|
<table id="posix-constraint-escapes-table">
|
|
<title>Regular Expression Constraint Escapes</title>
|
|
|
|
<tgroup cols="2">
|
|
<colspec colname="col1" colwidth="1*"/>
|
|
<colspec colname="col2" colwidth="3*"/>
|
|
<thead>
|
|
<row>
|
|
<entry>Escape</entry>
|
|
<entry>Description</entry>
|
|
</row>
|
|
</thead>
|
|
|
|
<tbody>
|
|
<row>
|
|
<entry> <literal>\A</literal> </entry>
|
|
<entry> matches only at the beginning of the string
|
|
(see <xref linkend="posix-matching-rules"/> for how this differs from
|
|
<literal>^</literal>) </entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry> <literal>\m</literal> </entry>
|
|
<entry> matches only at the beginning of a word </entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry> <literal>\M</literal> </entry>
|
|
<entry> matches only at the end of a word </entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry> <literal>\y</literal> </entry>
|
|
<entry> matches only at the beginning or end of a word </entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry> <literal>\Y</literal> </entry>
|
|
<entry> matches only at a point that is not the beginning or end of a
|
|
word </entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry> <literal>\Z</literal> </entry>
|
|
<entry> matches only at the end of the string
|
|
(see <xref linkend="posix-matching-rules"/> for how this differs from
|
|
<literal>$</literal>) </entry>
|
|
</row>
|
|
</tbody>
|
|
</tgroup>
|
|
</table>
|
|
|
|
<para>
|
|
A word is defined as in the specification of
|
|
<literal>[[:<:]]</literal> and <literal>[[:>:]]</literal> above.
|
|
Constraint escapes are illegal within bracket expressions.
|
|
</para>
|
|
|
|
<table id="posix-constraint-backref-table">
|
|
<title>Regular Expression Back References</title>
|
|
|
|
<tgroup cols="2">
|
|
<colspec colname="col1" colwidth="1*"/>
|
|
<colspec colname="col2" colwidth="3*"/>
|
|
<thead>
|
|
<row>
|
|
<entry>Escape</entry>
|
|
<entry>Description</entry>
|
|
</row>
|
|
</thead>
|
|
|
|
<tbody>
|
|
<row>
|
|
<entry> <literal>\</literal><replaceable>m</replaceable> </entry>
|
|
<entry> (where <replaceable>m</replaceable> is a nonzero digit)
|
|
a back reference to the <replaceable>m</replaceable>'th subexpression </entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry> <literal>\</literal><replaceable>mnn</replaceable> </entry>
|
|
<entry> (where <replaceable>m</replaceable> is a nonzero digit, and
|
|
<replaceable>nn</replaceable> is some more digits, and the decimal value
|
|
<replaceable>mnn</replaceable> is not greater than the number of closing capturing
|
|
parentheses seen so far)
|
|
a back reference to the <replaceable>mnn</replaceable>'th subexpression </entry>
|
|
</row>
|
|
</tbody>
|
|
</tgroup>
|
|
</table>
|
|
|
|
<note>
|
|
<para>
|
|
There is an inherent ambiguity between octal character-entry
|
|
escapes and back references, which is resolved by the following heuristics,
|
|
as hinted at above.
|
|
A leading zero always indicates an octal escape.
|
|
A single non-zero digit, not followed by another digit,
|
|
is always taken as a back reference.
|
|
A multi-digit sequence not starting with a zero is taken as a back
|
|
reference if it comes after a suitable subexpression
|
|
(i.e., the number is in the legal range for a back reference),
|
|
and otherwise is taken as octal.
|
|
</para>
|
|
</note>
|
|
</sect3>
|
|
|
|
<sect3 id="posix-metasyntax">
|
|
<title>Regular Expression Metasyntax</title>
|
|
|
|
<para>
|
|
In addition to the main syntax described above, there are some special
|
|
forms and miscellaneous syntactic facilities available.
|
|
</para>
|
|
|
|
<para>
|
|
An RE can begin with one of two special <firstterm>director</firstterm> prefixes.
|
|
If an RE begins with <literal>***:</literal>,
|
|
the rest of the RE is taken as an ARE. (This normally has no effect in
|
|
<productname>PostgreSQL</productname>, since REs are assumed to be AREs;
|
|
but it does have an effect if ERE or BRE mode had been specified by
|
|
the <replaceable>flags</replaceable> parameter to a regex function.)
|
|
If an RE begins with <literal>***=</literal>,
|
|
the rest of the RE is taken to be a literal string,
|
|
with all characters considered ordinary characters.
|
|
</para>
|
|
|
|
<para>
|
|
An ARE can begin with <firstterm>embedded options</firstterm>:
|
|
a sequence <literal>(?</literal><replaceable>xyz</replaceable><literal>)</literal>
|
|
(where <replaceable>xyz</replaceable> is one or more alphabetic characters)
|
|
specifies options affecting the rest of the RE.
|
|
These options override any previously determined options —
|
|
in particular, they can override the case-sensitivity behavior implied by
|
|
a regex operator, or the <replaceable>flags</replaceable> parameter to a regex
|
|
function.
|
|
The available option letters are
|
|
shown in <xref linkend="posix-embedded-options-table"/>.
|
|
Note that these same option letters are used in the <replaceable>flags</replaceable>
|
|
parameters of regex functions.
|
|
</para>
|
|
|
|
<table id="posix-embedded-options-table">
|
|
<title>ARE Embedded-Option Letters</title>
|
|
|
|
<tgroup cols="2">
|
|
<colspec colname="col1" colwidth="1*"/>
|
|
<colspec colname="col2" colwidth="3*"/>
|
|
<thead>
|
|
<row>
|
|
<entry>Option</entry>
|
|
<entry>Description</entry>
|
|
</row>
|
|
</thead>
|
|
|
|
<tbody>
|
|
<row>
|
|
<entry> <literal>b</literal> </entry>
|
|
<entry> rest of RE is a BRE </entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry> <literal>c</literal> </entry>
|
|
<entry> case-sensitive matching (overrides operator type) </entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry> <literal>e</literal> </entry>
|
|
<entry> rest of RE is an ERE </entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry> <literal>i</literal> </entry>
|
|
<entry> case-insensitive matching (see
|
|
<xref linkend="posix-matching-rules"/>) (overrides operator type) </entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry> <literal>m</literal> </entry>
|
|
<entry> historical synonym for <literal>n</literal> </entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry> <literal>n</literal> </entry>
|
|
<entry> newline-sensitive matching (see
|
|
<xref linkend="posix-matching-rules"/>) </entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry> <literal>p</literal> </entry>
|
|
<entry> partial newline-sensitive matching (see
|
|
<xref linkend="posix-matching-rules"/>) </entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry> <literal>q</literal> </entry>
|
|
<entry> rest of RE is a literal (<quote>quoted</quote>) string, all ordinary
|
|
characters </entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry> <literal>s</literal> </entry>
|
|
<entry> non-newline-sensitive matching (default) </entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry> <literal>t</literal> </entry>
|
|
<entry> tight syntax (default; see below) </entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry> <literal>w</literal> </entry>
|
|
<entry> inverse partial newline-sensitive (<quote>weird</quote>) matching
|
|
(see <xref linkend="posix-matching-rules"/>) </entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry> <literal>x</literal> </entry>
|
|
<entry> expanded syntax (see below) </entry>
|
|
</row>
|
|
</tbody>
|
|
</tgroup>
|
|
</table>
|
|
|
|
<para>
|
|
Embedded options take effect at the <literal>)</literal> terminating the sequence.
|
|
They can appear only at the start of an ARE (after the
|
|
<literal>***:</literal> director if any).
|
|
</para>
|
|
|
|
<para>
|
|
In addition to the usual (<firstterm>tight</firstterm>) RE syntax, in which all
|
|
characters are significant, there is an <firstterm>expanded</firstterm> syntax,
|
|
available by specifying the embedded <literal>x</literal> option.
|
|
In the expanded syntax,
|
|
white-space characters in the RE are ignored, as are
|
|
all characters between a <literal>#</literal>
|
|
and the following newline (or the end of the RE). This
|
|
permits paragraphing and commenting a complex RE.
|
|
There are three exceptions to that basic rule:
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
a white-space character or <literal>#</literal> preceded by <literal>\</literal> is
|
|
retained
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
white space or <literal>#</literal> within a bracket expression is retained
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
white space and comments cannot appear within multi-character symbols,
|
|
such as <literal>(?:</literal>
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
For this purpose, white-space characters are blank, tab, newline, and
|
|
any character that belongs to the <replaceable>space</replaceable> character class.
|
|
</para>
|
|
|
|
<para>
|
|
Finally, in an ARE, outside bracket expressions, the sequence
|
|
<literal>(?#</literal><replaceable>ttt</replaceable><literal>)</literal>
|
|
(where <replaceable>ttt</replaceable> is any text not containing a <literal>)</literal>)
|
|
is a comment, completely ignored.
|
|
Again, this is not allowed between the characters of
|
|
multi-character symbols, like <literal>(?:</literal>.
|
|
Such comments are more a historical artifact than a useful facility,
|
|
and their use is deprecated; use the expanded syntax instead.
|
|
</para>
|
|
|
|
<para>
|
|
<emphasis>None</emphasis> of these metasyntax extensions is available if
|
|
an initial <literal>***=</literal> director
|
|
has specified that the user's input be treated as a literal string
|
|
rather than as an RE.
|
|
</para>
|
|
</sect3>
|
|
|
|
<sect3 id="posix-matching-rules">
|
|
<title>Regular Expression Matching Rules</title>
|
|
|
|
<para>
|
|
In the event that an RE could match more than one substring of a given
|
|
string, the RE matches the one starting earliest in the string.
|
|
If the RE could match more than one substring starting at that point,
|
|
either the longest possible match or the shortest possible match will
|
|
be taken, depending on whether the RE is <firstterm>greedy</firstterm> or
|
|
<firstterm>non-greedy</firstterm>.
|
|
</para>
|
|
|
|
<para>
|
|
Whether an RE is greedy or not is determined by the following rules:
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
Most atoms, and all constraints, have no greediness attribute (because
|
|
they cannot match variable amounts of text anyway).
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
Adding parentheses around an RE does not change its greediness.
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
A quantified atom with a fixed-repetition quantifier
|
|
(<literal>{</literal><replaceable>m</replaceable><literal>}</literal>
|
|
or
|
|
<literal>{</literal><replaceable>m</replaceable><literal>}?</literal>)
|
|
has the same greediness (possibly none) as the atom itself.
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
A quantified atom with other normal quantifiers (including
|
|
<literal>{</literal><replaceable>m</replaceable><literal>,</literal><replaceable>n</replaceable><literal>}</literal>
|
|
with <replaceable>m</replaceable> equal to <replaceable>n</replaceable>)
|
|
is greedy (prefers longest match).
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
A quantified atom with a non-greedy quantifier (including
|
|
<literal>{</literal><replaceable>m</replaceable><literal>,</literal><replaceable>n</replaceable><literal>}?</literal>
|
|
with <replaceable>m</replaceable> equal to <replaceable>n</replaceable>)
|
|
is non-greedy (prefers shortest match).
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
A branch — that is, an RE that has no top-level
|
|
<literal>|</literal> operator — has the same greediness as the first
|
|
quantified atom in it that has a greediness attribute.
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
An RE consisting of two or more branches connected by the
|
|
<literal>|</literal> operator is always greedy.
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
|
|
<para>
|
|
The above rules associate greediness attributes not only with individual
|
|
quantified atoms, but with branches and entire REs that contain quantified
|
|
atoms. What that means is that the matching is done in such a way that
|
|
the branch, or whole RE, matches the longest or shortest possible
|
|
substring <emphasis>as a whole</emphasis>. Once the length of the entire match
|
|
is determined, the part of it that matches any particular subexpression
|
|
is determined on the basis of the greediness attribute of that
|
|
subexpression, with subexpressions starting earlier in the RE taking
|
|
priority over ones starting later.
|
|
</para>
|
|
|
|
<para>
|
|
An example of what this means:
|
|
<screen>
|
|
SELECT SUBSTRING('XY1234Z', 'Y*([0-9]{1,3})');
|
|
<lineannotation>Result: </lineannotation><computeroutput>123</computeroutput>
|
|
SELECT SUBSTRING('XY1234Z', 'Y*?([0-9]{1,3})');
|
|
<lineannotation>Result: </lineannotation><computeroutput>1</computeroutput>
|
|
</screen>
|
|
In the first case, the RE as a whole is greedy because <literal>Y*</literal>
|
|
is greedy. It can match beginning at the <literal>Y</literal>, and it matches
|
|
the longest possible string starting there, i.e., <literal>Y123</literal>.
|
|
The output is the parenthesized part of that, or <literal>123</literal>.
|
|
In the second case, the RE as a whole is non-greedy because <literal>Y*?</literal>
|
|
is non-greedy. It can match beginning at the <literal>Y</literal>, and it matches
|
|
the shortest possible string starting there, i.e., <literal>Y1</literal>.
|
|
The subexpression <literal>[0-9]{1,3}</literal> is greedy but it cannot change
|
|
the decision as to the overall match length; so it is forced to match
|
|
just <literal>1</literal>.
|
|
</para>
|
|
|
|
<para>
|
|
In short, when an RE contains both greedy and non-greedy subexpressions,
|
|
the total match length is either as long as possible or as short as
|
|
possible, according to the attribute assigned to the whole RE. The
|
|
attributes assigned to the subexpressions only affect how much of that
|
|
match they are allowed to <quote>eat</quote> relative to each other.
|
|
</para>
|
|
|
|
<para>
|
|
The quantifiers <literal>{1,1}</literal> and <literal>{1,1}?</literal>
|
|
can be used to force greediness or non-greediness, respectively,
|
|
on a subexpression or a whole RE.
|
|
This is useful when you need the whole RE to have a greediness attribute
|
|
different from what's deduced from its elements. As an example,
|
|
suppose that we are trying to separate a string containing some digits
|
|
into the digits and the parts before and after them. We might try to
|
|
do that like this:
|
|
<screen>
|
|
SELECT regexp_match('abc01234xyz', '(.*)(\d+)(.*)');
|
|
<lineannotation>Result: </lineannotation><computeroutput>{abc0123,4,xyz}</computeroutput>
|
|
</screen>
|
|
That didn't work: the first <literal>.*</literal> is greedy so
|
|
it <quote>eats</quote> as much as it can, leaving the <literal>\d+</literal> to
|
|
match at the last possible place, the last digit. We might try to fix
|
|
that by making it non-greedy:
|
|
<screen>
|
|
SELECT regexp_match('abc01234xyz', '(.*?)(\d+)(.*)');
|
|
<lineannotation>Result: </lineannotation><computeroutput>{abc,0,""}</computeroutput>
|
|
</screen>
|
|
That didn't work either, because now the RE as a whole is non-greedy
|
|
and so it ends the overall match as soon as possible. We can get what
|
|
we want by forcing the RE as a whole to be greedy:
|
|
<screen>
|
|
SELECT regexp_match('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}');
|
|
<lineannotation>Result: </lineannotation><computeroutput>{abc,01234,xyz}</computeroutput>
|
|
</screen>
|
|
Controlling the RE's overall greediness separately from its components'
|
|
greediness allows great flexibility in handling variable-length patterns.
|
|
</para>
|
|
|
|
<para>
|
|
When deciding what is a longer or shorter match,
|
|
match lengths are measured in characters, not collating elements.
|
|
An empty string is considered longer than no match at all.
|
|
For example:
|
|
<literal>bb*</literal>
|
|
matches the three middle characters of <literal>abbbc</literal>;
|
|
<literal>(week|wee)(night|knights)</literal>
|
|
matches all ten characters of <literal>weeknights</literal>;
|
|
when <literal>(.*).*</literal>
|
|
is matched against <literal>abc</literal> the parenthesized subexpression
|
|
matches all three characters; and when
|
|
<literal>(a*)*</literal> is matched against <literal>bc</literal>
|
|
both the whole RE and the parenthesized
|
|
subexpression match an empty string.
|
|
</para>
|
|
|
|
<para>
|
|
If case-independent matching is specified,
|
|
the effect is much as if all case distinctions had vanished from the
|
|
alphabet.
|
|
When an alphabetic that exists in multiple cases appears as an
|
|
ordinary character outside a bracket expression, it is effectively
|
|
transformed into a bracket expression containing both cases,
|
|
e.g., <literal>x</literal> becomes <literal>[xX]</literal>.
|
|
When it appears inside a bracket expression, all case counterparts
|
|
of it are added to the bracket expression, e.g.,
|
|
<literal>[x]</literal> becomes <literal>[xX]</literal>
|
|
and <literal>[^x]</literal> becomes <literal>[^xX]</literal>.
|
|
</para>
|
|
|
|
<para>
|
|
If newline-sensitive matching is specified, <literal>.</literal>
|
|
and bracket expressions using <literal>^</literal>
|
|
will never match the newline character
|
|
(so that matches will not cross lines unless the RE
|
|
explicitly includes a newline)
|
|
and <literal>^</literal> and <literal>$</literal>
|
|
will match the empty string after and before a newline
|
|
respectively, in addition to matching at beginning and end of string
|
|
respectively.
|
|
But the ARE escapes <literal>\A</literal> and <literal>\Z</literal>
|
|
continue to match beginning or end of string <emphasis>only</emphasis>.
|
|
Also, the character class shorthands <literal>\D</literal>
|
|
and <literal>\W</literal> will match a newline regardless of this mode.
|
|
(Before <productname>PostgreSQL</productname> 14, they did not match
|
|
newlines when in newline-sensitive mode.
|
|
Write <literal>[^[:digit:]]</literal>
|
|
or <literal>[^[:word:]]</literal> to get the old behavior.)
|
|
</para>
|
|
|
|
<para>
|
|
If partial newline-sensitive matching is specified,
|
|
this affects <literal>.</literal> and bracket expressions
|
|
as with newline-sensitive matching, but not <literal>^</literal>
|
|
and <literal>$</literal>.
|
|
</para>
|
|
|
|
<para>
|
|
If inverse partial newline-sensitive matching is specified,
|
|
this affects <literal>^</literal> and <literal>$</literal>
|
|
as with newline-sensitive matching, but not <literal>.</literal>
|
|
and bracket expressions.
|
|
This isn't very useful but is provided for symmetry.
|
|
</para>
|
|
</sect3>
|
|
|
|
<sect3 id="posix-limits-compatibility">
|
|
<title>Limits and Compatibility</title>
|
|
|
|
<para>
|
|
No particular limit is imposed on the length of REs in this
|
|
implementation. However,
|
|
programs intended to be highly portable should not employ REs longer
|
|
than 256 bytes,
|
|
as a POSIX-compliant implementation can refuse to accept such REs.
|
|
</para>
|
|
|
|
<para>
|
|
The only feature of AREs that is actually incompatible with
|
|
POSIX EREs is that <literal>\</literal> does not lose its special
|
|
significance inside bracket expressions.
|
|
All other ARE features use syntax which is illegal or has
|
|
undefined or unspecified effects in POSIX EREs;
|
|
the <literal>***</literal> syntax of directors likewise is outside the POSIX
|
|
syntax for both BREs and EREs.
|
|
</para>
|
|
|
|
<para>
|
|
Many of the ARE extensions are borrowed from Perl, but some have
|
|
been changed to clean them up, and a few Perl extensions are not present.
|
|
Incompatibilities of note include <literal>\b</literal>, <literal>\B</literal>,
|
|
the lack of special treatment for a trailing newline,
|
|
the addition of complemented bracket expressions to the things
|
|
affected by newline-sensitive matching,
|
|
the restrictions on parentheses and back references in lookahead/lookbehind
|
|
constraints, and the longest/shortest-match (rather than first-match)
|
|
matching semantics.
|
|
</para>
|
|
</sect3>
|
|
|
|
<sect3 id="posix-basic-regexes">
|
|
<title>Basic Regular Expressions</title>
|
|
|
|
<para>
|
|
BREs differ from EREs in several respects.
|
|
In BREs, <literal>|</literal>, <literal>+</literal>, and <literal>?</literal>
|
|
are ordinary characters and there is no equivalent
|
|
for their functionality.
|
|
The delimiters for bounds are
|
|
<literal>\{</literal> and <literal>\}</literal>,
|
|
with <literal>{</literal> and <literal>}</literal>
|
|
by themselves ordinary characters.
|
|
The parentheses for nested subexpressions are
|
|
<literal>\(</literal> and <literal>\)</literal>,
|
|
with <literal>(</literal> and <literal>)</literal> by themselves ordinary characters.
|
|
<literal>^</literal> is an ordinary character except at the beginning of the
|
|
RE or the beginning of a parenthesized subexpression,
|
|
<literal>$</literal> is an ordinary character except at the end of the
|
|
RE or the end of a parenthesized subexpression,
|
|
and <literal>*</literal> is an ordinary character if it appears at the beginning
|
|
of the RE or the beginning of a parenthesized subexpression
|
|
(after a possible leading <literal>^</literal>).
|
|
Finally, single-digit back references are available, and
|
|
<literal>\<</literal> and <literal>\></literal>
|
|
are synonyms for
|
|
<literal>[[:<:]]</literal> and <literal>[[:>:]]</literal>
|
|
respectively; no other escapes are available in BREs.
|
|
</para>
|
|
</sect3>
|
|
|
|
<!-- end re_syntax.n man page -->
|
|
|
|
<sect3 id="posix-vs-xquery">
|
|
<title>Differences from SQL Standard and XQuery</title>
|
|
|
|
<indexterm zone="posix-vs-xquery">
|
|
<primary>LIKE_REGEX</primary>
|
|
</indexterm>
|
|
|
|
<indexterm zone="posix-vs-xquery">
|
|
<primary>OCCURRENCES_REGEX</primary>
|
|
</indexterm>
|
|
|
|
<indexterm zone="posix-vs-xquery">
|
|
<primary>POSITION_REGEX</primary>
|
|
</indexterm>
|
|
|
|
<indexterm zone="posix-vs-xquery">
|
|
<primary>SUBSTRING_REGEX</primary>
|
|
</indexterm>
|
|
|
|
<indexterm zone="posix-vs-xquery">
|
|
<primary>TRANSLATE_REGEX</primary>
|
|
</indexterm>
|
|
|
|
<indexterm zone="posix-vs-xquery">
|
|
<primary>XQuery regular expressions</primary>
|
|
</indexterm>
|
|
|
|
<para>
|
|
Since SQL:2008, the SQL standard includes regular expression operators
|
|
and functions that performs pattern
|
|
matching according to the XQuery regular expression
|
|
standard:
|
|
<itemizedlist>
|
|
<listitem><para><literal>LIKE_REGEX</literal></para></listitem>
|
|
<listitem><para><literal>OCCURRENCES_REGEX</literal></para></listitem>
|
|
<listitem><para><literal>POSITION_REGEX</literal></para></listitem>
|
|
<listitem><para><literal>SUBSTRING_REGEX</literal></para></listitem>
|
|
<listitem><para><literal>TRANSLATE_REGEX</literal></para></listitem>
|
|
</itemizedlist>
|
|
<productname>PostgreSQL</productname> does not currently implement these
|
|
operators and functions. You can get approximately equivalent
|
|
functionality in each case as shown in <xref
|
|
linkend="functions-regexp-sql-table"/>. (Various optional clauses on
|
|
both sides have been omitted in this table.)
|
|
</para>
|
|
|
|
<table id="functions-regexp-sql-table">
|
|
<title>Regular Expression Functions Equivalencies</title>
|
|
|
|
<tgroup cols="2">
|
|
<thead>
|
|
<row>
|
|
<entry>SQL standard</entry>
|
|
<entry><productname>PostgreSQL</productname></entry>
|
|
</row>
|
|
</thead>
|
|
|
|
<tbody>
|
|
<row>
|
|
<entry><literal><replaceable>string</replaceable> LIKE_REGEX <replaceable>pattern</replaceable></literal></entry>
|
|
<entry><literal>regexp_like(<replaceable>string</replaceable>, <replaceable>pattern</replaceable>)</literal> or <literal><replaceable>string</replaceable> ~ <replaceable>pattern</replaceable></literal></entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry><literal>OCCURRENCES_REGEX(<replaceable>pattern</replaceable> IN <replaceable>string</replaceable>)</literal></entry>
|
|
<entry><literal>regexp_count(<replaceable>string</replaceable>, <replaceable>pattern</replaceable>)</literal></entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry><literal>POSITION_REGEX(<replaceable>pattern</replaceable> IN <replaceable>string</replaceable>)</literal></entry>
|
|
<entry><literal>regexp_instr(<replaceable>string</replaceable>, <replaceable>pattern</replaceable>)</literal></entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry><literal>SUBSTRING_REGEX(<replaceable>pattern</replaceable> IN <replaceable>string</replaceable>)</literal></entry>
|
|
<entry><literal>regexp_substr(<replaceable>string</replaceable>, <replaceable>pattern</replaceable>)</literal></entry>
|
|
</row>
|
|
|
|
<row>
|
|
<entry><literal>TRANSLATE_REGEX(<replaceable>pattern</replaceable> IN <replaceable>string</replaceable> WITH <replaceable>replacement</replaceable>)</literal></entry>
|
|
<entry><literal>regexp_replace(<replaceable>string</replaceable>, <replaceable>pattern</replaceable>, <replaceable>replacement</replaceable>)</literal></entry>
|
|
</row>
|
|
</tbody>
|
|
</tgroup>
|
|
</table>
|
|
|
|
<para>
|
|
Regular expression functions similar to those provided by PostgreSQL are
|
|
also available in a number of other SQL implementations, whereas the
|
|
SQL-standard functions are not as widely implemented. Some of the
|
|
details of the regular expression syntax will likely differ in each
|
|
implementation.
|
|
</para>
|
|
|
|
<para>
|
|
The SQL-standard operators and functions use XQuery regular expressions,
|
|
which are quite close to the ARE syntax described above.
|
|
Notable differences between the existing POSIX-based
|
|
regular-expression feature and XQuery regular expressions include:
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
XQuery character class subtraction is not supported. An example of
|
|
this feature is using the following to match only English
|
|
consonants: <literal>[a-z-[aeiou]]</literal>.
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
XQuery character class shorthands <literal>\c</literal>,
|
|
<literal>\C</literal>, <literal>\i</literal>,
|
|
and <literal>\I</literal> are not supported.
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
XQuery character class elements
|
|
using <literal>\p{UnicodeProperty}</literal> or the
|
|
inverse <literal>\P{UnicodeProperty}</literal> are not supported.
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
POSIX interprets character classes such as <literal>\w</literal>
|
|
(see <xref linkend="posix-class-shorthand-escapes-table"/>)
|
|
according to the prevailing locale (which you can control by
|
|
attaching a <literal>COLLATE</literal> clause to the operator or
|
|
function). XQuery specifies these classes by reference to Unicode
|
|
character properties, so equivalent behavior is obtained only with
|
|
a locale that follows the Unicode rules.
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
The SQL standard (not XQuery itself) attempts to cater for more
|
|
variants of <quote>newline</quote> than POSIX does. The
|
|
newline-sensitive matching options described above consider only
|
|
ASCII NL (<literal>\n</literal>) to be a newline, but SQL would have
|
|
us treat CR (<literal>\r</literal>), CRLF (<literal>\r\n</literal>)
|
|
(a Windows-style newline), and some Unicode-only characters like
|
|
LINE SEPARATOR (U+2028) as newlines as well.
|
|
Notably, <literal>.</literal> and <literal>\s</literal> should
|
|
count <literal>\r\n</literal> as one character not two according to
|
|
SQL.
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
Of the character-entry escapes described in
|
|
<xref linkend="posix-character-entry-escapes-table"/>,
|
|
XQuery supports only <literal>\n</literal>, <literal>\r</literal>,
|
|
and <literal>\t</literal>.
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
XQuery does not support
|
|
the <literal>[:<replaceable>name</replaceable>:]</literal> syntax
|
|
for character classes within bracket expressions.
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
XQuery does not have lookahead or lookbehind constraints,
|
|
nor any of the constraint escapes described in
|
|
<xref linkend="posix-constraint-escapes-table"/>.
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
The metasyntax forms described in <xref linkend="posix-metasyntax"/>
|
|
do not exist in XQuery.
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
The regular expression flag letters defined by XQuery are
|
|
related to but not the same as the option letters for POSIX
|
|
(<xref linkend="posix-embedded-options-table"/>). While the
|
|
<literal>i</literal> and <literal>q</literal> options behave the
|
|
same, others do not:
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
XQuery's <literal>s</literal> (allow dot to match newline)
|
|
and <literal>m</literal> (allow <literal>^</literal>
|
|
and <literal>$</literal> to match at newlines) flags provide
|
|
access to the same behaviors as
|
|
POSIX's <literal>n</literal>, <literal>p</literal>
|
|
and <literal>w</literal> flags, but they
|
|
do <emphasis>not</emphasis> match the behavior of
|
|
POSIX's <literal>s</literal> and <literal>m</literal> flags.
|
|
Note in particular that dot-matches-newline is the default
|
|
behavior in POSIX but not XQuery.
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
XQuery's <literal>x</literal> (ignore whitespace in pattern) flag
|
|
is noticeably different from POSIX's expanded-mode flag.
|
|
POSIX's <literal>x</literal> flag also
|
|
allows <literal>#</literal> to begin a comment in the pattern,
|
|
and POSIX will not ignore a whitespace character after a
|
|
backslash.
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</para>
|
|
|
|
</sect3>
|
|
</sect2>
|
|
</sect1>
|