1
0
mirror of https://github.com/postgres/postgres.git synced 2025-06-11 20:28:21 +03:00

Implement regexp_match(), a simplified alternative to regexp_matches().

regexp_match() is like regexp_matches(), but it disallows the 'g' flag
and in consequence does not need to return a set.  Instead, it returns
a simple text array value, or NULL if there's no match.  Previously people
usually got that behavior with a sub-select, but this way is considerably
more efficient.

Documentation adjusted so that regexp_match() is presented first and then
regexp_matches() is introduced as a more complicated version.  This is
a bit historically revisionist but seems pedagogically better.

Still TODO: extend contrib/citext to support this function.

Emre Hasegeli, reviewed by David Johnston

Discussion: <CAE2gYzy42sna2ME_e3y1KLQ-4UBrB-eVF0SWn8QG39sQSeVhEw@mail.gmail.com>
This commit is contained in:
Tom Lane
2016-08-17 18:32:56 -04:00
parent 2d7e591007
commit cf9b0fea5f
9 changed files with 253 additions and 94 deletions

View File

@ -2036,6 +2036,23 @@
<entry><literal>'42.5'</literal></entry>
</row>
<row>
<entry>
<indexterm>
<primary>regexp_match</primary>
</indexterm>
<literal><function>regexp_match(<parameter>string</parameter> <type>text</type>, <parameter>pattern</parameter> <type>text</type> [, <parameter>flags</parameter> <type>text</type>])</function></literal>
</entry>
<entry><type>text[]</type></entry>
<entry>
Return captured substring(s) resulting from the first match of a POSIX
regular expression to the <parameter>string</parameter>. See
<xref linkend="functions-posix-regexp"> for more information.
</entry>
<entry><literal>regexp_match('foobarbequebaz', '(bar)(beque)')</literal></entry>
<entry><literal>{bar,beque}</literal></entry>
</row>
<row>
<entry>
<indexterm>
@ -2045,12 +2062,12 @@
</entry>
<entry><type>setof text[]</type></entry>
<entry>
Return all captured substrings resulting from matching a POSIX regular
expression against the <parameter>string</parameter>. See
Return captured substring(s) resulting from matching a POSIX regular
expression to the <parameter>string</parameter>. See
<xref linkend="functions-posix-regexp"> for more information.
</entry>
<entry><literal>regexp_matches('foobarbequebaz', '(bar)(beque)')</literal></entry>
<entry><literal>{bar,beque}</literal></entry>
<entry><literal>regexp_matches('foobarbequebaz', 'ba.', 'g')</literal></entry>
<entry><literal>{bar}</literal><para><literal>{baz}</literal></para> (2 rows)</entry>
</row>
<row>
@ -4112,6 +4129,9 @@ substring('foobar' from '#"o_b#"%' for '#') <lineannotation>NULL</lineannotat
<indexterm>
<primary>regexp_replace</primary>
</indexterm>
<indexterm>
<primary>regexp_match</primary>
</indexterm>
<indexterm>
<primary>regexp_matches</primary>
</indexterm>
@ -4272,64 +4292,106 @@ regexp_replace('foobarbaz', 'b(..)', E'X\\1Y', 'g')
</para>
<para>
The <function>regexp_matches</> function returns a text array of
all of the captured substrings resulting from matching a POSIX
regular expression pattern. It has the syntax
<function>regexp_matches</function>(<replaceable>string</>, <replaceable>pattern</>
<optional>, <replaceable>flags</> </optional>).
The function can return no rows, one row, or multiple rows (see
the <literal>g</> flag below). If the <replaceable>pattern</>
does not match, the function returns no rows. If the pattern
contains no parenthesized subexpressions, then each row
returned is a single-element text array containing the substring
matching the whole pattern. If the pattern contains parenthesized
subexpressions, the function returns a text array whose
<replaceable>n</>'th element is the substring matching the
<replaceable>n</>'th parenthesized subexpression of the pattern
(not counting <quote>non-capturing</> parentheses; see below for
details).
The <replaceable>flags</> parameter is an optional text
string containing zero or more single-letter flags that change the
function's behavior. Flag <literal>g</> causes the function to find
each match in the string, not only the first one, and return a row for
each such match. Supported flags (though
not <literal>g</>)
are described in <xref linkend="posix-embedded-options-table">.
The <function>regexp_match</> function returns a text array of
captured substring(s) resulting from the first match of a POSIX
regular expression pattern to a string. It has the syntax
<function>regexp_match</function>(<replaceable>string</>,
<replaceable>pattern</> <optional>, <replaceable>flags</> </optional>).
If there is no match, the result is <literal>NULL</>.
If a match is found, and the <replaceable>pattern</> contains no
parenthesized subexpressions, then the result is a single-element text
array containing the substring matching the whole pattern.
If a match is found, and the <replaceable>pattern</> contains
parenthesized subexpressions, then the result is a text array
whose <replaceable>n</>'th element is the substring matching
the <replaceable>n</>'th parenthesized subexpression of
the <replaceable>pattern</> (not counting <quote>non-capturing</>
parentheses; see below for details).
The <replaceable>flags</> parameter is an optional text string
containing zero or more single-letter flags that change the function's
behavior. Supported flags are described
in <xref linkend="posix-embedded-options-table">.
</para>
<para>
Some examples:
<programlisting>
SELECT regexp_matches('foobarbequebaz', '(bar)(beque)');
regexp_matches
----------------
{bar,beque}
SELECT regexp_match('foobarbequebaz', 'bar.*que');
regexp_match
--------------
{barbeque}
(1 row)
SELECT regexp_match('foobarbequebaz', '(bar)(beque)');
regexp_match
--------------
{bar,beque}
(1 row)
</programlisting>
In the common case where you just want the whole matching substring
or <literal>NULL</> for no match, write something like
<programlisting>
SELECT (regexp_match('foobarbequebaz', 'bar.*que'))[1];
regexp_match
--------------
barbeque
(1 row)
</programlisting>
</para>
<para>
The <function>regexp_matches</> function returns a set of text arrays
of captured substring(s) resulting from matching a POSIX regular
expression pattern to a string. It has the same syntax as
<function>regexp_match</function>.
This function returns no rows if there is no match, one row if there is
a match and the <literal>g</> flag is not given, or <replaceable>N</>
rows if there are <replaceable>N</> matches and the <literal>g</> flag
is given. Each returned row is a text array containing the whole
matched substring or the substrings matching parenthesized
subexpressions of the <replaceable>pattern</>, just as described above
for <function>regexp_match</function>.
<function>regexp_matches</> accepts all the flags shown
in <xref linkend="posix-embedded-options-table">, plus
the <literal>g</> flag which commands it to return all matches, not
just the first one.
</para>
<para>
Some examples:
<programlisting>
SELECT regexp_matches('foo', 'not there');
regexp_matches
----------------
(0 rows)
SELECT regexp_matches('foobarbequebazilbarfbonk', '(b[^b]+)(b[^b]+)', 'g');
regexp_matches
regexp_matches
----------------
{bar,beque}
{bazil,barf}
(2 rows)
SELECT regexp_matches('foobarbequebaz', 'barbeque');
regexp_matches
----------------
{barbeque}
(1 row)
</programlisting>
</para>
<para>
It is possible to force <function>regexp_matches()</> to always
return one row by using a sub-select; this is particularly useful
in a <literal>SELECT</> target list when you want all rows
returned, even non-matching ones:
<tip>
<para>
In most cases <function>regexp_matches()</> should be used with
the <literal>g</> flag, since if you only want the first match, it's
easier and more efficient to use <function>regexp_match()</>.
However, <function>regexp_match()</> only exists
in <productname>PostgreSQL</> version 10 and up. When working in older
versions, a common trick is to place a <function>regexp_matches()</>
call in a sub-select, for example:
<programlisting>
SELECT col1, (SELECT regexp_matches(col2, '(bar)(beque)')) FROM tab;
</programlisting>
</para>
This produces a text array if there's a match, or <literal>NULL</> if
not, the same as <function>regexp_match()</> would do. Without the
sub-select, this query would produce no output at all for table rows
without a match, which is typically not the desired behavior.
</para>
</tip>
<para>
The <function>regexp_split_to_table</> function splits a string using a POSIX
@ -4408,6 +4470,7 @@ SELECT foo FROM regexp_split_to_table('the quick brown fox', E'\\s*') AS foo;
zero-length matches that occur at the start or end of the string
or immediately after a previous match. This is contrary to the strict
definition of regexp matching that is implemented by
<function>regexp_match</> and
<function>regexp_matches</>, but is usually the most convenient behavior
in practice. Other software systems such as Perl use similar definitions.
</para>
@ -5482,7 +5545,7 @@ SELECT SUBSTRING('XY1234Z', 'Y*?([0-9]{1,3})');
into the digits and the parts before and after them. We might try to
do that like this:
<screen>
SELECT regexp_matches('abc01234xyz', '(.*)(\d+)(.*)');
SELECT regexp_match('abc01234xyz', '(.*)(\d+)(.*)');
<lineannotation>Result: </lineannotation><computeroutput>{abc0123,4,xyz}</computeroutput>
</screen>
That didn't work: the first <literal>.*</> is greedy so
@ -5490,14 +5553,14 @@ SELECT regexp_matches('abc01234xyz', '(.*)(\d+)(.*)');
match at the last possible place, the last digit. We might try to fix
that by making it non-greedy:
<screen>
SELECT regexp_matches('abc01234xyz', '(.*?)(\d+)(.*)');
SELECT regexp_match('abc01234xyz', '(.*?)(\d+)(.*)');
<lineannotation>Result: </lineannotation><computeroutput>{abc,0,""}</computeroutput>
</screen>
That didn't work either, because now the RE as a whole is non-greedy
and so it ends the overall match as soon as possible. We can get what
we want by forcing the RE as a whole to be greedy:
<screen>
SELECT regexp_matches('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}');
SELECT regexp_match('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}');
<lineannotation>Result: </lineannotation><computeroutput>{abc,01234,xyz}</computeroutput>
</screen>
Controlling the RE's overall greediness separately from its components'