mirror of
https://github.com/postgres/postgres.git
synced 2025-05-01 01:04:50 +03:00
Doc: improve documentation around jsonpath regular expressions.
Provide some documentation about the differences between XQuery regular expressions and those supported by Spencer's regex engine. Since SQL now exposes XQuery regexps with the LIKE_REGEX operator, I made this a standalone section designed to help somebody who has to translate a LIKE_REGEX query to Postgres. (Eventually we might extend Spencer's engine to allow precise implementation of XQuery, but not today.) Reference that in the jsonpath docs, provide definitions of the XQuery flag letters, and add a description of the JavaScript-inspired string literal syntax used within jsonpath. Also point out explicitly that backslashes used within like_regex patterns will need to be doubled. This also syncs the docs with the decision implemented in commit d5b90cd64 to desupport XQuery's 'x' flag for now. Jonathan Katz and Tom Lane Discussion: https://postgr.es/m/CAPpHfdvDci4iqNF9fhRkTqhe-5_8HmzeLt56drH%2B_Rv2rNRqfg@mail.gmail.com
This commit is contained in:
parent
e1c8743e6c
commit
0a97edb12e
@ -5970,6 +5970,145 @@ SELECT regexp_match('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}');
|
||||
|
||||
<!-- end re_syntax.n man page -->
|
||||
|
||||
<sect3 id="posix-vs-xquery">
|
||||
<title>Differences From XQuery (<literal>LIKE_REGEX</literal>)</title>
|
||||
|
||||
<indexterm zone="posix-vs-xquery">
|
||||
<primary><literal>LIKE_REGEX</literal></primary>
|
||||
</indexterm>
|
||||
|
||||
<indexterm zone="posix-vs-xquery">
|
||||
<primary>XQuery regular expressions</primary>
|
||||
</indexterm>
|
||||
|
||||
<para>
|
||||
Since SQL:2008, the SQL standard includes
|
||||
a <literal>LIKE_REGEX</literal> operator that performs pattern
|
||||
matching according to the XQuery regular expression
|
||||
standard. <productname>PostgreSQL</productname> does not yet
|
||||
implement this operator, but you can get very similar behavior using
|
||||
the <function>regexp_match()</function> function, since XQuery
|
||||
regular expressions are quite close to the ARE syntax described above.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
Notable differences between the existing POSIX-based
|
||||
regular-expression feature and XQuery regular expressions include:
|
||||
|
||||
<itemizedlist>
|
||||
<listitem>
|
||||
<para>
|
||||
XQuery character class subtraction is not supported. An example of
|
||||
this feature is using the following to match only English
|
||||
consonants: <literal>[a-z-[aeiou]]</literal>.
|
||||
</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>
|
||||
XQuery character class shorthands <literal>\c</literal>,
|
||||
<literal>\C</literal>, <literal>\i</literal>,
|
||||
and <literal>\I</literal> are not supported.
|
||||
</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>
|
||||
XQuery character class elements
|
||||
using <literal>\p{UnicodeProperty}</literal> or the
|
||||
inverse <literal>\P{UnicodeProperty}</literal> are not supported.
|
||||
</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>
|
||||
POSIX interprets character classes such as <literal>\w</literal>
|
||||
(see <xref linkend="posix-class-shorthand-escapes-table"/>)
|
||||
according to the prevailing locale (which you can control by
|
||||
attaching a <literal>COLLATE</literal> clause to the operator or
|
||||
function). XQuery specifies these classes by reference to Unicode
|
||||
character properties, so equivalent behavior is obtained only with
|
||||
a locale that follows the Unicode rules.
|
||||
</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>
|
||||
The SQL standard (not XQuery itself) attempts to cater for more
|
||||
variants of <quote>newline</quote> than POSIX does. The
|
||||
newline-sensitive matching options described above consider only
|
||||
ASCII NL (<literal>\n</literal>) to be a newline, but SQL would have
|
||||
us treat CR (<literal>\r</literal>), CRLF (<literal>\r\n</literal>)
|
||||
(a Windows-style newline), and some Unicode-only characters like
|
||||
LINE SEPARATOR (U+2028) as newlines as well.
|
||||
Notably, <literal>.</literal> and <literal>\s</literal> should
|
||||
count <literal>\r\n</literal> as one character not two according to
|
||||
SQL.
|
||||
</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>
|
||||
Of the character-entry escapes described in
|
||||
<xref linkend="posix-character-entry-escapes-table"/>,
|
||||
XQuery supports only <literal>\n</literal>, <literal>\r</literal>,
|
||||
and <literal>\t</literal>.
|
||||
</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>
|
||||
XQuery does not support
|
||||
the <literal>[:<replaceable>name</replaceable>:]</literal> syntax
|
||||
for character classes within bracket expressions.
|
||||
</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>
|
||||
XQuery does not have lookahead or lookbehind constraints,
|
||||
nor any of the constraint escapes described in
|
||||
<xref linkend="posix-constraint-escapes-table"/>.
|
||||
</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>
|
||||
The metasyntax forms described in <xref linkend="posix-metasyntax"/>
|
||||
do not exist in XQuery.
|
||||
</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>
|
||||
The regular expression flag letters defined by XQuery are
|
||||
related to but not the same as the option letters for POSIX
|
||||
(<xref linkend="posix-embedded-options-table"/>). While the
|
||||
<literal>i</literal> and <literal>q</literal> options behave the
|
||||
same, others do not:
|
||||
<itemizedlist>
|
||||
<listitem>
|
||||
<para>
|
||||
XQuery's <literal>s</literal> (allow dot to match newline)
|
||||
and <literal>m</literal> (allow <literal>^</literal>
|
||||
and <literal>$</literal> to match at newlines) flags provide
|
||||
access to the same behaviors as
|
||||
POSIX's <literal>n</literal>, <literal>p</literal>
|
||||
and <literal>w</literal> flags, but they
|
||||
do <emphasis>not</emphasis> match the behavior of
|
||||
POSIX's <literal>s</literal> and <literal>m</literal> flags.
|
||||
Note in particular that dot-matches-newline is the default
|
||||
behavior in POSIX but not XQuery.
|
||||
</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>
|
||||
XQuery's <literal>x</literal> (ignore whitespace in pattern) flag
|
||||
is noticeably different from POSIX's expanded-mode flag.
|
||||
POSIX's <literal>x</literal> flag also
|
||||
allows <literal>#</literal> to begin a comment in the pattern,
|
||||
and POSIX will not ignore a whitespace character after a
|
||||
backslash.
|
||||
</para>
|
||||
</listitem>
|
||||
</itemizedlist>
|
||||
</para>
|
||||
</listitem>
|
||||
</itemizedlist>
|
||||
</para>
|
||||
|
||||
</sect3>
|
||||
</sect2>
|
||||
</sect1>
|
||||
|
||||
@ -11793,6 +11932,14 @@ table2-mapping
|
||||
</programlisting>
|
||||
</para>
|
||||
</listitem>
|
||||
|
||||
<listitem>
|
||||
<para>
|
||||
There are minor differences in the interpretation of regular
|
||||
expression patterns used in <literal>like_regex</literal> filters, as
|
||||
described in <xref linkend="jsonpath-regular-expressions"/>.
|
||||
</para>
|
||||
</listitem>
|
||||
</itemizedlist>
|
||||
|
||||
<sect3 id="strict-and-lax-modes">
|
||||
@ -11872,6 +12019,63 @@ table2-mapping
|
||||
|
||||
</sect3>
|
||||
|
||||
<sect3 id="jsonpath-regular-expressions">
|
||||
<title>Regular Expressions</title>
|
||||
|
||||
<indexterm zone="jsonpath-regular-expressions">
|
||||
<primary><literal>LIKE_REGEX</literal></primary>
|
||||
<secondary>in SQL/JSON</secondary>
|
||||
</indexterm>
|
||||
|
||||
<para>
|
||||
SQL/JSON path expressions allow matching text to a regular expression
|
||||
with the <literal>like_regex</literal> filter. For example, the
|
||||
following SQL/JSON path query would case-insensitively match all
|
||||
strings in an array that start with an English vowel:
|
||||
<programlisting>
|
||||
'$[*] ? (@ like_regex "^[aeiou]" flag "i")'
|
||||
</programlisting>
|
||||
</para>
|
||||
|
||||
<para>
|
||||
The optional <literal>flag</literal> string may include one or more of
|
||||
the characters
|
||||
<literal>i</literal> for case-insensitive match,
|
||||
<literal>m</literal> to allow <literal>^</literal>
|
||||
and <literal>$</literal> to match at newlines,
|
||||
<literal>s</literal> to allow <literal>.</literal> to match a newline,
|
||||
and <literal>q</literal> to quote the whole pattern (reducing the
|
||||
behavior to a simple substring match).
|
||||
</para>
|
||||
|
||||
<para>
|
||||
The SQL/JSON standard borrows its definition for regular expressions
|
||||
from the <literal>LIKE_REGEX</literal> operator, which in turn uses the
|
||||
XQuery standard. PostgreSQL does not currently support the
|
||||
<literal>LIKE_REGEX</literal> operator. Therefore,
|
||||
the <literal>like_regex</literal> filter is implemented using the
|
||||
POSIX regular expression engine described in
|
||||
<xref linkend="functions-posix-regexp"/>. This leads to various minor
|
||||
discrepancies from standard SQL/JSON behavior, which are cataloged in
|
||||
<xref linkend="posix-vs-xquery"/>.
|
||||
Note, however, that the flag-letter incompatibilities described there
|
||||
do not apply to SQL/JSON, as it translates the XQuery flag letters to
|
||||
match what the POSIX engine expects.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
Keep in mind that the pattern argument of <literal>like_regex</literal>
|
||||
is a JSON path string literal, written according to the rules given in
|
||||
<xref linkend="datatype-jsonpath"/>. This means in particular that any
|
||||
backslashes you want to use in the regular expression must be doubled.
|
||||
For example, to match strings that contain only digits:
|
||||
<programlisting>
|
||||
'$ ? (@ like_regex "^\\d+$")'
|
||||
</programlisting>
|
||||
</para>
|
||||
|
||||
</sect3>
|
||||
|
||||
<sect3 id="functions-sqljson-path-operators">
|
||||
<title>SQL/JSON Path Operators and Methods</title>
|
||||
|
||||
@ -12113,10 +12317,11 @@ table2-mapping
|
||||
<row>
|
||||
<entry><literal>like_regex</literal></entry>
|
||||
<entry>
|
||||
Tests pattern matching with POSIX regular expressions
|
||||
(see <xref linkend="functions-posix-regexp"/>). Supported flags
|
||||
are <literal>i</literal>, <literal>s</literal>, <literal>m</literal>,
|
||||
<literal>x</literal>, and <literal>q</literal>.</entry>
|
||||
Tests whether the first operand matches the regular expression
|
||||
given by the second operand, optionally with modifications
|
||||
described by a string of <literal>flag</literal> characters (see
|
||||
<xref linkend="jsonpath-regular-expressions"/>)
|
||||
</entry>
|
||||
<entry><literal>["abc", "abd", "aBdC", "abdacb", "babc"]</literal></entry>
|
||||
<entry><literal>$[*] ? (@ like_regex "^ab.*c" flag "i")</literal></entry>
|
||||
<entry><literal>"abc", "aBdC", "abdacb"</literal></entry>
|
||||
|
@ -666,13 +666,32 @@ SELECT jdoc->'guid', jdoc->'name' FROM api WHERE jdoc @> '{"tags": ["qu
|
||||
</itemizedlist>
|
||||
|
||||
<para>
|
||||
An SQL/JSON path expression is an SQL character string literal,
|
||||
so it must be enclosed in single quotes when passed to an SQL/JSON
|
||||
query function. Following the JavaScript
|
||||
conventions, character string literals within the path expression
|
||||
must be enclosed in double quotes. Any single quotes within this
|
||||
character string literal must be escaped with a single quote
|
||||
by the SQL convention.
|
||||
An SQL/JSON path expression is typically written in an SQL query as an
|
||||
SQL character string literal, so it must be enclosed in single quotes,
|
||||
and any single quotes desired within the value must be doubled
|
||||
(see <xref linkend="sql-syntax-strings"/>).
|
||||
Some forms of path expressions require string literals within them.
|
||||
These embedded string literals follow JavaScript/ECMAScript conventions:
|
||||
they must be surrounded by double quotes, and backslash escapes may be
|
||||
used within them to represent otherwise-hard-to-type characters.
|
||||
In particular, the way to write a double quote within an embedded string
|
||||
literal is <literal>\"</literal>, and to write a backslash itself, you
|
||||
must write <literal>\\</literal>. Other special backslash sequences
|
||||
include those recognized in JSON strings:
|
||||
<literal>\b</literal>,
|
||||
<literal>\f</literal>,
|
||||
<literal>\n</literal>,
|
||||
<literal>\r</literal>,
|
||||
<literal>\t</literal>,
|
||||
<literal>\v</literal>
|
||||
for various ASCII control characters, and
|
||||
<literal>\u<replaceable>NNNN</replaceable></literal> for a Unicode
|
||||
character identified by its 4-hex-digit code point. The backslash
|
||||
syntax also includes two cases not allowed by JSON:
|
||||
<literal>\x<replaceable>NN</replaceable></literal> for a character code
|
||||
written with only two hex digits, and
|
||||
<literal>\u{<replaceable>N...</replaceable>}</literal> for a character
|
||||
code written with 1 to 6 hex digits.
|
||||
</para>
|
||||
|
||||
<para>
|
||||
|
Loading…
x
Reference in New Issue
Block a user