mirror of
https://github.com/postgres/postgres.git
synced 2025-05-03 22:24:49 +03:00
Miscellaneous cleanup of regular-expression compiler.
Revert our previous addition of "all" flags to copyins() and copyouts(); they're no longer needed, and were never anything but an unsightly hack. Improve a couple of infelicities in the REG_DEBUG code for dumping the NFA data structure, including adding code to count the total number of states and arcs. Add a couple of missed error checks. Add some more documentation in the README file, and some regression tests illustrating cases that exceeded the state-count limit and/or took unreasonable amounts of time before this set of patches. Back-patch to all supported branches.
This commit is contained in:
parent
4e4610a8a1
commit
2419ab8aa9
@ -71,11 +71,10 @@ relates to what you'll see in the code. Here's what really happens:
|
||||
of states approximately proportional to the length of the regexp.
|
||||
|
||||
* The NFA is then optimized into a "compact NFA" representation, which is
|
||||
basically the same data but without fields that are not going to be needed
|
||||
at runtime. We do a little bit of cleanup too, such as removing
|
||||
unreachable states that might be created as a result of the rather naive
|
||||
transformation done by initial parsing. The cNFA representation is what
|
||||
is passed from regcomp to regexec.
|
||||
basically the same idea but without fields that are not going to be needed
|
||||
at runtime. It is simplified too: the compact format only allows "plain"
|
||||
and "LACON" arc types. The cNFA representation is what is passed from
|
||||
regcomp to regexec.
|
||||
|
||||
* Unlike traditional NFA-based regex engines, we do not execute directly
|
||||
from the NFA representation, as that would require backtracking and so be
|
||||
@ -134,12 +133,13 @@ a possible division of the input string that allows its two child nodes to
|
||||
each match their part of the string (and although this specific case can
|
||||
only succeed when the division is at the middle, the code does not know
|
||||
that, nor would it be true in general). However, we can first run the DFA
|
||||
and quickly reject any input that doesn't contain two a's and some number
|
||||
of b's and c's. If the DFA doesn't match, there is no need to recurse to
|
||||
the two child nodes for each possible string division point. In many
|
||||
cases, this prefiltering makes the search run much faster than a pure NFA
|
||||
engine could do. It is this behavior that justifies using the phrase
|
||||
"hybrid DFA/NFA engine" to describe Spencer's library.
|
||||
and quickly reject any input that doesn't start with an "a" and contain
|
||||
one more "a" plus some number of b's and c's. If the DFA doesn't match,
|
||||
there is no need to recurse to the two child nodes for each possible
|
||||
string division point. In many cases, this prefiltering makes the search
|
||||
run much faster than a pure NFA engine could do. It is this behavior that
|
||||
justifies using the phrase "hybrid DFA/NFA engine" to describe Spencer's
|
||||
library.
|
||||
|
||||
|
||||
Colors and colormapping
|
||||
@ -291,3 +291,76 @@ character classes are somehow processed "symbolically" without making a
|
||||
full expansion of their contents at parse time. This would mean that we'd
|
||||
have to be ready to call iswalpha() at runtime, but if that only happens
|
||||
for high-code-value characters, it shouldn't be a big performance hit.
|
||||
|
||||
|
||||
Detailed semantics of an NFA
|
||||
----------------------------
|
||||
|
||||
When trying to read dumped-out NFAs, it's helpful to know these facts:
|
||||
|
||||
State 0 (additionally marked with "@" in dumpnfa's output) is always the
|
||||
goal state, and state 1 (additionally marked with ">") is the start state.
|
||||
(The code refers to these as the post state and pre state respectively.)
|
||||
|
||||
The possible arc types are:
|
||||
|
||||
PLAIN arcs, which specify matching of any character of a given "color"
|
||||
(see above). These are dumped as "[color_number]->to_state".
|
||||
|
||||
EMPTY arcs, which specify a no-op transition to another state. These
|
||||
are dumped as "->to_state".
|
||||
|
||||
AHEAD constraints, which represent a "next character must be of this
|
||||
color" constraint. AHEAD differs from a PLAIN arc in that the input
|
||||
character is not consumed when crossing the arc. These are dumped as
|
||||
">color_number>->to_state".
|
||||
|
||||
BEHIND constraints, which represent a "previous character must be of
|
||||
this color" constraint, which likewise consumes no input. These are
|
||||
dumped as "<color_number<->to_state".
|
||||
|
||||
'^' arcs, which specify a beginning-of-input constraint. These are
|
||||
dumped as "^0->to_state" or "^1->to_state" for beginning-of-string and
|
||||
beginning-of-line constraints respectively.
|
||||
|
||||
'$' arcs, which specify an end-of-input constraint. These are dumped
|
||||
as "$0->to_state" or "$1->to_state" for end-of-string and end-of-line
|
||||
constraints respectively.
|
||||
|
||||
LACON constraints, which represent "(?=re)" and "(?!re)" constraints,
|
||||
i.e. the input starting at this point must match (or not match) a
|
||||
given sub-RE, but the matching input is not consumed. These are
|
||||
dumped as ":subtree_number:->to_state".
|
||||
|
||||
If you see anything else (especially any question marks) in the display of
|
||||
an arc, it's dumpnfa() trying to tell you that there's something fishy
|
||||
about the arc; see the source code.
|
||||
|
||||
The regex executor can only handle PLAIN and LACON transitions. The regex
|
||||
optimize() function is responsible for transforming the parser's output
|
||||
to get rid of all the other arc types. In particular, ^ and $ arcs that
|
||||
are not dropped as impossible will always end up adjacent to the pre or
|
||||
post state respectively, and then will be converted into PLAIN arcs that
|
||||
mention the special "colors" for BOS, BOL, EOS, or EOL.
|
||||
|
||||
To decide whether a thus-transformed NFA matches a given substring of the
|
||||
input string, the executor essentially follows these rules:
|
||||
1. Start the NFA "looking at" the character *before* the given substring,
|
||||
or if the substring is at the start of the input, prepend an imaginary BOS
|
||||
character instead.
|
||||
2. Run the NFA until it has consumed the character *after* the given
|
||||
substring, or an imaginary following EOS character if the substring is at
|
||||
the end of the input.
|
||||
3. If the NFA is (or can be) in the goal state at this point, it matches.
|
||||
|
||||
So one can mentally execute an untransformed NFA by taking ^ and $ as
|
||||
ordinary constraints that match at start and end of input; but plain
|
||||
arcs out of the start state should be taken as matches for the character
|
||||
before the target substring, and similarly, plain arcs leading to the
|
||||
post state are matches for the character after the target substring.
|
||||
This definition is necessary to support regexes that begin or end with
|
||||
constraints such as \m and \M, which imply requirements on the adjacent
|
||||
character if any. NFAs for simple unanchored patterns will usually have
|
||||
pre-state outarcs for all possible character colors as well as BOS and
|
||||
BOL, and post-state inarcs for all possible character colors as well as
|
||||
EOS and EOL, so that the executor's behavior will work.
|
||||
|
@ -823,14 +823,11 @@ moveins(struct nfa * nfa,
|
||||
|
||||
/*
|
||||
* copyins - copy in arcs of a state to another state
|
||||
*
|
||||
* Either all arcs, or only non-empty ones as determined by all value.
|
||||
*/
|
||||
static void
|
||||
copyins(struct nfa * nfa,
|
||||
struct state * oldState,
|
||||
struct state * newState,
|
||||
int all)
|
||||
struct state * newState)
|
||||
{
|
||||
assert(oldState != newState);
|
||||
|
||||
@ -840,8 +837,7 @@ copyins(struct nfa * nfa,
|
||||
struct arc *a;
|
||||
|
||||
for (a = oldState->ins; a != NULL; a = a->inchain)
|
||||
if (all || a->type != EMPTY)
|
||||
cparc(nfa, a, a->from, newState);
|
||||
cparc(nfa, a, a->from, newState);
|
||||
}
|
||||
else
|
||||
{
|
||||
@ -873,12 +869,6 @@ copyins(struct nfa * nfa,
|
||||
{
|
||||
struct arc *a = oa;
|
||||
|
||||
if (!all && a->type == EMPTY)
|
||||
{
|
||||
oa = oa->inchain;
|
||||
continue;
|
||||
}
|
||||
|
||||
switch (sortins_cmp(&oa, &na))
|
||||
{
|
||||
case -1:
|
||||
@ -904,12 +894,6 @@ copyins(struct nfa * nfa,
|
||||
/* newState does not have anything matching oa */
|
||||
struct arc *a = oa;
|
||||
|
||||
if (!all && a->type == EMPTY)
|
||||
{
|
||||
oa = oa->inchain;
|
||||
continue;
|
||||
}
|
||||
|
||||
oa = oa->inchain;
|
||||
createarc(nfa, a->type, a->co, a->from, newState);
|
||||
}
|
||||
@ -1107,14 +1091,11 @@ moveouts(struct nfa * nfa,
|
||||
|
||||
/*
|
||||
* copyouts - copy out arcs of a state to another state
|
||||
*
|
||||
* Either all arcs, or only non-empty ones as determined by all value.
|
||||
*/
|
||||
static void
|
||||
copyouts(struct nfa * nfa,
|
||||
struct state * oldState,
|
||||
struct state * newState,
|
||||
int all)
|
||||
struct state * newState)
|
||||
{
|
||||
assert(oldState != newState);
|
||||
|
||||
@ -1124,8 +1105,7 @@ copyouts(struct nfa * nfa,
|
||||
struct arc *a;
|
||||
|
||||
for (a = oldState->outs; a != NULL; a = a->outchain)
|
||||
if (all || a->type != EMPTY)
|
||||
cparc(nfa, a, newState, a->to);
|
||||
cparc(nfa, a, newState, a->to);
|
||||
}
|
||||
else
|
||||
{
|
||||
@ -1157,12 +1137,6 @@ copyouts(struct nfa * nfa,
|
||||
{
|
||||
struct arc *a = oa;
|
||||
|
||||
if (!all && a->type == EMPTY)
|
||||
{
|
||||
oa = oa->outchain;
|
||||
continue;
|
||||
}
|
||||
|
||||
switch (sortouts_cmp(&oa, &na))
|
||||
{
|
||||
case -1:
|
||||
@ -1188,12 +1162,6 @@ copyouts(struct nfa * nfa,
|
||||
/* newState does not have anything matching oa */
|
||||
struct arc *a = oa;
|
||||
|
||||
if (!all && a->type == EMPTY)
|
||||
{
|
||||
oa = oa->outchain;
|
||||
continue;
|
||||
}
|
||||
|
||||
oa = oa->outchain;
|
||||
createarc(nfa, a->type, a->co, newState, a->to);
|
||||
}
|
||||
@ -1452,6 +1420,10 @@ optimize(struct nfa * nfa,
|
||||
fprintf(f, "\nfinal cleanup:\n");
|
||||
#endif
|
||||
cleanup(nfa); /* final tidying */
|
||||
#ifdef REG_DEBUG
|
||||
if (verbose)
|
||||
dumpnfa(nfa, f);
|
||||
#endif
|
||||
return analyze(nfa); /* and analysis */
|
||||
}
|
||||
|
||||
@ -1568,7 +1540,7 @@ pull(struct nfa * nfa,
|
||||
s = newstate(nfa);
|
||||
if (NISERR())
|
||||
return 0;
|
||||
copyins(nfa, from, s, 1); /* duplicate inarcs */
|
||||
copyins(nfa, from, s); /* duplicate inarcs */
|
||||
cparc(nfa, con, s, to); /* move constraint arc */
|
||||
freearc(nfa, con);
|
||||
if (NISERR())
|
||||
@ -1735,7 +1707,7 @@ push(struct nfa * nfa,
|
||||
s = newstate(nfa);
|
||||
if (NISERR())
|
||||
return 0;
|
||||
copyouts(nfa, to, s, 1); /* duplicate outarcs */
|
||||
copyouts(nfa, to, s); /* duplicate outarcs */
|
||||
cparc(nfa, con, from, s); /* move constraint arc */
|
||||
freearc(nfa, con);
|
||||
if (NISERR())
|
||||
@ -2952,6 +2924,8 @@ dumpnfa(struct nfa * nfa,
|
||||
{
|
||||
#ifdef REG_DEBUG
|
||||
struct state *s;
|
||||
int nstates = 0;
|
||||
int narcs = 0;
|
||||
|
||||
fprintf(f, "pre %d, post %d", nfa->pre->no, nfa->post->no);
|
||||
if (nfa->bos[0] != COLORLESS)
|
||||
@ -2964,7 +2938,12 @@ dumpnfa(struct nfa * nfa,
|
||||
fprintf(f, ", eol [%ld]", (long) nfa->eos[1]);
|
||||
fprintf(f, "\n");
|
||||
for (s = nfa->states; s != NULL; s = s->next)
|
||||
{
|
||||
dumpstate(s, f);
|
||||
nstates++;
|
||||
narcs += s->nouts;
|
||||
}
|
||||
fprintf(f, "total of %d states, %d arcs\n", nstates, narcs);
|
||||
if (nfa->parent == NULL)
|
||||
dumpcolors(nfa->cm, f);
|
||||
fflush(f);
|
||||
|
@ -136,10 +136,10 @@ static int sortins_cmp(const void *, const void *);
|
||||
static void sortouts(struct nfa *, struct state *);
|
||||
static int sortouts_cmp(const void *, const void *);
|
||||
static void moveins(struct nfa *, struct state *, struct state *);
|
||||
static void copyins(struct nfa *, struct state *, struct state *, int);
|
||||
static void copyins(struct nfa *, struct state *, struct state *);
|
||||
static void mergeins(struct nfa *, struct state *, struct arc **, int);
|
||||
static void moveouts(struct nfa *, struct state *, struct state *);
|
||||
static void copyouts(struct nfa *, struct state *, struct state *, int);
|
||||
static void copyouts(struct nfa *, struct state *, struct state *);
|
||||
static void cloneouts(struct nfa *, struct state *, struct state *, struct state *, int);
|
||||
static void delsub(struct nfa *, struct state *, struct state *);
|
||||
static void deltraverse(struct nfa *, struct state *, struct state *);
|
||||
@ -181,7 +181,6 @@ static void dumpnfa(struct nfa *, FILE *);
|
||||
#ifdef REG_DEBUG
|
||||
static void dumpstate(struct state *, FILE *);
|
||||
static void dumparcs(struct state *, FILE *);
|
||||
static int dumprarcs(struct arc *, struct state *, FILE *, int);
|
||||
static void dumparc(struct arc *, struct state *, FILE *);
|
||||
static void dumpcnfa(struct cnfa *, FILE *);
|
||||
static void dumpcstate(int, struct cnfa *, FILE *);
|
||||
@ -614,7 +613,9 @@ makesearch(struct vars * v,
|
||||
for (s = slist; s != NULL; s = s2)
|
||||
{
|
||||
s2 = newstate(nfa);
|
||||
copyouts(nfa, s, s2, 1);
|
||||
NOERR();
|
||||
copyouts(nfa, s, s2);
|
||||
NOERR();
|
||||
for (a = s->ins; a != NULL; a = b)
|
||||
{
|
||||
b = a->inchain;
|
||||
@ -2014,7 +2015,7 @@ dump(regex_t *re,
|
||||
dumpcolors(&g->cmap, f);
|
||||
if (!NULLCNFA(g->search))
|
||||
{
|
||||
printf("\nsearch:\n");
|
||||
fprintf(f, "\nsearch:\n");
|
||||
dumpcnfa(&g->search, f);
|
||||
}
|
||||
for (i = 1; i < g->nlacons; i++)
|
||||
|
@ -229,6 +229,41 @@ select 'a' ~ '((((((a+|)+|)+|)+|)+|)+|)';
|
||||
t
|
||||
(1 row)
|
||||
|
||||
-- These cases used to give too-many-states failures
|
||||
select 'x' ~ 'abcd(\m)+xyz';
|
||||
?column?
|
||||
----------
|
||||
f
|
||||
(1 row)
|
||||
|
||||
select 'a' ~ '^abcd*(((((^(a c(e?d)a+|)+|)+|)+|)+|a)+|)';
|
||||
?column?
|
||||
----------
|
||||
f
|
||||
(1 row)
|
||||
|
||||
select 'x' ~ 'a^(^)bcd*xy(((((($a+|)+|)+|)+$|)+|)+|)^$';
|
||||
?column?
|
||||
----------
|
||||
f
|
||||
(1 row)
|
||||
|
||||
select 'x' ~ 'xyz(\Y\Y)+';
|
||||
?column?
|
||||
----------
|
||||
f
|
||||
(1 row)
|
||||
|
||||
select 'x' ~ 'x|(?:\M)+';
|
||||
?column?
|
||||
----------
|
||||
t
|
||||
(1 row)
|
||||
|
||||
-- This generates O(N) states but O(N^2) arcs, so it causes problems
|
||||
-- if arc count is not constrained
|
||||
select 'x' ~ repeat('x*y*z*', 1000);
|
||||
ERROR: invalid regular expression: regular expression is too complex
|
||||
-- Test backref in combination with non-greedy quantifier
|
||||
-- https://core.tcl.tk/tcl/tktview/6585b21ca8fa6f3678d442b97241fdd43dba2ec0
|
||||
select 'Programmer' ~ '(\w).*?\1' as t;
|
||||
|
@ -55,6 +55,17 @@ select 'dd x' ~ '(^(?!aa)(?!bb)(?!cc))+';
|
||||
select 'a' ~ '((((((a)*)*)*)*)*)*';
|
||||
select 'a' ~ '((((((a+|)+|)+|)+|)+|)+|)';
|
||||
|
||||
-- These cases used to give too-many-states failures
|
||||
select 'x' ~ 'abcd(\m)+xyz';
|
||||
select 'a' ~ '^abcd*(((((^(a c(e?d)a+|)+|)+|)+|)+|a)+|)';
|
||||
select 'x' ~ 'a^(^)bcd*xy(((((($a+|)+|)+|)+$|)+|)+|)^$';
|
||||
select 'x' ~ 'xyz(\Y\Y)+';
|
||||
select 'x' ~ 'x|(?:\M)+';
|
||||
|
||||
-- This generates O(N) states but O(N^2) arcs, so it causes problems
|
||||
-- if arc count is not constrained
|
||||
select 'x' ~ repeat('x*y*z*', 1000);
|
||||
|
||||
-- Test backref in combination with non-greedy quantifier
|
||||
-- https://core.tcl.tk/tcl/tktview/6585b21ca8fa6f3678d442b97241fdd43dba2ec0
|
||||
select 'Programmer' ~ '(\w).*?\1' as t;
|
||||
|
Loading…
x
Reference in New Issue
Block a user