Miscellaneous cleanup of regular-expression compiler.

Revert our previous addition of "all" flags to copyins() and copyouts(); they're no longer needed, and were never anything but an unsightly hack. Improve a couple of infelicities in the REG_DEBUG code for dumping the NFA data structure, including adding code to count the total number of states and arcs. Add a couple of missed error checks. Add some more documentation in the README file, and some regression tests illustrating cases that exceeded the state-count limit and/or took unreasonable amounts of time before this set of patches. Back-patch to all supported branches.
2025-07-07 00:36:50 +03:00 · 2015-10-16 15:52:12 -04:00
parent 4e4610a8a1
commit 2419ab8aa9
5 changed files with 153 additions and 54 deletions
--- a/src/backend/regex/README
+++ b/src/backend/regex/README
@ -71,11 +71,10 @@ relates to what you'll see in the code.  Here's what really happens:
 of states approximately proportional to the length of the regexp.

 * The NFA is then optimized into a "compact NFA" representation, which is
-basically the same data but without fields that are not going to be needed
-at runtime.  We do a little bit of cleanup too, such as removing
-unreachable states that might be created as a result of the rather naive
-transformation done by initial parsing.  The cNFA representation is what
-is passed from regcomp to regexec.
+basically the same idea but without fields that are not going to be needed
+at runtime.  It is simplified too: the compact format only allows "plain"
+and "LACON" arc types.  The cNFA representation is what is passed from
+regcomp to regexec.

 * Unlike traditional NFA-based regex engines, we do not execute directly
 from the NFA representation, as that would require backtracking and so be
@ -134,12 +133,13 @@ a possible division of the input string that allows its two child nodes to
 each match their part of the string (and although this specific case can
 only succeed when the division is at the middle, the code does not know
 that, nor would it be true in general).  However, we can first run the DFA
-and quickly reject any input that doesn't contain two a's and some number
-of b's and c's.  If the DFA doesn't match, there is no need to recurse to
-the two child nodes for each possible string division point.  In many
-cases, this prefiltering makes the search run much faster than a pure NFA
-engine could do.  It is this behavior that justifies using the phrase
-"hybrid DFA/NFA engine" to describe Spencer's library.
+and quickly reject any input that doesn't start with an "a" and contain
+one more "a" plus some number of b's and c's.  If the DFA doesn't match,
+there is no need to recurse to the two child nodes for each possible
+string division point.  In many cases, this prefiltering makes the search
+run much faster than a pure NFA engine could do.  It is this behavior that
+justifies using the phrase "hybrid DFA/NFA engine" to describe Spencer's
+library.


 Colors and colormapping
@ -291,3 +291,76 @@ character classes are somehow processed "symbolically" without making a
 full expansion of their contents at parse time.  This would mean that we'd
 have to be ready to call iswalpha() at runtime, but if that only happens
 for high-code-value characters, it shouldn't be a big performance hit.
+
+
+Detailed semantics of an NFA
+----------------------------
+
+When trying to read dumped-out NFAs, it's helpful to know these facts:
+
+State 0 (additionally marked with "@" in dumpnfa's output) is always the
+goal state, and state 1 (additionally marked with ">") is the start state.
+(The code refers to these as the post state and pre state respectively.)
+
+The possible arc types are:
+
+    PLAIN arcs, which specify matching of any character of a given "color"
+    (see above).  These are dumped as "[color_number]->to_state".
+
+    EMPTY arcs, which specify a no-op transition to another state.  These
+    are dumped as "->to_state".
+
+    AHEAD constraints, which represent a "next character must be of this
+    color" constraint.  AHEAD differs from a PLAIN arc in that the input
+    character is not consumed when crossing the arc.  These are dumped as
+    ">color_number>->to_state".
+
+    BEHIND constraints, which represent a "previous character must be of
+    this color" constraint, which likewise consumes no input.  These are
+    dumped as "<color_number<->to_state".
+
+    '^' arcs, which specify a beginning-of-input constraint.  These are
+    dumped as "^0->to_state" or "^1->to_state" for beginning-of-string and
+    beginning-of-line constraints respectively.
+
+    '$' arcs, which specify an end-of-input constraint.  These are dumped
+    as "$0->to_state" or "$1->to_state" for end-of-string and end-of-line
+    constraints respectively.
+
+    LACON constraints, which represent "(?=re)" and "(?!re)" constraints,
+    i.e. the input starting at this point must match (or not match) a
+    given sub-RE, but the matching input is not consumed.  These are
+    dumped as ":subtree_number:->to_state".
+
+If you see anything else (especially any question marks) in the display of
+an arc, it's dumpnfa() trying to tell you that there's something fishy
+about the arc; see the source code.
+
+The regex executor can only handle PLAIN and LACON transitions.  The regex
+optimize() function is responsible for transforming the parser's output
+to get rid of all the other arc types.  In particular, ^ and $ arcs that
+are not dropped as impossible will always end up adjacent to the pre or
+post state respectively, and then will be converted into PLAIN arcs that
+mention the special "colors" for BOS, BOL, EOS, or EOL.
+
+To decide whether a thus-transformed NFA matches a given substring of the
+input string, the executor essentially follows these rules:
+1. Start the NFA "looking at" the character *before* the given substring,
+or if the substring is at the start of the input, prepend an imaginary BOS
+character instead.
+2. Run the NFA until it has consumed the character *after* the given
+substring, or an imaginary following EOS character if the substring is at
+the end of the input.
+3. If the NFA is (or can be) in the goal state at this point, it matches.
+
+So one can mentally execute an untransformed NFA by taking ^ and $ as
+ordinary constraints that match at start and end of input; but plain
+arcs out of the start state should be taken as matches for the character
+before the target substring, and similarly, plain arcs leading to the
+post state are matches for the character after the target substring.
+This definition is necessary to support regexes that begin or end with
+constraints such as \m and \M, which imply requirements on the adjacent
+character if any.  NFAs for simple unanchored patterns will usually have
+pre-state outarcs for all possible character colors as well as BOS and
+BOL, and post-state inarcs for all possible character colors as well as
+EOS and EOL, so that the executor's behavior will work.
--- a/src/backend/regex/regc_nfa.c
+++ b/src/backend/regex/regc_nfa.c
@ -823,14 +823,11 @@ moveins(struct nfa * nfa,

 /*
 * copyins - copy in arcs of a state to another state
- *
- * Either all arcs, or only non-empty ones as determined by all value.
 */
 static void
 copyins(struct nfa * nfa,
 		struct state * oldState,
-		struct state * newState,
-		int all)
+		struct state * newState)
 {
 	assert(oldState != newState);

@ -840,8 +837,7 @@ copyins(struct nfa * nfa,
 		struct arc *a;

 		for (a = oldState->ins; a != NULL; a = a->inchain)
-			if (all || a->type != EMPTY)
-				cparc(nfa, a, a->from, newState);
+			cparc(nfa, a, a->from, newState);
 	}
 	else
 	{
@ -873,12 +869,6 @@ copyins(struct nfa * nfa,
 		{
 			struct arc *a = oa;

-			if (!all && a->type == EMPTY)
-			{
-				oa = oa->inchain;
-				continue;
-			}
-
 			switch (sortins_cmp(&oa, &na))
 			{
 				case -1:
@ -904,12 +894,6 @@ copyins(struct nfa * nfa,
 			/* newState does not have anything matching oa */
 			struct arc *a = oa;

-			if (!all && a->type == EMPTY)
-			{
-				oa = oa->inchain;
-				continue;
-			}
-
 			oa = oa->inchain;
 			createarc(nfa, a->type, a->co, a->from, newState);
 		}
@ -1107,14 +1091,11 @@ moveouts(struct nfa * nfa,

 /*
 * copyouts - copy out arcs of a state to another state
- *
- * Either all arcs, or only non-empty ones as determined by all value.
 */
 static void
 copyouts(struct nfa * nfa,
 		 struct state * oldState,
-		 struct state * newState,
-		 int all)
+		 struct state * newState)
 {
 	assert(oldState != newState);

@ -1124,8 +1105,7 @@ copyouts(struct nfa * nfa,
 		struct arc *a;

 		for (a = oldState->outs; a != NULL; a = a->outchain)
-			if (all || a->type != EMPTY)
-				cparc(nfa, a, newState, a->to);
+			cparc(nfa, a, newState, a->to);
 	}
 	else
 	{
@ -1157,12 +1137,6 @@ copyouts(struct nfa * nfa,
 		{
 			struct arc *a = oa;

-			if (!all && a->type == EMPTY)
-			{
-				oa = oa->outchain;
-				continue;
-			}
-
 			switch (sortouts_cmp(&oa, &na))
 			{
 				case -1:
@ -1188,12 +1162,6 @@ copyouts(struct nfa * nfa,
 			/* newState does not have anything matching oa */
 			struct arc *a = oa;

-			if (!all && a->type == EMPTY)
-			{
-				oa = oa->outchain;
-				continue;
-			}
-
 			oa = oa->outchain;
 			createarc(nfa, a->type, a->co, newState, a->to);
 		}
@ -1452,6 +1420,10 @@ optimize(struct nfa * nfa,
 		fprintf(f, "\nfinal cleanup:\n");
 #endif
 	cleanup(nfa);				/* final tidying */
+#ifdef REG_DEBUG
+	if (verbose)
+		dumpnfa(nfa, f);
+#endif
 	return analyze(nfa);		/* and analysis */
 }

@ -1568,7 +1540,7 @@ pull(struct nfa * nfa,
 		s = newstate(nfa);
 		if (NISERR())
 			return 0;
-		copyins(nfa, from, s, 1);		/* duplicate inarcs */
+		copyins(nfa, from, s);	/* duplicate inarcs */
 		cparc(nfa, con, s, to); /* move constraint arc */
 		freearc(nfa, con);
 		if (NISERR())
@ -1735,7 +1707,7 @@ push(struct nfa * nfa,
 		s = newstate(nfa);
 		if (NISERR())
 			return 0;
-		copyouts(nfa, to, s, 1);	/* duplicate outarcs */
+		copyouts(nfa, to, s);	/* duplicate outarcs */
 		cparc(nfa, con, from, s);		/* move constraint arc */
 		freearc(nfa, con);
 		if (NISERR())
@ -2952,6 +2924,8 @@ dumpnfa(struct nfa * nfa,
 {
 #ifdef REG_DEBUG
 	struct state *s;
+	int			nstates = 0;
+	int			narcs = 0;

 	fprintf(f, "pre %d, post %d", nfa->pre->no, nfa->post->no);
 	if (nfa->bos[0] != COLORLESS)
@ -2964,7 +2938,12 @@ dumpnfa(struct nfa * nfa,
 		fprintf(f, ", eol [%ld]", (long) nfa->eos[1]);
 	fprintf(f, "\n");
 	for (s = nfa->states; s != NULL; s = s->next)
+	{
 		dumpstate(s, f);
+		nstates++;
+		narcs += s->nouts;
+	}
+	fprintf(f, "total of %d states, %d arcs\n", nstates, narcs);
 	if (nfa->parent == NULL)
 		dumpcolors(nfa->cm, f);
 	fflush(f);
--- a/src/backend/regex/regcomp.c
+++ b/src/backend/regex/regcomp.c
@ -136,10 +136,10 @@ static int	sortins_cmp(const void *, const void *);
 static void sortouts(struct nfa *, struct state *);
 static int	sortouts_cmp(const void *, const void *);
 static void moveins(struct nfa *, struct state *, struct state *);
-static void copyins(struct nfa *, struct state *, struct state *, int);
+static void copyins(struct nfa *, struct state *, struct state *);
 static void mergeins(struct nfa *, struct state *, struct arc **, int);
 static void moveouts(struct nfa *, struct state *, struct state *);
-static void copyouts(struct nfa *, struct state *, struct state *, int);
+static void copyouts(struct nfa *, struct state *, struct state *);
 static void cloneouts(struct nfa *, struct state *, struct state *, struct state *, int);
 static void delsub(struct nfa *, struct state *, struct state *);
 static void deltraverse(struct nfa *, struct state *, struct state *);
@ -181,7 +181,6 @@ static void dumpnfa(struct nfa *, FILE *);
 #ifdef REG_DEBUG
 static void dumpstate(struct state *, FILE *);
 static void dumparcs(struct state *, FILE *);
-static int	dumprarcs(struct arc *, struct state *, FILE *, int);
 static void dumparc(struct arc *, struct state *, FILE *);
 static void dumpcnfa(struct cnfa *, FILE *);
 static void dumpcstate(int, struct cnfa *, FILE *);
@ -614,7 +613,9 @@ makesearch(struct vars * v,
 	for (s = slist; s != NULL; s = s2)
 	{
 		s2 = newstate(nfa);
-		copyouts(nfa, s, s2, 1);
+		NOERR();
+		copyouts(nfa, s, s2);
+		NOERR();
 		for (a = s->ins; a != NULL; a = b)
 		{
 			b = a->inchain;
@ -2014,7 +2015,7 @@ dump(regex_t *re,
 	dumpcolors(&g->cmap, f);
 	if (!NULLCNFA(g->search))
 	{
-		printf("\nsearch:\n");
+		fprintf(f, "\nsearch:\n");
 		dumpcnfa(&g->search, f);
 	}
 	for (i = 1; i < g->nlacons; i++)
--- a/src/test/regress/expected/regex.out
+++ b/src/test/regress/expected/regex.out
@ -229,6 +229,41 @@ select 'a' ~ '((((((a+|)+|)+|)+|)+|)+|)';
 t
 (1 row)

+-- These cases used to give too-many-states failures
+select 'x' ~ 'abcd(\m)+xyz';
+ ?column? 
+----------
+ f
+(1 row)
+
+select 'a' ~ '^abcd*(((((^(a c(e?d)a+|)+|)+|)+|)+|a)+|)';
+ ?column? 
+----------
+ f
+(1 row)
+
+select 'x' ~ 'a^(^)bcd*xy(((((($a+|)+|)+|)+$|)+|)+|)^$';
+ ?column? 
+----------
+ f
+(1 row)
+
+select 'x' ~ 'xyz(\Y\Y)+';
+ ?column? 
+----------
+ f
+(1 row)
+
+select 'x' ~ 'x|(?:\M)+';
+ ?column? 
+----------
+ t
+(1 row)
+
+-- This generates O(N) states but O(N^2) arcs, so it causes problems
+-- if arc count is not constrained
+select 'x' ~ repeat('x*y*z*', 1000);
+ERROR:  invalid regular expression: regular expression is too complex
 -- Test backref in combination with non-greedy quantifier
 -- https://core.tcl.tk/tcl/tktview/6585b21ca8fa6f3678d442b97241fdd43dba2ec0
 select 'Programmer' ~ '(\w).*?\1' as t;
--- a/src/test/regress/sql/regex.sql
+++ b/src/test/regress/sql/regex.sql
@ -55,6 +55,17 @@ select 'dd x' ~ '(^(?!aa)(?!bb)(?!cc))+';
 select 'a' ~ '((((((a)*)*)*)*)*)*';
 select 'a' ~ '((((((a+|)+|)+|)+|)+|)+|)';

+-- These cases used to give too-many-states failures
+select 'x' ~ 'abcd(\m)+xyz';
+select 'a' ~ '^abcd*(((((^(a c(e?d)a+|)+|)+|)+|)+|a)+|)';
+select 'x' ~ 'a^(^)bcd*xy(((((($a+|)+|)+|)+$|)+|)+|)^$';
+select 'x' ~ 'xyz(\Y\Y)+';
+select 'x' ~ 'x|(?:\M)+';
+
+-- This generates O(N) states but O(N^2) arcs, so it causes problems
+-- if arc count is not constrained
+select 'x' ~ repeat('x*y*z*', 1000);
+
 -- Test backref in combination with non-greedy quantifier
 -- https://core.tcl.tk/tcl/tktview/6585b21ca8fa6f3678d442b97241fdd43dba2ec0
 select 'Programmer' ~ '(\w).*?\1' as t;