It's possible to construct regular expressions that contain loops of
constraint arcs (that is, ^ $ AHEAD BEHIND or LACON arcs). There's no use
in fully traversing such a loop at execution, since you'd just end up in
the same NFA state without having consumed any input. Worse, such a loop
leads to infinite looping in the pullback/pushfwd stage of compilation,
because we keep pushing or pulling the same constraints around the loop
in a vain attempt to move them to the pre or post state. Such looping was
previously recognized in CVE-2007-4772; but the fix only handled the case
of trivial single-state loops (that is, a constraint arc leading back to
its source state) ... and not only that, it was incorrect even for that
case, because it broke the admittedly-not-very-clearly-stated API contract
of the pull() and push() subroutines. The first two regression test cases
added by this commit exhibit patterns that result in assertion failures
because of that (though there seem to be no ill effects in non-assert
builds). The other new test cases exhibit multi-state constraint loops;
in an unpatched build they will run until the NFA state-count limit is
exceeded.
To fix, remove the code added for CVE-2007-4772, and instead create a
general-purpose constraint-loop-breaking phase of regex compilation that
executes before we do pullback/pushfwd. Since we never need to traverse
a constraint loop fully, we can just break the loop at any chosen spot,
if we add clone states that can replicate any sequence of arc transitions
that would've traversed just part of the loop.
Also add some commentary clarifying why we have to have all these
machinations in the first place.
This class of problems has been known for some time --- we had a report
from Marc Mamin about two years ago, for example, and there are related
complaints in the Tcl bug tracker. I had discussed a fix of this kind
off-list with Henry Spencer, but didn't get around to doing something
about it until the issue was rediscovered by Greg Stark recently.
Back-patch to all supported branches.
Some of the functions in regex compilation and execution recurse, and
therefore could in principle be driven to stack overflow. The Tcl crew
has seen this happen in practice in duptraverse(), though their fix was
to put in a hard-wired limit on the number of recursive levels, which is
not too appetizing --- fortunately, we have enough infrastructure to check
the actually available stack. Greg Stark has also seen it in other places
while fuzz testing on a machine with limited stack space. Let's put guards
in to prevent crashes in all these places.
Since the regex code would leak memory if we simply threw elog(ERROR),
we have to introduce an API that checks for stack depth without throwing
such an error. Fortunately that's not difficult.
Commit 9662143f0c added infrastructure to
allow regular-expression operations to be terminated early in the event
of SIGINT etc. However, fuzz testing by Greg Stark disclosed that there
are still cases where regex compilation could run for a long time without
noticing a cancel request. Specifically, the fixempties() phase never
adds new states, only new arcs, so it doesn't hit the cancel check I'd put
in newstate(). Add one to newarc() as well to cover that.
Some experimentation of my own found that regex execution could also run
for a long time despite a pending cancel. We'd put a high-level cancel
check into cdissect(), but there was none inside the core text-matching
routines longest() and shortest(). Ordinarily those inner loops are very
very fast ... but in the presence of lookahead constraints, not so much.
As a compromise, stick a cancel check into the stateset cache-miss
function, which is enough to guarantee a cancel check at least once per
lookahead constraint test.
Making this work required more attention to error handling throughout the
regex executor. Henry Spencer had apparently originally intended longest()
and shortest() to be incapable of incurring errors while running, so
neither they nor their subroutines had well-defined error reporting
behaviors. However, that was already broken by the lookahead constraint
feature, since lacon() can surely suffer an out-of-memory failure ---
which, in the code as it stood, might never be reported to the user at all,
but just silently be treated as a non-match of the lookahead constraint.
Normalize all that by inserting explicit error tests as needed. I took the
opportunity to add some more comments to the code, too.
Back-patch to all supported branches, like the previous patch.
newnfa() failed to set the regex error state when malloc() fails.
Several places in regcomp.c failed to check for an error after calling
subre(). Each of these mistakes could lead to null-pointer-dereference
crashes in memory-starved backends.
Report and patch by Andreas Seltenreich. Back-patch to all branches.
This was not changed in HEAD, but will be done later as part of a
pgindent run. Future pgindent runs will also do this.
Report by Tom Lane
Backpatch through all supported branches, but not HEAD
The regex code didn't have any provision for query cancel; which is
unsurprising given its non-Postgres origin, but still problematic since
some operations can take a long time. Introduce a callback function to
check for a pending query cancel or session termination request, and
call it in a couple of strategic spots where we can make the regex code
exit with an error indicator.
If we ever actually split out the regex code as a standalone library,
some additional work will be needed to let the cancel callback function
be specified externally to the library. But that's straightforward
(certainly so by comparison to putting the locale-dependent character
classification logic on a similar arms-length basis), and there seems
no need to do it right now.
A bigger issue is that there may be more places than these two where
we need to check for cancels. We can always add more checks later,
now that the infrastructure is in place.
Since there are known examples of not-terribly-long regexes that can
lock up a backend for a long time, back-patch to all supported branches.
I have hopes of fixing the known performance problems later, but adding
query cancel ability seems like a good idea even if they were all fixed.
The previous coding of this function could get into situations where it
would never terminate, because successive passes would re-add EMPTY arcs
that had been removed by the previous pass. Rewrite the function
completely using a new algorithm that is guaranteed to terminate, and
also seems to be usually faster than the old one. Per Tcl bugs 3604074
and 3606683.
Tom Lane and Don Porter
was Tcl 8.4.8. The main changes are to remove the never-fully-implemented
code for multi-character collating elements, and to const-ify some stuff a
bit more fully. In combination with the recent security patch, this commit
brings us into line with Tcl 8.5.0.
Note that I didn't make any effort to duplicate a lot of cosmetic changes
that they made to bring their copy into line with their own style
guidelines, such as adding braces around single-line IF bodies. Most of
those we either had done already (such as ANSI-fication of function headers)
or there is no point because pgindent would undo the change anyway.
are shared with Tcl, since it's their code to begin with, and the patches
have been copied from Tcl 8.5.0. Problems:
CVE-2007-4769: Inadequate check on the range of backref numbers allows
crash due to out-of-bounds read.
CVE-2007-4772: Infinite loop in regex optimizer for pattern '($|^)*'.
CVE-2007-6067: Very slow optimizer cleanup for regex with a large NFA
representation, as well as crash if we encounter an out-of-memory condition
during NFA construction.
Part of the response to CVE-2007-6067 is to put a limit on the number of
states in the NFA representation of a regex. This seems needed even though
the within-the-code problems have been corrected, since otherwise the code
could try to use very large amounts of memory for a suitably-crafted regex,
leading to potential DOS by driving the system into swap, activating a kernel
OOM killer, etc.
Although there are certainly plenty of ways to drive the system into effective
DOS with poorly-written SQL queries, these problems seem worth treating as
security issues because many applications might accept regex search patterns
from untrustworthy sources.
Thanks to Will Drewry of Google for reporting these problems. Patches by Will
Drewry and Tom Lane.
Security: CVE-2007-4769, CVE-2007-4772, CVE-2007-6067
(extracted from Tcl 8.4.1 release, as Henry still hasn't got round to
making it a separate library). This solves a performance problem for
multibyte, as well as upgrading our regexp support to match recent Tcl
and nearly match recent Perl.