Since %c only passes a C "char" to printf, it's incapable of dealing
with multibyte characters. Passing just the first byte of such a
character leads to an output string that is visibly not correctly
encoded, resulting in undesirable behavior such as encoding conversion
failures while sending error messages to clients.
We've lived with this issue for a long time because it was inconvenient
to avoid in a portable fashion. However, now that we always use our own
snprintf code, it's reasonable to use the %.*s format to print just one
possibly-multibyte character in a string. (We previously avoided that
obvious-looking answer in order to work around glibc's bug #6530, cf
commits 54cd4f045 and ed437e2b2.)
Hence, run around and fix a bunch of places that used %c to report
a character found in a user-supplied string. For simplicity, I did
not touch places that were emitting non-user-facing debug messages,
or reporting catalog data that should always be ASCII. (It's also
unclear how useful this approach could be in frontend code, where
it's less certain that we know what encoding we're dealing with.)
In passing, improve a couple of poorly-written error messages in
pageinspect/heapfuncs.c.
This is a longstanding issue, but I'm hesitant to back-patch because
of the impact on translatable message strings. In any case this fix
would not work reliably before v12.
Tom Lane and Quan Zongliang
Discussion: https://postgr.es/m/a120087c-4c88-d9d4-1ec5-808d7a7f133d@gmail.com
Our usual practice for "poor man's enum" catalog columns is to define
macros for the possible values and use those, not literal constants,
in C code. But for some reason lost in the mists of time, this was
never done for typalign/attalign or typstorage/attstorage. It's never
too late to make it better though, so let's do that.
The reason I got interested in this right now is the need to duplicate
some uses of the TYPSTORAGE constants in an upcoming ALTER TYPE patch.
But in general, this sort of change aids greppability and readability,
so it's a good idea even without any specific motivation.
I may have missed a few places that could be converted, and it's even
more likely that pending patches will re-introduce some hard-coded
references. But that's not fatal --- there's no expectation that
we'd actually change any of these values. We can clean up stragglers
over time.
Discussion: https://postgr.es/m/16457.1583189537@sss.pgh.pa.us
As a result of some long-ago quick hacks, the SIMILAR TO operator
and the corresponding flavor of substring() interpreted "ESCAPE NULL"
as selecting the default escape character '\'. This is both
surprising and not per spec: the standard is clear that these
functions should return NULL for NULL input.
Additionally, because of inconsistency of the strictness markings
of 3-argument substring() and similar_escape(), the planner could not
inline the SQL definition of substring(), resulting in a substantial
performance penalty compared to the underlying POSIX substring()
function.
The simplest fix for this would be to change the strictness marking
of similar_escape(), but if we do that we risk breaking existing views
that depend on that function. Hence, leave similar_escape() as-is
as a compatibility function, and instead invent a new function
similar_to_escape() that comes in two strict variants.
There are a couple of other behaviors in this area that are also
not per spec, but they are documented and seem generally at least
as sane as the spec's definition, so leave them alone. But improve
the documentation to describe them fully.
Patch by me; thanks to Álvaro Herrera and Andrew Gierth for review
and discussion.
Discussion: https://postgr.es/m/14047.1557708214@sss.pgh.pa.us
This is still using the 2.0 version of pg_bsd_indent.
I thought it would be good to commit this separately,
so as to document the differences between 2.0 and 2.1 behavior.
Discussion: https://postgr.es/m/16296.1558103386@sss.pgh.pa.us
SQL's regular-expression substring() function is defined to have a
pattern argument that's separated into three subpatterns by escape-
double-quote markers; the function result is the part of the input
matching the second subpattern. The standard makes it clear that
if there is ambiguity about how to match the input to the subpatterns,
the first and third subpatterns should be taken to match the smallest
possible amount of text (i.e., they're "non greedy", in the terms of
our regex code). We were not doing it that way: the first subpattern
would eat the largest possible amount of text, causing the function
result to be shorter than what the spec requires.
Fix that by attaching explicit greediness quantifiers to the
subpatterns. (This depends on the regex fix in commit 8a29ed053;
before that, this didn't reliably change the regex engine's behavior.)
Also, by adding parentheses around each subpattern, we ensure that
"|" (OR) in the subpatterns behave sanely. Previously, "|" in the
first or third subpatterns didn't work.
This patch also makes the function throw error if you write more than
two escape-double-quote markers, and do something sane if you write
just one, and document that behavior. Previously, an odd number of
markers led to a confusing complaint about unbalanced parentheses,
while extra pairs of markers were just ignored. (Note that the spec
requires exactly two markers, but we've historically allowed there
to be none, and this patch preserves the old behavior for that case.)
In passing, adjust some substring() test cases that didn't really
prove what they said they were testing for: they used patterns
that didn't match the data string, so that the output would be
NULL whether or not the function was really strict.
Although this is certainly a bug fix, changing the behavior in back
branches seems undesirable: applications could perhaps be depending on
the old behavior, since it's not obviously wrong unless you read the
spec very closely. Hence, no back-patch.
Discussion: https://postgr.es/m/5bb27a41-350d-37bf-901e-9d26f5592dd0@charter.net
SQL 2016 standards among other things contains set of SQL/JSON features for
JSON processing inside of relational database. The core of SQL/JSON is JSON
path language, allowing access parts of JSON documents and make computations
over them. This commit implements partial support JSON path language as
separate datatype called "jsonpath". The implementation is partial because
it's lacking datetime support and suppression of numeric errors. Missing
features will be added later by separate commits.
Support of SQL/JSON features requires implementation of separate nodes, and it
will be considered in subsequent patches. This commit includes following
set of plain functions, allowing to execute jsonpath over jsonb values:
* jsonb_path_exists(jsonb, jsonpath[, jsonb, bool]),
* jsonb_path_match(jsonb, jsonpath[, jsonb, bool]),
* jsonb_path_query(jsonb, jsonpath[, jsonb, bool]),
* jsonb_path_query_array(jsonb, jsonpath[, jsonb, bool]).
* jsonb_path_query_first(jsonb, jsonpath[, jsonb, bool]).
This commit also implements "jsonb @? jsonpath" and "jsonb @@ jsonpath", which
are wrappers over jsonpath_exists(jsonb, jsonpath) and jsonpath_predicate(jsonb,
jsonpath) correspondingly. These operators will have an index support
(implemented in subsequent patches).
Catversion bumped, to add new functions and operators.
Code was written by Nikita Glukhov and Teodor Sigaev, revised by me.
Documentation was written by Oleg Bartunov and Liudmila Mantrova. The work
was inspired by Oleg Bartunov.
Discussion: https://postgr.es/m/fcc6fc6a-b497-f39a-923d-aa34d0c588e8%402ndQuadrant.com
Author: Nikita Glukhov, Teodor Sigaev, Alexander Korotkov, Oleg Bartunov, Liudmila Mantrova
Reviewed-by: Tomas Vondra, Andrew Dunstan, Pavel Stehule, Alexander Korotkov
Commit c8ea87e4b introduced a temporary conversion buffer for
substrings extracted during regexp splits. Unfortunately the code that
sized it was failing to ignore the effects of ignored degenerate
regexp matches, so for regexp_split_* calls it could under-size the
buffer in such cases.
Fix, and add some regression test cases (though those will only catch
the bug if run in a multibyte encoding).
Backpatch to 9.3 as the faulty code was.
Thanks to the PostGIS project, Regina Obe and Paul Ramsey for the
report (via IRC) and assistance in analysis. Patch by me.
regexp_matches, regexp_split_to_table and regexp_split_to_array all
work by compiling a list of match positions as character offsets (NOT
byte positions) in the source string.
Formerly, they then used text_substr to extract the matched text; but
in a multi-byte encoding, that counts the characters in the string,
and the characters needed to reach the starting byte position, on
every call. Accordingly, the performance degraded as the product of
the input string length and the number of match positions, such that
splitting a string of a few hundred kbytes could take many minutes.
Repair by keeping the wide-character copy of the input string
available (only in the case where encoding_max_length is not 1) after
performing the match operation, and extracting substrings from that
instead. This reduces the complexity to being linear in the number of
result bytes, discounting the actual regexp match itself (which is not
affected by this patch).
In passing, remove cleanup using retail pfree() which was obsoleted by
commit ff428cded (Feb 2008) which made cleanup of SRF multi-call
contexts automatic. Also increase (to ~134 million) the maximum number
of matches and provide an error message when it is reached.
Backpatch all the way because this has been wrong forever.
Analysis and patch by me; review by Kaiting Chen.
Discussion: https://postgr.es/m/87pnyn55qh.fsf@news-spur.riddles.org.uk
see also https://postgr.es/m/87lg996g4r.fsf@news-spur.riddles.org.uk
The lower case spellings are C and C++ standard and are used in most
parts of the PostgreSQL sources. The upper case spellings are only used
in some files/modules. So standardize on the standard spellings.
The APIs for ICU, Perl, and Windows define their own TRUE and FALSE, so
those are left as is when using those APIs.
In code comments, we use the lower-case spelling for the C concepts and
keep the upper-case spelling for the SQL concepts.
Reviewed-by: Michael Paquier <michael.paquier@gmail.com>
Don't move parenthesized lines to the left, even if that means they
flow past the right margin.
By default, BSD indent lines up statement continuation lines that are
within parentheses so that they start just to the right of the preceding
left parenthesis. However, traditionally, if that resulted in the
continuation line extending to the right of the desired right margin,
then indent would push it left just far enough to not overrun the margin,
if it could do so without making the continuation line start to the left of
the current statement indent. That makes for a weird mix of indentations
unless one has been completely rigid about never violating the 80-column
limit.
This behavior has been pretty universally panned by Postgres developers.
Hence, disable it with indent's new -lpl switch, so that parenthesized
lines are always lined up with the preceding left paren.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
regexp_match() is like regexp_matches(), but it disallows the 'g' flag
and in consequence does not need to return a set. Instead, it returns
a simple text array value, or NULL if there's no match. Previously people
usually got that behavior with a sub-select, but this way is considerably
more efficient.
Documentation adjusted so that regexp_match() is presented first and then
regexp_matches() is introduced as a more complicated version. This is
a bit historically revisionist but seems pedagogically better.
Still TODO: extend contrib/citext to support this function.
Emre Hasegeli, reviewed by David Johnston
Discussion: <CAE2gYzy42sna2ME_e3y1KLQ-4UBrB-eVF0SWn8QG39sQSeVhEw@mail.gmail.com>
Previously, only a single-byte character was allowed as an
escape. This patch allows it to be a multi-byte character, though it
still must be a single character.
Reviewed by Heikki Linnakangas and Tom Lane.
The regex code didn't have any provision for query cancel; which is
unsurprising given its non-Postgres origin, but still problematic since
some operations can take a long time. Introduce a callback function to
check for a pending query cancel or session termination request, and
call it in a couple of strategic spots where we can make the regex code
exit with an error indicator.
If we ever actually split out the regex code as a standalone library,
some additional work will be needed to let the cancel callback function
be specified externally to the library. But that's straightforward
(certainly so by comparison to putting the locale-dependent character
classification logic on a similar arms-length basis), and there seems
no need to do it right now.
A bigger issue is that there may be more places than these two where
we need to check for cancels. We can always add more checks later,
now that the infrastructure is in place.
Since there are known examples of not-terribly-long regexes that can
lock up a backend for a long time, back-patch to all supported branches.
I have hopes of fixing the known performance problems later, but adding
query cancel ability seems like a good idea even if they were all fixed.
We'd find the same match twice if it was of zero length and not immediately
adjacent to the previous match. replace_text_regexp() got similar cases
right, so adjust this search logic to match that. Note that even though
the regexp_split_to_xxx() functions share this code, they did not display
equivalent misbehavior, because the second match would be considered
degenerate and ignored.
Jeevan Chalke, with some cosmetic changes by me.
rfree() failed to cope with the case that pg_regcomp() had initialized the
regex_t struct but then failed to allocate any memory for re->re_guts (ie,
the first malloc call in pg_regcomp() failed). It would try to touch the
guts struct anyway, and thus dump core. This is a sufficiently narrow
corner case that it's not surprising it's never been seen in the field;
but still a bug is a bug, so patch all active branches.
Noted while investigating whether we need to call pg_regfree after a
failure return from pg_regcomp. Other than this bug, it turns out we
don't, so adjust comments appropriately.
To generate btree-indexable conditions from regex WHERE conditions (such as
WHERE indexed_col ~ '^foo'), we need to be able to identify any fixed
prefix that a regex might have; that is, find any string that must be a
prefix of all strings satisfying the regex. We used to do that with
entirely ad-hoc code that looked at the source text of the regex. It
didn't know very much about regex syntax, which mostly meant that it would
fail to identify some optimizable cases; but Viktor Rosenfeld reported that
it would produce actively wrong answers for quantified parenthesized
subexpressions, such as '^(foo)?bar'. Rather than trying to extend the
ad-hoc code to cover this, let's get rid of it altogether in favor of
identifying prefixes by examining the compiled form of a regex.
To do this, I've added a new entry point "pg_regprefix" to the regex library;
hopefully it is defined in a sufficiently general fashion that it can remain
in the library when/if that code gets split out as a standalone project.
Since this bug has been there for a very long time, this fix needs to get
back-patched. However it depends on some other recent commits (particularly
the addition of wchar-to-database-encoding conversion), so I'll commit this
separately and then go to work on back-porting the necessary fixes.
walsender.h should depend on xlog.h, not vice versa. (Actually, the
inclusion was circular until a couple hours ago, which was even sillier;
but Bruce broke it in the expedient rather than logically correct
direction.) Because of that poor decision, plus blind application of
pgrminclude, we had a situation where half the system was depending on
xlog.h to include such unrelated stuff as array.h and guc.h. Clean up
the header inclusion, and manually revert a lot of what pgrminclude had
done so things build again.
This episode reinforces my feeling that pgrminclude should not be run
without adult supervision. Inclusion changes in header files in particular
need to be reviewed with great care. More generally, it'd be good if we
had a clearer notion of module layering to dictate which headers can sanely
include which others ... but that's a big task for another day.
This involves getting the character classification and case-folding
functions in the regex library to use the collations infrastructure.
Most of this work had been done already in connection with the upper/lower
and LIKE logic, so it was a simple matter of transposition.
While at it, split out these functions into a separate source file
regc_pg_locale.c, so that they can be correctly labeled with the Postgres
project's license rather than the Scriptics license. These functions are
100% Postgres-written code whereas what remains in regc_locale.c is still
mostly not ours, so lumping them both under the same copyright notice was
getting more and more misleading.
This is needed to avoid unwanted interference with SUBSTRING behavior,
as per bug #5257 from Roman Kononov. Also, add some basic intelligence
about character classes (bracket expressions) since we now have several
behaviors that aren't appropriate inside a character class.
As with the previous patch in this area, I'm reluctant to back-patch
since it might affect applications that are relying on the prior
behavior.
style by default. Per discussion, there seems to be hardly anything that
really relies on being able to change the regex flavor, so the ability to
select it via embedded options ought to be enough for any stragglers.
Also, if we didn't remove the GUC, we'd really be morally obligated to
mark the regex functions non-immutable, which'd possibly create performance
issues.
* Stop escaping ? and {. As of SQL:2008, SIMILAR TO is defined to have
POSIX-compatible interpretation of ? as well as {m,n} and related constructs,
so we should allow these things through to our regex engine.
* Escape ^ and $. It appears that our regex engine will treat ^^ at the
beginning of the string the same as ^, and similarly for $$ at the end of
the string, which meant that SIMILAR TO was effectively ignoring ^ at the
start of the pattern and $ at the end. Since these are not supposed to be
metacharacters, this is a bug.
The second part of this is arguably a back-patchable bug fix, but I'm
hesitant to do that because it might break applications that are expecting
something like "col SIMILAR TO '^foo$'" to work like a POSIX pattern.
Seems safer to only change it at a major version boundary.
Per discussion of an example from Doug Gorley.
case where there is a match to the pattern overall but the user has specified
a parenthesized subexpression and that subexpression hasn't got a match.
An example is substring('foo' from 'foo(bar)?'). This should return NULL,
since (bar) isn't matched, but it was mistakenly returning the whole-pattern
match instead (ie, 'foo'). Per bug #4044 from Rui Martins.
This has been broken since the beginning; patch in all supported versions.
The old behavior was sufficiently inconsistent that it's impossible to believe
anyone is depending on it.
to not cause needless copying of text datums that have 1-byte headers.
Greg Stark, in response to performance gripe from Guillaume Smet and
ITAGAKI Takahiro.
regexp_split_to_table() within a single query. This is only a partial
solution, as it turns out that with enough matches per string these
functions can also tickle a repalloc() misbehavior. But fixing that
is a topic for a separate patch.