Since %c only passes a C "char" to printf, it's incapable of dealing
with multibyte characters. Passing just the first byte of such a
character leads to an output string that is visibly not correctly
encoded, resulting in undesirable behavior such as encoding conversion
failures while sending error messages to clients.
We've lived with this issue for a long time because it was inconvenient
to avoid in a portable fashion. However, now that we always use our own
snprintf code, it's reasonable to use the %.*s format to print just one
possibly-multibyte character in a string. (We previously avoided that
obvious-looking answer in order to work around glibc's bug #6530, cf
commits 54cd4f045 and ed437e2b2.)
Hence, run around and fix a bunch of places that used %c to report
a character found in a user-supplied string. For simplicity, I did
not touch places that were emitting non-user-facing debug messages,
or reporting catalog data that should always be ASCII. (It's also
unclear how useful this approach could be in frontend code, where
it's less certain that we know what encoding we're dealing with.)
In passing, improve a couple of poorly-written error messages in
pageinspect/heapfuncs.c.
This is a longstanding issue, but I'm hesitant to back-patch because
of the impact on translatable message strings. In any case this fix
would not work reliably before v12.
Tom Lane and Quan Zongliang
Discussion: https://postgr.es/m/a120087c-4c88-d9d4-1ec5-808d7a7f133d@gmail.com
Since the existing bit number argument can't exceed INT32_MAX, it's
not possible for these functions to manipulate bits beyond the first
256MB of a bytea value. Lift that restriction by redeclaring the
bit number arguments as int8 (which requires a catversion bump,
hence is not back-patchable).
The similarly-named functions for bit/varbit don't really have a
problem because we restrict those types to at most VARBITMAXLEN bits;
hence leave them alone.
While here, extend the encode/decode functions in utils/adt/encode.c
to allow dealing with values wider than 1GB. This is not a live bug
or restriction in current usage, because no input could be more than
1GB, and since none of the encoders can expand a string more than 4X,
the result size couldn't overflow uint32. But it might be desirable
to support more in future, so make the input length values size_t
and the potential-output-length values uint64.
Also add some test cases to improve the miserable code coverage
of these functions.
Movead Li, editorialized some by me; also reviewed by Ashutosh Bapat
Discussion: https://postgr.es/m/20200312115135445367128@highgo.ca
Solaris 11.4 has built-in functions named b64_encode and b64_decode.
Rename ours to something else to avoid the conflict (fortunately,
ours are static so the impact is limited).
One could wish for less duplication of code in this area, but that
would be a larger patch and not very suitable for back-patching.
Since this is a portability fix, we want to put it into all supported
branches.
Report and initial patch by Rainer Orth, reviewed and adjusted a bit
by Michael Paquier
Discussion: https://postgr.es/m/ydd372wk28h.fsf@CeBiTec.Uni-Bielefeld.DE
Don't move parenthesized lines to the left, even if that means they
flow past the right margin.
By default, BSD indent lines up statement continuation lines that are
within parentheses so that they start just to the right of the preceding
left parenthesis. However, traditionally, if that resulted in the
continuation line extending to the right of the desired right margin,
then indent would push it left just far enough to not overrun the margin,
if it could do so without making the continuation line start to the left of
the current statement indent. That makes for a weird mix of indentations
unless one has been completely rigid about never violating the 80-column
limit.
This behavior has been pretty universally panned by Postgres developers.
Hence, disable it with indent's new -lpl switch, so that parenthesized
lines are always lined up with the preceding left paren.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
The new indent version includes numerous fixes thanks to Piotr Stefaniak.
The main changes visible in this commit are:
* Nicer formatting of function-pointer declarations.
* No longer unexpectedly removes spaces in expressions using casts,
sizeof, or offsetof.
* No longer wants to add a space in "struct structname *varname", as
well as some similar cases for const- or volatile-qualified pointers.
* Declarations using PG_USED_FOR_ASSERTS_ONLY are formatted more nicely.
* Fixes bug where comments following declarations were sometimes placed
with no space separating them from the code.
* Fixes some odd decisions for comments following case labels.
* Fixes some cases where comments following code were indented to less
than the expected column 33.
On the less good side, it now tends to put more whitespace around typedef
names that are not listed in typedefs.list. This might encourage us to
put more effort into typedef name collection; it's not really a bug in
indent itself.
There are more changes coming after this round, having to do with comment
indentation and alignment of lines appearing within parentheses. I wanted
to limit the size of the diffs to something that could be reviewed without
one's eyes completely glazing over, so it seemed better to split up the
changes as much as practical.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
This makes almost all core code follow the policy introduced in the
previous commit. Specific decisions:
- Text search support functions with char* and length arguments, such as
prsstart and lexize, may receive unaligned strings. I doubt
maintainers of non-core text search code will notice.
- Use plain VARDATA() on values detoasted or synthesized earlier in the
same function. Use VARDATA_ANY() on varlenas sourced outside the
function, even if they happen to always have four-byte headers. As an
exception, retain the universal practice of using VARDATA() on return
values of SendFunctionCall().
- Retain PG_GETARG_BYTEA_P() in pageinspect. (Page images are too large
for a one-byte header, so this misses no optimization.) Sites that do
not call get_page_from_raw() typically need the four-byte alignment.
- For now, do not change btree_gist. Its use of four-byte headers in
memory is partly entangled with storage of 4-byte headers inside
GBT_VARKEY, on disk.
- For now, do not change gtrgm_consistent() or gtrgm_distance(). They
incorporate the varlena header into a cache, and there are multiple
credible implementation strategies to consider.
This avoids additional translatable strings for each distinct type, as
well as making our quoting style around type names more consistent
(namely, that we don't quote type names). This continues what started
as f402b9950120.
Discussion: https://postgr.es/m/20160401170642.GA57509@alvherre.pgsql
Both hex format and the traditional "escape" format are automatically
handled on input. The output format is selected by the new GUC variable
bytea_output.
As committed, bytea_output defaults to HEX, which is an *incompatible
change*. We will keep it this way for awhile for testing purposes, but
should consider whether to switch to the more backwards-compatible
default of ESCAPE before 8.5 is released.
Peter Eisentraut
strings. This patch introduces four support functions cstring_to_text,
cstring_to_text_with_len, text_to_cstring, and text_to_cstring_buffer, and
two macros CStringGetTextDatum and TextDatumGetCString. A number of
existing macros that provided variants on these themes were removed.
Most of the places that need to make such conversions now require just one
function or macro call, in place of the multiple notational layers that used
to be needed. There are no longer any direct calls of textout or textin,
and we got most of the places that were using handmade conversions via
memcpy (there may be a few still lurking, though).
This commit doesn't make any serious effort to eliminate transient memory
leaks caused by detoasting toasted text objects before they reach
text_to_cstring. We changed PG_GETARG_TEXT_P to PG_GETARG_TEXT_PP in a few
places where it was easy, but much more could be done.
Brendan Jurd and Tom Lane
values into \nnn octal escape sequences. When the database encoding is
multibyte this is *necessary* to avoid generating invalidly encoded text.
Even in a single-byte encoding, the old behavior seems very hazardous ---
consider for example what happens if the text is transferred to another
database with a different encoding. Decoding would then yield some other
bytea value than what was encoded, which is surely undesirable. Per gripe
from Hernan Gonzalez.
Backpatch to 8.3, but not further. This is a bit of a judgment call, but I
make it on these grounds: pre-8.3 we don't really have much encoding safety
anyway because of the convert() function family, and we would also have much
higher risk of breaking existing apps that may not be expecting this behavior.
8.3 is still new enough that we can probably get away with making this change
in the function's behavior.
Get rid of VARATT_SIZE and VARATT_DATA, which were simply redundant with
VARSIZE and VARDATA, and as a consequence almost no code was using the
longer names. Rename the length fields of struct varlena and various
derived structures to catch anyplace that was accessing them directly;
and clean up various places so caught. In itself this patch doesn't
change any behavior at all, but it is necessary infrastructure if we hope
to play any games with the representation of varlena headers.
Greg Stark and Tom Lane
conversion of basic ASCII letters. Remove all uses of strcasecmp and
strncasecmp in favor of new functions pg_strcasecmp and pg_strncasecmp;
remove most but not all direct uses of toupper and tolower in favor of
pg_toupper and pg_tolower. These functions use the same notions of
case folding already developed for identifier case conversion. I left
the straight locale-based folding in place for situations where we are
just manipulating user data and not trying to match it to built-in
strings --- for example, the SQL upper() function is still locale
dependent. Perhaps this will prove not to be what's wanted, but at
the moment we can initdb and pass regression tests in Turkish locale.
>
> 1. Now outputs '\\' instead of '\134' when using encode(bytea, 'escape')
> Note that I ended up leaving \0 as \000 so that there are no ambiguities
> when decoding something like, for example, \0123.
>
> 2. Fixed bug in byteain which allowed input values which were not valid
> octals (e.g. \789), to be parsed as if they were octals.
>
> Joe
>
Here's rev 2 of the bytea string support patch. Changes:
1. Added missing declaration for MatchBytea function
2. Added PQescapeBytea to fe-exec.c
3. Applies cleanly on cvs tip from this afternoon
I'm hoping that someone can review/approve/apply this before beta starts, so
I guess I'd vote (not that it counts for much) to delay beta a few days :-)
Joe Conway