The BKI file's string quoting conventions were previously quite weird,
perhaps as a result of repurposing a function built to scan
single-quoted strings to scan double-quoted ones. Change to use the
same rules as we use in GUC files, allowing some simplifications in
genbki.pl and initdb.c.
While at it, completely remove the backend's scanstr() function, which
was essentially a duplicate of the string dequoting code in guc-file.l.
Instead export that one (under a less generic name than it had) and let
bootscanner.l use it. Now we can clarify that scansup.c exists only to
support the main lexer. We could alternatively have removed GUC_scanstr,
but this way seems better since the previous arrangement could mislead
a reader into thinking that scanstr() had something to do with the main
lexer's handling of string literals. Maybe it did once, but if so it
was a long time ago.
This patch does not bump catversion, since the initially-installed
catalog contents don't change. Note however that successful initdb
after applying this patch will require up-to-date postgres.bki as well
as postgres and initdb executables.
In passing, remove a bunch of very-long-obsolete #include's in
bootparse.y and bootscanner.l.
John Naylor
Discussion: https://postgr.es/m/CACPNZCtDpd18T0KATTmCggO2GdVC4ow86ypiq5ENff1VnauL8g@mail.gmail.com
Commit 54cd4f045 added some kluges to work around an old glibc bug,
namely that %.*s could misbehave if glibc thought any characters in
the supplied string were incorrectly encoded. Now that we use our
own snprintf.c implementation, we need not worry about that bug (even
if it still exists in the wild). Revert a couple of particularly
ugly hacks, and remove or improve assorted comments.
Note that there can still be encoding-related hazards here: blindly
clipping at a fixed length risks producing wrongly-encoded output
if the clip splits a multibyte character. However, code that's
doing correct multibyte-aware clipping doesn't really need a comment
about that, while code that isn't needs an explanation why not,
rather than a red-herring comment about an obsolete bug.
Discussion: https://postgr.es/m/279428.1593373684@sss.pgh.pa.us
Similar to commits 7e735035f2 and dddf4cdc33, this commit makes the order
of header file inclusion consistent for backend modules.
In the passing, removed a couple of duplicate inclusions.
Author: Vignesh C
Reviewed-by: Kuntal Ghosh and Amit Kapila
Discussion: https://postgr.es/m/CALDaNm2Sznv8RR6Ex-iJO6xAdsxgWhCoETkaYX=+9DW3q0QCfA@mail.gmail.com
The lower case spellings are C and C++ standard and are used in most
parts of the PostgreSQL sources. The upper case spellings are only used
in some files/modules. So standardize on the standard spellings.
The APIs for ICU, Perl, and Windows define their own TRUE and FALSE, so
those are left as is when using those APIs.
In code comments, we use the lower-case spelling for the C concepts and
keep the upper-case spelling for the SQL concepts.
Reviewed-by: Michael Paquier <michael.paquier@gmail.com>
Long-standing code has called tolower() on identifier character bytes
with the high bit set. This is clearly an error and produces junk output
when the encoding is multi-byte. This patch therefore restricts this
activity to cases where there is a character with the high bit set AND
the encoding is single-byte.
There have been numerous gripes about this, most recently from Martin
Schäfer.
Backpatch to all live releases.
My initial impression that glibc was measuring the precision in characters
(which is what the Linux man page says it does) was incorrect. It does take
the precision to be in bytes, but it also tries to truncate the string at a
character boundary. The bottom line remains the same: it will mess up
if the string is not in the encoding it expects, so we need to avoid %.*s
anytime there's a significant risk of that. Previous code changes are still
good, but adjust the comments to reflect this knowledge. Per research by
Hernan Gonzalez.
Depending on which spec you read, field widths and precisions in %s may be
counted either in bytes or characters. Our code was assuming bytes, which
is wrong at least for glibc's implementation, and in any case libc might
have a different idea of the prevailing encoding than we do. Hence, for
portable results we must avoid using anything more complex than just "%s"
unless the string to be printed is known to be all-ASCII.
This patch fixes the cases I could find, including the psql formatting
failure reported by Hernan Gonzalez. In HEAD only, I also added comments
to some places where it appears safe to continue using "%.*s".
directly. This was a lot of trouble, but should be worth it in terms of
not having to keep the plpgsql lexer in step with core anymore. In addition
the handling of keywords is significantly better-structured, allowing us to
de-reserve a number of words that plpgsql formerly treated as reserved.
return true for exactly the characters treated as whitespace by their flex
scanners. Per report from Victor Snezhko and subsequent investigation.
Also fix a passel of unsafe usages of <ctype.h> functions, that is, ye olde
char-vs-unsigned-char issue. I won't miss <ctype.h> when we are finally
able to stop using it.
#define HIGHBIT (0x80)
#define IS_HIGHBIT_SET(ch) ((unsigned char)(ch) & HIGHBIT)
and removed CSIGNBIT and mapped it uses to HIGHBIT. I have also added
uses for IS_HIGHBIT_SET where appropriate. This change is
purely for code clarity.
Also performed an initial run through of upgrading our Copyright date to
extend to 2005 ... first run here was very simple ... change everything
where: grep 1996-2004 && the word 'Copyright' ... scanned through the
generated list with 'less' first, and after, to make sure that I only
picked up the right entries ...
problem, per previous discussion. Make some additional changes to
centralize the knowledge of just how identifier downcasing is done,
in hopes of simplifying any future tweaking in this area.