Make escaping functions retain trailing bytes of an invalid character.

Instead of dropping the trailing byte(s) of an invalid or incomplete multibyte character, replace only the first byte with a known-invalid sequence, and process the rest normally. This seems less likely to confuse incautious callers than the behavior adopted in 5dc1e42b4. While we're at it, adjust PQescapeStringInternal to produce at most one bleat about invalid multibyte characters per string. This matches the behavior of PQescapeInternal, and avoids the risk of producing tons of repetitive junk if a long string is simply given in the wrong encoding. This is a followup to the fixes for CVE-2025-1094, and should be included if cherry-picking those fixes. Author: Andres Freund <andres@anarazel.de> Co-authored-by: Tom Lane <tgl@sss.pgh.pa.us> Reported-by: Jeff Davis <pgsql@j-davis.com> Discussion: https://postgr.es/m/20250215012712.45@rfd.leadboat.com Backpatch-through: 13
2025-11-26 23:43:30 +03:00 · 2025-02-15 16:20:21 -05:00
parent 2a8a00674e
commit 9f45e6a91d
2 changed files with 65 additions and 95 deletions
--- a/src/fe_utils/string_utils.c
+++ b/src/fe_utils/string_utils.c
@@ -180,40 +180,25 @@ fmtIdEnc(const char *rawid, int encoding)
 			/* Slow path for possible multibyte characters */
 			charlen = pg_encoding_mblen(encoding, cp);
-			if (remaining < charlen)
+			if (remaining < charlen ||
-			{
+				pg_encoding_verifymbchar(encoding, cp, charlen) == -1)
 				/*
 				 * If the character is longer than the available input,
 				 * replace the string with an invalid sequence. The invalid
 				 * sequence ensures that the escaped string will trigger an
 				 * error on the server-side, even if we can't directly report
 				 * an error here.
 				 */
 				enlargePQExpBuffer(id_return, 2);
 				pg_encoding_set_invalid(encoding,
 										id_return->data + id_return->len);
 				id_return->len += 2;
 				id_return->data[id_return->len] = '\0';
 				/* there's no more input data, so we can stop */
 				break;
 			}
 			else if (pg_encoding_verifymbchar(encoding, cp, charlen) == -1)
 			{
 				/*
 				 * Multibyte character is invalid.  It's important to verify
-				 * that as invalid multi-byte characters could e.g. be used to
+				 * that as invalid multibyte characters could e.g. be used to
 				 * "skip" over quote characters, e.g. when parsing
 				 * character-by-character.
 				 *
-				 * Replace the bytes corresponding to the invalid character
+				 * Replace the character's first byte with an invalid
-				 * with an invalid sequence, for the same reason as above.
+				 * sequence. The invalid sequence ensures that the escaped
 				 * string will trigger an error on the server-side, even if we
 				 * can't directly report an error here.
 				 *
 				 * It would be a bit faster to verify the whole string the
 				 * first time we encounter a set highbit, but this way we can
-				 * replace just the invalid characters, which probably makes
+				 * replace just the invalid data, which probably makes it
-				 * it easier for users to find the invalidly encoded portion
+				 * easier for users to find the invalidly encoded portion of a
-				 * of a larger string.
+				 * larger string.
 				 */
 				enlargePQExpBuffer(id_return, 2);
 				pg_encoding_set_invalid(encoding,
@@ -222,11 +207,13 @@ fmtIdEnc(const char *rawid, int encoding)
 				id_return->data[id_return->len] = '\0';
 				/*
-				 * Copy the rest of the string after the invalid multi-byte
+				 * Handle the following bytes as if this byte didn't exist.
-				 * character.
+				 * That's safer in case the subsequent bytes contain
 				 * characters that are significant for the caller (e.g. '>' in
 				 * html).
 				 */
-				remaining -= charlen;
+				remaining--;
-				cp += charlen;
+				cp++;
 			}
 			else
 			{
@@ -395,49 +382,39 @@ appendStringLiteral(PQExpBuffer buf, const char *str,
 		/* Slow path for possible multibyte characters */
 		charlen = PQmblen(source, encoding);
-		if (remaining < charlen)
+		if (remaining < charlen ||
 			pg_encoding_verifymbchar(encoding, source, charlen) == -1)
 		{
 			/*
-			 * If the character is longer than the available input, replace
+			 * Multibyte character is invalid.  It's important to verify that
-			 * the string with an invalid sequence. The invalid sequence
+			 * as invalid multibyte characters could e.g. be used to "skip"
-			 * ensures that the escaped string will trigger an error on the
+			 * over quote characters, e.g. when parsing
-			 * server-side, even if we can't directly report an error here.
+			 * character-by-character.
 			 *
 			 * Replace the character's first byte with an invalid sequence.
 			 * The invalid sequence ensures that the escaped string will
 			 * trigger an error on the server-side, even if we can't directly
 			 * report an error here.
 			 *
 			 * We know there's enough space for the invalid sequence because
 			 * the "target" buffer is 2 * length + 2 long, and at worst we're
 			 * replacing a single input byte with two invalid bytes.
 			 */
 			pg_encoding_set_invalid(encoding, target);
 			target += 2;
 			/* there's no more valid input data, so we can stop */
 			break;
 		}
 		else if (pg_encoding_verifymbchar(encoding, source, charlen) == -1)
 		{
 			/*
 			 * Multibyte character is invalid.  It's important to verify that
 			 * as invalid multi-byte characters could e.g. be used to "skip"
 			 * over quote characters, e.g. when parsing
 			 * character-by-character.
 			 *
 			 * Replace the bytes corresponding to the invalid character with
 			 * an invalid sequence, for the same reason as above.
 			 *
 			 * It would be a bit faster to verify the whole string the first
 			 * time we encounter a set highbit, but this way we can replace
-			 * just the invalid characters, which probably makes it easier for
+			 * just the invalid data, which probably makes it easier for users
-			 * users to find the invalidly encoded portion of a larger string.
+			 * to find the invalidly encoded portion of a larger string.
 			 */
 			pg_encoding_set_invalid(encoding, target);
 			target += 2;
 			remaining -= charlen;
 			/*
-			 * Copy the rest of the string after the invalid multi-byte
+			 * Handle the following bytes as if this byte didn't exist. That's
-			 * character.
+			 * safer in case the subsequent bytes contain important characters
 			 * for the caller (e.g. '>' in html).
 			 */
-			source += charlen;
+			source++;
 			remaining--;
 		}
 		else
 		{
--- a/src/interfaces/libpq/fe-exec.c
+++ b/src/interfaces/libpq/fe-exec.c
@@ -4076,6 +4076,7 @@ PQescapeStringInternal(PGconn *conn,
 	const char *source = from;
 	char	   *target = to;
 	size_t		remaining = strnlen(from, length);
 	bool		already_complained = false;
 	if (error)
 		*error = 0;
@@ -4102,65 +4103,57 @@ PQescapeStringInternal(PGconn *conn,
 		/* Slow path for possible multibyte characters */
 		charlen = pg_encoding_mblen(encoding, source);
-		if (remaining < charlen)
+		if (remaining < charlen ||
 			pg_encoding_verifymbchar(encoding, source, charlen) == -1)
 		{
 			/*
-			 * If the character is longer than the available input, report an
+			 * Multibyte character is invalid.  It's important to verify that
-			 * error if possible, and replace the string with an invalid
+			 * as invalid multibyte characters could e.g. be used to "skip"
-			 * sequence. The invalid sequence ensures that the escaped string
+			 * over quote characters, e.g. when parsing
-			 * will trigger an error on the server-side, even if we can't
+			 * character-by-character.
-			 * directly report an error here.
+			 *
 			 * Report an error if possible, and replace the character's first
 			 * byte with an invalid sequence. The invalid sequence ensures
 			 * that the escaped string will trigger an error on the
 			 * server-side, even if we can't directly report an error here.
 			 *
 			 * This isn't *that* crucial when we can report an error to the
-			 * caller, but if we can't, the caller will use this string
+			 * caller; but if we can't or the caller ignores it, the caller
-			 * unmodified and it needs to be safe for parsing.
+			 * will use this string unmodified and it needs to be safe for
 			 * parsing.
 			 *
 			 * We know there's enough space for the invalid sequence because
 			 * the "to" buffer needs to be at least 2 * length + 1 long, and
 			 * at worst we're replacing a single input byte with two invalid
 			 * bytes.
 			 */
 			if (error)
 				*error = 1;
 			if (conn)
 				libpq_append_conn_error(conn, "incomplete multibyte character");
 			pg_encoding_set_invalid(encoding, target);
 			target += 2;
 			/* there's no more input data, so we can stop */
 			break;
 		}
 		else if (pg_encoding_verifymbchar(encoding, source, charlen) == -1)
 		{
 			/*
 			 * Multibyte character is invalid.  It's important to verify that
 			 * as invalid multi-byte characters could e.g. be used to "skip"
 			 * over quote characters, e.g. when parsing
 			 * character-by-character.
 			 *
 			 * Replace the bytes corresponding to the invalid character with
 			 * an invalid sequence, for the same reason as above.
 			 *
 			 * It would be a bit faster to verify the whole string the first
 			 * time we encounter a set highbit, but this way we can replace
-			 * just the invalid characters, which probably makes it easier for
+			 * just the invalid data, which probably makes it easier for users
-			 * users to find the invalidly encoded portion of a larger string.
+			 * to find the invalidly encoded portion of a larger string.
 			 */
 			if (error)
 				*error = 1;
-			if (conn)
+			if (conn && !already_complained)
 			{
 				if (remaining < charlen)
 					libpq_append_conn_error(conn, "incomplete multibyte character");
 				else
 					libpq_append_conn_error(conn, "invalid multibyte character");
 				/* Issue a complaint only once per string */
 				already_complained = true;
 			}
 			pg_encoding_set_invalid(encoding, target);
 			target += 2;
 			remaining -= charlen;
 			/*
-			 * Copy the rest of the string after the invalid multi-byte
+			 * Handle the following bytes as if this byte didn't exist. That's
-			 * character.
+			 * safer in case the subsequent bytes contain important characters
 			 * for the caller (e.g. '>' in html).
 			 */
-			source += charlen;
+			source++;
 			remaining--;
 		}
 		else
 		{