Fix comments that claimed that mblen() only looks at first byte.

GB18030's mblen() function looks at the first and the second byte of the multibyte character, to determine its length. copy.c had made the assumption that mblen() only looks at the first byte, but it turns out to work out fine, because of the way the GB18030 encoding works. COPY will see a 4-byte encoded character as two 2-byte encoded characters, which is enough for COPY's purposes. It cannot mix those up with delimiter or escaping characters, because only single-byte ASCII characters are supported as delimiters or escape characters. Discussion: https://www.postgresql.org/message-id/7704d099-9643-2a55-fb0e-becd64400dcb%40iki.fi
2025-12-21 05:21:08 +03:00 · 2019-01-25 14:54:38 +02:00
parent 7c079d7417
commit a5be6e9a1d
2 changed files with 31 additions and 8 deletions
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -4121,9 +4121,14 @@ not_end_of_copy:
 		{
 			int			mblen;

+			/*
+			 * It is enough to look at the first byte in all our encodings, to
+			 * get the length.  (GB18030 is a bit special, but still works for
+			 * our purposes; see comment in pg_gb18030_mblen())
+			 */
 			mblen_str[0] = c;
-			/* All our encodings only read the first byte to get the length */
 			mblen = pg_encoding_mblen(cstate->file_encoding, mblen_str);
+
 			IF_NEED_REFILL_AND_NOT_EOF_CONTINUE(mblen - 1);
 			IF_NEED_REFILL_AND_EOF_BREAK(mblen - 1);
 			raw_buf_ptr += mblen - 1;