Fix two issues in TOAST decompression.

pglz_maximum_compressed_size() potentially underestimated the amount of compressed data required to produce N bytes of decompressed data; this is a fault in commit 11a078cf8. Separately from that, pglz_decompress() failed to protect itself against corrupt compressed data, particularly off == 0 in a match tag. Commit c60e520f6 turned such a situation into an infinite loop, where before it'd just have resulted in garbage output. The combination of these two bugs seems like it may explain bug #16694 from Tom Vijlbrief, though it's impossible to be quite sure without direct inspection of the failing session. (One needs to assume that the pglz_maximum_compressed_size() bug caused us to fail to fetch the second byte of a match tag, and what happened to be there instead was a zero. The reported infinite loop is hard to explain without off == 0, though.) Aside from fixing the bugs, rewrite associated comments for more clarity. Back-patch to v13 where both these commits landed. Discussion: https://postgr.es/m/16694-f107871e499ec114@postgresql.org
2025-05-08 07:21:33 +03:00 · 2020-11-01 18:38:42 -05:00 · 2020-11-01 18:38:42 -05:00 · dfc797730f
commit dfc797730f
parent 7f4235032f
1 changed files with 66 additions and 35 deletions
--- a/src/common/pg_lzcompress.c
+++ b/src/common/pg_lzcompress.c
@ -710,7 +710,6 @@ pglz_decompress(const char *source, int32 slen, char *dest,
 		for (ctrlc = 0; ctrlc < 8 && sp < srcend && dp < destend; ctrlc++)
 		{
 			if (ctrl & 1)
 			{
 				/*
@ -732,41 +731,62 @@ pglz_decompress(const char *source, int32 slen, char *dest,
 					len += *sp++;
 				/*
-				 * Now we copy the bytes specified by the tag from OUTPUT to
+				 * Check for corrupt data: if we fell off the end of the
-				 * OUTPUT (copy len bytes from dp - off to dp). The copied
+				 * source, or if we obtained off = 0, we have problems.  (We
-				 * areas could overlap, to preven possible uncertainty, we
+				 * must check this, else we risk an infinite loop below in the
-				 * copy only non-overlapping regions.
+				 * face of corrupt data.)
 				 */
 				if (sp > srcend || off == 0)
 					break;
 				/*
 				 * Don't emit more data than requested.
 				 */
 				len = Min(len, destend - dp);
 				/*
 				 * Now we copy the bytes specified by the tag from OUTPUT to
 				 * OUTPUT (copy len bytes from dp - off to dp).  The copied
 				 * areas could overlap, so to avoid undefined behavior in
 				 * memcpy(), be careful to copy only non-overlapping regions.
 				 *
 				 * Note that we cannot use memmove() instead, since while its
 				 * behavior is well-defined, it's also not what we want.
 				 */
 				while (off < len)
 				{
-					/*---------
+					/*
-					 * When offset is smaller than length - source and
+					 * We can safely copy "off" bytes since that clearly
-					 * destination regions overlap. memmove() is resolving
+					 * results in non-overlapping source and destination.
 					 * this overlap in an incompatible way with pglz. Thus we
 					 * resort to memcpy()-ing non-overlapping regions.
 					 *
 					 * Consider input: 112341234123412341234
 					 * At byte 5       here ^ we have match with length 16 and
 					 * offset 4.       11234M(len=16, off=4)
 					 * We are decoding first period of match and rewrite match
 					 *                 112341234M(len=12, off=8)
 					 *
 					 * The same match is now at position 9, it points to the
 					 * same start byte of output, but from another position:
 					 * the offset is doubled.
 					 *
 					 * We iterate through this offset growth until we can
 					 * proceed to usual memcpy(). If we would try to decode
 					 * the match at byte 5 (len=16, off=4) by memmove() we
 					 * would issue memmove(5, 1, 16) which would produce
 					 * 112341234XXXXXXXXXXXX, where series of X is 12
 					 * undefined bytes, that were at bytes [5:17].
 					 *---------
 					 */
 					memcpy(dp, dp - off, off);
 					len -= off;
 					dp += off;
 					/*----------
 					 * This bit is less obvious: we can double "off" after
 					 * each such step.  Consider this raw input:
 					 *		112341234123412341234
 					 * This will be encoded as 5 literal bytes "11234" and
 					 * then a match tag with length 16 and offset 4.  After
 					 * memcpy'ing the first 4 bytes, we will have emitted
 					 *		112341234
 					 * so we can double "off" to 8, then after the next step
 					 * we have emitted
 					 *		11234123412341234
 					 * Then we can double "off" again, after which it is more
 					 * than the remaining "len" so we fall out of this loop
 					 * and finish with a non-overlapping copy of the
 					 * remainder.  In general, a match tag with off < len
 					 * implies that the decoded data has a repeat length of
 					 * "off".  We can handle 1, 2, 4, etc repetitions of the
 					 * repeated string per memcpy until we get to a situation
 					 * where the final copy step is non-overlapping.
 					 *
 					 * (Another way to understand this is that we are keeping
 					 * the copy source point dp - off the same throughout.)
 					 *----------
 					 */
 					off += off;
 				}
 				memcpy(dp, dp - off, len);
@ -820,21 +840,32 @@ pglz_decompress(const char *source, int32 slen, char *dest,
 int32
 pglz_maximum_compressed_size(int32 rawsize, int32 total_compressed_size)
 {
-	int32		compressed_size;
+	int64		compressed_size;
 	/*
-	 * pglz uses one control bit per byte, so we need (rawsize * 9) bits. We
+	 * pglz uses one control bit per byte, so if the entire desired prefix is
-	 * care about bytes though, so we add 7 to make sure we include the last
+	 * represented as literal bytes, we'll need (rawsize * 9) bits.  We care
-	 * incomplete byte (integer division rounds down).
+	 * about bytes though, so be sure to round up not down.
 	 *
-	 * XXX Use int64 to prevent overflow during calculation.
+	 * Use int64 here to prevent overflow during calculation.
 	 */
-	compressed_size = (int32) ((int64) rawsize * 9 + 7) / 8;
+	compressed_size = ((int64) rawsize * 9 + 7) / 8;
 	/*
 	 * The above fails to account for a corner case: we could have compressed
 	 * data that starts with N-1 or N-2 literal bytes and then has a match tag
 	 * of 2 or 3 bytes.  It's therefore possible that we need to fetch 1 or 2
 	 * more bytes in order to have the whole match tag.  (Match tags earlier
 	 * in the compressed data don't cause a problem, since they should
 	 * represent more decompressed bytes than they occupy themselves.)
 	 */
 	compressed_size += 2;
 	/*
 	 * Maximum compressed size can't be larger than total compressed size.
 	 * (This also ensures that our result fits in int32.)
 	 */
 	compressed_size = Min(compressed_size, total_compressed_size);
-	return compressed_size;
+	return (int32) compressed_size;
 }