mirror of
				https://github.com/facebook/zstd.git
				synced 2025-11-03 20:33:11 +03:00 
			
		
		
		
	updated format spec
This commit is contained in:
		@@ -76,9 +76,9 @@ allowing streaming operations.
 | 
				
			|||||||
General Structure of Zstandard Frame format
 | 
					General Structure of Zstandard Frame format
 | 
				
			||||||
-------------------------------------------
 | 
					-------------------------------------------
 | 
				
			||||||
 | 
					
 | 
				
			||||||
| MagicNb |  F. Header | Block | (More blocks) | EndMark |
 | 
					| MagicNb |  Frame Header | Block | (More blocks) | EndMark |
 | 
				
			||||||
|:-------:|:----------:| ----- | ------------- | ------- |
 | 
					|:-------:|:-------------:| ----- | ------------- | ------- |
 | 
				
			||||||
| 4 bytes | 2-14 bytes |       |               | 3 bytes |
 | 
					| 4 bytes |  2-14 bytes   |       |               | 3 bytes |
 | 
				
			||||||
 | 
					
 | 
				
			||||||
__Magic Number__
 | 
					__Magic Number__
 | 
				
			||||||
 | 
					
 | 
				
			||||||
@@ -87,11 +87,11 @@ Value : 0xFD2FB527
 | 
				
			|||||||
 | 
					
 | 
				
			||||||
__Frame Header__
 | 
					__Frame Header__
 | 
				
			||||||
 | 
					
 | 
				
			||||||
2 to 14 Bytes, to be detailed in the next part.
 | 
					2 to 14 Bytes, detailed in [next part](#frame-header).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
__Data Blocks__
 | 
					__Data Blocks__
 | 
				
			||||||
 | 
					
 | 
				
			||||||
To be detailed later on.
 | 
					Detailed in [next chapter](#data-blocks).
 | 
				
			||||||
That’s where compressed data is stored.
 | 
					That’s where compressed data is stored.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
__EndMark__
 | 
					__EndMark__
 | 
				
			||||||
@@ -101,12 +101,12 @@ This last block header may optionally host a __Content Checksum__ .
 | 
				
			|||||||
 | 
					
 | 
				
			||||||
##### __Content Checksum__
 | 
					##### __Content Checksum__
 | 
				
			||||||
 | 
					
 | 
				
			||||||
Content Checksum verify that frame content has been regenrated correctly.
 | 
					Content Checksum verify that frame content has been regenerated correctly.
 | 
				
			||||||
The content checksum is the result
 | 
					The content checksum is the result
 | 
				
			||||||
of [xxh64() hash function](https://www.xxHash.com)
 | 
					of [xxh64() hash function](https://www.xxHash.com)
 | 
				
			||||||
digesting the original (decoded) data as input, and a seed of zero.
 | 
					digesting the original (decoded) data as input, and a seed of zero.
 | 
				
			||||||
Bits from 11 to 32 (included) are extracted to form a 22 bits checksum
 | 
					Bits from 11 to 32 (included) are extracted to form a 22 bits checksum
 | 
				
			||||||
stored into the last block header.
 | 
					stored into the endmark body.
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
mask22bits = (1<<22)-1;
 | 
					mask22bits = (1<<22)-1;
 | 
				
			||||||
contentChecksum = (XXH64(content, size, 0) >> 11) & mask22bits;
 | 
					contentChecksum = (XXH64(content, size, 0) >> 11) & mask22bits;
 | 
				
			||||||
@@ -224,6 +224,10 @@ Provides guarantees on maximum back-reference distance
 | 
				
			|||||||
that will be present within compressed data.
 | 
					that will be present within compressed data.
 | 
				
			||||||
This information is useful for decoders to allocate enough memory.
 | 
					This information is useful for decoders to allocate enough memory.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					`WD` byte is optional. It's not present in `single segment` mode.
 | 
				
			||||||
 | 
					In which case, the maximum back-reference distance is the content size itself,
 | 
				
			||||||
 | 
					which can be any value from 1 to 2^64-1 bytes (16 EB).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
|   BitNb   |    7-3   |    0-2   |
 | 
					|   BitNb   |    7-3   |    0-2   |
 | 
				
			||||||
| --------- | -------- | -------- |
 | 
					| --------- | -------- | -------- |
 | 
				
			||||||
| FieldName | Exponent | Mantissa |
 | 
					| FieldName | Exponent | Mantissa |
 | 
				
			||||||
@@ -236,20 +240,16 @@ windowAdd = (windowBase / 8) * Mantissa;
 | 
				
			|||||||
windowSize = windowBase + windowAdd;
 | 
					windowSize = windowBase + windowAdd;
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
The minimum window size is 1 KB.
 | 
					The minimum window size is 1 KB.
 | 
				
			||||||
The maximum size is (15*(2^38))-1 bytes, which is almost 1.875 TB.
 | 
					The maximum size is `15*(1<<38)` bytes, which is 1.875 TB.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
To properly decode compressed data,
 | 
					To properly decode compressed data,
 | 
				
			||||||
a decoder will need to allocate a buffer of at least `windowSize` bytes.
 | 
					a decoder will need to allocate a buffer of at least `windowSize` bytes.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
Note that `WD` byte is optional. It's not present in `single segment` mode.
 | 
					 | 
				
			||||||
In which case, the maximum back-reference distance is the content size itself,
 | 
					 | 
				
			||||||
which can be any value from 1 to 2^64-1 bytes (16 EB).
 | 
					 | 
				
			||||||
 | 
					 | 
				
			||||||
In order to preserve decoder from unreasonable memory requirements,
 | 
					In order to preserve decoder from unreasonable memory requirements,
 | 
				
			||||||
a decoder can refuse a compressed frame
 | 
					a decoder can refuse a compressed frame
 | 
				
			||||||
which requests a memory size beyond decoder's authorized range.
 | 
					which requests a memory size beyond decoder's authorized range.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
For better interoperability,
 | 
					For improved interoperability,
 | 
				
			||||||
decoders are recommended to be compatible with window sizes of 8 MB.
 | 
					decoders are recommended to be compatible with window sizes of 8 MB.
 | 
				
			||||||
Encoders are recommended to not request more than 8 MB.
 | 
					Encoders are recommended to not request more than 8 MB.
 | 
				
			||||||
It's merely a recommendation though,
 | 
					It's merely a recommendation though,
 | 
				
			||||||
@@ -266,10 +266,10 @@ it's up to the caller to make sure it uses the correct dictionary.
 | 
				
			|||||||
Field size depends on __Dictionary ID flag__.
 | 
					Field size depends on __Dictionary ID flag__.
 | 
				
			||||||
1 byte can represent an ID 0-255.
 | 
					1 byte can represent an ID 0-255.
 | 
				
			||||||
2 bytes can represent an ID 0-65535.
 | 
					2 bytes can represent an ID 0-65535.
 | 
				
			||||||
4 bytes can represent an ID 0-(2^32-1).
 | 
					4 bytes can represent an ID 0-4294967295.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
It's allowed to represent a small ID (for example `13`)
 | 
					It's allowed to represent a small ID (for example `13`)
 | 
				
			||||||
with a large 4-bytes dictionary ID, losing some efficiency in the process.
 | 
					with a large 4-bytes dictionary ID, losing some compacity in the process.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
__Frame Content Size__
 | 
					__Frame Content Size__
 | 
				
			||||||
 | 
					
 | 
				
			||||||
@@ -331,7 +331,7 @@ There are 4 block types :
 | 
				
			|||||||
  while the "compressed" block is just 1 byte (the byte to repeat).
 | 
					  while the "compressed" block is just 1 byte (the byte to repeat).
 | 
				
			||||||
- EndMark : this is not a block. Signal the end of the frame.
 | 
					- EndMark : this is not a block. Signal the end of the frame.
 | 
				
			||||||
  The rest of the field may be optionally filled by a checksum
 | 
					  The rest of the field may be optionally filled by a checksum
 | 
				
			||||||
  (see [__Content Checksum__]).
 | 
					  (see [Content Checksum](#content-checksum)).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
Block sizes must respect a few rules :
 | 
					Block sizes must respect a few rules :
 | 
				
			||||||
- In compressed mode, compressed size if always strictly `< decompressed size`.
 | 
					- In compressed mode, compressed size if always strictly `< decompressed size`.
 | 
				
			||||||
@@ -345,9 +345,9 @@ Where the actual data to decode stands.
 | 
				
			|||||||
It might be compressed or not, depending on previous field indications.
 | 
					It might be compressed or not, depending on previous field indications.
 | 
				
			||||||
A data block is not necessarily "full" :
 | 
					A data block is not necessarily "full" :
 | 
				
			||||||
since an arbitrary “flush” may happen anytime,
 | 
					since an arbitrary “flush” may happen anytime,
 | 
				
			||||||
block decompressed content can be any size, up to Block Maximum Size.
 | 
					block decompressed content can be any size,
 | 
				
			||||||
Block Maximum Decompressed Size is the smallest of :
 | 
					up to Block Maximum Decompressed Size, which is the smallest of :
 | 
				
			||||||
- Max back-reference distance
 | 
					- Maximum back-reference distance
 | 
				
			||||||
- 128 KB
 | 
					- 128 KB
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
@@ -364,7 +364,9 @@ Its design is pretty straightforward,
 | 
				
			|||||||
with the sole objective to allow the decoder to quickly skip
 | 
					with the sole objective to allow the decoder to quickly skip
 | 
				
			||||||
over user-defined data and continue decoding.
 | 
					over user-defined data and continue decoding.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
Skippable frames defined in this specification are compatible with LZ4 ones.
 | 
					Skippable frames defined in this specification are compatible with [LZ4] ones.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					[LZ4]:http://www.lz4.org
 | 
				
			||||||
 | 
					
 | 
				
			||||||
__Magic Number__ :
 | 
					__Magic Number__ :
 | 
				
			||||||
 | 
					
 | 
				
			||||||
@@ -393,8 +395,8 @@ in order to properly allocate destination buffer.
 | 
				
			|||||||
See [Data Blocks](#data-blocks) for more details.
 | 
					See [Data Blocks](#data-blocks) for more details.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
A compressed block consists of 2 sections :
 | 
					A compressed block consists of 2 sections :
 | 
				
			||||||
- Literals section
 | 
					- [Literals section](#literals-section)
 | 
				
			||||||
- Sequences section
 | 
					- [Sequences section](#sequences-section)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### Prerequisites
 | 
					### Prerequisites
 | 
				
			||||||
To decode a compressed block, the following elements are necessary :
 | 
					To decode a compressed block, the following elements are necessary :
 | 
				
			||||||
@@ -439,12 +441,11 @@ This is a 2-bits field, describing 4 different block types :
 | 
				
			|||||||
| Block Type | Compressed | Repeat | Raw |   RLE   |
 | 
					| Block Type | Compressed | Repeat | Raw |   RLE   |
 | 
				
			||||||
 | 
					
 | 
				
			||||||
- Compressed : This is a standard huffman-compressed block,
 | 
					- Compressed : This is a standard huffman-compressed block,
 | 
				
			||||||
               starting with a huffman tree description.
 | 
					        starting with a huffman tree description.
 | 
				
			||||||
               See details below.
 | 
					        See details below.
 | 
				
			||||||
- Repeat Stats : This is a huffman-compressed block,
 | 
					- Repeat Stats : This is a huffman-compressed block,
 | 
				
			||||||
               using huffman tree from previous huffman-compressed block.
 | 
					        using huffman tree _from previous huffman-compressed literals block_.
 | 
				
			||||||
               Huffman tree description will be skipped.
 | 
					        Huffman tree description will be skipped.
 | 
				
			||||||
               Compressed stream is equivalent to "compressed" block type.
 | 
					 | 
				
			||||||
- Raw : Literals are stored uncompressed.
 | 
					- Raw : Literals are stored uncompressed.
 | 
				
			||||||
- RLE : Literals consist of a single byte value repeated N times.
 | 
					- RLE : Literals consist of a single byte value repeated N times.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
@@ -476,24 +477,24 @@ using a long format, accepting the reduced compacity.
 | 
				
			|||||||
__Sizes format for Compressed literals block__ :
 | 
					__Sizes format for Compressed literals block__ :
 | 
				
			||||||
 | 
					
 | 
				
			||||||
Note : also applicable to "repeat-stats" blocks.
 | 
					Note : also applicable to "repeat-stats" blocks.
 | 
				
			||||||
- Value : 00 : 4 streams
 | 
					- Value : 00 : 4 streams.
 | 
				
			||||||
               Compressed and regenerated sizes use 10 bits (0-1023)
 | 
					               Compressed and regenerated sizes use 10 bits (0-1023).
 | 
				
			||||||
               Total literal header size is 3 bytes
 | 
					               Total literal header size is 3 bytes.
 | 
				
			||||||
- Value : 01 : _Single stream_
 | 
					- Value : 01 : _Single stream_.
 | 
				
			||||||
               Compressed and regenerated sizes use 10 bits (0-1023)
 | 
					               Compressed and regenerated sizes use 10 bits (0-1023).
 | 
				
			||||||
               Total literal header size is 3 bytes
 | 
					               Total literal header size is 3 bytes.
 | 
				
			||||||
- Value : 10 : 4 streams
 | 
					- Value : 10 : 4 streams.
 | 
				
			||||||
               Compressed and regenerated sizes use 14 bits (0-16383)
 | 
					               Compressed and regenerated sizes use 14 bits (0-16383).
 | 
				
			||||||
               Total literal header size is 4 bytes
 | 
					               Total literal header size is 4 bytes.
 | 
				
			||||||
- Value : 10 : 4 streams
 | 
					- Value : 10 : 4 streams.
 | 
				
			||||||
               Compressed and regenerated sizes use 18 bits (0-262143)
 | 
					               Compressed and regenerated sizes use 18 bits (0-262143).
 | 
				
			||||||
               Total literal header size is 5 bytes
 | 
					               Total literal header size is 5 bytes.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
Compressed and regenerated size fields follow big endian convention.
 | 
					Compressed and regenerated size fields follow big endian convention.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
#### Huffman Tree description
 | 
					#### Huffman Tree description
 | 
				
			||||||
 | 
					
 | 
				
			||||||
This section is only present when block type is `Compressed` (`0`).
 | 
					This section is only present when literals block type is `Compressed` (`0`).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
Prefix coding represents symbols from an a priori known alphabet
 | 
					Prefix coding represents symbols from an a priori known alphabet
 | 
				
			||||||
by bit sequences (codes), one code for each symbol,
 | 
					by bit sequences (codes), one code for each symbol,
 | 
				
			||||||
@@ -546,7 +547,7 @@ It gives the following serie of weights :
 | 
				
			|||||||
The decoder will do the inverse operation :
 | 
					The decoder will do the inverse operation :
 | 
				
			||||||
having collected weights of literals from `0` to `4`,
 | 
					having collected weights of literals from `0` to `4`,
 | 
				
			||||||
it knows the last literal, `5`, is present with a non-zero weight.
 | 
					it knows the last literal, `5`, is present with a non-zero weight.
 | 
				
			||||||
The weight of `5` can be deduced by joining to the nearest power of 2.
 | 
					The weight of `5` can be deducted by joining to the nearest power of 2.
 | 
				
			||||||
Sum of 2^(weight-1) (excluding 0) is :
 | 
					Sum of 2^(weight-1) (excluding 0) is :
 | 
				
			||||||
`8 + 4 + 2 + 0 + 1 = 15`
 | 
					`8 + 4 + 2 + 0 + 1 = 15`
 | 
				
			||||||
Nearest power of 2 is 16.
 | 
					Nearest power of 2 is 16.
 | 
				
			||||||
@@ -558,24 +559,17 @@ This is a single byte value (0-255),
 | 
				
			|||||||
which tells how to decode the list of weights.
 | 
					which tells how to decode the list of weights.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
- if headerByte >= 242 : this is one of 14 pre-defined weight distributions :
 | 
					- if headerByte >= 242 : this is one of 14 pre-defined weight distributions :
 | 
				
			||||||
  + 242 :  1x1 (+ 1x1)
 | 
					
 | 
				
			||||||
  + 243 :  2x1 (+ 1x2)
 | 
					| value    |242|243|244|245|246|247|248|249|250|251|252|253|254|255|
 | 
				
			||||||
  + 244 :  3x1 (+ 1x1)
 | 
					| -------- |
 | 
				
			||||||
  + 245 :  4x1 (+ 1x4)
 | 
					| Nb of 1s | 1 | 2 | 3 | 4 | 7 | 8 | 15| 16| 31| 32| 63| 64|127|128|
 | 
				
			||||||
  + 246 :  7x1 (+ 1x1)
 | 
					|Complement| 1 | 2 | 1 | 4 | 1 | 8 | 1 | 16| 1 | 32| 1 | 64| 1 |128|
 | 
				
			||||||
  + 247 :  8x1 (+ 1x8)
 | 
					
 | 
				
			||||||
  + 248 : 15x1 (+ 1x1)
 | 
					_Note_ : complement is by using the "join to nearest power of 2" rule.
 | 
				
			||||||
  + 249 : 16x1 (+ 1x16)
 | 
					 | 
				
			||||||
  + 250 : 31x1 (+ 1x1)
 | 
					 | 
				
			||||||
  + 251 : 32x1 (+ 1x32)
 | 
					 | 
				
			||||||
  + 252 : 63x1 (+ 1x1)
 | 
					 | 
				
			||||||
  + 253 : 64x1 (+ 1x64)
 | 
					 | 
				
			||||||
  + 254 :127x1 (+ 1x1)
 | 
					 | 
				
			||||||
  + 255 :128x1 (+ 1x128)
 | 
					 | 
				
			||||||
 | 
					
 | 
				
			||||||
- if headerByte >= 128 : this is a direct representation,
 | 
					- if headerByte >= 128 : this is a direct representation,
 | 
				
			||||||
  where each weight is written directly as a 4 bits field (0-15).
 | 
					  where each weight is written directly as a 4 bits field (0-15).
 | 
				
			||||||
  The full representation occupies ((nbSymbols+1)/2) bytes,
 | 
					  The full representation occupies `((nbSymbols+1)/2)` bytes,
 | 
				
			||||||
  meaning it uses a last full byte even if nbSymbols is odd.
 | 
					  meaning it uses a last full byte even if nbSymbols is odd.
 | 
				
			||||||
  `nbSymbols = headerByte - 127;`
 | 
					  `nbSymbols = headerByte - 127;`
 | 
				
			||||||
 | 
					
 | 
				
			||||||
@@ -593,13 +587,13 @@ To decode an FSE bitstream, it is necessary to know its compressed size.
 | 
				
			|||||||
Compressed size is provided by `headerByte`.
 | 
					Compressed size is provided by `headerByte`.
 | 
				
			||||||
It's also necessary to know its maximum decompressed size.
 | 
					It's also necessary to know its maximum decompressed size.
 | 
				
			||||||
In this case, it's `255`, since literal values range from `0` to `255`,
 | 
					In this case, it's `255`, since literal values range from `0` to `255`,
 | 
				
			||||||
and the last symbol value is not represented.
 | 
					and last symbol value is not represented.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
An FSE bitstream starts by a header, describing probabilities distribution.
 | 
					An FSE bitstream starts by a header, describing probabilities distribution.
 | 
				
			||||||
It will create a Decoding Table.
 | 
					It will create a Decoding Table.
 | 
				
			||||||
It is necessary to know the maximum accuracy of distribution
 | 
					It is necessary to know the maximum accuracy of distribution
 | 
				
			||||||
to properly allocate space for the Table.
 | 
					to properly allocate space for the Table.
 | 
				
			||||||
For a list of huffman weights, this maximum is 8 bits.
 | 
					For a list of huffman weights, this maximum is 7 bits.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
FSE header and bitstreams are described in a separated chapter.
 | 
					FSE header and bitstreams are described in a separated chapter.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
@@ -652,13 +646,12 @@ presuming the CPU has enough parallelism available.
 | 
				
			|||||||
For single stream, header provides both the compressed and regenerated size.
 | 
					For single stream, header provides both the compressed and regenerated size.
 | 
				
			||||||
For 4-streams though,
 | 
					For 4-streams though,
 | 
				
			||||||
header only provides compressed and regenerated size of all 4 streams combined.
 | 
					header only provides compressed and regenerated size of all 4 streams combined.
 | 
				
			||||||
 | 
					 | 
				
			||||||
In order to properly decode the 4 streams,
 | 
					In order to properly decode the 4 streams,
 | 
				
			||||||
it's necessary to know the compressed and regenerated size of each stream.
 | 
					it's necessary to know the compressed and regenerated size of each stream.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
Regenerated size is easiest :
 | 
					Regenerated size is easiest :
 | 
				
			||||||
each stream has a size of `(totalSize+3)/4`,
 | 
					each stream has a size of `(totalSize+3)/4`,
 | 
				
			||||||
except the last one, which is up to 3 bytes smaller, to reach totalSize.
 | 
					except the last one, which is up to 3 bytes smaller, to reach `totalSize`.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
Compressed size must be provided explicitly : in the 4-streams variant,
 | 
					Compressed size must be provided explicitly : in the 4-streams variant,
 | 
				
			||||||
bitstreams are preceded by 3 unsigned Little Endian 16-bits values.
 | 
					bitstreams are preceded by 3 unsigned Little Endian 16-bits values.
 | 
				
			||||||
@@ -704,11 +697,11 @@ A match copy command specifies an offset and a length.
 | 
				
			|||||||
The offset gives the position to copy from,
 | 
					The offset gives the position to copy from,
 | 
				
			||||||
which can stand within a previous block.
 | 
					which can stand within a previous block.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
These are 3 symbol types, `literalLength`, `matchLength` and `offset`,
 | 
					There are 3 symbol types, `literalLength`, `matchLength` and `offset`,
 | 
				
			||||||
which are encoded together, interleaved in a single _bitstream_.
 | 
					which are encoded together, interleaved in a single _bitstream_.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
Each symbol decoding consists of a _code_,
 | 
					Each symbol is a _code_ in its own context,
 | 
				
			||||||
which specifies a baseline and a number of additional bits.
 | 
					which specifies a baseline and a number of bits to add.
 | 
				
			||||||
_Codes_ are FSE compressed,
 | 
					_Codes_ are FSE compressed,
 | 
				
			||||||
and interleaved with raw additional bits in the same bitstream.
 | 
					and interleaved with raw additional bits in the same bitstream.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
@@ -717,7 +710,7 @@ followed by optional Probability tables for each symbol type,
 | 
				
			|||||||
followed by the bitstream.
 | 
					followed by the bitstream.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
To decode the Sequence section, it's required to know its size.
 | 
					To decode the Sequence section, it's required to know its size.
 | 
				
			||||||
This size is deducted from "blockSize - literalSectionSize".
 | 
					This size is deducted from `blockSize - literalSectionSize`.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
#### Sequences section header
 | 
					#### Sequences section header
 | 
				
			||||||
@@ -733,9 +726,9 @@ Let's call its first byte `byte0`.
 | 
				
			|||||||
- `if (byte0 == 0)` : there are no sequences.
 | 
					- `if (byte0 == 0)` : there are no sequences.
 | 
				
			||||||
            The sequence section stops there.
 | 
					            The sequence section stops there.
 | 
				
			||||||
            Regenerated content is defined entirely by literals section.
 | 
					            Regenerated content is defined entirely by literals section.
 | 
				
			||||||
- `if (byte0 < 128)` : nbSeqs = byte0 . Uses 1 byte.
 | 
					- `if (byte0 < 128)` : `nbSeqs = byte0;` . Uses 1 byte.
 | 
				
			||||||
- `if (byte0 < 255)` : nbSeqs = ((byte0-128) << 8) + byte1 . Uses 2 bytes.
 | 
					- `if (byte0 < 255)` : `nbSeqs = ((byte0-128) << 8) + byte1;` . Uses 2 bytes.
 | 
				
			||||||
- `if (byte0 == 255)`: nbSeqs = byte1 + (byte2<<8) + 0x7F00 . Uses 3 bytes.
 | 
					- `if (byte0 == 255)`: `nbSeqs = byte1 + (byte2<<8) + 0x7F00;` . Uses 3 bytes.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
__Symbol compression modes__
 | 
					__Symbol compression modes__
 | 
				
			||||||
 | 
					
 | 
				
			||||||
@@ -760,8 +753,8 @@ They follow the same enumeration :
 | 
				
			|||||||
- "RLE" : it's a single code, repeated `nbSeqs` times.
 | 
					- "RLE" : it's a single code, repeated `nbSeqs` times.
 | 
				
			||||||
- "Repeat" : re-use distribution table from previous compressed block.
 | 
					- "Repeat" : re-use distribution table from previous compressed block.
 | 
				
			||||||
- "FSE" : standard FSE compression.
 | 
					- "FSE" : standard FSE compression.
 | 
				
			||||||
          Symbol type requires a distribution table,
 | 
					          A distribution table will be present.
 | 
				
			||||||
          which will be described in next part.
 | 
					          It will be described in [next part](#distribution-tables).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
#### Symbols decoding
 | 
					#### Symbols decoding
 | 
				
			||||||
 | 
					
 | 
				
			||||||
@@ -772,8 +765,9 @@ They define lengths from 0 to 131071 bytes.
 | 
				
			|||||||
 | 
					
 | 
				
			||||||
|  Code  | 0-15 |
 | 
					|  Code  | 0-15 |
 | 
				
			||||||
| ------ | ---- |
 | 
					| ------ | ---- |
 | 
				
			||||||
| nbBits |   0  |
 | 
					 | 
				
			||||||
| value  | Code |
 | 
					| value  | Code |
 | 
				
			||||||
 | 
					| nbBits |   0  |
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
|   Code   |  16  |  17  |  18  |  19  |  20  |  21  |  22  |  23  |
 | 
					|   Code   |  16  |  17  |  18  |  19  |  20  |  21  |  22  |  23  |
 | 
				
			||||||
| -------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
 | 
					| -------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
 | 
				
			||||||
@@ -792,7 +786,7 @@ They define lengths from 0 to 131071 bytes.
 | 
				
			|||||||
 | 
					
 | 
				
			||||||
__Default distribution__
 | 
					__Default distribution__
 | 
				
			||||||
 | 
					
 | 
				
			||||||
When "compression mode" is defined as "default distribution",
 | 
					When "compression mode" is "predef"",
 | 
				
			||||||
a pre-defined distribution is used for FSE compression.
 | 
					a pre-defined distribution is used for FSE compression.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
Here is its definition. It uses an accuracy of 6 bits (64 states).
 | 
					Here is its definition. It uses an accuracy of 6 bits (64 states).
 | 
				
			||||||
@@ -810,8 +804,8 @@ They define lengths from 3 to 131074 bytes.
 | 
				
			|||||||
 | 
					
 | 
				
			||||||
|  Code  |   0-31   |
 | 
					|  Code  |   0-31   |
 | 
				
			||||||
| ------ | -------- |
 | 
					| ------ | -------- |
 | 
				
			||||||
| nbBits |     0    |
 | 
					 | 
				
			||||||
| value  | Code + 3 |
 | 
					| value  | Code + 3 |
 | 
				
			||||||
 | 
					| nbBits |     0    |
 | 
				
			||||||
 | 
					
 | 
				
			||||||
|   Code   |  32  |  33  |  34  |  35  |  36  |  37  |  38  |  39  |
 | 
					|   Code   |  32  |  33  |  34  |  35  |  36  |  37  |  38  |  39  |
 | 
				
			||||||
| -------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
 | 
					| -------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
 | 
				
			||||||
@@ -847,8 +841,8 @@ short matchLengths_defaultDistribution[53] =
 | 
				
			|||||||
Offset codes are values ranging from `0` to `N`,
 | 
					Offset codes are values ranging from `0` to `N`,
 | 
				
			||||||
with `N` being limited by maximum backreference distance.
 | 
					with `N` being limited by maximum backreference distance.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
A decoder is free to limit its maximum `N` supported,
 | 
					A decoder is free to limit its maximum `N` supported.
 | 
				
			||||||
although the recommendation is to support at least up to `22`.
 | 
					Recommendation is to support at least up to `22`.
 | 
				
			||||||
For information, at the time of this writing.
 | 
					For information, at the time of this writing.
 | 
				
			||||||
the reference decoder supports a maximum `N` value of `28` in 64-bits mode.
 | 
					the reference decoder supports a maximum `N` value of `28` in 64-bits mode.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
@@ -863,14 +857,15 @@ if (OFValue > 3) offset = OFValue - 3;
 | 
				
			|||||||
OFValue from 1 to 3 are special : they define "repeat codes",
 | 
					OFValue from 1 to 3 are special : they define "repeat codes",
 | 
				
			||||||
which means one of the previous offsets will be repeated.
 | 
					which means one of the previous offsets will be repeated.
 | 
				
			||||||
They are sorted in recency order, with 1 meaning the most recent one.
 | 
					They are sorted in recency order, with 1 meaning the most recent one.
 | 
				
			||||||
 | 
					See [Repeat offsets](#repeat-offsets) paragraph.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
__Default distribution__
 | 
					__Default distribution__
 | 
				
			||||||
 | 
					
 | 
				
			||||||
When "compression mode" is defined as "default distribution",
 | 
					When "compression mode" is defined as "predef",
 | 
				
			||||||
a pre-defined distribution is used for FSE compression.
 | 
					a pre-defined distribution is used for FSE compression.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
Here is its definition. It uses an accuracy of 5 bits (32 states),
 | 
					Here is its definition. It uses an accuracy of 5 bits (32 states),
 | 
				
			||||||
and support a maximum `N` of 28, allowing offset values up to 536,870,908 .
 | 
					and supports a maximum `N` of 28, allowing offset values up to 536,870,908 .
 | 
				
			||||||
 | 
					
 | 
				
			||||||
If any sequence in the compressed block requires an offset larger than this,
 | 
					If any sequence in the compressed block requires an offset larger than this,
 | 
				
			||||||
it's not possible to use the default distribution to represent it.
 | 
					it's not possible to use the default distribution to represent it.
 | 
				
			||||||
@@ -920,7 +915,7 @@ It depends on :
 | 
				
			|||||||
  __example__ :
 | 
					  __example__ :
 | 
				
			||||||
  Presuming an AccuracyLog of 8,
 | 
					  Presuming an AccuracyLog of 8,
 | 
				
			||||||
  and presuming 100 probabilities points have already been distributed,
 | 
					  and presuming 100 probabilities points have already been distributed,
 | 
				
			||||||
  the decoder may discover value from `0` to `255 - 100 + 1 == 156` (included).
 | 
					  the decoder may read any value from `0` to `255 - 100 + 1 == 156` (included).
 | 
				
			||||||
  Therefore, it must read `log2sup(156) == 8` bits.
 | 
					  Therefore, it must read `log2sup(156) == 8` bits.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
- Value decoded : small values use 1 less bit :
 | 
					- Value decoded : small values use 1 less bit :
 | 
				
			||||||
@@ -946,7 +941,7 @@ Probability is obtained from Value decoded by following formulae :
 | 
				
			|||||||
 | 
					
 | 
				
			||||||
It means value `0` becomes negative probability `-1`.
 | 
					It means value `0` becomes negative probability `-1`.
 | 
				
			||||||
`-1` is a special probability, which means `less than 1`.
 | 
					`-1` is a special probability, which means `less than 1`.
 | 
				
			||||||
Its effect on distribution table is described in a later paragraph.
 | 
					Its effect on distribution table is described in next paragraph.
 | 
				
			||||||
For the purpose of calculating cumulated distribution, it counts as one.
 | 
					For the purpose of calculating cumulated distribution, it counts as one.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
When a symbol has a probability of `zero`,
 | 
					When a symbol has a probability of `zero`,
 | 
				
			||||||
@@ -988,8 +983,10 @@ each symbol gets attributed as many cells as its probability.
 | 
				
			|||||||
Cell allocation is spreaded, not linear :
 | 
					Cell allocation is spreaded, not linear :
 | 
				
			||||||
each successor position follow this rule :
 | 
					each successor position follow this rule :
 | 
				
			||||||
 | 
					
 | 
				
			||||||
`position += (tableSize>>1) + (tableSize>>3) + 3);
 | 
					```
 | 
				
			||||||
position &= tableSize-1;`
 | 
					position += (tableSize>>1) + (tableSize>>3) + 3;
 | 
				
			||||||
 | 
					position &= tableSize-1;
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
A position is skipped if already occupied,
 | 
					A position is skipped if already occupied,
 | 
				
			||||||
typically by a "less than 1" probability symbol.
 | 
					typically by a "less than 1" probability symbol.
 | 
				
			||||||
@@ -999,11 +996,11 @@ Each state will decode the current symbol.
 | 
				
			|||||||
 | 
					
 | 
				
			||||||
To get the Number of bits and baseline required for next state,
 | 
					To get the Number of bits and baseline required for next state,
 | 
				
			||||||
it's first necessary to sort all states in their natural order.
 | 
					it's first necessary to sort all states in their natural order.
 | 
				
			||||||
The lower states will need 1 bit more than higher ones.
 | 
					The lower states will need 1 more bit than higher ones.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
__Example__ :
 | 
					__Example__ :
 | 
				
			||||||
Presuming a symbol has a probability of 5.
 | 
					Presuming a symbol has a probability of 5.
 | 
				
			||||||
It receives 5 state values.
 | 
					It receives 5 state values. States are sorted in natural order.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
Next power of 2 is 8.
 | 
					Next power of 2 is 8.
 | 
				
			||||||
Space of probabilities is divided into 8 equal parts.
 | 
					Space of probabilities is divided into 8 equal parts.
 | 
				
			||||||
@@ -1034,8 +1031,8 @@ All sequences are stored in a single bitstream, read _backward_.
 | 
				
			|||||||
It is therefore necessary to know the bitstream size,
 | 
					It is therefore necessary to know the bitstream size,
 | 
				
			||||||
which is deducted from compressed block size.
 | 
					which is deducted from compressed block size.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
The exact last bit of the stream is followed by a set-bit-flag.
 | 
					The bit of the stream is followed by a set-bit-flag.
 | 
				
			||||||
The highest bit of last byte is this flag.
 | 
					Highest bit of last byte is this flag.
 | 
				
			||||||
It does not belong to the useful part of the bitstream.
 | 
					It does not belong to the useful part of the bitstream.
 | 
				
			||||||
Therefore, last byte has 0-7 useful bits.
 | 
					Therefore, last byte has 0-7 useful bits.
 | 
				
			||||||
Note that it also means that last byte cannot be `0`.
 | 
					Note that it also means that last byte cannot be `0`.
 | 
				
			||||||
 
 | 
				
			|||||||
		Reference in New Issue
	
	Block a user