1
0
mirror of https://github.com/facebook/zstd.git synced 2025-07-30 22:23:13 +03:00

numerous typos and clarifications in format specification

fix limit values of Window_Size
bump version to 0.2.5
This commit is contained in:
Yann Collet
2017-03-31 10:54:45 -07:00
parent 6476c51b86
commit 14433ca1ad

View File

@ -16,7 +16,8 @@ Distribution of this document is unlimited.
### Version
0.2.4 (17/02/17)
0.2.5 (31/03/17)
Introduction
------------
@ -109,7 +110,7 @@ The structure of a single Zstandard frame is following:
__`Magic_Number`__
4 Bytes, little-endian format.
4 Bytes, __little-endian__ format.
Value : 0xFD2FB528
__`Frame_Header`__
@ -127,7 +128,7 @@ An optional 32-bit checksum, only present if `Content_Checksum_flag` is set.
The content checksum is the result
of [xxh64() hash function](http://www.xxhash.org)
digesting the original (decoded) data as input, and a seed of zero.
The low 4 bytes of the checksum are stored in little endian format.
The low 4 bytes of the checksum are stored in __little-endian__ format.
### `Frame_Header`
@ -154,41 +155,42 @@ Decoding this byte is enough to tell the size of `Frame_Header`.
| 2 | `Content_Checksum_flag` |
| 1-0 | `Dictionary_ID_flag` |
In this table, bit 7 the is highest bit, while bit 0 the is lowest.
In this table, bit 7 is the highest bit, while bit 0 is the lowest one.
__`Frame_Content_Size_flag`__
This is a 2-bits flag (`= Frame_Header_Descriptor >> 6`),
specifying if decompressed data size is provided within the header.
The `Flag_Value` can be converted into `Field_Size`,
specifying if `Frame_Content_Size` (the decompressed data size)
is provided within the header.
`Flag_Value` provides `FCS_Field_Size`,
which is the number of bytes used by `Frame_Content_Size`
according to the following table:
| `Flag_Value` | 0 | 1 | 2 | 3 |
| ---------- | ------ | --- | --- | --- |
|`Field_Size`| 0 or 1 | 2 | 4 | 8 |
| -------------- | ------ | --- | --- | --- |
|`FCS_Field_Size`| 0 or 1 | 2 | 4 | 8 |
When `Flag_Value` is `0`, `Field_Size` depends on `Single_Segment_flag` :
When `Flag_Value` is `0`, `FCS_Field_Size` depends on `Single_Segment_flag` :
if `Single_Segment_flag` is set, `Field_Size` is 1.
Otherwise, `Field_Size` is 0 (content size not provided).
Otherwise, `Field_Size` is 0 : `Frame_Content_Size` is not provided.
__`Single_Segment_flag`__
If this flag is set,
data must be regenerated within a single continuous memory segment.
In this case, `Frame_Content_Size` is necessarily present,
but `Window_Descriptor` byte is skipped.
In this case, `Window_Descriptor` byte is skipped,
but `Frame_Content_Size` is necessarily present.
As a consequence, the decoder must allocate a memory segment
of size equal or bigger than `Frame_Content_Size`.
In order to preserve the decoder from unreasonable memory requirements,
a decoder can reject a compressed frame
a decoder is allowed to reject a compressed frame
which requests a memory size beyond decoder's authorized range.
For broader compatibility, decoders are recommended to support
memory sizes of at least 8 MB.
This is just a recommendation,
This is only a recommendation,
each decoder is free to support higher or lower limits,
depending on local limitations.
@ -224,37 +226,38 @@ It also specifies the size of this field as `Field_Size`.
#### `Window_Descriptor`
Provides guarantees on maximum back-reference distance
that will be used within compressed data.
Provides guarantees on minimum memory buffer required to decompress a frame.
This information is important for decoders to allocate enough memory.
The `Window_Descriptor` byte is optional. It is absent when `Single_Segment_flag` is set.
In this case, the maximum back-reference distance is the content size itself,
which can be any value from 1 to 2^64-1 bytes (16 EB).
The `Window_Descriptor` byte is optional.
When `Single_Segment_flag` is set, `Window_Descriptor` is not present.
In this case, the required buffer size is the frame content size itself,
which can be any value from 0 to 2^64-1 bytes (16 EB).
| Bit numbers | 7-3 | 2-0 |
| ----------- | ---------- | ---------- |
| Field name | `Exponent` | `Mantissa` |
Maximum distance is given by the following formulas :
The minimum memory buffer size is called `Window_Size`.
It is described by the following formulas :
```
windowLog = 10 + Exponent;
windowBase = 1 << windowLog;
windowAdd = (windowBase / 8) * Mantissa;
Window_Size = windowBase + windowAdd;
```
The minimum window size is 1 KB.
The maximum size is `15*(1<<38)` bytes, which is 1.875 TB.
The minimum `Window_Size` is 1 KB.
The maximum `Window_Size` is `(1<<41) + 7*(1<<38)` bytes, which is 3.75 TB.
To properly decode compressed data,
a decoder will need to allocate a buffer of at least `Window_Size` bytes.
In order to preserve decoder from unreasonable memory requirements,
a decoder can refuse a compressed frame
a decoder is allowed to reject a compressed frame
which requests a memory size beyond decoder's authorized range.
For improved interoperability,
decoders are recommended to be compatible with window sizes of 8 MB,
decoders are recommended to be compatible with `Window_Size >= 8 MB`,
and encoders are recommended to not request more than 8 MB.
It's merely a recommendation though,
decoders are free to support larger or lower limits,
@ -264,47 +267,50 @@ depending on local limitations.
This is a variable size field, which contains
the ID of the dictionary required to properly decode the frame.
Note that this field is optional. When it's not present,
`Dictionary_ID` field is optional. When it's not present,
it's up to the decoder to make sure it uses the correct dictionary.
Format is little-endian.
Field size depends on `Dictionary_ID_flag`.
1 byte can represent an ID 0-255.
2 bytes can represent an ID 0-65535.
4 bytes can represent an ID 0-4294967295.
Format is __little-endian__.
It's allowed to represent a small ID (for example `13`)
with a large 4-bytes dictionary ID, losing some compacity in the process.
with a large 4-bytes dictionary ID, even if it is less efficient.
_Reserved ranges :_
If the frame is going to be distributed in a private environment,
any dictionary ID can be used.
However, for public distribution of compressed frames using a dictionary,
the following ranges are reserved for future use and should not be used :
- low range : 1 - 32767
- high range : >= (2^31)
the following ranges are reserved and shall not be used :
- low range : `<= 32767`
- high range : `>= (1 << 31)`
#### `Frame_Content_Size`
This is the original (uncompressed) size. This information is optional.
The `Field_Size` is provided according to value of `Frame_Content_Size_flag`.
The `Field_Size` can be equal to 0 (not present), 1, 2, 4 or 8 bytes.
Format is little-endian.
`Frame_Content_Size` uses a variable number of bytes, provided by `FCS_Field_Size`.
`FCS_Field_Size` is provided by the value of `Frame_Content_Size_flag`.
`FCS_Field_Size` can be equal to 0 (not present), 1, 2, 4 or 8 bytes.
| `Field_Size` | Range |
| ------------ | ---------- |
| `FCS_Field_Size` | Range |
| ---------------- | ---------- |
| 0 | unknown |
| 1 | 0 - 255 |
| 2 | 256 - 65791|
| 4 | 0 - 2^32-1 |
| 8 | 0 - 2^64-1 |
When `Field_Size` is 1, 4 or 8 bytes, the value is read directly.
When `Field_Size` is 2, _the offset of 256 is added_.
`Frame_Content_Size` format is __little-endian__.
When `FCS_Field_Size` is 1, 4 or 8 bytes, the value is read directly.
When `FCS_Field_Size` is 2, _the offset of 256 is added_.
It's allowed to represent a small size (for example `18`) using any compatible variant.
Blocks
-------
After the magic number and header of each block,
there are some number of blocks.
Each frame must have at least one block but there is no upper limit
@ -312,25 +318,27 @@ on the number of blocks per frame.
The structure of a block is as follows:
| `Last_Block` | `Block_Type` | `Block_Size` | `Block_Content` |
|:------------:|:------------:|:------------:|:---------------:|
| 1 bit | 2 bits | 21 bits | n bytes |
| `Block_Header` | `Block_Content` |
|:--------------:|:---------------:|
| 3 bytes | n bytes |
The block header (`Last_Block`, `Block_Type`, and `Block_Size`) uses 3-bytes.
`Block_Header` uses 3 bytes, written using __little-endian__ convention.
It contains 3 fields :
| `Block_Size` | `Block_Type` | `Last_Block` |
|:------------:|:------------:|:------------:|
| bits 3-23 | bits 1-2 | bit 0 |
__`Last_Block`__
The lowest bit signals if this block is the last one.
The frame will end after this one.
The frame will end after this last block.
It may be followed by an optional `Content_Checksum`
(see [Zstandard Frames](#zstandard-frames)).
__`Block_Type` and `Block_Size`__
The next 2 bits represent the `Block_Type`,
while the remaining 21 bits represent the `Block_Size`.
Format is __little-endian__.
__`Block_Type`__
The next 2 bits represent the `Block_Type`.
There are 4 block types :
| Value | 0 | 1 | 2 | 3 |
@ -338,38 +346,40 @@ There are 4 block types :
| `Block_Type` | `Raw_Block` | `RLE_Block` | `Compressed_Block` | `Reserved`|
- `Raw_Block` - this is an uncompressed block.
`Block_Content` contains `Block_Size` bytes to read and copy
as decoded data.
`Block_Content` contains `Block_Size` bytes.
- `RLE_Block` - this is a single byte, repeated N times.
`Block_Content` consists of a single byte,
and `Block_Size` is the number of times this byte should be repeated.
- `RLE_Block` - this is a single byte, repeated `Block_Size` times.
`Block_Content` consists of a single byte.
On the decompression side, this byte must be repeated `Block_Size` times.
- `Compressed_Block` - this is a [Zstandard compressed block](#compressed-blocks),
explained later on.
`Block_Size` is the length of `Block_Content`, the compressed data.
The decompressed size is unknown,
The decompressed size is not known,
but its maximum possible value is guaranteed (see below)
- `Reserved` - this is not a block.
This value cannot be used with current version of this specification.
__`Block_Size`__
The upper 21 bits of `Block_Header` represent the `Block_Size`.
Block sizes must respect a few rules :
- In compressed mode, compressed size is always strictly less than decompressed size.
- Block decompressed size is always <= maximum back-reference distance.
- For `Compressed_Block`, `Block_Size` is always strictly less than decompressed size.
- Block decompressed size is always <= `Window_Size`
- Block decompressed size is always <= 128 KB.
A data block is not necessarily "full" :
since an arbitrary “flush” may happen anytime,
block decompressed content can be any size (even empty),
A block can contain any number of bytes (even empty),
up to `Block_Maximum_Decompressed_Size`, which is the smallest of :
- Maximum back-reference distance
- `Window_Size`
- 128 KB
Compressed Blocks
-----------------
To decompress a compressed block, the compressed size must be provided from
`Block_Size` field in the block header.
To decompress a compressed block, the compressed size must be provided
from `Block_Size` field within `Block_Header`.
A compressed block consists of 2 sections :
- [Literals Section](#literals-section)
@ -381,35 +391,33 @@ data in [Sequence Execution](#sequence-execution)
#### Prerequisites
To decode a compressed block, the following elements are necessary :
- Previous decoded data, up to a distance of `Window_Size`,
or all previous data when `Single_Segment_flag` is set.
- List of "recent offsets" from the previous compressed block.
- Decoding tables of the previous compressed block for each symbol type
or all previously decoded data when `Single_Segment_flag` is set.
- List of "recent offsets" from previous `Compressed_Block`.
- Decoding tables of previous `Compressed_Block` for each symbol type
(literals, literals lengths, match lengths, offsets).
Literals Section
----------------
During sequence execution, symbols from the literals section
During sequence phase, literals will be entangled with match copy operations.
All literals are regrouped in the first part of the block.
They can be decoded first, and then copied during sequence operations,
or they can be decoded on the flow, as needed by sequence commands.
| `Literals_Section_Header` | [`Huffman_Tree_Description`] | Stream1 | [Stream2] | [Stream3] | [Stream4] |
| ------------------------- | ---------------------------- | ------- | --------- | --------- | --------- |
They can be decoded first, and then copied during [Sequence Execution],
or they can be decoded on the flow during [Sequence Execution].
Literals can be stored uncompressed or compressed using Huffman prefix codes.
When compressed, an optional tree description can be present,
followed by 1 or 4 streams.
| `Literals_Section_Header` | [`Huffman_Tree_Description`] | Stream1 | [Stream2] | [Stream3] | [Stream4] |
| ------------------------- | ---------------------------- | ------- | --------- | --------- | --------- |
#### `Literals_Section_Header`
Header is in charge of describing how literals are packed.
It's a byte-aligned variable-size bitfield, ranging from 1 to 5 bytes,
using little-endian convention.
using __little-endian__ convention.
| `Literals_Block_Type` | `Size_Format` | `Regenerated_Size` | [`Compressed_Size`] |
| --------------------- | ------------- | ------------------ | ----------------- |
| --------------------- | ------------- | ------------------ | ------------------- |
| 2 bits | 1 - 2 bits | 5 - 20 bits | 0 - 18 bits |
In this representation, bits on the left are the lowest bits.
@ -419,32 +427,37 @@ __`Literals_Block_Type`__
This field uses 2 lowest bits of first byte, describing 4 different block types :
| `Literals_Block_Type` | Value |
| ----------------------------- | ----- |
| --------------------------- | ----- |
| `Raw_Literals_Block` | 0 |
| `RLE_Literals_Block` | 1 |
| `Compressed_Literals_Block` | 2 |
| `Repeat_Stats_Literals_Block` | 3 |
| `Treeless_Literals_Block` | 3 |
- `Raw_Literals_Block` - Literals are stored uncompressed.
- `RLE_Literals_Block` - Literals consist of a single byte value repeated N times.
- `RLE_Literals_Block` - Literals consist of a single byte value
repeated `Regenerated_Size` times.
- `Compressed_Literals_Block` - This is a standard Huffman-compressed block,
starting with a Huffman tree description.
See details below.
- `Repeat_Stats_Literals_Block` - This is a Huffman-compressed block,
- `Treeless_Literals_Block` - This is a Huffman-compressed block,
using Huffman tree _from previous Huffman-compressed literals block_.
Huffman tree description will be skipped.
Note: If this mode is used without any previous Huffman-table in the frame
(or [dictionary](#dictionary-format)), this should be treated as corruption.
`Huffman_Tree_Description` will be skipped.
Note: If this mode is triggering without any previous Huffman-table in the frame
(or [dictionary](#dictionary-format)), this should be treated as data corruption.
__`Size_Format`__
`Size_Format` is divided into 2 families :
- For `Raw_Literals_Block` and `RLE_Literals_Block` it's enough to decode `Regenerated_Size`.
- For `Compressed_Block`, its required to decode both `Compressed_Size`
and `Regenerated_Size` (the decompressed size). It will also decode the number of streams.
- For `Raw_Literals_Block` and `RLE_Literals_Block`,
it's only necessary to decode `Regenerated_Size`.
There is no `Compressed_Size` field.
- For `Compressed_Block` and `Treeless_Literals_Block`,
it's required to decode both `Compressed_Size`
and `Regenerated_Size` (the decompressed size).
It's also necessary to decode the number of streams (1 or 4).
For values spanning several bytes, convention is little-endian.
For values spanning several bytes, convention is __little-endian__.
__`Size_Format` for `Raw_Literals_Block` and `RLE_Literals_Block`__ :
@ -463,9 +476,9 @@ __`Size_Format` for `Raw_Literals_Block` and `RLE_Literals_Block`__ :
Only Stream1 is present for these cases.
Note : it's allowed to represent a short value (for example `13`)
using a long format, accepting the increased compressed data size.
using a long format, even if it's less efficient.
__`Size_Format` for `Compressed_Literals_Block` and `Repeat_Stats_Literals_Block`__ :
__`Size_Format` for `Compressed_Literals_Block` and `Treeless_Literals_Block`__ :
- Value 00 : _A single stream_.
Both `Regenerated_Size` and `Compressed_Size` use 10 bits (0-1023).
@ -480,63 +493,64 @@ __`Size_Format` for `Compressed_Literals_Block` and `Repeat_Stats_Literals_Block
Both `Regenerated_Size` and `Compressed_Size` use 18 bits (0-262143).
`Literals_Section_Header` has 5 bytes.
Both `Compressed_Size` and `Regenerated_Size` fields follow little-endian convention.
Note: `Compressed_Size` __includes__ the size of the Huffman Tree description if it
is present.
Both `Compressed_Size` and `Regenerated_Size` fields follow __little-endian__ convention.
Note: `Compressed_Size` __includes__ the size of the Huffman Tree description
_when_ it is present.
### Raw Literals Block
The data in Stream1 is `Regenerated_Size` bytes long, and contains the raw literals data
to be used in sequence execution.
The data in Stream1 is `Regenerated_Size` bytes long,
it contains the raw literals data to be used during [Sequence Execution].
### RLE Literals Block
Stream1 consists of a single byte which should be repeated `Regenerated_Size` times
to generate the decoded literals.
### Compressed Literals Block and Repeat Stats Literals Block
Both of these modes contain Huffman encoded data
### Compressed Literals Block and Treeless Literals Block
Both of these modes contain Huffman encoded data.
`Treeless_Literals_Block` does not have a `Huffman_Tree_Description`.
#### `Huffman_Tree_Description`
This section is only present when `Literals_Block_Type` type is `Compressed_Literals_Block` (`2`).
The format of the Huffman tree description can be found at [Huffman Tree description](#huffman-tree-description).
The size Huffman Tree description will be determined during the decoding process,
and must be used to determine where the compressed Huffman streams begin.
The size of `Huffman_Tree_Description` is determined during decoding process,
it must be used to determine where streams begin.
`Total_Streams_Size = Compress_Size - Huffman_Tree_Description_Size`.
If repeat stats mode is used, the Huffman table used in the previous compressed block will
be used to decompress this block as well.
For `Treeless_Literals_Block`,
the Huffman table comes from previously compressed literals block.
Huffman compressed data consists either 1 or 4 Huffman-coded streams.
Huffman compressed data consists of either 1 or 4 Huffman-coded streams.
If only one stream is present, it is a single bitstream occupying the entire
remaining portion of the literals block, encoded as described at
remaining portion of the literals block, encoded as described within
[Huffman-Coded Streams](#huffman-coded-streams).
If there are four streams, the literals section header only provides enough
information to know the regenerated and compressed sizes of all four streams combined.
The regenerated size of each stream is equal to `(totalSize+3)/4`, except for the last stream,
which may be up to 3 bytes smaller, to reach a total decompressed size match that described
in the literals header.
information to know the decompressed and compressed sizes of all four streams _combined_.
The decompressed size of each stream is equal to `(totalSize+3)/4`,
except for the last stream which may be up to 3 bytes smaller,
to reach a total decompressed size as specified in `Regenerated_Size`.
The compressed size of each stream is provided explicitly: the first 6 bytes of the compressed
data consist of three 2-byte little endian fields, describing the compressed sizes
of the first three streams.
The last streams size is computed from the total compressed size and the size of the other
three streams.
The compressed size of each stream is provided explicitly:
the first 6 bytes of the compressed data consist of three 2-byte __little-endian__ fields,
describing the compressed sizes of the first three streams.
Stream4 size is computed from total `Total_Streams_Size` minus sizes of other streams.
`stream4CSize = totalCSize - 6 - stream1CSize - stream2CSize - stream3CSize`.
`Stream4_Size = Total_Streams_Size - 6 - Stream1_Size - Stream2_Size - Stream3_Size`.
Note: remember that totalCSize may be smaller than the `Compressed_Size` found in the literals
block header as `Compressed_Size` also contains the size of the Huffman Tree description if it
is present.
Note: remember that `Total_Streams_Size` can be smaller than `Compressed_Size` in header,
because `Compressed_Size` also contains `Huffman_Tree_Description_Size` when it is present.
Each of these 4 bitstreams is then decoded independently as a Huffman-Coded stream,
as described at [Huffman-Coded Streams](#huffman-coded-streams)
Sequences Section
-----------------
A compressed block is a succession of _sequences_ .
A sequence is a literal copy command, followed by a match copy command.
A literal copy command specifies a length.
It is the number of bytes to be copied (or extracted) from the literal section.
It is the number of bytes to be copied (or extracted) from the [Literals Section].
A match copy command specifies an offset and a length.
When all _sequences_ are decoded,
@ -557,7 +571,7 @@ followed by the bitstream.
| -------------------------- | ------------------------- | ---------------- | ---------------------- | --------- |
To decode the `Sequences_Section`, it's required to know its size.
This size is deduced from `blockSize - literalSectionSize`.
This size is deduced from `Block_Size - Literals_Section_Size`.
#### `Sequences_Section_Header`
@ -572,7 +586,7 @@ This is a variable size field using between 1 and 3 bytes.
Let's call its first byte `byte0`.
- `if (byte0 == 0)` : there are no sequences.
The sequence section stops there.
Regenerated content is defined entirely by literals section.
Decompressed content is defined entirely as [Literals Section] content.
- `if (byte0 < 128)` : `Number_of_Sequences = byte0` . Uses 1 byte.
- `if (byte0 < 255)` : `Number_of_Sequences = ((byte0-128) << 8) + byte1` . Uses 2 bytes.
- `if (byte0 == 255)`: `Number_of_Sequences = byte1 + (byte2<<8) + 0x7F00` . Uses 3 bytes.
@ -588,7 +602,7 @@ This is a single byte, defining the compression mode of each symbol type.
The last field, `Reserved`, must be all-zeroes.
`Literals_Lengths_Mode`, `Offsets_Mode` and `Match_Lengths_Mode` define the `Compression_Mode` of
literals lengths, offsets, and match lengths respectively.
literals lengths, offsets, and match lengths symbols respectively.
They follow the same enumeration :
@ -598,12 +612,12 @@ They follow the same enumeration :
- `Predefined_Mode` : A predefined FSE distribution table is used, defined in
[default distributions](#default-distributions).
The table takes no space in the compressed data.
No distribution table will be present.
- `RLE_Mode` : The table description consists of a single byte.
This code will be repeated for every sequence.
This code will be repeated for all sequences.
- `Repeat_Mode` : The table used in the previous compressed block will be used again.
No distribution table will be present.
Note: this includes RLE mode, so if repeat_mode follows rle_mode the same symbol will be repeated.
Note: this includes RLE mode, so if `Repeat_Mode` follows `RLE_Mode`, the same symbol will be repeated.
If this mode is used without any previous sequence table in the frame
(or [dictionary](#dictionary-format)) to repeat, this should be treated as corruption.
- `FSE_Compressed_Mode` : standard FSE compression.
@ -625,7 +639,7 @@ Literals length codes are values ranging from `0` to `35` included.
They define lengths from 0 to 131071 bytes.
The literals length is equal to the decoded `Baseline` plus
the result of reading `Number_of_Bits` bits from the bitstream,
as a little-endian value.
as a __little-endian__ value.
| `Literals_Length_Code` | 0-15 |
| ---------------------- | ---------------------- |
@ -654,7 +668,7 @@ Match length codes are values ranging from `0` to `52` included.
They define lengths from 3 to 131074 bytes.
The match length is equal to the decoded `Baseline` plus
the result of reading `Number_of_Bits` bits from the bitstream,
as a little-endian value.
as a __little-endian__ value.
| `Match_Length_Code` | 0-31 |
| ------------------- | ----------------------- |
@ -685,7 +699,7 @@ Recommendation is to support at least up to `22`.
For information, at the time of this writing.
the reference decoder supports a maximum `N` value of `28` in 64-bits mode.
An offset code is also the number of additional bits to read in little-endian fashion,
An offset code is also the number of additional bits to read in __little-endian__ fashion,
and can be translated into an `Offset_Value` using the following formulas :
```
@ -720,8 +734,8 @@ begins.
FSE decoding requires a 'state' to be carried from symbol to symbol.
For more explanation on FSE decoding, see the [FSE section](#fse).
For sequence decoding, a separate state must be kept track of for each of
literal lengths, offsets, and match lengths.
For sequence decoding, a separate state keeps track of each
literal lengths, offsets, and match lengths symbols.
Some FSE primitives are also used.
For more details on the operation of these primitives, see the [FSE section](#fse).
@ -753,8 +767,7 @@ See the [description of the codes] for how to determine these values.
[description of the codes]: #the-codes-for-literals-lengths-match-lengths-and-offsets
Decoding starts by reading the `Number_of_Bits` required to decode `Offset`.
It then does the same for `Match_Length`,
and then for `Literals_Length`.
It then does the same for `Match_Length`, and then for `Literals_Length`.
This sequence is then used for [sequence execution](#sequence-execution).
If it is not the last sequence in the block,
@ -807,6 +820,7 @@ short offsetCodes_defaultDistribution[29] =
1, 1, 1, 1, 1, 1, 1, 1,-1,-1,-1,-1,-1 };
```
Sequence Execution
------------------
Once literals and sequences have been decoded,
@ -826,7 +840,8 @@ in this case.
The offset is defined as from the current position, so an offset of 6
and a match length of 3 means that 3 bytes should be copied from 6 bytes back.
Note that all offsets must be at most equal to the window size defined by the frame header.
Note that all offsets leading to previously decoded data
must be smaller than `Window_Size` defined in `Frame_Header_Descriptor`.
#### Repeat offsets
As seen in [Sequence Execution](#sequence-execution),
@ -842,11 +857,10 @@ so an `offset_value` of 1 means `Repeated_Offset2`,
an `offset_value` of 2 means `Repeated_Offset3`,
and an `offset_value` of 3 means `Repeated_Offset1 - 1_byte`.
In the first block, the offset history is populated with the following values : 1, 4 and 8 (in order).
For the first block, the starting offset history is populated with the following values : 1, 4 and 8 (in order).
Then each block gets its starting offset history from the ending values of the most recent compressed block.
Note that non-compressed blocks are skipped,
they do not contribute to offset history.
Then each block gets its starting offset history from the ending values of the most recent `Compressed_Block`.
Note that blocks which are not `Compressed_Block` are skipped, they do not contribute to offset history.
[Offset Codes]: #offset-codes
@ -859,6 +873,7 @@ This means that when `Repeated_Offset1` (most recent) is used, history is unmodi
When `Repeated_Offset2` is used, it's swapped with `Repeated_Offset1`.
If any other offset is used, it becomes `Repeated_Offset1` and the rest are shift back by one.
Skippable Frames
----------------
@ -878,7 +893,7 @@ Skippable frames defined in this specification are compatible with [LZ4] ones.
__`Magic_Number`__
4 Bytes, little-endian format.
4 Bytes, __little-endian__ format.
Value : 0x184D2A5?, which means any value from 0x184D2A50 to 0x184D2A5F.
All 16 values are valid to identify a skippable frame.
@ -886,13 +901,14 @@ __`Frame_Size`__
This is the size, in bytes, of the following `User_Data`
(without including the magic number nor the size field itself).
This field is represented using 4 Bytes, little-endian format, unsigned 32-bits.
This field is represented using 4 Bytes, __little-endian__ format, unsigned 32-bits.
This means `User_Data` cant be bigger than (2^32-1) bytes.
__`User_Data`__
The `User_Data` can be anything. Data will just be skipped by the decoder.
Entropy Encoding
----------------
Two types of entropy encoding are used by the Zstandard format:
@ -900,7 +916,7 @@ FSE, and Huffman coding.
FSE
---
FSE, or FiniteStateEntropy is an entropy coding based on [ANS].
FSE, short for Finite State Entropy, is an entropy codec based on [ANS].
FSE encoding/decoding involves a state that is carried over between symbols,
so decoding must be done in the opposite direction as encoding.
Therefore, all FSE bitstreams are read from end to beginning.
@ -909,15 +925,15 @@ For additional details on FSE, see [Finite State Entropy].
[Finite State Entropy]:https://github.com/Cyan4973/FiniteStateEntropy/
FSE decoding involves a decoding table which has a power of 2 size and three elements:
FSE decoding involves a decoding table which has a power of 2 size, and contain three elements:
`Symbol`, `Num_Bits`, and `Baseline`.
The `log2` of the table size is its `Accuracy_Log`.
The FSE state represents an index in this table.
The next symbol in the stream is the symbol indicated by the table value for that state.
To obtain the next state value,
the decoder should consume `Num_Bits` bits from the stream as a little endian value and add it to baseline.
To obtain the initial state value, consume `Accuracy_Log` bits from the stream as a little endian value.
To obtain the initial state value, consume `Accuracy_Log` bits from the stream as a __little-endian__ value.
The next symbol in the stream is the `Symbol` indicated in the table for that state.
To obtain the next state value,
the decoder should consume `Num_Bits` bits from the stream as a __little-endian__ value and add it to `Baseline`.
[ANS]: https://en.wikipedia.org/wiki/Asymmetric_Numeral_Systems
@ -929,7 +945,7 @@ An FSE distribution table describes the probabilities of all symbols
from `0` to the last present one (included)
on a normalized scale of `1 << Accuracy_Log` .
It's a bitstream which is read forward, in little-endian fashion.
It's a bitstream which is read forward, in __little-endian__ fashion.
It's not necessary to know its exact size,
since it will be discovered and reported by the decoding process.
@ -1064,7 +1080,7 @@ Huffman Coding
--------------
Zstandard Huffman-coded streams are read backwards,
similar to the FSE bitstreams.
Therefore, to find the start of the bitstream it is therefore necessary to
Therefore, to find the start of the bitstream, it is therefore to
know the offset of the last byte of the Huffman-coded stream.
After writing the last bit containing information, the compressor
@ -1077,7 +1093,7 @@ byte to read. The decompressor needs to skip 0-7 initial `0`-bits and
the first `1`-bit it occurs. Afterwards, the useful part of the bitstream
begins.
The bitstream contains Huffman-coded symbols in little-endian order,
The bitstream contains Huffman-coded symbols in __little-endian__ order,
with the codes defined by the method below.
### Huffman Tree Description
@ -1182,14 +1198,14 @@ The Huffman header compression uses 2 states,
which share the same FSE distribution table.
The first state (`State1`) encodes the even indexed symbols,
and the second (`State2`) encodes the odd indexes.
State1 is initialized first, and then State2, and they take turns decoding
a single symbol and updating their state.
`State1` is initialized first, and then `State2`, and they take turns
decoding a single symbol and updating their state.
For more details on these FSE operations, see the [FSE section](#fse).
The number of symbols to decode is determined
by tracking bitStream overflow condition:
If updating state after decoding a symbol would require more bits than
remain in the stream, it is assumed the extra bits are 0. Then,
remain in the stream, it is assumed that extra bits are 0. Then,
the symbols for each of the final states are decoded and the process is complete.
##### Conversion from weights to Huffman prefix codes
@ -1245,7 +1261,7 @@ it would be encoded as:
|Encoding|`0000`|`0001`|`01`|`1`| `10000` |
Starting from the end,
it's possible to read the bitstream in a little-endian fashion,
it's possible to read the bitstream in a __little-endian__ fashion,
keeping track of already used bits. Since the bitstream is encoded in reverse
order, by starting at the end the symbols can be read in forward order.
@ -1258,13 +1274,14 @@ If a bitstream is not entirely and exactly consumed,
hence reaching exactly its beginning position with _all_ bits consumed,
the decoding process is considered faulty.
Dictionary Format
-----------------
Zstandard is compatible with "raw content" dictionaries, free of any format restriction,
except that they must be at least 8 bytes.
These dictionaries function as if they were just the `Content` block of a formatted
dictionary.
Zstandard is compatible with "raw content" dictionaries,
free of any format restriction, except that they must be at least 8 bytes.
These dictionaries function as if they were just the `Content` part
of a formatted dictionary.
But dictionaries created by `zstd --train` follow a format, described here.
@ -1274,9 +1291,9 @@ __Pre-requisites__ : a dictionary has a size,
| `Magic_Number` | `Dictionary_ID` | `Entropy_Tables` | `Content` |
| -------------- | --------------- | ---------------- | --------- |
__`Magic_Number`__ : 4 bytes ID, value 0xEC30A437, little-endian format
__`Magic_Number`__ : 4 bytes ID, value 0xEC30A437, __little-endian__ format
__`Dictionary_ID`__ : 4 bytes, stored in little-endian format.
__`Dictionary_ID`__ : 4 bytes, stored in __little-endian__ format.
`Dictionary_ID` can be any value, except 0 (which means no `Dictionary_ID`).
It's used by decoders to check if they use the correct dictionary.
@ -1284,9 +1301,9 @@ _Reserved ranges :_
If the frame is going to be distributed in a private environment,
any `Dictionary_ID` can be used.
However, for public distribution of compressed frames,
the following ranges are reserved for future use and should not be used :
the following ranges are reserved and shall not be used :
- low range : 1 - 32767
- low range : <= 32767
- high range : >= (2^31)
__`Entropy_Tables`__ : following the same format as the tables in compressed blocks.
@ -1298,26 +1315,30 @@ __`Entropy_Tables`__ : following the same format as the tables in compressed blo
These tables populate the Repeat Stats literals mode and
Repeat distribution mode for sequence decoding.
It's finally followed by 3 offset values, populating recent offsets (instead of using `{1,4,8}`),
stored in order, 4-bytes little-endian each, for a total of 12 bytes.
stored in order, 4-bytes __little-endian__ each, for a total of 12 bytes.
Each recent offset must have a value < dictionary size.
__`Content`__ : The rest of the dictionary is its content.
The content act as a "past" in front of data to compress or decompress,
so it can be referenced in sequence commands.
As long as the amount of data decoded from this frame is less than or
equal to the window-size, sequence commands may specify offsets longer
than the lenght of total decoded output so far to reference back to the
dictionary. After the total output has surpassed the window size however,
equal to `Window_Size`, sequence commands may specify offsets longer
than the total length of decoded output so far to reference back to the
dictionary. After the total output has surpassed `Window_Size` however,
this is no longer allowed and the dictionary is no longer accessible.
[compressed blocks]: #the-format-of-compressed_block
Appendix A - Decoding tables for predefined codes
-------------------------------------------------
This appendix contains FSE decoding tables for the predefined literal length, match length, and offset
codes. The tables have been constructed using the algorithm as given above in the
"from normalized distribution to decoding tables" chapter. The tables here can be used as examples
to crosscheck that an implementation implements the decoding table generation algorithm correctly.
This appendix contains FSE decoding tables
for the predefined literal length, match length, and offset codes.
The tables have been constructed using the algorithm as given above in chapter
"from normalized distribution to decoding tables".
The tables here can be used as examples
to crosscheck that an implementation build its decoding tables correctly.
#### Literal Length Code:
@ -1496,6 +1517,7 @@ to crosscheck that an implementation implements the decoding table generation al
Version changes
---------------
- 0.2.5 : minor typos and clarifications
- 0.2.4 : section restructuring, by Sean Purcell
- 0.2.3 : clarified several details, by Sean Purcell
- 0.2.2 : added predefined codes, by Johannes Rudolph