mirror of
https://github.com/facebook/zstd.git
synced 2025-07-30 22:23:13 +03:00
numerous typos and clarifications in format specification
fix limit values of Window_Size bump version to 0.2.5
This commit is contained in:
@ -16,7 +16,8 @@ Distribution of this document is unlimited.
|
|||||||
|
|
||||||
### Version
|
### Version
|
||||||
|
|
||||||
0.2.4 (17/02/17)
|
0.2.5 (31/03/17)
|
||||||
|
|
||||||
|
|
||||||
Introduction
|
Introduction
|
||||||
------------
|
------------
|
||||||
@ -109,7 +110,7 @@ The structure of a single Zstandard frame is following:
|
|||||||
|
|
||||||
__`Magic_Number`__
|
__`Magic_Number`__
|
||||||
|
|
||||||
4 Bytes, little-endian format.
|
4 Bytes, __little-endian__ format.
|
||||||
Value : 0xFD2FB528
|
Value : 0xFD2FB528
|
||||||
|
|
||||||
__`Frame_Header`__
|
__`Frame_Header`__
|
||||||
@ -127,7 +128,7 @@ An optional 32-bit checksum, only present if `Content_Checksum_flag` is set.
|
|||||||
The content checksum is the result
|
The content checksum is the result
|
||||||
of [xxh64() hash function](http://www.xxhash.org)
|
of [xxh64() hash function](http://www.xxhash.org)
|
||||||
digesting the original (decoded) data as input, and a seed of zero.
|
digesting the original (decoded) data as input, and a seed of zero.
|
||||||
The low 4 bytes of the checksum are stored in little endian format.
|
The low 4 bytes of the checksum are stored in __little-endian__ format.
|
||||||
|
|
||||||
### `Frame_Header`
|
### `Frame_Header`
|
||||||
|
|
||||||
@ -154,41 +155,42 @@ Decoding this byte is enough to tell the size of `Frame_Header`.
|
|||||||
| 2 | `Content_Checksum_flag` |
|
| 2 | `Content_Checksum_flag` |
|
||||||
| 1-0 | `Dictionary_ID_flag` |
|
| 1-0 | `Dictionary_ID_flag` |
|
||||||
|
|
||||||
In this table, bit 7 the is highest bit, while bit 0 the is lowest.
|
In this table, bit 7 is the highest bit, while bit 0 is the lowest one.
|
||||||
|
|
||||||
__`Frame_Content_Size_flag`__
|
__`Frame_Content_Size_flag`__
|
||||||
|
|
||||||
This is a 2-bits flag (`= Frame_Header_Descriptor >> 6`),
|
This is a 2-bits flag (`= Frame_Header_Descriptor >> 6`),
|
||||||
specifying if decompressed data size is provided within the header.
|
specifying if `Frame_Content_Size` (the decompressed data size)
|
||||||
The `Flag_Value` can be converted into `Field_Size`,
|
is provided within the header.
|
||||||
|
`Flag_Value` provides `FCS_Field_Size`,
|
||||||
which is the number of bytes used by `Frame_Content_Size`
|
which is the number of bytes used by `Frame_Content_Size`
|
||||||
according to the following table:
|
according to the following table:
|
||||||
|
|
||||||
|`Flag_Value`| 0 | 1 | 2 | 3 |
|
| `Flag_Value` | 0 | 1 | 2 | 3 |
|
||||||
| ---------- | ------ | --- | --- | --- |
|
| -------------- | ------ | --- | --- | --- |
|
||||||
|`Field_Size`| 0 or 1 | 2 | 4 | 8 |
|
|`FCS_Field_Size`| 0 or 1 | 2 | 4 | 8 |
|
||||||
|
|
||||||
When `Flag_Value` is `0`, `Field_Size` depends on `Single_Segment_flag` :
|
When `Flag_Value` is `0`, `FCS_Field_Size` depends on `Single_Segment_flag` :
|
||||||
if `Single_Segment_flag` is set, `Field_Size` is 1.
|
if `Single_Segment_flag` is set, `Field_Size` is 1.
|
||||||
Otherwise, `Field_Size` is 0 (content size not provided).
|
Otherwise, `Field_Size` is 0 : `Frame_Content_Size` is not provided.
|
||||||
|
|
||||||
__`Single_Segment_flag`__
|
__`Single_Segment_flag`__
|
||||||
|
|
||||||
If this flag is set,
|
If this flag is set,
|
||||||
data must be regenerated within a single continuous memory segment.
|
data must be regenerated within a single continuous memory segment.
|
||||||
|
|
||||||
In this case, `Frame_Content_Size` is necessarily present,
|
In this case, `Window_Descriptor` byte is skipped,
|
||||||
but `Window_Descriptor` byte is skipped.
|
but `Frame_Content_Size` is necessarily present.
|
||||||
As a consequence, the decoder must allocate a memory segment
|
As a consequence, the decoder must allocate a memory segment
|
||||||
of size equal or bigger than `Frame_Content_Size`.
|
of size equal or bigger than `Frame_Content_Size`.
|
||||||
|
|
||||||
In order to preserve the decoder from unreasonable memory requirements,
|
In order to preserve the decoder from unreasonable memory requirements,
|
||||||
a decoder can reject a compressed frame
|
a decoder is allowed to reject a compressed frame
|
||||||
which requests a memory size beyond decoder's authorized range.
|
which requests a memory size beyond decoder's authorized range.
|
||||||
|
|
||||||
For broader compatibility, decoders are recommended to support
|
For broader compatibility, decoders are recommended to support
|
||||||
memory sizes of at least 8 MB.
|
memory sizes of at least 8 MB.
|
||||||
This is just a recommendation,
|
This is only a recommendation,
|
||||||
each decoder is free to support higher or lower limits,
|
each decoder is free to support higher or lower limits,
|
||||||
depending on local limitations.
|
depending on local limitations.
|
||||||
|
|
||||||
@ -224,37 +226,38 @@ It also specifies the size of this field as `Field_Size`.
|
|||||||
|
|
||||||
#### `Window_Descriptor`
|
#### `Window_Descriptor`
|
||||||
|
|
||||||
Provides guarantees on maximum back-reference distance
|
Provides guarantees on minimum memory buffer required to decompress a frame.
|
||||||
that will be used within compressed data.
|
|
||||||
This information is important for decoders to allocate enough memory.
|
This information is important for decoders to allocate enough memory.
|
||||||
|
|
||||||
The `Window_Descriptor` byte is optional. It is absent when `Single_Segment_flag` is set.
|
The `Window_Descriptor` byte is optional.
|
||||||
In this case, the maximum back-reference distance is the content size itself,
|
When `Single_Segment_flag` is set, `Window_Descriptor` is not present.
|
||||||
which can be any value from 1 to 2^64-1 bytes (16 EB).
|
In this case, the required buffer size is the frame content size itself,
|
||||||
|
which can be any value from 0 to 2^64-1 bytes (16 EB).
|
||||||
|
|
||||||
| Bit numbers | 7-3 | 2-0 |
|
| Bit numbers | 7-3 | 2-0 |
|
||||||
| ----------- | ---------- | ---------- |
|
| ----------- | ---------- | ---------- |
|
||||||
| Field name | `Exponent` | `Mantissa` |
|
| Field name | `Exponent` | `Mantissa` |
|
||||||
|
|
||||||
Maximum distance is given by the following formulas :
|
The minimum memory buffer size is called `Window_Size`.
|
||||||
|
It is described by the following formulas :
|
||||||
```
|
```
|
||||||
windowLog = 10 + Exponent;
|
windowLog = 10 + Exponent;
|
||||||
windowBase = 1 << windowLog;
|
windowBase = 1 << windowLog;
|
||||||
windowAdd = (windowBase / 8) * Mantissa;
|
windowAdd = (windowBase / 8) * Mantissa;
|
||||||
Window_Size = windowBase + windowAdd;
|
Window_Size = windowBase + windowAdd;
|
||||||
```
|
```
|
||||||
The minimum window size is 1 KB.
|
The minimum `Window_Size` is 1 KB.
|
||||||
The maximum size is `15*(1<<38)` bytes, which is 1.875 TB.
|
The maximum `Window_Size` is `(1<<41) + 7*(1<<38)` bytes, which is 3.75 TB.
|
||||||
|
|
||||||
To properly decode compressed data,
|
To properly decode compressed data,
|
||||||
a decoder will need to allocate a buffer of at least `Window_Size` bytes.
|
a decoder will need to allocate a buffer of at least `Window_Size` bytes.
|
||||||
|
|
||||||
In order to preserve decoder from unreasonable memory requirements,
|
In order to preserve decoder from unreasonable memory requirements,
|
||||||
a decoder can refuse a compressed frame
|
a decoder is allowed to reject a compressed frame
|
||||||
which requests a memory size beyond decoder's authorized range.
|
which requests a memory size beyond decoder's authorized range.
|
||||||
|
|
||||||
For improved interoperability,
|
For improved interoperability,
|
||||||
decoders are recommended to be compatible with window sizes of 8 MB,
|
decoders are recommended to be compatible with `Window_Size >= 8 MB`,
|
||||||
and encoders are recommended to not request more than 8 MB.
|
and encoders are recommended to not request more than 8 MB.
|
||||||
It's merely a recommendation though,
|
It's merely a recommendation though,
|
||||||
decoders are free to support larger or lower limits,
|
decoders are free to support larger or lower limits,
|
||||||
@ -264,47 +267,50 @@ depending on local limitations.
|
|||||||
|
|
||||||
This is a variable size field, which contains
|
This is a variable size field, which contains
|
||||||
the ID of the dictionary required to properly decode the frame.
|
the ID of the dictionary required to properly decode the frame.
|
||||||
Note that this field is optional. When it's not present,
|
`Dictionary_ID` field is optional. When it's not present,
|
||||||
it's up to the decoder to make sure it uses the correct dictionary.
|
it's up to the decoder to make sure it uses the correct dictionary.
|
||||||
Format is little-endian.
|
|
||||||
|
|
||||||
Field size depends on `Dictionary_ID_flag`.
|
Field size depends on `Dictionary_ID_flag`.
|
||||||
1 byte can represent an ID 0-255.
|
1 byte can represent an ID 0-255.
|
||||||
2 bytes can represent an ID 0-65535.
|
2 bytes can represent an ID 0-65535.
|
||||||
4 bytes can represent an ID 0-4294967295.
|
4 bytes can represent an ID 0-4294967295.
|
||||||
|
Format is __little-endian__.
|
||||||
|
|
||||||
It's allowed to represent a small ID (for example `13`)
|
It's allowed to represent a small ID (for example `13`)
|
||||||
with a large 4-bytes dictionary ID, losing some compacity in the process.
|
with a large 4-bytes dictionary ID, even if it is less efficient.
|
||||||
|
|
||||||
_Reserved ranges :_
|
_Reserved ranges :_
|
||||||
If the frame is going to be distributed in a private environment,
|
If the frame is going to be distributed in a private environment,
|
||||||
any dictionary ID can be used.
|
any dictionary ID can be used.
|
||||||
However, for public distribution of compressed frames using a dictionary,
|
However, for public distribution of compressed frames using a dictionary,
|
||||||
the following ranges are reserved for future use and should not be used :
|
the following ranges are reserved and shall not be used :
|
||||||
- low range : 1 - 32767
|
- low range : `<= 32767`
|
||||||
- high range : >= (2^31)
|
- high range : `>= (1 << 31)`
|
||||||
|
|
||||||
|
|
||||||
#### `Frame_Content_Size`
|
#### `Frame_Content_Size`
|
||||||
|
|
||||||
This is the original (uncompressed) size. This information is optional.
|
This is the original (uncompressed) size. This information is optional.
|
||||||
The `Field_Size` is provided according to value of `Frame_Content_Size_flag`.
|
`Frame_Content_Size` uses a variable number of bytes, provided by `FCS_Field_Size`.
|
||||||
The `Field_Size` can be equal to 0 (not present), 1, 2, 4 or 8 bytes.
|
`FCS_Field_Size` is provided by the value of `Frame_Content_Size_flag`.
|
||||||
Format is little-endian.
|
`FCS_Field_Size` can be equal to 0 (not present), 1, 2, 4 or 8 bytes.
|
||||||
|
|
||||||
| `Field_Size` | Range |
|
| `FCS_Field_Size` | Range |
|
||||||
| ------------ | ---------- |
|
| ---------------- | ---------- |
|
||||||
| 1 | 0 - 255 |
|
| 0 | unknown |
|
||||||
| 2 | 256 - 65791|
|
| 1 | 0 - 255 |
|
||||||
| 4 | 0 - 2^32-1 |
|
| 2 | 256 - 65791|
|
||||||
| 8 | 0 - 2^64-1 |
|
| 4 | 0 - 2^32-1 |
|
||||||
|
| 8 | 0 - 2^64-1 |
|
||||||
|
|
||||||
When `Field_Size` is 1, 4 or 8 bytes, the value is read directly.
|
`Frame_Content_Size` format is __little-endian__.
|
||||||
When `Field_Size` is 2, _the offset of 256 is added_.
|
When `FCS_Field_Size` is 1, 4 or 8 bytes, the value is read directly.
|
||||||
|
When `FCS_Field_Size` is 2, _the offset of 256 is added_.
|
||||||
It's allowed to represent a small size (for example `18`) using any compatible variant.
|
It's allowed to represent a small size (for example `18`) using any compatible variant.
|
||||||
|
|
||||||
|
|
||||||
Blocks
|
Blocks
|
||||||
-------
|
-------
|
||||||
|
|
||||||
After the magic number and header of each block,
|
After the magic number and header of each block,
|
||||||
there are some number of blocks.
|
there are some number of blocks.
|
||||||
Each frame must have at least one block but there is no upper limit
|
Each frame must have at least one block but there is no upper limit
|
||||||
@ -312,64 +318,68 @@ on the number of blocks per frame.
|
|||||||
|
|
||||||
The structure of a block is as follows:
|
The structure of a block is as follows:
|
||||||
|
|
||||||
| `Last_Block` | `Block_Type` | `Block_Size` | `Block_Content` |
|
| `Block_Header` | `Block_Content` |
|
||||||
|:------------:|:------------:|:------------:|:---------------:|
|
|:--------------:|:---------------:|
|
||||||
| 1 bit | 2 bits | 21 bits | n bytes |
|
| 3 bytes | n bytes |
|
||||||
|
|
||||||
The block header (`Last_Block`, `Block_Type`, and `Block_Size`) uses 3-bytes.
|
`Block_Header` uses 3 bytes, written using __little-endian__ convention.
|
||||||
|
It contains 3 fields :
|
||||||
|
|
||||||
|
| `Block_Size` | `Block_Type` | `Last_Block` |
|
||||||
|
|:------------:|:------------:|:------------:|
|
||||||
|
| bits 3-23 | bits 1-2 | bit 0 |
|
||||||
|
|
||||||
__`Last_Block`__
|
__`Last_Block`__
|
||||||
|
|
||||||
The lowest bit signals if this block is the last one.
|
The lowest bit signals if this block is the last one.
|
||||||
The frame will end after this one.
|
The frame will end after this last block.
|
||||||
It may be followed by an optional `Content_Checksum`
|
It may be followed by an optional `Content_Checksum`
|
||||||
(see [Zstandard Frames](#zstandard-frames)).
|
(see [Zstandard Frames](#zstandard-frames)).
|
||||||
|
|
||||||
__`Block_Type` and `Block_Size`__
|
__`Block_Type`__
|
||||||
|
|
||||||
The next 2 bits represent the `Block_Type`,
|
|
||||||
while the remaining 21 bits represent the `Block_Size`.
|
|
||||||
Format is __little-endian__.
|
|
||||||
|
|
||||||
|
The next 2 bits represent the `Block_Type`.
|
||||||
There are 4 block types :
|
There are 4 block types :
|
||||||
|
|
||||||
| Value | 0 | 1 | 2 | 3 |
|
| Value | 0 | 1 | 2 | 3 |
|
||||||
| ------------ | ----------- | ----------- | ------------------ | --------- |
|
| ------------ | ----------- | ----------- | ------------------ | --------- |
|
||||||
| `Block_Type` | `Raw_Block` | `RLE_Block` | `Compressed_Block` | `Reserved`|
|
| `Block_Type` | `Raw_Block` | `RLE_Block` | `Compressed_Block` | `Reserved`|
|
||||||
|
|
||||||
- `Raw_Block` - this is an uncompressed block.
|
- `Raw_Block` - this is an uncompressed block.
|
||||||
`Block_Content` contains `Block_Size` bytes to read and copy
|
`Block_Content` contains `Block_Size` bytes.
|
||||||
as decoded data.
|
|
||||||
|
|
||||||
- `RLE_Block` - this is a single byte, repeated N times.
|
- `RLE_Block` - this is a single byte, repeated `Block_Size` times.
|
||||||
`Block_Content` consists of a single byte,
|
`Block_Content` consists of a single byte.
|
||||||
and `Block_Size` is the number of times this byte should be repeated.
|
On the decompression side, this byte must be repeated `Block_Size` times.
|
||||||
|
|
||||||
- `Compressed_Block` - this is a [Zstandard compressed block](#compressed-blocks),
|
- `Compressed_Block` - this is a [Zstandard compressed block](#compressed-blocks),
|
||||||
explained later on.
|
explained later on.
|
||||||
`Block_Size` is the length of `Block_Content`, the compressed data.
|
`Block_Size` is the length of `Block_Content`, the compressed data.
|
||||||
The decompressed size is unknown,
|
The decompressed size is not known,
|
||||||
but its maximum possible value is guaranteed (see below)
|
but its maximum possible value is guaranteed (see below)
|
||||||
|
|
||||||
- `Reserved` - this is not a block.
|
- `Reserved` - this is not a block.
|
||||||
This value cannot be used with current version of this specification.
|
This value cannot be used with current version of this specification.
|
||||||
|
|
||||||
|
__`Block_Size`__
|
||||||
|
|
||||||
|
The upper 21 bits of `Block_Header` represent the `Block_Size`.
|
||||||
|
|
||||||
Block sizes must respect a few rules :
|
Block sizes must respect a few rules :
|
||||||
- In compressed mode, compressed size is always strictly less than decompressed size.
|
- For `Compressed_Block`, `Block_Size` is always strictly less than decompressed size.
|
||||||
- Block decompressed size is always <= maximum back-reference distance.
|
- Block decompressed size is always <= `Window_Size`
|
||||||
- Block decompressed size is always <= 128 KB.
|
- Block decompressed size is always <= 128 KB.
|
||||||
|
|
||||||
A data block is not necessarily "full" :
|
A block can contain any number of bytes (even empty),
|
||||||
since an arbitrary “flush” may happen anytime,
|
|
||||||
block decompressed content can be any size (even empty),
|
|
||||||
up to `Block_Maximum_Decompressed_Size`, which is the smallest of :
|
up to `Block_Maximum_Decompressed_Size`, which is the smallest of :
|
||||||
- Maximum back-reference distance
|
- `Window_Size`
|
||||||
- 128 KB
|
- 128 KB
|
||||||
|
|
||||||
|
|
||||||
Compressed Blocks
|
Compressed Blocks
|
||||||
-----------------
|
-----------------
|
||||||
To decompress a compressed block, the compressed size must be provided from
|
To decompress a compressed block, the compressed size must be provided
|
||||||
`Block_Size` field in the block header.
|
from `Block_Size` field within `Block_Header`.
|
||||||
|
|
||||||
A compressed block consists of 2 sections :
|
A compressed block consists of 2 sections :
|
||||||
- [Literals Section](#literals-section)
|
- [Literals Section](#literals-section)
|
||||||
@ -381,36 +391,34 @@ data in [Sequence Execution](#sequence-execution)
|
|||||||
#### Prerequisites
|
#### Prerequisites
|
||||||
To decode a compressed block, the following elements are necessary :
|
To decode a compressed block, the following elements are necessary :
|
||||||
- Previous decoded data, up to a distance of `Window_Size`,
|
- Previous decoded data, up to a distance of `Window_Size`,
|
||||||
or all previous data when `Single_Segment_flag` is set.
|
or all previously decoded data when `Single_Segment_flag` is set.
|
||||||
- List of "recent offsets" from the previous compressed block.
|
- List of "recent offsets" from previous `Compressed_Block`.
|
||||||
- Decoding tables of the previous compressed block for each symbol type
|
- Decoding tables of previous `Compressed_Block` for each symbol type
|
||||||
(literals, literals lengths, match lengths, offsets).
|
(literals, literals lengths, match lengths, offsets).
|
||||||
|
|
||||||
Literals Section
|
Literals Section
|
||||||
----------------
|
----------------
|
||||||
During sequence execution, symbols from the literals section
|
|
||||||
During sequence phase, literals will be entangled with match copy operations.
|
|
||||||
All literals are regrouped in the first part of the block.
|
All literals are regrouped in the first part of the block.
|
||||||
They can be decoded first, and then copied during sequence operations,
|
They can be decoded first, and then copied during [Sequence Execution],
|
||||||
or they can be decoded on the flow, as needed by sequence commands.
|
or they can be decoded on the flow during [Sequence Execution].
|
||||||
|
|
||||||
| `Literals_Section_Header` | [`Huffman_Tree_Description`] | Stream1 | [Stream2] | [Stream3] | [Stream4] |
|
|
||||||
| ------------------------- | ---------------------------- | ------- | --------- | --------- | --------- |
|
|
||||||
|
|
||||||
Literals can be stored uncompressed or compressed using Huffman prefix codes.
|
Literals can be stored uncompressed or compressed using Huffman prefix codes.
|
||||||
When compressed, an optional tree description can be present,
|
When compressed, an optional tree description can be present,
|
||||||
followed by 1 or 4 streams.
|
followed by 1 or 4 streams.
|
||||||
|
|
||||||
|
| `Literals_Section_Header` | [`Huffman_Tree_Description`] | Stream1 | [Stream2] | [Stream3] | [Stream4] |
|
||||||
|
| ------------------------- | ---------------------------- | ------- | --------- | --------- | --------- |
|
||||||
|
|
||||||
|
|
||||||
#### `Literals_Section_Header`
|
#### `Literals_Section_Header`
|
||||||
|
|
||||||
Header is in charge of describing how literals are packed.
|
Header is in charge of describing how literals are packed.
|
||||||
It's a byte-aligned variable-size bitfield, ranging from 1 to 5 bytes,
|
It's a byte-aligned variable-size bitfield, ranging from 1 to 5 bytes,
|
||||||
using little-endian convention.
|
using __little-endian__ convention.
|
||||||
|
|
||||||
| `Literals_Block_Type` | `Size_Format` | `Regenerated_Size` | [`Compressed_Size`] |
|
| `Literals_Block_Type` | `Size_Format` | `Regenerated_Size` | [`Compressed_Size`] |
|
||||||
| --------------------- | ------------- | ------------------ | ----------------- |
|
| --------------------- | ------------- | ------------------ | ------------------- |
|
||||||
| 2 bits | 1 - 2 bits | 5 - 20 bits | 0 - 18 bits |
|
| 2 bits | 1 - 2 bits | 5 - 20 bits | 0 - 18 bits |
|
||||||
|
|
||||||
In this representation, bits on the left are the lowest bits.
|
In this representation, bits on the left are the lowest bits.
|
||||||
|
|
||||||
@ -418,33 +426,38 @@ __`Literals_Block_Type`__
|
|||||||
|
|
||||||
This field uses 2 lowest bits of first byte, describing 4 different block types :
|
This field uses 2 lowest bits of first byte, describing 4 different block types :
|
||||||
|
|
||||||
| `Literals_Block_Type` | Value |
|
| `Literals_Block_Type` | Value |
|
||||||
| ----------------------------- | ----- |
|
| --------------------------- | ----- |
|
||||||
| `Raw_Literals_Block` | 0 |
|
| `Raw_Literals_Block` | 0 |
|
||||||
| `RLE_Literals_Block` | 1 |
|
| `RLE_Literals_Block` | 1 |
|
||||||
| `Compressed_Literals_Block` | 2 |
|
| `Compressed_Literals_Block` | 2 |
|
||||||
| `Repeat_Stats_Literals_Block` | 3 |
|
| `Treeless_Literals_Block` | 3 |
|
||||||
|
|
||||||
- `Raw_Literals_Block` - Literals are stored uncompressed.
|
- `Raw_Literals_Block` - Literals are stored uncompressed.
|
||||||
- `RLE_Literals_Block` - Literals consist of a single byte value repeated N times.
|
- `RLE_Literals_Block` - Literals consist of a single byte value
|
||||||
|
repeated `Regenerated_Size` times.
|
||||||
- `Compressed_Literals_Block` - This is a standard Huffman-compressed block,
|
- `Compressed_Literals_Block` - This is a standard Huffman-compressed block,
|
||||||
starting with a Huffman tree description.
|
starting with a Huffman tree description.
|
||||||
See details below.
|
See details below.
|
||||||
- `Repeat_Stats_Literals_Block` - This is a Huffman-compressed block,
|
- `Treeless_Literals_Block` - This is a Huffman-compressed block,
|
||||||
using Huffman tree _from previous Huffman-compressed literals block_.
|
using Huffman tree _from previous Huffman-compressed literals block_.
|
||||||
Huffman tree description will be skipped.
|
`Huffman_Tree_Description` will be skipped.
|
||||||
Note: If this mode is used without any previous Huffman-table in the frame
|
Note: If this mode is triggering without any previous Huffman-table in the frame
|
||||||
(or [dictionary](#dictionary-format)), this should be treated as corruption.
|
(or [dictionary](#dictionary-format)), this should be treated as data corruption.
|
||||||
|
|
||||||
__`Size_Format`__
|
__`Size_Format`__
|
||||||
|
|
||||||
`Size_Format` is divided into 2 families :
|
`Size_Format` is divided into 2 families :
|
||||||
|
|
||||||
- For `Raw_Literals_Block` and `RLE_Literals_Block` it's enough to decode `Regenerated_Size`.
|
- For `Raw_Literals_Block` and `RLE_Literals_Block`,
|
||||||
- For `Compressed_Block`, its required to decode both `Compressed_Size`
|
it's only necessary to decode `Regenerated_Size`.
|
||||||
and `Regenerated_Size` (the decompressed size). It will also decode the number of streams.
|
There is no `Compressed_Size` field.
|
||||||
|
- For `Compressed_Block` and `Treeless_Literals_Block`,
|
||||||
|
it's required to decode both `Compressed_Size`
|
||||||
|
and `Regenerated_Size` (the decompressed size).
|
||||||
|
It's also necessary to decode the number of streams (1 or 4).
|
||||||
|
|
||||||
For values spanning several bytes, convention is little-endian.
|
For values spanning several bytes, convention is __little-endian__.
|
||||||
|
|
||||||
__`Size_Format` for `Raw_Literals_Block` and `RLE_Literals_Block`__ :
|
__`Size_Format` for `Raw_Literals_Block` and `RLE_Literals_Block`__ :
|
||||||
|
|
||||||
@ -463,9 +476,9 @@ __`Size_Format` for `Raw_Literals_Block` and `RLE_Literals_Block`__ :
|
|||||||
|
|
||||||
Only Stream1 is present for these cases.
|
Only Stream1 is present for these cases.
|
||||||
Note : it's allowed to represent a short value (for example `13`)
|
Note : it's allowed to represent a short value (for example `13`)
|
||||||
using a long format, accepting the increased compressed data size.
|
using a long format, even if it's less efficient.
|
||||||
|
|
||||||
__`Size_Format` for `Compressed_Literals_Block` and `Repeat_Stats_Literals_Block`__ :
|
__`Size_Format` for `Compressed_Literals_Block` and `Treeless_Literals_Block`__ :
|
||||||
|
|
||||||
- Value 00 : _A single stream_.
|
- Value 00 : _A single stream_.
|
||||||
Both `Regenerated_Size` and `Compressed_Size` use 10 bits (0-1023).
|
Both `Regenerated_Size` and `Compressed_Size` use 10 bits (0-1023).
|
||||||
@ -480,63 +493,64 @@ __`Size_Format` for `Compressed_Literals_Block` and `Repeat_Stats_Literals_Block
|
|||||||
Both `Regenerated_Size` and `Compressed_Size` use 18 bits (0-262143).
|
Both `Regenerated_Size` and `Compressed_Size` use 18 bits (0-262143).
|
||||||
`Literals_Section_Header` has 5 bytes.
|
`Literals_Section_Header` has 5 bytes.
|
||||||
|
|
||||||
Both `Compressed_Size` and `Regenerated_Size` fields follow little-endian convention.
|
Both `Compressed_Size` and `Regenerated_Size` fields follow __little-endian__ convention.
|
||||||
Note: `Compressed_Size` __includes__ the size of the Huffman Tree description if it
|
Note: `Compressed_Size` __includes__ the size of the Huffman Tree description
|
||||||
is present.
|
_when_ it is present.
|
||||||
|
|
||||||
### Raw Literals Block
|
### Raw Literals Block
|
||||||
The data in Stream1 is `Regenerated_Size` bytes long, and contains the raw literals data
|
The data in Stream1 is `Regenerated_Size` bytes long,
|
||||||
to be used in sequence execution.
|
it contains the raw literals data to be used during [Sequence Execution].
|
||||||
|
|
||||||
### RLE Literals Block
|
### RLE Literals Block
|
||||||
Stream1 consists of a single byte which should be repeated `Regenerated_Size` times
|
Stream1 consists of a single byte which should be repeated `Regenerated_Size` times
|
||||||
to generate the decoded literals.
|
to generate the decoded literals.
|
||||||
|
|
||||||
### Compressed Literals Block and Repeat Stats Literals Block
|
### Compressed Literals Block and Treeless Literals Block
|
||||||
Both of these modes contain Huffman encoded data
|
Both of these modes contain Huffman encoded data.
|
||||||
|
`Treeless_Literals_Block` does not have a `Huffman_Tree_Description`.
|
||||||
|
|
||||||
#### `Huffman_Tree_Description`
|
#### `Huffman_Tree_Description`
|
||||||
This section is only present when `Literals_Block_Type` type is `Compressed_Literals_Block` (`2`).
|
This section is only present when `Literals_Block_Type` type is `Compressed_Literals_Block` (`2`).
|
||||||
The format of the Huffman tree description can be found at [Huffman Tree description](#huffman-tree-description).
|
The format of the Huffman tree description can be found at [Huffman Tree description](#huffman-tree-description).
|
||||||
The size Huffman Tree description will be determined during the decoding process,
|
The size of `Huffman_Tree_Description` is determined during decoding process,
|
||||||
and must be used to determine where the compressed Huffman streams begin.
|
it must be used to determine where streams begin.
|
||||||
|
`Total_Streams_Size = Compress_Size - Huffman_Tree_Description_Size`.
|
||||||
|
|
||||||
If repeat stats mode is used, the Huffman table used in the previous compressed block will
|
For `Treeless_Literals_Block`,
|
||||||
be used to decompress this block as well.
|
the Huffman table comes from previously compressed literals block.
|
||||||
|
|
||||||
Huffman compressed data consists either 1 or 4 Huffman-coded streams.
|
Huffman compressed data consists of either 1 or 4 Huffman-coded streams.
|
||||||
|
|
||||||
If only one stream is present, it is a single bitstream occupying the entire
|
If only one stream is present, it is a single bitstream occupying the entire
|
||||||
remaining portion of the literals block, encoded as described at
|
remaining portion of the literals block, encoded as described within
|
||||||
[Huffman-Coded Streams](#huffman-coded-streams).
|
[Huffman-Coded Streams](#huffman-coded-streams).
|
||||||
|
|
||||||
If there are four streams, the literals section header only provides enough
|
If there are four streams, the literals section header only provides enough
|
||||||
information to know the regenerated and compressed sizes of all four streams combined.
|
information to know the decompressed and compressed sizes of all four streams _combined_.
|
||||||
The regenerated size of each stream is equal to `(totalSize+3)/4`, except for the last stream,
|
The decompressed size of each stream is equal to `(totalSize+3)/4`,
|
||||||
which may be up to 3 bytes smaller, to reach a total decompressed size match that described
|
except for the last stream which may be up to 3 bytes smaller,
|
||||||
in the literals header.
|
to reach a total decompressed size as specified in `Regenerated_Size`.
|
||||||
|
|
||||||
The compressed size of each stream is provided explicitly: the first 6 bytes of the compressed
|
The compressed size of each stream is provided explicitly:
|
||||||
data consist of three 2-byte little endian fields, describing the compressed sizes
|
the first 6 bytes of the compressed data consist of three 2-byte __little-endian__ fields,
|
||||||
of the first three streams.
|
describing the compressed sizes of the first three streams.
|
||||||
The last streams size is computed from the total compressed size and the size of the other
|
Stream4 size is computed from total `Total_Streams_Size` minus sizes of other streams.
|
||||||
three streams.
|
|
||||||
|
|
||||||
`stream4CSize = totalCSize - 6 - stream1CSize - stream2CSize - stream3CSize`.
|
`Stream4_Size = Total_Streams_Size - 6 - Stream1_Size - Stream2_Size - Stream3_Size`.
|
||||||
|
|
||||||
Note: remember that totalCSize may be smaller than the `Compressed_Size` found in the literals
|
Note: remember that `Total_Streams_Size` can be smaller than `Compressed_Size` in header,
|
||||||
block header as `Compressed_Size` also contains the size of the Huffman Tree description if it
|
because `Compressed_Size` also contains `Huffman_Tree_Description_Size` when it is present.
|
||||||
is present.
|
|
||||||
|
|
||||||
Each of these 4 bitstreams is then decoded independently as a Huffman-Coded stream,
|
Each of these 4 bitstreams is then decoded independently as a Huffman-Coded stream,
|
||||||
as described at [Huffman-Coded Streams](#huffman-coded-streams)
|
as described at [Huffman-Coded Streams](#huffman-coded-streams)
|
||||||
|
|
||||||
|
|
||||||
Sequences Section
|
Sequences Section
|
||||||
-----------------
|
-----------------
|
||||||
A compressed block is a succession of _sequences_ .
|
A compressed block is a succession of _sequences_ .
|
||||||
A sequence is a literal copy command, followed by a match copy command.
|
A sequence is a literal copy command, followed by a match copy command.
|
||||||
A literal copy command specifies a length.
|
A literal copy command specifies a length.
|
||||||
It is the number of bytes to be copied (or extracted) from the literal section.
|
It is the number of bytes to be copied (or extracted) from the [Literals Section].
|
||||||
A match copy command specifies an offset and a length.
|
A match copy command specifies an offset and a length.
|
||||||
|
|
||||||
When all _sequences_ are decoded,
|
When all _sequences_ are decoded,
|
||||||
@ -557,7 +571,7 @@ followed by the bitstream.
|
|||||||
| -------------------------- | ------------------------- | ---------------- | ---------------------- | --------- |
|
| -------------------------- | ------------------------- | ---------------- | ---------------------- | --------- |
|
||||||
|
|
||||||
To decode the `Sequences_Section`, it's required to know its size.
|
To decode the `Sequences_Section`, it's required to know its size.
|
||||||
This size is deduced from `blockSize - literalSectionSize`.
|
This size is deduced from `Block_Size - Literals_Section_Size`.
|
||||||
|
|
||||||
|
|
||||||
#### `Sequences_Section_Header`
|
#### `Sequences_Section_Header`
|
||||||
@ -572,7 +586,7 @@ This is a variable size field using between 1 and 3 bytes.
|
|||||||
Let's call its first byte `byte0`.
|
Let's call its first byte `byte0`.
|
||||||
- `if (byte0 == 0)` : there are no sequences.
|
- `if (byte0 == 0)` : there are no sequences.
|
||||||
The sequence section stops there.
|
The sequence section stops there.
|
||||||
Regenerated content is defined entirely by literals section.
|
Decompressed content is defined entirely as [Literals Section] content.
|
||||||
- `if (byte0 < 128)` : `Number_of_Sequences = byte0` . Uses 1 byte.
|
- `if (byte0 < 128)` : `Number_of_Sequences = byte0` . Uses 1 byte.
|
||||||
- `if (byte0 < 255)` : `Number_of_Sequences = ((byte0-128) << 8) + byte1` . Uses 2 bytes.
|
- `if (byte0 < 255)` : `Number_of_Sequences = ((byte0-128) << 8) + byte1` . Uses 2 bytes.
|
||||||
- `if (byte0 == 255)`: `Number_of_Sequences = byte1 + (byte2<<8) + 0x7F00` . Uses 3 bytes.
|
- `if (byte0 == 255)`: `Number_of_Sequences = byte1 + (byte2<<8) + 0x7F00` . Uses 3 bytes.
|
||||||
@ -581,14 +595,14 @@ __Symbol compression modes__
|
|||||||
|
|
||||||
This is a single byte, defining the compression mode of each symbol type.
|
This is a single byte, defining the compression mode of each symbol type.
|
||||||
|
|
||||||
|Bit number| 7-6 | 5-4 | 3-2 | 1-0 |
|
|Bit number| 7-6 | 5-4 | 3-2 | 1-0 |
|
||||||
| -------- | ----------------------- | -------------- | -------------------- | ---------- |
|
| -------- | ----------------------- | -------------- | -------------------- | ---------- |
|
||||||
|Field name| `Literals_Lengths_Mode` | `Offsets_Mode` | `Match_Lengths_Mode` | `Reserved` |
|
|Field name| `Literals_Lengths_Mode` | `Offsets_Mode` | `Match_Lengths_Mode` | `Reserved` |
|
||||||
|
|
||||||
The last field, `Reserved`, must be all-zeroes.
|
The last field, `Reserved`, must be all-zeroes.
|
||||||
|
|
||||||
`Literals_Lengths_Mode`, `Offsets_Mode` and `Match_Lengths_Mode` define the `Compression_Mode` of
|
`Literals_Lengths_Mode`, `Offsets_Mode` and `Match_Lengths_Mode` define the `Compression_Mode` of
|
||||||
literals lengths, offsets, and match lengths respectively.
|
literals lengths, offsets, and match lengths symbols respectively.
|
||||||
|
|
||||||
They follow the same enumeration :
|
They follow the same enumeration :
|
||||||
|
|
||||||
@ -598,12 +612,12 @@ They follow the same enumeration :
|
|||||||
|
|
||||||
- `Predefined_Mode` : A predefined FSE distribution table is used, defined in
|
- `Predefined_Mode` : A predefined FSE distribution table is used, defined in
|
||||||
[default distributions](#default-distributions).
|
[default distributions](#default-distributions).
|
||||||
The table takes no space in the compressed data.
|
No distribution table will be present.
|
||||||
- `RLE_Mode` : The table description consists of a single byte.
|
- `RLE_Mode` : The table description consists of a single byte.
|
||||||
This code will be repeated for every sequence.
|
This code will be repeated for all sequences.
|
||||||
- `Repeat_Mode` : The table used in the previous compressed block will be used again.
|
- `Repeat_Mode` : The table used in the previous compressed block will be used again.
|
||||||
No distribution table will be present.
|
No distribution table will be present.
|
||||||
Note: this includes RLE mode, so if repeat_mode follows rle_mode the same symbol will be repeated.
|
Note: this includes RLE mode, so if `Repeat_Mode` follows `RLE_Mode`, the same symbol will be repeated.
|
||||||
If this mode is used without any previous sequence table in the frame
|
If this mode is used without any previous sequence table in the frame
|
||||||
(or [dictionary](#dictionary-format)) to repeat, this should be treated as corruption.
|
(or [dictionary](#dictionary-format)) to repeat, this should be treated as corruption.
|
||||||
- `FSE_Compressed_Mode` : standard FSE compression.
|
- `FSE_Compressed_Mode` : standard FSE compression.
|
||||||
@ -625,7 +639,7 @@ Literals length codes are values ranging from `0` to `35` included.
|
|||||||
They define lengths from 0 to 131071 bytes.
|
They define lengths from 0 to 131071 bytes.
|
||||||
The literals length is equal to the decoded `Baseline` plus
|
The literals length is equal to the decoded `Baseline` plus
|
||||||
the result of reading `Number_of_Bits` bits from the bitstream,
|
the result of reading `Number_of_Bits` bits from the bitstream,
|
||||||
as a little-endian value.
|
as a __little-endian__ value.
|
||||||
|
|
||||||
| `Literals_Length_Code` | 0-15 |
|
| `Literals_Length_Code` | 0-15 |
|
||||||
| ---------------------- | ---------------------- |
|
| ---------------------- | ---------------------- |
|
||||||
@ -654,7 +668,7 @@ Match length codes are values ranging from `0` to `52` included.
|
|||||||
They define lengths from 3 to 131074 bytes.
|
They define lengths from 3 to 131074 bytes.
|
||||||
The match length is equal to the decoded `Baseline` plus
|
The match length is equal to the decoded `Baseline` plus
|
||||||
the result of reading `Number_of_Bits` bits from the bitstream,
|
the result of reading `Number_of_Bits` bits from the bitstream,
|
||||||
as a little-endian value.
|
as a __little-endian__ value.
|
||||||
|
|
||||||
| `Match_Length_Code` | 0-31 |
|
| `Match_Length_Code` | 0-31 |
|
||||||
| ------------------- | ----------------------- |
|
| ------------------- | ----------------------- |
|
||||||
@ -685,7 +699,7 @@ Recommendation is to support at least up to `22`.
|
|||||||
For information, at the time of this writing.
|
For information, at the time of this writing.
|
||||||
the reference decoder supports a maximum `N` value of `28` in 64-bits mode.
|
the reference decoder supports a maximum `N` value of `28` in 64-bits mode.
|
||||||
|
|
||||||
An offset code is also the number of additional bits to read in little-endian fashion,
|
An offset code is also the number of additional bits to read in __little-endian__ fashion,
|
||||||
and can be translated into an `Offset_Value` using the following formulas :
|
and can be translated into an `Offset_Value` using the following formulas :
|
||||||
|
|
||||||
```
|
```
|
||||||
@ -720,8 +734,8 @@ begins.
|
|||||||
FSE decoding requires a 'state' to be carried from symbol to symbol.
|
FSE decoding requires a 'state' to be carried from symbol to symbol.
|
||||||
For more explanation on FSE decoding, see the [FSE section](#fse).
|
For more explanation on FSE decoding, see the [FSE section](#fse).
|
||||||
|
|
||||||
For sequence decoding, a separate state must be kept track of for each of
|
For sequence decoding, a separate state keeps track of each
|
||||||
literal lengths, offsets, and match lengths.
|
literal lengths, offsets, and match lengths symbols.
|
||||||
Some FSE primitives are also used.
|
Some FSE primitives are also used.
|
||||||
For more details on the operation of these primitives, see the [FSE section](#fse).
|
For more details on the operation of these primitives, see the [FSE section](#fse).
|
||||||
|
|
||||||
@ -753,8 +767,7 @@ See the [description of the codes] for how to determine these values.
|
|||||||
[description of the codes]: #the-codes-for-literals-lengths-match-lengths-and-offsets
|
[description of the codes]: #the-codes-for-literals-lengths-match-lengths-and-offsets
|
||||||
|
|
||||||
Decoding starts by reading the `Number_of_Bits` required to decode `Offset`.
|
Decoding starts by reading the `Number_of_Bits` required to decode `Offset`.
|
||||||
It then does the same for `Match_Length`,
|
It then does the same for `Match_Length`, and then for `Literals_Length`.
|
||||||
and then for `Literals_Length`.
|
|
||||||
This sequence is then used for [sequence execution](#sequence-execution).
|
This sequence is then used for [sequence execution](#sequence-execution).
|
||||||
|
|
||||||
If it is not the last sequence in the block,
|
If it is not the last sequence in the block,
|
||||||
@ -807,6 +820,7 @@ short offsetCodes_defaultDistribution[29] =
|
|||||||
1, 1, 1, 1, 1, 1, 1, 1,-1,-1,-1,-1,-1 };
|
1, 1, 1, 1, 1, 1, 1, 1,-1,-1,-1,-1,-1 };
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
Sequence Execution
|
Sequence Execution
|
||||||
------------------
|
------------------
|
||||||
Once literals and sequences have been decoded,
|
Once literals and sequences have been decoded,
|
||||||
@ -826,7 +840,8 @@ in this case.
|
|||||||
|
|
||||||
The offset is defined as from the current position, so an offset of 6
|
The offset is defined as from the current position, so an offset of 6
|
||||||
and a match length of 3 means that 3 bytes should be copied from 6 bytes back.
|
and a match length of 3 means that 3 bytes should be copied from 6 bytes back.
|
||||||
Note that all offsets must be at most equal to the window size defined by the frame header.
|
Note that all offsets leading to previously decoded data
|
||||||
|
must be smaller than `Window_Size` defined in `Frame_Header_Descriptor`.
|
||||||
|
|
||||||
#### Repeat offsets
|
#### Repeat offsets
|
||||||
As seen in [Sequence Execution](#sequence-execution),
|
As seen in [Sequence Execution](#sequence-execution),
|
||||||
@ -842,11 +857,10 @@ so an `offset_value` of 1 means `Repeated_Offset2`,
|
|||||||
an `offset_value` of 2 means `Repeated_Offset3`,
|
an `offset_value` of 2 means `Repeated_Offset3`,
|
||||||
and an `offset_value` of 3 means `Repeated_Offset1 - 1_byte`.
|
and an `offset_value` of 3 means `Repeated_Offset1 - 1_byte`.
|
||||||
|
|
||||||
In the first block, the offset history is populated with the following values : 1, 4 and 8 (in order).
|
For the first block, the starting offset history is populated with the following values : 1, 4 and 8 (in order).
|
||||||
|
|
||||||
Then each block gets its starting offset history from the ending values of the most recent compressed block.
|
Then each block gets its starting offset history from the ending values of the most recent `Compressed_Block`.
|
||||||
Note that non-compressed blocks are skipped,
|
Note that blocks which are not `Compressed_Block` are skipped, they do not contribute to offset history.
|
||||||
they do not contribute to offset history.
|
|
||||||
|
|
||||||
[Offset Codes]: #offset-codes
|
[Offset Codes]: #offset-codes
|
||||||
|
|
||||||
@ -859,6 +873,7 @@ This means that when `Repeated_Offset1` (most recent) is used, history is unmodi
|
|||||||
When `Repeated_Offset2` is used, it's swapped with `Repeated_Offset1`.
|
When `Repeated_Offset2` is used, it's swapped with `Repeated_Offset1`.
|
||||||
If any other offset is used, it becomes `Repeated_Offset1` and the rest are shift back by one.
|
If any other offset is used, it becomes `Repeated_Offset1` and the rest are shift back by one.
|
||||||
|
|
||||||
|
|
||||||
Skippable Frames
|
Skippable Frames
|
||||||
----------------
|
----------------
|
||||||
|
|
||||||
@ -878,7 +893,7 @@ Skippable frames defined in this specification are compatible with [LZ4] ones.
|
|||||||
|
|
||||||
__`Magic_Number`__
|
__`Magic_Number`__
|
||||||
|
|
||||||
4 Bytes, little-endian format.
|
4 Bytes, __little-endian__ format.
|
||||||
Value : 0x184D2A5?, which means any value from 0x184D2A50 to 0x184D2A5F.
|
Value : 0x184D2A5?, which means any value from 0x184D2A50 to 0x184D2A5F.
|
||||||
All 16 values are valid to identify a skippable frame.
|
All 16 values are valid to identify a skippable frame.
|
||||||
|
|
||||||
@ -886,13 +901,14 @@ __`Frame_Size`__
|
|||||||
|
|
||||||
This is the size, in bytes, of the following `User_Data`
|
This is the size, in bytes, of the following `User_Data`
|
||||||
(without including the magic number nor the size field itself).
|
(without including the magic number nor the size field itself).
|
||||||
This field is represented using 4 Bytes, little-endian format, unsigned 32-bits.
|
This field is represented using 4 Bytes, __little-endian__ format, unsigned 32-bits.
|
||||||
This means `User_Data` can’t be bigger than (2^32-1) bytes.
|
This means `User_Data` can’t be bigger than (2^32-1) bytes.
|
||||||
|
|
||||||
__`User_Data`__
|
__`User_Data`__
|
||||||
|
|
||||||
The `User_Data` can be anything. Data will just be skipped by the decoder.
|
The `User_Data` can be anything. Data will just be skipped by the decoder.
|
||||||
|
|
||||||
|
|
||||||
Entropy Encoding
|
Entropy Encoding
|
||||||
----------------
|
----------------
|
||||||
Two types of entropy encoding are used by the Zstandard format:
|
Two types of entropy encoding are used by the Zstandard format:
|
||||||
@ -900,7 +916,7 @@ FSE, and Huffman coding.
|
|||||||
|
|
||||||
FSE
|
FSE
|
||||||
---
|
---
|
||||||
FSE, or FiniteStateEntropy is an entropy coding based on [ANS].
|
FSE, short for Finite State Entropy, is an entropy codec based on [ANS].
|
||||||
FSE encoding/decoding involves a state that is carried over between symbols,
|
FSE encoding/decoding involves a state that is carried over between symbols,
|
||||||
so decoding must be done in the opposite direction as encoding.
|
so decoding must be done in the opposite direction as encoding.
|
||||||
Therefore, all FSE bitstreams are read from end to beginning.
|
Therefore, all FSE bitstreams are read from end to beginning.
|
||||||
@ -909,15 +925,15 @@ For additional details on FSE, see [Finite State Entropy].
|
|||||||
|
|
||||||
[Finite State Entropy]:https://github.com/Cyan4973/FiniteStateEntropy/
|
[Finite State Entropy]:https://github.com/Cyan4973/FiniteStateEntropy/
|
||||||
|
|
||||||
FSE decoding involves a decoding table which has a power of 2 size and three elements:
|
FSE decoding involves a decoding table which has a power of 2 size, and contain three elements:
|
||||||
`Symbol`, `Num_Bits`, and `Baseline`.
|
`Symbol`, `Num_Bits`, and `Baseline`.
|
||||||
The `log2` of the table size is its `Accuracy_Log`.
|
The `log2` of the table size is its `Accuracy_Log`.
|
||||||
The FSE state represents an index in this table.
|
The FSE state represents an index in this table.
|
||||||
The next symbol in the stream is the symbol indicated by the table value for that state.
|
|
||||||
To obtain the next state value,
|
|
||||||
the decoder should consume `Num_Bits` bits from the stream as a little endian value and add it to baseline.
|
|
||||||
|
|
||||||
To obtain the initial state value, consume `Accuracy_Log` bits from the stream as a little endian value.
|
To obtain the initial state value, consume `Accuracy_Log` bits from the stream as a __little-endian__ value.
|
||||||
|
The next symbol in the stream is the `Symbol` indicated in the table for that state.
|
||||||
|
To obtain the next state value,
|
||||||
|
the decoder should consume `Num_Bits` bits from the stream as a __little-endian__ value and add it to `Baseline`.
|
||||||
|
|
||||||
[ANS]: https://en.wikipedia.org/wiki/Asymmetric_Numeral_Systems
|
[ANS]: https://en.wikipedia.org/wiki/Asymmetric_Numeral_Systems
|
||||||
|
|
||||||
@ -929,7 +945,7 @@ An FSE distribution table describes the probabilities of all symbols
|
|||||||
from `0` to the last present one (included)
|
from `0` to the last present one (included)
|
||||||
on a normalized scale of `1 << Accuracy_Log` .
|
on a normalized scale of `1 << Accuracy_Log` .
|
||||||
|
|
||||||
It's a bitstream which is read forward, in little-endian fashion.
|
It's a bitstream which is read forward, in __little-endian__ fashion.
|
||||||
It's not necessary to know its exact size,
|
It's not necessary to know its exact size,
|
||||||
since it will be discovered and reported by the decoding process.
|
since it will be discovered and reported by the decoding process.
|
||||||
|
|
||||||
@ -1064,7 +1080,7 @@ Huffman Coding
|
|||||||
--------------
|
--------------
|
||||||
Zstandard Huffman-coded streams are read backwards,
|
Zstandard Huffman-coded streams are read backwards,
|
||||||
similar to the FSE bitstreams.
|
similar to the FSE bitstreams.
|
||||||
Therefore, to find the start of the bitstream it is therefore necessary to
|
Therefore, to find the start of the bitstream, it is therefore to
|
||||||
know the offset of the last byte of the Huffman-coded stream.
|
know the offset of the last byte of the Huffman-coded stream.
|
||||||
|
|
||||||
After writing the last bit containing information, the compressor
|
After writing the last bit containing information, the compressor
|
||||||
@ -1077,7 +1093,7 @@ byte to read. The decompressor needs to skip 0-7 initial `0`-bits and
|
|||||||
the first `1`-bit it occurs. Afterwards, the useful part of the bitstream
|
the first `1`-bit it occurs. Afterwards, the useful part of the bitstream
|
||||||
begins.
|
begins.
|
||||||
|
|
||||||
The bitstream contains Huffman-coded symbols in little-endian order,
|
The bitstream contains Huffman-coded symbols in __little-endian__ order,
|
||||||
with the codes defined by the method below.
|
with the codes defined by the method below.
|
||||||
|
|
||||||
### Huffman Tree Description
|
### Huffman Tree Description
|
||||||
@ -1182,14 +1198,14 @@ The Huffman header compression uses 2 states,
|
|||||||
which share the same FSE distribution table.
|
which share the same FSE distribution table.
|
||||||
The first state (`State1`) encodes the even indexed symbols,
|
The first state (`State1`) encodes the even indexed symbols,
|
||||||
and the second (`State2`) encodes the odd indexes.
|
and the second (`State2`) encodes the odd indexes.
|
||||||
State1 is initialized first, and then State2, and they take turns decoding
|
`State1` is initialized first, and then `State2`, and they take turns
|
||||||
a single symbol and updating their state.
|
decoding a single symbol and updating their state.
|
||||||
For more details on these FSE operations, see the [FSE section](#fse).
|
For more details on these FSE operations, see the [FSE section](#fse).
|
||||||
|
|
||||||
The number of symbols to decode is determined
|
The number of symbols to decode is determined
|
||||||
by tracking bitStream overflow condition:
|
by tracking bitStream overflow condition:
|
||||||
If updating state after decoding a symbol would require more bits than
|
If updating state after decoding a symbol would require more bits than
|
||||||
remain in the stream, it is assumed the extra bits are 0. Then,
|
remain in the stream, it is assumed that extra bits are 0. Then,
|
||||||
the symbols for each of the final states are decoded and the process is complete.
|
the symbols for each of the final states are decoded and the process is complete.
|
||||||
|
|
||||||
##### Conversion from weights to Huffman prefix codes
|
##### Conversion from weights to Huffman prefix codes
|
||||||
@ -1245,7 +1261,7 @@ it would be encoded as:
|
|||||||
|Encoding|`0000`|`0001`|`01`|`1`| `10000` |
|
|Encoding|`0000`|`0001`|`01`|`1`| `10000` |
|
||||||
|
|
||||||
Starting from the end,
|
Starting from the end,
|
||||||
it's possible to read the bitstream in a little-endian fashion,
|
it's possible to read the bitstream in a __little-endian__ fashion,
|
||||||
keeping track of already used bits. Since the bitstream is encoded in reverse
|
keeping track of already used bits. Since the bitstream is encoded in reverse
|
||||||
order, by starting at the end the symbols can be read in forward order.
|
order, by starting at the end the symbols can be read in forward order.
|
||||||
|
|
||||||
@ -1258,13 +1274,14 @@ If a bitstream is not entirely and exactly consumed,
|
|||||||
hence reaching exactly its beginning position with _all_ bits consumed,
|
hence reaching exactly its beginning position with _all_ bits consumed,
|
||||||
the decoding process is considered faulty.
|
the decoding process is considered faulty.
|
||||||
|
|
||||||
|
|
||||||
Dictionary Format
|
Dictionary Format
|
||||||
-----------------
|
-----------------
|
||||||
|
|
||||||
Zstandard is compatible with "raw content" dictionaries, free of any format restriction,
|
Zstandard is compatible with "raw content" dictionaries,
|
||||||
except that they must be at least 8 bytes.
|
free of any format restriction, except that they must be at least 8 bytes.
|
||||||
These dictionaries function as if they were just the `Content` block of a formatted
|
These dictionaries function as if they were just the `Content` part
|
||||||
dictionary.
|
of a formatted dictionary.
|
||||||
|
|
||||||
But dictionaries created by `zstd --train` follow a format, described here.
|
But dictionaries created by `zstd --train` follow a format, described here.
|
||||||
|
|
||||||
@ -1274,9 +1291,9 @@ __Pre-requisites__ : a dictionary has a size,
|
|||||||
| `Magic_Number` | `Dictionary_ID` | `Entropy_Tables` | `Content` |
|
| `Magic_Number` | `Dictionary_ID` | `Entropy_Tables` | `Content` |
|
||||||
| -------------- | --------------- | ---------------- | --------- |
|
| -------------- | --------------- | ---------------- | --------- |
|
||||||
|
|
||||||
__`Magic_Number`__ : 4 bytes ID, value 0xEC30A437, little-endian format
|
__`Magic_Number`__ : 4 bytes ID, value 0xEC30A437, __little-endian__ format
|
||||||
|
|
||||||
__`Dictionary_ID`__ : 4 bytes, stored in little-endian format.
|
__`Dictionary_ID`__ : 4 bytes, stored in __little-endian__ format.
|
||||||
`Dictionary_ID` can be any value, except 0 (which means no `Dictionary_ID`).
|
`Dictionary_ID` can be any value, except 0 (which means no `Dictionary_ID`).
|
||||||
It's used by decoders to check if they use the correct dictionary.
|
It's used by decoders to check if they use the correct dictionary.
|
||||||
|
|
||||||
@ -1284,9 +1301,9 @@ _Reserved ranges :_
|
|||||||
If the frame is going to be distributed in a private environment,
|
If the frame is going to be distributed in a private environment,
|
||||||
any `Dictionary_ID` can be used.
|
any `Dictionary_ID` can be used.
|
||||||
However, for public distribution of compressed frames,
|
However, for public distribution of compressed frames,
|
||||||
the following ranges are reserved for future use and should not be used :
|
the following ranges are reserved and shall not be used :
|
||||||
|
|
||||||
- low range : 1 - 32767
|
- low range : <= 32767
|
||||||
- high range : >= (2^31)
|
- high range : >= (2^31)
|
||||||
|
|
||||||
__`Entropy_Tables`__ : following the same format as the tables in compressed blocks.
|
__`Entropy_Tables`__ : following the same format as the tables in compressed blocks.
|
||||||
@ -1298,26 +1315,30 @@ __`Entropy_Tables`__ : following the same format as the tables in compressed blo
|
|||||||
These tables populate the Repeat Stats literals mode and
|
These tables populate the Repeat Stats literals mode and
|
||||||
Repeat distribution mode for sequence decoding.
|
Repeat distribution mode for sequence decoding.
|
||||||
It's finally followed by 3 offset values, populating recent offsets (instead of using `{1,4,8}`),
|
It's finally followed by 3 offset values, populating recent offsets (instead of using `{1,4,8}`),
|
||||||
stored in order, 4-bytes little-endian each, for a total of 12 bytes.
|
stored in order, 4-bytes __little-endian__ each, for a total of 12 bytes.
|
||||||
Each recent offset must have a value < dictionary size.
|
Each recent offset must have a value < dictionary size.
|
||||||
|
|
||||||
__`Content`__ : The rest of the dictionary is its content.
|
__`Content`__ : The rest of the dictionary is its content.
|
||||||
The content act as a "past" in front of data to compress or decompress,
|
The content act as a "past" in front of data to compress or decompress,
|
||||||
so it can be referenced in sequence commands.
|
so it can be referenced in sequence commands.
|
||||||
As long as the amount of data decoded from this frame is less than or
|
As long as the amount of data decoded from this frame is less than or
|
||||||
equal to the window-size, sequence commands may specify offsets longer
|
equal to `Window_Size`, sequence commands may specify offsets longer
|
||||||
than the lenght of total decoded output so far to reference back to the
|
than the total length of decoded output so far to reference back to the
|
||||||
dictionary. After the total output has surpassed the window size however,
|
dictionary. After the total output has surpassed `Window_Size` however,
|
||||||
this is no longer allowed and the dictionary is no longer accessible.
|
this is no longer allowed and the dictionary is no longer accessible.
|
||||||
|
|
||||||
[compressed blocks]: #the-format-of-compressed_block
|
[compressed blocks]: #the-format-of-compressed_block
|
||||||
|
|
||||||
|
|
||||||
Appendix A - Decoding tables for predefined codes
|
Appendix A - Decoding tables for predefined codes
|
||||||
-------------------------------------------------
|
-------------------------------------------------
|
||||||
|
|
||||||
This appendix contains FSE decoding tables for the predefined literal length, match length, and offset
|
This appendix contains FSE decoding tables
|
||||||
codes. The tables have been constructed using the algorithm as given above in the
|
for the predefined literal length, match length, and offset codes.
|
||||||
"from normalized distribution to decoding tables" chapter. The tables here can be used as examples
|
The tables have been constructed using the algorithm as given above in chapter
|
||||||
to crosscheck that an implementation implements the decoding table generation algorithm correctly.
|
"from normalized distribution to decoding tables".
|
||||||
|
The tables here can be used as examples
|
||||||
|
to crosscheck that an implementation build its decoding tables correctly.
|
||||||
|
|
||||||
#### Literal Length Code:
|
#### Literal Length Code:
|
||||||
|
|
||||||
@ -1496,6 +1517,7 @@ to crosscheck that an implementation implements the decoding table generation al
|
|||||||
|
|
||||||
Version changes
|
Version changes
|
||||||
---------------
|
---------------
|
||||||
|
- 0.2.5 : minor typos and clarifications
|
||||||
- 0.2.4 : section restructuring, by Sean Purcell
|
- 0.2.4 : section restructuring, by Sean Purcell
|
||||||
- 0.2.3 : clarified several details, by Sean Purcell
|
- 0.2.3 : clarified several details, by Sean Purcell
|
||||||
- 0.2.2 : added predefined codes, by Johannes Rudolph
|
- 0.2.2 : added predefined codes, by Johannes Rudolph
|
||||||
|
Reference in New Issue
Block a user