1
0
mirror of https://github.com/facebook/zstd.git synced 2025-08-01 09:47:01 +03:00

update Zstandard format specification

answering a few questions from IETF RFC Discuss stage.
This commit is contained in:
Yann Collet
2018-05-31 10:47:44 -07:00
parent ac4f7ead3b
commit a4c9c4defe

View File

@ -16,7 +16,7 @@ Distribution of this document is unlimited.
### Version ### Version
0.2.7 (30/04/18) 0.2.8 (30/05/18)
Introduction Introduction
@ -27,6 +27,8 @@ that is independent of CPU type, operating system,
file system and character set, suitable for file system and character set, suitable for
file compression, pipe and streaming compression, file compression, pipe and streaming compression,
using the [Zstandard algorithm](http://www.zstandard.org). using the [Zstandard algorithm](http://www.zstandard.org).
The text of the specification assumes a basic background in programming
at the level of bits and other primitive data representations.
The data can be produced or consumed, The data can be produced or consumed,
even for an arbitrarily long sequentially presented input data stream, even for an arbitrarily long sequentially presented input data stream,
@ -39,11 +41,6 @@ for detection of data corruption.
The data format defined by this specification The data format defined by this specification
does not attempt to allow random access to compressed data. does not attempt to allow random access to compressed data.
This specification is intended for use by implementers of software
to compress data into Zstandard format and/or decompress data from Zstandard format.
The text of the specification assumes a basic background in programming
at the level of bits and other primitive data representations.
Unless otherwise indicated below, Unless otherwise indicated below,
a compliant compressor must produce data sets a compliant compressor must produce data sets
that conform to the specifications presented here. that conform to the specifications presented here.
@ -57,6 +54,16 @@ Whenever it does not support a parameter defined in the compressed stream,
it must produce a non-ambiguous error code and associated error message it must produce a non-ambiguous error code and associated error message
explaining which parameter is unsupported. explaining which parameter is unsupported.
This specification is intended for use by implementers of software
to compress data into Zstandard format and/or decompress data from Zstandard format.
The Zstandard format is supported by an open source reference implementation,
which also contains some useful validation tool,
such as `decodeCorpus`, which generate random valid frames,
that a compliant decoder should be able to decode,
or provide a meaningful error code explaining why it cannot.
### Overall conventions ### Overall conventions
In this document: In this document:
- square brackets i.e. `[` and `]` are used to indicate optional fields or parameters. - square brackets i.e. `[` and `]` are used to indicate optional fields or parameters.
@ -92,14 +99,14 @@ Overview
Frames Frames
------ ------
Zstandard compressed data is made of one or more __frames__. Zstandard compressed data is made of one or more __frames__.
Each frame is independent and can be decompressed indepedently of other frames. Each frame is independent and can be decompressed independently of other frames.
The decompressed content of multiple concatenated frames is the concatenation of The decompressed content of multiple concatenated frames is the concatenation of
each frame decompressed content. each frame decompressed content.
There are two frame formats defined by Zstandard: There are two frame formats defined by Zstandard:
Zstandard frames and Skippable frames. Zstandard frames and Skippable frames.
Zstandard frames contain compressed data, while Zstandard frames contain compressed data, while
skippable frames contain no data and can be used for metadata. skippable frames contain custom user metadata.
## Zstandard frames ## Zstandard frames
The structure of a single Zstandard frame is following: The structure of a single Zstandard frame is following:
@ -630,15 +637,8 @@ They follow the same enumeration :
- `Predefined_Mode` : A predefined FSE distribution table is used, defined in - `Predefined_Mode` : A predefined FSE distribution table is used, defined in
[default distributions](#default-distributions). [default distributions](#default-distributions).
No distribution table will be present. No distribution table will be present.
- `RLE_Mode` : The table description consists of a single byte. - `RLE_Mode` : The table description consists of a single byte, which contain symbol's value.
This code will be repeated for all sequences. This symbol will be used for all sequences.
- `Repeat_Mode` : The table used in the previous `Compressed_Block` with `Number_of_Sequences > 0` will be used again,
or if this is the first block, table in the dictionary will be used
No distribution table will be present.
Note that this includes `RLE_mode`, so if `Repeat_Mode` follows `RLE_Mode`, the same symbol will be repeated.
Note that this also includes `Predefined_Mode`.
If this mode is used without any previous sequence table in the frame
(or [dictionary](#dictionary-format)) to repeat, this should be treated as corruption.
- `FSE_Compressed_Mode` : standard FSE compression. - `FSE_Compressed_Mode` : standard FSE compression.
A distribution table will be present. A distribution table will be present.
The format of this distribution table is described in [FSE Table Description](#fse-table-description). The format of this distribution table is described in [FSE Table Description](#fse-table-description).
@ -646,6 +646,13 @@ They follow the same enumeration :
and the maximum accuracy log for the offsets table is 8. and the maximum accuracy log for the offsets table is 8.
`FSE_Compressed_Mode` must not be used when only one symbol is present, `FSE_Compressed_Mode` must not be used when only one symbol is present,
`RLE_Mode` should be used instead (although any other mode will work). `RLE_Mode` should be used instead (although any other mode will work).
- `Repeat_Mode` : The table used in the previous `Compressed_Block` with `Number_of_Sequences > 0` will be used again,
or if this is the first block, table in the dictionary will be used.
Note that this includes `RLE_mode`, so if `Repeat_Mode` follows `RLE_Mode`, the same symbol will be repeated.
It also includes `Predefined_Mode`, in which case `Repeat_Mode` will have same outcome as `Predefined_Mode`.
No distribution table will be present.
If this mode is used without any previous sequence table in the frame
(nor [dictionary](#dictionary-format)) to repeat, this should be treated as corruption.
#### The codes for literals lengths, match lengths, and offsets. #### The codes for literals lengths, match lengths, and offsets.
@ -903,16 +910,22 @@ Skippable Frames
|:--------------:|:------------:|:-----------:| |:--------------:|:------------:|:-----------:|
| 4 bytes | 4 bytes | n bytes | | 4 bytes | 4 bytes | n bytes |
Skippable frames allow the insertion of user-defined data Skippable frames allow the insertion of user-defined metadata
into a flow of concatenated frames. into a flow of concatenated frames.
Its design is pretty straightforward,
with the sole objective to allow the decoder to quickly skip
over user-defined data and continue decoding.
Skippable frames defined in this specification are compatible with [LZ4] ones. Skippable frames defined in this specification are compatible with [LZ4] ones.
[LZ4]:http://www.lz4.org [LZ4]:http://www.lz4.org
From a compliant decoder perspective, skippable frames need just be skipped,
and their content ignored, resuming decoding after the skippable frame.
It can be noted that a skippable frame
can be used to watermark a stream of concatenated frames
embedding any kind of tracking information (even just an UUID).
User wary of such usage should scan the stream of concatenated frames
in an attempt to detect such frame for analysis or removal.
__`Magic_Number`__ __`Magic_Number`__
4 Bytes, __little-endian__ format. 4 Bytes, __little-endian__ format.
@ -1196,14 +1209,15 @@ which describes how to decode the list of weights.
the top four bits and the second taking the bottom four (e.g. the following the top four bits and the second taking the bottom four (e.g. the following
operations could be used to read the weights: operations could be used to read the weights:
`Weight[0] = (Byte[0] >> 4), Weight[1] = (Byte[0] & 0xf)`, etc.). `Weight[0] = (Byte[0] >> 4), Weight[1] = (Byte[0] & 0xf)`, etc.).
The full representation occupies `((Number_of_Symbols+1)/2)` bytes, The full representation occupies `Ceiling(Number_of_Symbols/2)` bytes,
meaning it uses a last full byte even if `Number_of_Symbols` is odd. meaning it uses only full bytes even if `Number_of_Symbols` is odd.
`Number_of_Symbols = headerByte - 127`. `Number_of_Symbols = headerByte - 127`.
Note that maximum `Number_of_Symbols` is 255-127 = 128. Note that maximum `Number_of_Symbols` is 255-127 = 128.
A larger series must necessarily use FSE compression. If any present literal has a value > 128, raw header mode is not possible.
It's necessary to use FSE compression.
- if `headerByte` < 128 : - if `headerByte` < 128 :
the series of weights is compressed by FSE. the series of weights is compressed using FSE.
The length of the FSE-compressed series is equal to `headerByte` (0-127). The length of the FSE-compressed series is equal to `headerByte` (0-127).
##### Finite State Entropy (FSE) compression of Huffman weights ##### Finite State Entropy (FSE) compression of Huffman weights
@ -1235,18 +1249,19 @@ The number of symbols to decode is determined
by tracking bitStream overflow condition: by tracking bitStream overflow condition:
If updating state after decoding a symbol would require more bits than If updating state after decoding a symbol would require more bits than
remain in the stream, it is assumed that extra bits are 0. Then, remain in the stream, it is assumed that extra bits are 0. Then,
the symbols for each of the final states are decoded and the process is complete. symbols for each of the final states are decoded and the process is complete.
##### Conversion from weights to Huffman prefix codes ##### Conversion from weights to Huffman prefix codes
All present symbols shall now have a `Weight` value. All present symbols shall now have a `Weight` value.
It is possible to transform weights into Number_of_Bits, using this formula: It is possible to transform weights into` Number_of_Bits`, using this formula:
``` ```
Number_of_Bits = Number_of_Bits ? Max_Number_of_Bits + 1 - Weight : 0 Number_of_Bits = Weight ? Max_Number_of_Bits + 1 - Weight : 0
``` ```
Symbols are sorted by `Weight`. Within same `Weight`, symbols keep natural order. Symbols are sorted by `Weight`.
Within same `Weight`, symbols keep natural sequential order.
Symbols with a `Weight` of zero are removed. Symbols with a `Weight` of zero are removed.
Then, starting from lowest weight, prefix codes are distributed in order. Then, starting from lowest weight, prefix codes are distributed in sequential order.
__Example__ : __Example__ :
Let's presume the following list of weights has been decoded : Let's presume the following list of weights has been decoded :
@ -1255,7 +1270,7 @@ Let's presume the following list of weights has been decoded :
| -------- | --- | --- | --- | --- | --- | --- | | -------- | --- | --- | --- | --- | --- | --- |
| `Weight` | 4 | 3 | 2 | 0 | 1 | 1 | | `Weight` | 4 | 3 | 2 | 0 | 1 | 1 |
Sorted by weight and then natural order, Sorted by weight and then natural sequential order,
it gives the following distribution : it gives the following distribution :
| Literal | 3 | 4 | 5 | 2 | 1 | 0 | | Literal | 3 | 4 | 5 | 2 | 1 | 0 |
@ -1265,6 +1280,7 @@ it gives the following distribution :
| prefix codes | N/A | 0000| 0001| 001 | 01 | 1 | | prefix codes | N/A | 0000| 0001| 001 | 01 | 1 |
### Huffman-coded Streams ### Huffman-coded Streams
Given a Huffman decoding table, Given a Huffman decoding table,
it's possible to decode a Huffman-coded stream. it's possible to decode a Huffman-coded stream.
@ -1554,6 +1570,7 @@ to crosscheck that an implementation build its decoding tables correctly.
Version changes Version changes
--------------- ---------------
- 0.2.8 : clarifications for IETF RFC discuss
- 0.2.7 : clarifications from IETF RFC review, by Vijay Gurbani and Nick Terrell - 0.2.7 : clarifications from IETF RFC review, by Vijay Gurbani and Nick Terrell
- 0.2.6 : fixed an error in huffman example, by Ulrich Kunitz - 0.2.6 : fixed an error in huffman example, by Ulrich Kunitz
- 0.2.5 : minor typos and clarifications - 0.2.5 : minor typos and clarifications