mirror of
https://github.com/facebook/zstd.git
synced 2025-05-11 16:21:35 +03:00
refactor documentation of the FSE decoding table build process
This commit is contained in:
parent
75b0f5f4f5
commit
a8b86d024a
@ -16,7 +16,7 @@ Distribution of this document is unlimited.
|
|||||||
|
|
||||||
### Version
|
### Version
|
||||||
|
|
||||||
0.4.0 (2023-06-05)
|
0.4.2 (2024-10-02)
|
||||||
|
|
||||||
|
|
||||||
Introduction
|
Introduction
|
||||||
@ -1038,53 +1038,54 @@ and to compress Huffman headers.
|
|||||||
FSE
|
FSE
|
||||||
---
|
---
|
||||||
FSE, short for Finite State Entropy, is an entropy codec based on [ANS].
|
FSE, short for Finite State Entropy, is an entropy codec based on [ANS].
|
||||||
FSE encoding/decoding involves a state that is carried over between symbols,
|
FSE encoding/decoding involves a state that is carried over between symbols.
|
||||||
so decoding must be done in the opposite direction as encoding.
|
Decoding must be done in the opposite direction as encoding.
|
||||||
Therefore, all FSE bitstreams are read from end to beginning.
|
Therefore, all FSE bitstreams are read from end to beginning.
|
||||||
Note that the order of the bits in the stream is not reversed,
|
Note that the order of the bits in the stream is not reversed,
|
||||||
we just read the elements in the reverse order they are written.
|
we just read each multi-bits element in the reverse order they are encoded.
|
||||||
|
|
||||||
For additional details on FSE, see [Finite State Entropy].
|
For additional details on FSE, see [Finite State Entropy].
|
||||||
|
|
||||||
[Finite State Entropy]:https://github.com/Cyan4973/FiniteStateEntropy/
|
[Finite State Entropy]:https://github.com/Cyan4973/FiniteStateEntropy/
|
||||||
|
|
||||||
FSE decoding involves a decoding table which has a power of 2 size, and contain three elements:
|
FSE decoding is directed by a decoding table with a power of 2 size, each row containing three elements:
|
||||||
`Symbol`, `Num_Bits`, and `Baseline`.
|
`Symbol`, `Num_Bits`, and `Baseline`.
|
||||||
The `log2` of the table size is its `Accuracy_Log`.
|
The `log2` of the table size is its `Accuracy_Log`.
|
||||||
An FSE state value represents an index in this table.
|
An FSE state value represents an index in this table.
|
||||||
|
|
||||||
To obtain the initial state value, consume `Accuracy_Log` bits from the stream as a __little-endian__ value.
|
To obtain the initial state value, consume `Accuracy_Log` bits from the stream as a __little-endian__ value.
|
||||||
The next symbol in the stream is the `Symbol` indicated in the table for that state.
|
The first symbol in the stream is the `Symbol` indicated in the table for that state.
|
||||||
To obtain the next state value,
|
To obtain the next state value,
|
||||||
the decoder should consume `Num_Bits` bits from the stream as a __little-endian__ value and add it to `Baseline`.
|
the decoder should consume `Num_Bits` bits from the stream as a __little-endian__ value and add it to `Baseline`.
|
||||||
|
|
||||||
[ANS]: https://en.wikipedia.org/wiki/Asymmetric_Numeral_Systems
|
[ANS]: https://en.wikipedia.org/wiki/Asymmetric_Numeral_Systems
|
||||||
|
|
||||||
### FSE Table Description
|
### FSE Table Description
|
||||||
To decode FSE streams, it is necessary to construct the decoding table.
|
To decode an FSE bitstream, it is necessary to build its FSE decoding table.
|
||||||
The Zstandard format encodes FSE table descriptions as follows:
|
The decoding table is derived from a distribution of Probabilities.
|
||||||
|
The Zstandard format encodes distributions of Probabilities as follows:
|
||||||
|
|
||||||
An FSE distribution table describes the probabilities of all symbols
|
The distribution of probabilities is described in a bitstream which is read forward,
|
||||||
from `0` to the last present one (included)
|
in __little-endian__ fashion.
|
||||||
on a normalized scale of `1 << Accuracy_Log` .
|
The amount of bytes consumed from the bitstream to describe the distribution
|
||||||
Note that there must be two or more symbols with nonzero probability.
|
is discovered at the end of the decoding process.
|
||||||
|
|
||||||
It's a bitstream which is read forward, in __little-endian__ fashion.
|
The bitstream starts by reporting on which scale the distribution operates.
|
||||||
It's not necessary to know bitstream exact size,
|
|
||||||
it will be discovered and reported by the decoding process.
|
|
||||||
|
|
||||||
The bitstream starts by reporting on which scale it operates.
|
|
||||||
Let's `low4Bits` designate the lowest 4 bits of the first byte :
|
Let's `low4Bits` designate the lowest 4 bits of the first byte :
|
||||||
`Accuracy_Log = low4bits + 5`.
|
`Accuracy_Log = low4bits + 5`.
|
||||||
|
|
||||||
Then follows each symbol value, from `0` to last present one.
|
An FSE distribution table describes the probabilities of all symbols
|
||||||
The number of bits used by each field is variable.
|
from `0` to the last present one (included) in natural order.
|
||||||
|
The sum of probabilities is normalized to reach a power of 2 total of `1 << Accuracy_Log` .
|
||||||
|
There must be two or more symbols with non-zero probabilities.
|
||||||
|
|
||||||
|
The number of bits used to decode each probability is variable.
|
||||||
It depends on :
|
It depends on :
|
||||||
|
|
||||||
- Remaining probabilities + 1 :
|
- Remaining probabilities + 1 :
|
||||||
__example__ :
|
__example__ :
|
||||||
Presuming an `Accuracy_Log` of 8,
|
Presuming an `Accuracy_Log` of 8,
|
||||||
and presuming 100 probabilities points have already been distributed,
|
and presuming 100 probability points have already been distributed,
|
||||||
the decoder may read any value from `0` to `256 - 100 + 1 == 157` (inclusive).
|
the decoder may read any value from `0` to `256 - 100 + 1 == 157` (inclusive).
|
||||||
Therefore, it may read up to `log2sup(157) == 8` bits, where `log2sup(N)`
|
Therefore, it may read up to `log2sup(157) == 8` bits, where `log2sup(N)`
|
||||||
is the smallest integer `T` that satisfies `(1 << T) > N`.
|
is the smallest integer `T` that satisfies `(1 << T) > N`.
|
||||||
@ -1098,115 +1099,133 @@ It depends on :
|
|||||||
values from 98 to 157 use 8 bits.
|
values from 98 to 157 use 8 bits.
|
||||||
This is achieved through this scheme :
|
This is achieved through this scheme :
|
||||||
|
|
||||||
| Value read | Value decoded | Number of bits used |
|
| 8-bit field read | Value decoded | Nb of bits consumed |
|
||||||
| ---------- | ------------- | ------------------- |
|
| ---------------- | ------------- | ------------------- |
|
||||||
| 0 - 97 | 0 - 97 | 7 |
|
| 0 - 97 | 0 - 97 | 7 |
|
||||||
| 98 - 127 | 98 - 127 | 8 |
|
| 98 - 127 | 98 - 127 | 8 |
|
||||||
| 128 - 225 | 0 - 97 | 7 |
|
| 128 - 225 | 0 - 97 | 7 |
|
||||||
| 226 - 255 | 128 - 157 | 8 |
|
| 226 - 255 | 128 - 157 | 8 |
|
||||||
|
|
||||||
Symbols probabilities are read one by one, in order.
|
Probability is derived from Value decoded using the following formula:
|
||||||
|
`Probality = Value - 1`
|
||||||
|
|
||||||
Probability is obtained from Value decoded by following formula :
|
Consequently, a Probability of `0` is described by a Value `1`.
|
||||||
`Proba = value - 1`
|
|
||||||
|
|
||||||
It means value `0` becomes negative probability `-1`.
|
A Value `0` is used to signal a special case, named "Probability `-1`".
|
||||||
`-1` is a special probability, which means "less than 1".
|
It describes a probability which should have been "less than 1".
|
||||||
Its effect on distribution table is described in the [next section].
|
Its effect on the decoding table building process is described in the [next section].
|
||||||
For the purpose of calculating total allocated probability points, it counts as one.
|
For the purpose of counting total allocated probability points, it counts as one.
|
||||||
|
|
||||||
[next section]:#from-normalized-distribution-to-decoding-tables
|
[next section]:#from-normalized-distribution-to-decoding-tables
|
||||||
|
|
||||||
When a symbol has a __probability__ of `zero`,
|
Symbols probabilities are read one by one, in order.
|
||||||
|
After each probability is decoded, the total nb of probability points is updated.
|
||||||
|
This is used to dermine how many bits must be read to decode the probability of next symbol.
|
||||||
|
|
||||||
|
When a symbol has a __probability__ of `zero` (decoded from reading a Value `1`),
|
||||||
it is followed by a 2-bits repeat flag.
|
it is followed by a 2-bits repeat flag.
|
||||||
This repeat flag tells how many probabilities of zeroes follow the current one.
|
This repeat flag tells how many probabilities of zeroes follow the current one.
|
||||||
It provides a number ranging from 0 to 3.
|
It provides a number ranging from 0 to 3.
|
||||||
If it is a 3, another 2-bits repeat flag follows, and so on.
|
If it is a 3, another 2-bits repeat flag follows, and so on.
|
||||||
|
|
||||||
When last symbol reaches cumulated total of `1 << Accuracy_Log`,
|
When the Probability for a symbol makes cumulated total reach `1 << Accuracy_Log`,
|
||||||
decoding is complete.
|
then it's the last symbol, and decoding is complete.
|
||||||
If this process results in a non-zero probability for a value outside of the
|
|
||||||
valid range of values that the FSE table is defined for, even if that value is
|
|
||||||
not used, then the data is considered corrupted. In the case of offset codes,
|
|
||||||
a decoder implementation may reject a frame containing a non-zero probability
|
|
||||||
for an offset code larger than the largest offset code supported by the decoder
|
|
||||||
implementation.
|
|
||||||
|
|
||||||
Then the decoder can tell how many bytes were used in this process,
|
Then the decoder can tell how many bytes were used in this process,
|
||||||
and how many symbols are present.
|
and how many symbols are present.
|
||||||
The bitstream consumes a round number of bytes.
|
The bitstream consumes a round number of bytes.
|
||||||
Any remaining bit within the last byte is just unused.
|
Any remaining bit within the last byte is just unused.
|
||||||
|
|
||||||
|
If this process results in a non-zero probability for a symbol outside of the
|
||||||
|
valid range of symbols that the FSE table is defined for, even if that symbol is
|
||||||
|
not used, then the data is considered corrupted.
|
||||||
|
For the specific case of offset codes,
|
||||||
|
a decoder implementation may reject a frame containing a non-zero probability
|
||||||
|
for an offset code larger than the largest offset code supported by the decoder
|
||||||
|
implementation.
|
||||||
|
|
||||||
#### From normalized distribution to decoding tables
|
#### From normalized distribution to decoding tables
|
||||||
|
|
||||||
The distribution of normalized probabilities is enough
|
The normalized distribution of probabilities is enough
|
||||||
to create a unique decoding table.
|
to create a unique decoding table.
|
||||||
|
It is generated using the following build rule :
|
||||||
It follows the following build rule :
|
|
||||||
|
|
||||||
The table has a size of `Table_Size = 1 << Accuracy_Log`.
|
The table has a size of `Table_Size = 1 << Accuracy_Log`.
|
||||||
Each cell describes the symbol decoded,
|
Each row specifies the decoded symbol,
|
||||||
and instructions to get the next state (`Number_of_Bits` and `Baseline`).
|
and instructions to reach the next state (`Number_of_Bits` and `Baseline`).
|
||||||
|
|
||||||
Symbols are scanned in their natural order for "less than 1" probabilities.
|
Symbols are first scanned in their natural order for "less than 1" probabilities
|
||||||
Symbols with this probability are being attributed a single cell,
|
(previously decoded from a Value of `0`).
|
||||||
|
Symbols with this special probability are being attributed a single row,
|
||||||
starting from the end of the table and retreating.
|
starting from the end of the table and retreating.
|
||||||
These symbols define a full state reset, reading `Accuracy_Log` bits.
|
These symbols define a full state reset, reading `Accuracy_Log` bits.
|
||||||
|
|
||||||
Then, all remaining symbols, sorted in natural order, are allocated cells.
|
Then, all remaining symbols, sorted in natural order, are allocated rows.
|
||||||
Starting from symbol `0` (if it exists), and table position `0`,
|
Starting from smallest present symbol, and table position `0`,
|
||||||
each symbol gets allocated as many cells as its probability.
|
each symbol gets allocated as many rows as its probability.
|
||||||
Cell allocation is spread, not linear :
|
|
||||||
each successor position follows this rule :
|
|
||||||
|
|
||||||
|
Row allocation is not linear, it follows this order, in modular arithmetic:
|
||||||
```
|
```
|
||||||
position += (tableSize>>1) + (tableSize>>3) + 3;
|
position += (tableSize>>1) + (tableSize>>3) + 3;
|
||||||
position &= tableSize-1;
|
position &= tableSize-1;
|
||||||
```
|
```
|
||||||
|
|
||||||
A position is skipped if already occupied by a "less than 1" probability symbol.
|
Using above ordering rule, each symbol gets allocated as many rows as its probability.
|
||||||
`position` does not reset between symbols, it simply iterates through
|
If a position is already occupied by a "less than 1" probability symbol,
|
||||||
each position in the table, switching to the next symbol when enough
|
it is simply skipped, and the next position is allocated instead.
|
||||||
states have been allocated to the current one.
|
Once enough rows have been allocated for the current symbol,
|
||||||
|
the allocation process continues, using the next symbol, in natural order.
|
||||||
|
This process guarantees that the table is entirely and exactly filled.
|
||||||
|
|
||||||
The process guarantees that the table is entirely filled.
|
Each row specifies a decoded symbol, and is accessed by current state value.
|
||||||
Each cell corresponds to a state value, which contains the symbol being decoded.
|
It also specifies `Number_of_Bits` and `Baseline`, which are required to determine next state value.
|
||||||
|
|
||||||
To add the `Number_of_Bits` and `Baseline` required to retrieve next state,
|
To correctly set these fields, it's necessary to sort all occurrences of each symbol in state value order,
|
||||||
it's first necessary to sort all occurrences of each symbol in state order.
|
and then attribute N+1 bits to lower rows, and N bits to higher rows,
|
||||||
Lower states will need 1 more bit than higher ones.
|
following the process described below (using an example):
|
||||||
The process is repeated for each symbol.
|
|
||||||
|
|
||||||
__Example__ :
|
__Example__ :
|
||||||
Presuming a symbol has a probability of 5,
|
Presuming an `Accuracy_Log` of 7,
|
||||||
it receives 5 cells, corresponding to 5 state values.
|
let's imagine a symbol with a Probability of 5:
|
||||||
These state values are then sorted in natural order.
|
it receives 5 rows, corresponding to 5 state values between `0` and `127`.
|
||||||
|
|
||||||
Next power of 2 after 5 is 8.
|
In this example, the first state value happens to be `1` (after unspecified previous symbols).
|
||||||
Space of probabilities must be divided into 8 equal parts.
|
The next 4 states are then determined using above modular arithmetic rule,
|
||||||
Presuming the `Accuracy_Log` is 7, it defines a space of 128 states.
|
which specifies to add `64+16+3 = 83` modulo `128` to jump to next position,
|
||||||
Divided by 8, each share is 16 large.
|
producing the following series: `1`, `84`, `39`, `122`, `77` (modular arithmetic).
|
||||||
|
(note: the next symbol will then start at `32`).
|
||||||
|
|
||||||
In order to reach 8 shares, 8-5=3 lowest states will count "double",
|
These state values are then sorted in natural order,
|
||||||
|
resulting in the following series: `1`, `39`, `77`, `84`, `122`.
|
||||||
|
|
||||||
|
The next power of 2 after 5 is 8.
|
||||||
|
Therefore, the probability space will be divided into 8 equal parts.
|
||||||
|
Since the probability space is `1<<7 = 128` large, each share is `128/8 = 16` large.
|
||||||
|
|
||||||
|
In order to reach 8 shares, the `8-5 = 3` lowest states will count "double",
|
||||||
doubling their shares (32 in width), hence requiring one more bit.
|
doubling their shares (32 in width), hence requiring one more bit.
|
||||||
|
|
||||||
Baseline is assigned starting from the higher states using fewer bits,
|
Baseline is assigned starting from the lowest state using fewer bits,
|
||||||
increasing at each state, then resuming at the first state,
|
continuing in natural state order, looping back at the beginning.
|
||||||
each state takes its allocated width from Baseline.
|
Each state takes its allocated range from Baseline, sized by its `Number_of_Bits`.
|
||||||
|
|
||||||
| state order | 0 | 1 | 2 | 3 | 4 |
|
| state order | 0 | 1 | 2 | 3 | 4 |
|
||||||
| ---------------- | ----- | ----- | ------ | ---- | ------ |
|
| ---------------- | ----- | ----- | ------ | ---- | ------ |
|
||||||
| state value | 1 | 39 | 77 | 84 | 122 |
|
| state value | 1 | 39 | 77 | 84 | 122 |
|
||||||
| width | 32 | 32 | 32 | 16 | 16 |
|
| width | 32 | 32 | 32 | 16 | 16 |
|
||||||
| `Number_of_Bits` | 5 | 5 | 5 | 4 | 4 |
|
| `Number_of_Bits` | 5 | 5 | 5 | 4 | 4 |
|
||||||
| range number | 2 | 4 | 6 | 0 | 1 |
|
| allocation order | 3 | 4 | 5 | 1 | 2 |
|
||||||
| `Baseline` | 32 | 64 | 96 | 0 | 16 |
|
| `Baseline` | 32 | 64 | 96 | 0 | 16 |
|
||||||
| range | 32-63 | 64-95 | 96-127 | 0-15 | 16-31 |
|
| range | 32-63 | 64-95 | 96-127 | 0-15 | 16-31 |
|
||||||
|
|
||||||
During decoding, the next state value is determined from current state value,
|
During decoding, the next state value is determined by using current state value as row number,
|
||||||
by reading the required `Number_of_Bits`, and adding the specified `Baseline`.
|
then reading the required `Number_of_Bits` from the bitstream, and adding the specified `Baseline`.
|
||||||
|
|
||||||
See [Appendix A] for the results of this process applied to the default distributions.
|
Note:
|
||||||
|
as a trivial example, it follows that, for a symbol with a Probability of `1`,
|
||||||
|
`Baseline` is necessarily `0`, and `Number_of_Bits` is necessarily `Accuracy_Log`.
|
||||||
|
|
||||||
|
See [Appendix A] to see the outcome of this process applied to the default distributions.
|
||||||
|
|
||||||
[Appendix A]: #appendix-a---decoding-tables-for-predefined-codes
|
[Appendix A]: #appendix-a---decoding-tables-for-predefined-codes
|
||||||
|
|
||||||
@ -1716,6 +1735,8 @@ or at least provide a meaningful error code explaining for which reason it canno
|
|||||||
|
|
||||||
Version changes
|
Version changes
|
||||||
---------------
|
---------------
|
||||||
|
- 0.4.2 : refactor FSE table construction process, inspired by Donald Pian
|
||||||
|
- 0.4.1 : clarifications on a few error scenarios, by Eric Lasota
|
||||||
- 0.4.0 : fixed imprecise behavior for nbSeq==0, detected by Igor Pavlov
|
- 0.4.0 : fixed imprecise behavior for nbSeq==0, detected by Igor Pavlov
|
||||||
- 0.3.9 : clarifications for Huffman-compressed literal sizes.
|
- 0.3.9 : clarifications for Huffman-compressed literal sizes.
|
||||||
- 0.3.8 : clarifications for Huffman Blocks and Huffman Tree descriptions.
|
- 0.3.8 : clarifications for Huffman Blocks and Huffman Tree descriptions.
|
||||||
|
Loading…
x
Reference in New Issue
Block a user