mirror of
https://github.com/facebook/zstd.git
synced 2025-05-10 05:21:34 +03:00
refactor documentation of the FSE decoding table build process
This commit is contained in:
parent
75b0f5f4f5
commit
a8b86d024a
@ -16,7 +16,7 @@ Distribution of this document is unlimited.
|
||||
|
||||
### Version
|
||||
|
||||
0.4.0 (2023-06-05)
|
||||
0.4.2 (2024-10-02)
|
||||
|
||||
|
||||
Introduction
|
||||
@ -1038,53 +1038,54 @@ and to compress Huffman headers.
|
||||
FSE
|
||||
---
|
||||
FSE, short for Finite State Entropy, is an entropy codec based on [ANS].
|
||||
FSE encoding/decoding involves a state that is carried over between symbols,
|
||||
so decoding must be done in the opposite direction as encoding.
|
||||
FSE encoding/decoding involves a state that is carried over between symbols.
|
||||
Decoding must be done in the opposite direction as encoding.
|
||||
Therefore, all FSE bitstreams are read from end to beginning.
|
||||
Note that the order of the bits in the stream is not reversed,
|
||||
we just read the elements in the reverse order they are written.
|
||||
we just read each multi-bits element in the reverse order they are encoded.
|
||||
|
||||
For additional details on FSE, see [Finite State Entropy].
|
||||
|
||||
[Finite State Entropy]:https://github.com/Cyan4973/FiniteStateEntropy/
|
||||
|
||||
FSE decoding involves a decoding table which has a power of 2 size, and contain three elements:
|
||||
FSE decoding is directed by a decoding table with a power of 2 size, each row containing three elements:
|
||||
`Symbol`, `Num_Bits`, and `Baseline`.
|
||||
The `log2` of the table size is its `Accuracy_Log`.
|
||||
An FSE state value represents an index in this table.
|
||||
|
||||
To obtain the initial state value, consume `Accuracy_Log` bits from the stream as a __little-endian__ value.
|
||||
The next symbol in the stream is the `Symbol` indicated in the table for that state.
|
||||
The first symbol in the stream is the `Symbol` indicated in the table for that state.
|
||||
To obtain the next state value,
|
||||
the decoder should consume `Num_Bits` bits from the stream as a __little-endian__ value and add it to `Baseline`.
|
||||
|
||||
[ANS]: https://en.wikipedia.org/wiki/Asymmetric_Numeral_Systems
|
||||
|
||||
### FSE Table Description
|
||||
To decode FSE streams, it is necessary to construct the decoding table.
|
||||
The Zstandard format encodes FSE table descriptions as follows:
|
||||
To decode an FSE bitstream, it is necessary to build its FSE decoding table.
|
||||
The decoding table is derived from a distribution of Probabilities.
|
||||
The Zstandard format encodes distributions of Probabilities as follows:
|
||||
|
||||
An FSE distribution table describes the probabilities of all symbols
|
||||
from `0` to the last present one (included)
|
||||
on a normalized scale of `1 << Accuracy_Log` .
|
||||
Note that there must be two or more symbols with nonzero probability.
|
||||
The distribution of probabilities is described in a bitstream which is read forward,
|
||||
in __little-endian__ fashion.
|
||||
The amount of bytes consumed from the bitstream to describe the distribution
|
||||
is discovered at the end of the decoding process.
|
||||
|
||||
It's a bitstream which is read forward, in __little-endian__ fashion.
|
||||
It's not necessary to know bitstream exact size,
|
||||
it will be discovered and reported by the decoding process.
|
||||
|
||||
The bitstream starts by reporting on which scale it operates.
|
||||
The bitstream starts by reporting on which scale the distribution operates.
|
||||
Let's `low4Bits` designate the lowest 4 bits of the first byte :
|
||||
`Accuracy_Log = low4bits + 5`.
|
||||
|
||||
Then follows each symbol value, from `0` to last present one.
|
||||
The number of bits used by each field is variable.
|
||||
An FSE distribution table describes the probabilities of all symbols
|
||||
from `0` to the last present one (included) in natural order.
|
||||
The sum of probabilities is normalized to reach a power of 2 total of `1 << Accuracy_Log` .
|
||||
There must be two or more symbols with non-zero probabilities.
|
||||
|
||||
The number of bits used to decode each probability is variable.
|
||||
It depends on :
|
||||
|
||||
- Remaining probabilities + 1 :
|
||||
__example__ :
|
||||
Presuming an `Accuracy_Log` of 8,
|
||||
and presuming 100 probabilities points have already been distributed,
|
||||
and presuming 100 probability points have already been distributed,
|
||||
the decoder may read any value from `0` to `256 - 100 + 1 == 157` (inclusive).
|
||||
Therefore, it may read up to `log2sup(157) == 8` bits, where `log2sup(N)`
|
||||
is the smallest integer `T` that satisfies `(1 << T) > N`.
|
||||
@ -1098,115 +1099,133 @@ It depends on :
|
||||
values from 98 to 157 use 8 bits.
|
||||
This is achieved through this scheme :
|
||||
|
||||
| Value read | Value decoded | Number of bits used |
|
||||
| ---------- | ------------- | ------------------- |
|
||||
| 0 - 97 | 0 - 97 | 7 |
|
||||
| 98 - 127 | 98 - 127 | 8 |
|
||||
| 128 - 225 | 0 - 97 | 7 |
|
||||
| 226 - 255 | 128 - 157 | 8 |
|
||||
| 8-bit field read | Value decoded | Nb of bits consumed |
|
||||
| ---------------- | ------------- | ------------------- |
|
||||
| 0 - 97 | 0 - 97 | 7 |
|
||||
| 98 - 127 | 98 - 127 | 8 |
|
||||
| 128 - 225 | 0 - 97 | 7 |
|
||||
| 226 - 255 | 128 - 157 | 8 |
|
||||
|
||||
Symbols probabilities are read one by one, in order.
|
||||
Probability is derived from Value decoded using the following formula:
|
||||
`Probality = Value - 1`
|
||||
|
||||
Probability is obtained from Value decoded by following formula :
|
||||
`Proba = value - 1`
|
||||
Consequently, a Probability of `0` is described by a Value `1`.
|
||||
|
||||
It means value `0` becomes negative probability `-1`.
|
||||
`-1` is a special probability, which means "less than 1".
|
||||
Its effect on distribution table is described in the [next section].
|
||||
For the purpose of calculating total allocated probability points, it counts as one.
|
||||
A Value `0` is used to signal a special case, named "Probability `-1`".
|
||||
It describes a probability which should have been "less than 1".
|
||||
Its effect on the decoding table building process is described in the [next section].
|
||||
For the purpose of counting total allocated probability points, it counts as one.
|
||||
|
||||
[next section]:#from-normalized-distribution-to-decoding-tables
|
||||
|
||||
When a symbol has a __probability__ of `zero`,
|
||||
Symbols probabilities are read one by one, in order.
|
||||
After each probability is decoded, the total nb of probability points is updated.
|
||||
This is used to dermine how many bits must be read to decode the probability of next symbol.
|
||||
|
||||
When a symbol has a __probability__ of `zero` (decoded from reading a Value `1`),
|
||||
it is followed by a 2-bits repeat flag.
|
||||
This repeat flag tells how many probabilities of zeroes follow the current one.
|
||||
It provides a number ranging from 0 to 3.
|
||||
If it is a 3, another 2-bits repeat flag follows, and so on.
|
||||
|
||||
When last symbol reaches cumulated total of `1 << Accuracy_Log`,
|
||||
decoding is complete.
|
||||
If this process results in a non-zero probability for a value outside of the
|
||||
valid range of values that the FSE table is defined for, even if that value is
|
||||
not used, then the data is considered corrupted. In the case of offset codes,
|
||||
a decoder implementation may reject a frame containing a non-zero probability
|
||||
for an offset code larger than the largest offset code supported by the decoder
|
||||
implementation.
|
||||
When the Probability for a symbol makes cumulated total reach `1 << Accuracy_Log`,
|
||||
then it's the last symbol, and decoding is complete.
|
||||
|
||||
Then the decoder can tell how many bytes were used in this process,
|
||||
and how many symbols are present.
|
||||
The bitstream consumes a round number of bytes.
|
||||
Any remaining bit within the last byte is just unused.
|
||||
|
||||
If this process results in a non-zero probability for a symbol outside of the
|
||||
valid range of symbols that the FSE table is defined for, even if that symbol is
|
||||
not used, then the data is considered corrupted.
|
||||
For the specific case of offset codes,
|
||||
a decoder implementation may reject a frame containing a non-zero probability
|
||||
for an offset code larger than the largest offset code supported by the decoder
|
||||
implementation.
|
||||
|
||||
#### From normalized distribution to decoding tables
|
||||
|
||||
The distribution of normalized probabilities is enough
|
||||
The normalized distribution of probabilities is enough
|
||||
to create a unique decoding table.
|
||||
|
||||
It follows the following build rule :
|
||||
It is generated using the following build rule :
|
||||
|
||||
The table has a size of `Table_Size = 1 << Accuracy_Log`.
|
||||
Each cell describes the symbol decoded,
|
||||
and instructions to get the next state (`Number_of_Bits` and `Baseline`).
|
||||
Each row specifies the decoded symbol,
|
||||
and instructions to reach the next state (`Number_of_Bits` and `Baseline`).
|
||||
|
||||
Symbols are scanned in their natural order for "less than 1" probabilities.
|
||||
Symbols with this probability are being attributed a single cell,
|
||||
Symbols are first scanned in their natural order for "less than 1" probabilities
|
||||
(previously decoded from a Value of `0`).
|
||||
Symbols with this special probability are being attributed a single row,
|
||||
starting from the end of the table and retreating.
|
||||
These symbols define a full state reset, reading `Accuracy_Log` bits.
|
||||
|
||||
Then, all remaining symbols, sorted in natural order, are allocated cells.
|
||||
Starting from symbol `0` (if it exists), and table position `0`,
|
||||
each symbol gets allocated as many cells as its probability.
|
||||
Cell allocation is spread, not linear :
|
||||
each successor position follows this rule :
|
||||
Then, all remaining symbols, sorted in natural order, are allocated rows.
|
||||
Starting from smallest present symbol, and table position `0`,
|
||||
each symbol gets allocated as many rows as its probability.
|
||||
|
||||
Row allocation is not linear, it follows this order, in modular arithmetic:
|
||||
```
|
||||
position += (tableSize>>1) + (tableSize>>3) + 3;
|
||||
position &= tableSize-1;
|
||||
```
|
||||
|
||||
A position is skipped if already occupied by a "less than 1" probability symbol.
|
||||
`position` does not reset between symbols, it simply iterates through
|
||||
each position in the table, switching to the next symbol when enough
|
||||
states have been allocated to the current one.
|
||||
Using above ordering rule, each symbol gets allocated as many rows as its probability.
|
||||
If a position is already occupied by a "less than 1" probability symbol,
|
||||
it is simply skipped, and the next position is allocated instead.
|
||||
Once enough rows have been allocated for the current symbol,
|
||||
the allocation process continues, using the next symbol, in natural order.
|
||||
This process guarantees that the table is entirely and exactly filled.
|
||||
|
||||
The process guarantees that the table is entirely filled.
|
||||
Each cell corresponds to a state value, which contains the symbol being decoded.
|
||||
Each row specifies a decoded symbol, and is accessed by current state value.
|
||||
It also specifies `Number_of_Bits` and `Baseline`, which are required to determine next state value.
|
||||
|
||||
To add the `Number_of_Bits` and `Baseline` required to retrieve next state,
|
||||
it's first necessary to sort all occurrences of each symbol in state order.
|
||||
Lower states will need 1 more bit than higher ones.
|
||||
The process is repeated for each symbol.
|
||||
To correctly set these fields, it's necessary to sort all occurrences of each symbol in state value order,
|
||||
and then attribute N+1 bits to lower rows, and N bits to higher rows,
|
||||
following the process described below (using an example):
|
||||
|
||||
__Example__ :
|
||||
Presuming a symbol has a probability of 5,
|
||||
it receives 5 cells, corresponding to 5 state values.
|
||||
These state values are then sorted in natural order.
|
||||
Presuming an `Accuracy_Log` of 7,
|
||||
let's imagine a symbol with a Probability of 5:
|
||||
it receives 5 rows, corresponding to 5 state values between `0` and `127`.
|
||||
|
||||
Next power of 2 after 5 is 8.
|
||||
Space of probabilities must be divided into 8 equal parts.
|
||||
Presuming the `Accuracy_Log` is 7, it defines a space of 128 states.
|
||||
Divided by 8, each share is 16 large.
|
||||
In this example, the first state value happens to be `1` (after unspecified previous symbols).
|
||||
The next 4 states are then determined using above modular arithmetic rule,
|
||||
which specifies to add `64+16+3 = 83` modulo `128` to jump to next position,
|
||||
producing the following series: `1`, `84`, `39`, `122`, `77` (modular arithmetic).
|
||||
(note: the next symbol will then start at `32`).
|
||||
|
||||
In order to reach 8 shares, 8-5=3 lowest states will count "double",
|
||||
These state values are then sorted in natural order,
|
||||
resulting in the following series: `1`, `39`, `77`, `84`, `122`.
|
||||
|
||||
The next power of 2 after 5 is 8.
|
||||
Therefore, the probability space will be divided into 8 equal parts.
|
||||
Since the probability space is `1<<7 = 128` large, each share is `128/8 = 16` large.
|
||||
|
||||
In order to reach 8 shares, the `8-5 = 3` lowest states will count "double",
|
||||
doubling their shares (32 in width), hence requiring one more bit.
|
||||
|
||||
Baseline is assigned starting from the higher states using fewer bits,
|
||||
increasing at each state, then resuming at the first state,
|
||||
each state takes its allocated width from Baseline.
|
||||
Baseline is assigned starting from the lowest state using fewer bits,
|
||||
continuing in natural state order, looping back at the beginning.
|
||||
Each state takes its allocated range from Baseline, sized by its `Number_of_Bits`.
|
||||
|
||||
| state order | 0 | 1 | 2 | 3 | 4 |
|
||||
| ---------------- | ----- | ----- | ------ | ---- | ------ |
|
||||
| state value | 1 | 39 | 77 | 84 | 122 |
|
||||
| width | 32 | 32 | 32 | 16 | 16 |
|
||||
| `Number_of_Bits` | 5 | 5 | 5 | 4 | 4 |
|
||||
| range number | 2 | 4 | 6 | 0 | 1 |
|
||||
| allocation order | 3 | 4 | 5 | 1 | 2 |
|
||||
| `Baseline` | 32 | 64 | 96 | 0 | 16 |
|
||||
| range | 32-63 | 64-95 | 96-127 | 0-15 | 16-31 |
|
||||
|
||||
During decoding, the next state value is determined from current state value,
|
||||
by reading the required `Number_of_Bits`, and adding the specified `Baseline`.
|
||||
During decoding, the next state value is determined by using current state value as row number,
|
||||
then reading the required `Number_of_Bits` from the bitstream, and adding the specified `Baseline`.
|
||||
|
||||
See [Appendix A] for the results of this process applied to the default distributions.
|
||||
Note:
|
||||
as a trivial example, it follows that, for a symbol with a Probability of `1`,
|
||||
`Baseline` is necessarily `0`, and `Number_of_Bits` is necessarily `Accuracy_Log`.
|
||||
|
||||
See [Appendix A] to see the outcome of this process applied to the default distributions.
|
||||
|
||||
[Appendix A]: #appendix-a---decoding-tables-for-predefined-codes
|
||||
|
||||
@ -1716,6 +1735,8 @@ or at least provide a meaningful error code explaining for which reason it canno
|
||||
|
||||
Version changes
|
||||
---------------
|
||||
- 0.4.2 : refactor FSE table construction process, inspired by Donald Pian
|
||||
- 0.4.1 : clarifications on a few error scenarios, by Eric Lasota
|
||||
- 0.4.0 : fixed imprecise behavior for nbSeq==0, detected by Igor Pavlov
|
||||
- 0.3.9 : clarifications for Huffman-compressed literal sizes.
|
||||
- 0.3.8 : clarifications for Huffman Blocks and Huffman Tree descriptions.
|
||||
|
Loading…
x
Reference in New Issue
Block a user