refactor documentation of the FSE decoding table build process

2025-07-01 09:21:47 +03:00 · 2024-10-02 22:57:36 -07:00
parent 75b0f5f4f5
commit a8b86d024a
1 changed files with 100 additions and 79 deletions
--- a/doc/zstd_compression_format.md
+++ b/doc/zstd_compression_format.md
@ -16,7 +16,7 @@ Distribution of this document is unlimited.
 ### Version
-0.4.0 (2023-06-05)
+0.4.2 (2024-10-02)
 Introduction
@ -1038,53 +1038,54 @@ and to compress Huffman headers.
 FSE
 ---
 FSE, short for Finite State Entropy, is an entropy codec based on [ANS].
-FSE encoding/decoding involves a state that is carried over between symbols,
+FSE encoding/decoding involves a state that is carried over between symbols.
-so decoding must be done in the opposite direction as encoding.
+Decoding must be done in the opposite direction as encoding.
 Therefore, all FSE bitstreams are read from end to beginning.
 Note that the order of the bits in the stream is not reversed,
-we just read the elements in the reverse order they are written.
+we just read each multi-bits element in the reverse order they are encoded.
 For additional details on FSE, see [Finite State Entropy].
 [Finite State Entropy]:https://github.com/Cyan4973/FiniteStateEntropy/
-FSE decoding involves a decoding table which has a power of 2 size, and contain three elements:
+FSE decoding is directed by a decoding table with a power of 2 size, each row containing three elements:
 `Symbol`, `Num_Bits`, and `Baseline`.
 The `log2` of the table size is its `Accuracy_Log`.
 An FSE state value represents an index in this table.
 To obtain the initial state value, consume `Accuracy_Log` bits from the stream as a __little-endian__ value.
-The next symbol in the stream is the `Symbol` indicated in the table for that state.
+The first symbol in the stream is the `Symbol` indicated in the table for that state.
 To obtain the next state value,
 the decoder should consume `Num_Bits` bits from the stream as a __little-endian__ value and add it to `Baseline`.
 [ANS]: https://en.wikipedia.org/wiki/Asymmetric_Numeral_Systems
 ### FSE Table Description
-To decode FSE streams, it is necessary to construct the decoding table.
+To decode an FSE bitstream, it is necessary to build its FSE decoding table.
-The Zstandard format encodes FSE table descriptions as follows:
+The decoding table is derived from a distribution of Probabilities.
 The Zstandard format encodes distributions of Probabilities as follows:
-An FSE distribution table describes the probabilities of all symbols
+The distribution of probabilities is described in a bitstream which is read forward,
-from `0` to the last present one (included)
+in __little-endian__ fashion.
-on a normalized scale of `1 << Accuracy_Log` .
+The amount of bytes consumed from the bitstream to describe the distribution
-Note that there must be two or more symbols with nonzero probability.
+is discovered at the end of the decoding process.
-It's a bitstream which is read forward, in __little-endian__ fashion.
+The bitstream starts by reporting on which scale the distribution operates.
 It's not necessary to know bitstream exact size,
 it will be discovered and reported by the decoding process.
 The bitstream starts by reporting on which scale it operates.
 Let's `low4Bits` designate the lowest 4 bits of the first byte :
 `Accuracy_Log = low4bits + 5`.
-Then follows each symbol value, from `0` to last present one.
+An FSE distribution table describes the probabilities of all symbols
-The number of bits used by each field is variable.
+from `0` to the last present one (included) in natural order.
 The sum of probabilities is normalized to reach a power of 2 total of `1 << Accuracy_Log` .
 There must be two or more symbols with non-zero probabilities.
 The number of bits used to decode each probability is variable.
 It depends on :
 - Remaining probabilities + 1 :
  __example__ :
  Presuming an `Accuracy_Log` of 8,
-  and presuming 100 probabilities points have already been distributed,
+  and presuming 100 probability points have already been distributed,
  the decoder may read any value from `0` to `256 - 100 + 1 == 157` (inclusive).
  Therefore, it may read up to `log2sup(157) == 8` bits, where `log2sup(N)`
  is the smallest integer `T` that satisfies `(1 << T) > N`.
@ -1098,115 +1099,133 @@ It depends on :
  values from 98 to 157 use 8 bits.
  This is achieved through this scheme :
-  | Value read | Value decoded | Number of bits used |
+  | 8-bit field read | Value decoded | Nb of bits consumed |
-  | ---------- | ------------- | ------------------- |
+  | ---------------- | ------------- | ------------------- |
  |         0 -  97  |   0 -  97     |  7                  |
  |        98 - 127  |  98 - 127     |  8                  |
  |       128 - 225  |   0 -  97     |  7                  |
  |       226 - 255  | 128 - 157     |  8                  |
-Symbols probabilities are read one by one, in order.
+Probability is derived from Value decoded using the following formula:
 `Probality = Value - 1`
-Probability is obtained from Value decoded by following formula :
+Consequently, a Probability of `0` is described by a Value `1`.
 `Proba = value - 1`
-It means value `0` becomes negative probability `-1`.
+A Value `0` is used to signal a special case, named "Probability `-1`".
-`-1` is a special probability, which means "less than 1".
+It describes a probability which should have been "less than 1".
-Its effect on distribution table is described in the [next section].
+Its effect on the decoding table building process is described in the [next section].
-For the purpose of calculating total allocated probability points, it counts as one.
+For the purpose of counting total allocated probability points, it counts as one.
 [next section]:#from-normalized-distribution-to-decoding-tables
-When a symbol has a __probability__ of `zero`,
+Symbols probabilities are read one by one, in order.
 After each probability is decoded, the total nb of probability points is updated.
 This is used to dermine how many bits must be read to decode the probability of next symbol.
 When a symbol has a __probability__ of `zero` (decoded from reading a Value `1`),
 it is followed by a 2-bits repeat flag.
 This repeat flag tells how many probabilities of zeroes follow the current one.
 It provides a number ranging from 0 to 3.
 If it is a 3, another 2-bits repeat flag follows, and so on.
-When last symbol reaches cumulated total of `1 << Accuracy_Log`,
+When the Probability for a symbol makes cumulated total reach `1 << Accuracy_Log`,
-decoding is complete.
+then it's the last symbol, and decoding is complete.
 If this process results in a non-zero probability for a value outside of the
 valid range of values that the FSE table is defined for, even if that value is
 not used, then the data is considered corrupted.  In the case of offset codes,
 a decoder implementation may reject a frame containing a non-zero probability
 for an offset code larger than the largest offset code supported by the decoder
 implementation.
 Then the decoder can tell how many bytes were used in this process,
 and how many symbols are present.
 The bitstream consumes a round number of bytes.
 Any remaining bit within the last byte is just unused.
 If this process results in a non-zero probability for a symbol outside of the
 valid range of symbols that the FSE table is defined for, even if that symbol is
 not used, then the data is considered corrupted.
 For the specific case of offset codes,
 a decoder implementation may reject a frame containing a non-zero probability
 for an offset code larger than the largest offset code supported by the decoder
 implementation.
 #### From normalized distribution to decoding tables
-The distribution of normalized probabilities is enough
+The normalized distribution of probabilities is enough
 to create a unique decoding table.
-
+It is generated using the following build rule :
 It follows the following build rule :
 The table has a size of `Table_Size = 1 << Accuracy_Log`.
-Each cell describes the symbol decoded,
+Each row specifies the decoded symbol,
-and instructions to get the next state (`Number_of_Bits` and `Baseline`).
+and instructions to reach the next state (`Number_of_Bits` and `Baseline`).
-Symbols are scanned in their natural order for "less than 1" probabilities.
+Symbols are first scanned in their natural order for "less than 1" probabilities
-Symbols with this probability are being attributed a single cell,
+(previously decoded from a Value of `0`).
 Symbols with this special probability are being attributed a single row,
 starting from the end of the table and retreating.
 These symbols define a full state reset, reading `Accuracy_Log` bits.
-Then, all remaining symbols, sorted in natural order, are allocated cells.
+Then, all remaining symbols, sorted in natural order, are allocated rows.
-Starting from symbol `0` (if it exists), and table position `0`,
+Starting from smallest present symbol, and table position `0`,
-each symbol gets allocated as many cells as its probability.
+each symbol gets allocated as many rows as its probability.
 Cell allocation is spread, not linear :
 each successor position follows this rule :
 Row allocation is not linear, it follows this order, in modular arithmetic:
 ```
 position += (tableSize>>1) + (tableSize>>3) + 3;
 position &= tableSize-1;
 ```
-A position is skipped if already occupied by a "less than 1" probability symbol.
+Using above ordering rule, each symbol gets allocated as many rows as its probability.
-`position` does not reset between symbols, it simply iterates through
+If a position is already occupied by a "less than 1" probability symbol,
-each position in the table, switching to the next symbol when enough
+it is simply skipped, and the next position is allocated instead.
-states have been allocated to the current one.
+Once enough rows have been allocated for the current symbol,
 the allocation process continues, using the next symbol, in natural order.
 This process guarantees that the table is entirely and exactly filled.
-The process guarantees that the table is entirely filled.
+Each row specifies a decoded symbol, and is accessed by current state value.
-Each cell corresponds to a state value, which contains the symbol being decoded.
+It also specifies `Number_of_Bits` and `Baseline`, which are required to determine next state value.
-To add the `Number_of_Bits` and `Baseline` required to retrieve next state,
+To correctly set these fields, it's necessary to sort all occurrences of each symbol in state value order,
-it's first necessary to sort all occurrences of each symbol in state order.
+and then attribute N+1 bits to lower rows, and N bits to higher rows,
-Lower states will need 1 more bit than higher ones.
+following the process described below (using an example):
 The process is repeated for each symbol.
 __Example__ :
-Presuming a symbol has a probability of 5,
+Presuming an `Accuracy_Log` of 7,
-it receives 5 cells, corresponding to 5 state values.
+let's imagine a symbol with a Probability of 5:
-These state values are then sorted in natural order.
+it receives 5 rows, corresponding to 5 state values between `0` and `127`.
-Next power of 2 after 5 is 8.
+In this example, the first state value happens to be `1` (after unspecified previous symbols).
-Space of probabilities must be divided into 8 equal parts.
+The next 4 states are then determined using above modular arithmetic rule,
-Presuming the `Accuracy_Log` is 7, it defines a space of 128 states.
+which specifies to add `64+16+3 = 83` modulo `128` to jump to next position,
-Divided by 8, each share is 16 large.
+producing the following series: `1`, `84`, `39`, `122`, `77` (modular arithmetic).
 (note: the next symbol will then start at `32`).
-In order to reach 8 shares, 8-5=3 lowest states will count "double",
+These state values are then sorted in natural order,
 resulting in the following series: `1`, `39`, `77`, `84`, `122`.
 The next power of 2 after 5 is 8.
 Therefore, the probability space will be divided into 8 equal parts.
 Since the probability space is `1<<7 = 128` large, each share is `128/8 = 16` large.
 In order to reach 8 shares, the `8-5 = 3` lowest states will count "double",
 doubling their shares (32 in width), hence requiring one more bit.
-Baseline is assigned starting from the higher states using fewer bits,
+Baseline is assigned starting from the lowest state using fewer bits,
-increasing at each state, then resuming at the first state,
+continuing in natural state order, looping back at the beginning.
-each state takes its allocated width from Baseline.
+Each state takes its allocated range from Baseline, sized by its `Number_of_Bits`.
 | state order      |   0   |   1   |    2   |   3  |    4   |
 | ---------------- | ----- | ----- | ------ | ---- | ------ |
 | state value      |   1   |  39   |   77   |  84  |  122   |
 | width            |  32   |  32   |   32   |  16  |   16   |
 | `Number_of_Bits` |   5   |   5   |    5   |   4  |    4   |
-| range number     |   2   |   4   |    6   |   0  |    1   |
+| allocation order |   3   |   4   |    5   |   1  |    2   |
 | `Baseline`       |  32   |  64   |   96   |   0  |   16   |
 | range            | 32-63 | 64-95 | 96-127 | 0-15 | 16-31  |
-During decoding, the next state value is determined from current state value,
+During decoding, the next state value is determined by using current state value as row number,
-by reading the required `Number_of_Bits`, and adding the specified `Baseline`.
+then reading the required `Number_of_Bits` from the bitstream, and adding the specified `Baseline`.
-See [Appendix A] for the results of this process applied to the default distributions.
+Note:
 as a trivial example, it follows that, for a symbol with a Probability of `1`,
 `Baseline` is necessarily `0`, and `Number_of_Bits` is necessarily `Accuracy_Log`.
 See [Appendix A] to see the outcome of this process applied to the default distributions.
 [Appendix A]: #appendix-a---decoding-tables-for-predefined-codes
@ -1716,6 +1735,8 @@ or at least provide a meaningful error code explaining for which reason it canno
 Version changes
 ---------------
 - 0.4.2 : refactor FSE table construction process, inspired by Donald Pian
 - 0.4.1 : clarifications on a few error scenarios, by Eric Lasota
 - 0.4.0 : fixed imprecise behavior for nbSeq==0, detected by Igor Pavlov
 - 0.3.9 : clarifications for Huffman-compressed literal sizes.
 - 0.3.8 : clarifications for Huffman Blocks and Huffman Tree descriptions.