updated man page, providing more details for --train mode

following questions from #3111. Note : only the source markdown has been updated, the actual man page zstd.1 still need to be processed.
2025-07-29 11:21:22 +03:00 · 2022-04-13 18:51:59 -07:00
parent 460780f804
commit 0df2fd6088
1 changed files with 63 additions and 35 deletions
--- a/programs/zstd.1.md
+++ b/programs/zstd.1.md
@ -19,8 +19,8 @@ DESCRIPTION
 with command line syntax similar to `gzip (1)` and `xz (1)`.
 It is based on the **LZ77** family, with further FSE & huff0 entropy stages.
 `zstd` offers highly configurable compression speed,
-with fast modes at > 200 MB/s per core,
-and strong modes nearing lzma compression ratios.
+from fast modes at > 200 MB/s per core,
+to strong modes with excellent compression ratios.
 It also features a very fast decoder, with speeds > 500 MB/s per core.

 `zstd` command line syntax is generally similar to gzip,
@ -31,13 +31,12 @@ but features the following differences :
  - When compressing a single file, `zstd` displays progress notifications
    and result summary by default.
    Use `-q` to turn them off.
-  - `zstd` does not accept input from console,
-    but it properly accepts `stdin` when it's not the console.
  - `zstd` displays a short help page when command line is an error.
    Use `-q` to turn it off.
+  - `zstd` does not accept input from console,
+    though it does accept `stdin` when it's not the console.

-`zstd` compresses or decompresses each _file_ according to the selected
-operation mode.
+`zstd` processes each _file_ according to the selected operation mode.
 If no _files_ are given or _file_ is `-`, `zstd` reads from standard input
 and writes the processed data to standard output.
 `zstd` will refuse to write compressed data to standard output
@ -54,8 +53,8 @@ whose name is derived from the source _file_ name:
  get the target filename

 ### Concatenation with .zst files
-It is possible to concatenate `.zst` files as is.
-`zstd` will decompress such files as if they were a single `.zst` file.
+It is possible to concatenate multiple `.zst` files. `zstd` will decompress
+such agglomerated file as if it was a single `.zst` file.

 OPTIONS
 -------
@ -85,8 +84,8 @@ the last one takes effect.
    Decompress.
 * `-t`, `--test`:
    Test the integrity of compressed _files_.
-    This option is equivalent to `--decompress --stdout` except that the
-    decompressed data is discarded instead of being written to standard output.
+    This option is equivalent to `--decompress --stdout > /dev/null`,
+    decompressed data is discarded and checksummed for errors.
    No files are created or removed.
 * `-b#`:
    Benchmark file(s) using compression level #
@ -96,7 +95,7 @@ the last one takes effect.
 * `-l`, `--list`:
    Display information related to a zstd compressed file, such as size, ratio, and checksum.
    Some of these fields may not be available.
-    This command can be augmented with the `-v` modifier.
+    This command's output can be augmented with the `-v` modifier.

 ### Operation modifiers

@ -292,10 +291,10 @@ options that intend to mimic the `gzip` behavior:
    alias to the option `-9`.


-### Restricted usage of Environment Variables
+### Interactions with Environment Variables

-Using environment variables to set parameters has security implications.
-Therefore, this avenue is intentionally restricted.
+Employing environment variables to set parameters has security implications.
+Therefore, this avenue is intentionally limited.
 Only `ZSTD_CLEVEL` and `ZSTD_NBTHREADS` are currently supported.
 They set the compression level and number of threads to use during compression, respectively.

@ -305,8 +304,8 @@ If the value of `ZSTD_CLEVEL` is not a valid integer, it will be ignored with a

 `ZSTD_NBTHREADS` can be used to set the number of threads `zstd` will attempt to use during compression.
 If the value of `ZSTD_NBTHREADS` is not a valid unsigned integer, it will be ignored with a warning message.
-`ZSTD_NBTHREADS` has a default value of (`1`), and is capped at ZSTDMT_NBWORKERS_MAX==200. `zstd` must be
-compiled with multithread support for this to have any effect.
+`ZSTD_NBTHREADS` has a default value of (`1`), and is capped at ZSTDMT_NBWORKERS_MAX==200.
+`zstd` must be compiled with multithread support for this to have any effect.

 They can both be overridden by corresponding command line arguments:
 `-#` for compression level and `-T#` for number of compression threads.
@ -318,27 +317,36 @@ DICTIONARY BUILDER
 which greatly improves efficiency on small files and messages.
 It's possible to train `zstd` with a set of samples,
 the result of which is saved into a file called a `dictionary`.
-Then during compression and decompression, reference the same dictionary,
+Then, during compression and decompression, reference the same dictionary,
 using command `-D dictionaryFileName`.
 Compression of small files similar to the sample set will be greatly improved.

 * `--train FILEs`:
    Use FILEs as training set to create a dictionary.
-    The training set should contain a lot of small files (> 100),
+    The training set should ideally contain a lot of samples (> 100),
    and weight typically 100x the target dictionary size
-    (for example, 10 MB for a 100 KB dictionary).
+    (for example, ~10 MB for a 100 KB dictionary).
    `--train` can be combined with `-r` to indicate a directory rather than listing all the files,
    which can be useful to circumvent shell expansion limits.

+    Since dictionary compression is mostly effective for small files,
+    the expectation is that the training set will only contain small files.
+    In the case where some samples happen to be large,
+    only the first 128 KB of these samples will be used for training.
+
    `--train` supports multithreading if `zstd` is compiled with threading support (default).
-    Additional parameters can be specified with `--train-fastcover`.
+    Additional advanced parameters can be specified with `--train-fastcover`.
    The legacy dictionary builder can be accessed with `--train-legacy`.
    The slower cover dictionary builder can be accessed with `--train-cover`.
-    Default is equivalent to `--train-fastcover=d=8,steps=4`.
-* `-o file`:
-    Dictionary saved into `file` (default name: dictionary).
+    Default `--train` is equivalent to `--train-fastcover=d=8,steps=4`.
+
+* `-o FILE`:
+    Dictionary saved into `FILE` (default name: dictionary).
 * `--maxdict=#`:
-    Limit dictionary to specified size (default: 112640).
+    Limit dictionary to specified size (default: 112640 bytes).
+    As usual, quantities are expressed in bytes by default,
+    and it's possible to employ suffixes (like `KB` or `MB`)
+    to specify larger values.
 * `-#`:
    Use `#` compression level during training (optional).
    Will generate statistics more tuned for selected compression level,
@ -346,17 +354,37 @@ Compression of small files similar to the sample set will be greatly improved.
 * `-B#`:
    Split input files into blocks of size # (default: no split)
 * `-M#`, `--memory=#`:
-    Limit the amount of sample data loaded for training (default: 2 GB). See above for details.
+    Limit the amount of sample data loaded for training (default: 2 GB).
+    Note that the default (2 GB) is also the maximum.
+    This parameter can be useful in situations where the training set size
+    is not well controlled and could be potentially very large.
+    Since speed of the training process is directly correlated to
+    the size of the training sample set,
+    a smaller sample set leads to faster training.
+
+    In situations where the training set is larger than maximum memory,
+    the CLI will randomly select samples among the available ones,
+    up to the maximum allowed memory budget.
+    This is meant to improve dictionary relevance
+    by mitigating the potential impact of clustering,
+    such as selecting only files from the beginning of a list
+    sorted by modification date, or sorted by alphabetical order.
+    The randomization process is deterministic, so
+    training of the same list of files with the same parameters
+    will lead to the creation of the same dictionary.
+
 * `--dictID=#`:
-    A dictionary ID is a locally unique ID
-    that a decoder can use to verify it is using the right dictionary.
+    A dictionary ID is a locally unique ID.
+    The decoder will use this value to verify it is using the right dictionary.
    By default, zstd will create a 4-bytes random number ID.
-    It's possible to give a precise number instead.
-    Short numbers have an advantage : an ID < 256 will only need 1 byte in the
-    compressed frame header, and an ID < 65536 will only need 2 bytes.
-    This compares favorably to 4 bytes default.
-    However, it's up to the dictionary manager to not assign twice the same ID to
+    It's possible to provide an explicit number ID instead.
+    It's up to the dictionary manager to not assign twice the same ID to
    2 different dictionaries.
+    Note that short numbers have an advantage :
+    an ID < 256 will only need 1 byte in the compressed frame header,
+    and an ID < 65536 will only need 2 bytes.
+    This compares favorably to 4 bytes default.
+
 * `--train-cover[=k#,d=#,steps=#,split=#,shrink[=#]]`:
    Select parameters for the default dictionary builder algorithm named cover.
    If _d_ is not specified, then it tries _d_ = 6 and _d_ = 8.
@ -421,7 +449,7 @@ Compression of small files similar to the sample set will be greatly improved.
    Use legacy dictionary builder algorithm with the given dictionary
    _selectivity_ (default: 9).
    The smaller the _selectivity_ value, the denser the dictionary,
-    improving its efficiency but reducing its possible maximum size.
+    improving its efficiency but reducing its achievable maximum size.
    `--train-legacy=s=#` is also accepted.

    Examples:
@ -452,14 +480,14 @@ BENCHMARK
 ADVANCED COMPRESSION OPTIONS
 ----------------------------
 ### -B#:
-Select the size of each compression job.
+Specify the size of each compression job.
 This parameter is only available when multi-threading is enabled.
 Each compression job is run in parallel, so this value indirectly impacts the nb of active threads.
 Default job size varies depending on compression level (generally  `4 * windowSize`).
 `-B#` makes it possible to manually select a custom size.
 Note that job size must respect a minimum value which is enforced transparently.
 This minimum is either 512 KB, or `overlapSize`, whichever is largest.
-Different job sizes will lead to (slightly) different compressed frames.
+Different job sizes will lead to non-identical compressed frames.

 ### --zstd[=options]:
 `zstd` provides 22 predefined compression levels.