1
0
mirror of https://github.com/facebook/zstd.git synced 2025-07-29 11:21:22 +03:00

updated man page, providing more details for --train mode

following questions from #3111.

Note : only the source markdown has been updated,
the actual man page zstd.1 still need to be processed.
This commit is contained in:
Yann Collet
2022-04-13 18:51:59 -07:00
parent 460780f804
commit 0df2fd6088

View File

@ -19,8 +19,8 @@ DESCRIPTION
with command line syntax similar to `gzip (1)` and `xz (1)`.
It is based on the **LZ77** family, with further FSE & huff0 entropy stages.
`zstd` offers highly configurable compression speed,
with fast modes at > 200 MB/s per core,
and strong modes nearing lzma compression ratios.
from fast modes at > 200 MB/s per core,
to strong modes with excellent compression ratios.
It also features a very fast decoder, with speeds > 500 MB/s per core.
`zstd` command line syntax is generally similar to gzip,
@ -31,13 +31,12 @@ but features the following differences :
- When compressing a single file, `zstd` displays progress notifications
and result summary by default.
Use `-q` to turn them off.
- `zstd` does not accept input from console,
but it properly accepts `stdin` when it's not the console.
- `zstd` displays a short help page when command line is an error.
Use `-q` to turn it off.
- `zstd` does not accept input from console,
though it does accept `stdin` when it's not the console.
`zstd` compresses or decompresses each _file_ according to the selected
operation mode.
`zstd` processes each _file_ according to the selected operation mode.
If no _files_ are given or _file_ is `-`, `zstd` reads from standard input
and writes the processed data to standard output.
`zstd` will refuse to write compressed data to standard output
@ -54,8 +53,8 @@ whose name is derived from the source _file_ name:
get the target filename
### Concatenation with .zst files
It is possible to concatenate `.zst` files as is.
`zstd` will decompress such files as if they were a single `.zst` file.
It is possible to concatenate multiple `.zst` files. `zstd` will decompress
such agglomerated file as if it was a single `.zst` file.
OPTIONS
-------
@ -85,8 +84,8 @@ the last one takes effect.
Decompress.
* `-t`, `--test`:
Test the integrity of compressed _files_.
This option is equivalent to `--decompress --stdout` except that the
decompressed data is discarded instead of being written to standard output.
This option is equivalent to `--decompress --stdout > /dev/null`,
decompressed data is discarded and checksummed for errors.
No files are created or removed.
* `-b#`:
Benchmark file(s) using compression level #
@ -96,7 +95,7 @@ the last one takes effect.
* `-l`, `--list`:
Display information related to a zstd compressed file, such as size, ratio, and checksum.
Some of these fields may not be available.
This command can be augmented with the `-v` modifier.
This command's output can be augmented with the `-v` modifier.
### Operation modifiers
@ -292,10 +291,10 @@ options that intend to mimic the `gzip` behavior:
alias to the option `-9`.
### Restricted usage of Environment Variables
### Interactions with Environment Variables
Using environment variables to set parameters has security implications.
Therefore, this avenue is intentionally restricted.
Employing environment variables to set parameters has security implications.
Therefore, this avenue is intentionally limited.
Only `ZSTD_CLEVEL` and `ZSTD_NBTHREADS` are currently supported.
They set the compression level and number of threads to use during compression, respectively.
@ -305,8 +304,8 @@ If the value of `ZSTD_CLEVEL` is not a valid integer, it will be ignored with a
`ZSTD_NBTHREADS` can be used to set the number of threads `zstd` will attempt to use during compression.
If the value of `ZSTD_NBTHREADS` is not a valid unsigned integer, it will be ignored with a warning message.
`ZSTD_NBTHREADS` has a default value of (`1`), and is capped at ZSTDMT_NBWORKERS_MAX==200. `zstd` must be
compiled with multithread support for this to have any effect.
`ZSTD_NBTHREADS` has a default value of (`1`), and is capped at ZSTDMT_NBWORKERS_MAX==200.
`zstd` must be compiled with multithread support for this to have any effect.
They can both be overridden by corresponding command line arguments:
`-#` for compression level and `-T#` for number of compression threads.
@ -318,27 +317,36 @@ DICTIONARY BUILDER
which greatly improves efficiency on small files and messages.
It's possible to train `zstd` with a set of samples,
the result of which is saved into a file called a `dictionary`.
Then during compression and decompression, reference the same dictionary,
Then, during compression and decompression, reference the same dictionary,
using command `-D dictionaryFileName`.
Compression of small files similar to the sample set will be greatly improved.
* `--train FILEs`:
Use FILEs as training set to create a dictionary.
The training set should contain a lot of small files (> 100),
The training set should ideally contain a lot of samples (> 100),
and weight typically 100x the target dictionary size
(for example, 10 MB for a 100 KB dictionary).
(for example, ~10 MB for a 100 KB dictionary).
`--train` can be combined with `-r` to indicate a directory rather than listing all the files,
which can be useful to circumvent shell expansion limits.
Since dictionary compression is mostly effective for small files,
the expectation is that the training set will only contain small files.
In the case where some samples happen to be large,
only the first 128 KB of these samples will be used for training.
`--train` supports multithreading if `zstd` is compiled with threading support (default).
Additional parameters can be specified with `--train-fastcover`.
Additional advanced parameters can be specified with `--train-fastcover`.
The legacy dictionary builder can be accessed with `--train-legacy`.
The slower cover dictionary builder can be accessed with `--train-cover`.
Default is equivalent to `--train-fastcover=d=8,steps=4`.
* `-o file`:
Dictionary saved into `file` (default name: dictionary).
Default `--train` is equivalent to `--train-fastcover=d=8,steps=4`.
* `-o FILE`:
Dictionary saved into `FILE` (default name: dictionary).
* `--maxdict=#`:
Limit dictionary to specified size (default: 112640).
Limit dictionary to specified size (default: 112640 bytes).
As usual, quantities are expressed in bytes by default,
and it's possible to employ suffixes (like `KB` or `MB`)
to specify larger values.
* `-#`:
Use `#` compression level during training (optional).
Will generate statistics more tuned for selected compression level,
@ -346,17 +354,37 @@ Compression of small files similar to the sample set will be greatly improved.
* `-B#`:
Split input files into blocks of size # (default: no split)
* `-M#`, `--memory=#`:
Limit the amount of sample data loaded for training (default: 2 GB). See above for details.
Limit the amount of sample data loaded for training (default: 2 GB).
Note that the default (2 GB) is also the maximum.
This parameter can be useful in situations where the training set size
is not well controlled and could be potentially very large.
Since speed of the training process is directly correlated to
the size of the training sample set,
a smaller sample set leads to faster training.
In situations where the training set is larger than maximum memory,
the CLI will randomly select samples among the available ones,
up to the maximum allowed memory budget.
This is meant to improve dictionary relevance
by mitigating the potential impact of clustering,
such as selecting only files from the beginning of a list
sorted by modification date, or sorted by alphabetical order.
The randomization process is deterministic, so
training of the same list of files with the same parameters
will lead to the creation of the same dictionary.
* `--dictID=#`:
A dictionary ID is a locally unique ID
that a decoder can use to verify it is using the right dictionary.
A dictionary ID is a locally unique ID.
The decoder will use this value to verify it is using the right dictionary.
By default, zstd will create a 4-bytes random number ID.
It's possible to give a precise number instead.
Short numbers have an advantage : an ID < 256 will only need 1 byte in the
compressed frame header, and an ID < 65536 will only need 2 bytes.
This compares favorably to 4 bytes default.
However, it's up to the dictionary manager to not assign twice the same ID to
It's possible to provide an explicit number ID instead.
It's up to the dictionary manager to not assign twice the same ID to
2 different dictionaries.
Note that short numbers have an advantage :
an ID < 256 will only need 1 byte in the compressed frame header,
and an ID < 65536 will only need 2 bytes.
This compares favorably to 4 bytes default.
* `--train-cover[=k#,d=#,steps=#,split=#,shrink[=#]]`:
Select parameters for the default dictionary builder algorithm named cover.
If _d_ is not specified, then it tries _d_ = 6 and _d_ = 8.
@ -421,7 +449,7 @@ Compression of small files similar to the sample set will be greatly improved.
Use legacy dictionary builder algorithm with the given dictionary
_selectivity_ (default: 9).
The smaller the _selectivity_ value, the denser the dictionary,
improving its efficiency but reducing its possible maximum size.
improving its efficiency but reducing its achievable maximum size.
`--train-legacy=s=#` is also accepted.
Examples:
@ -452,14 +480,14 @@ BENCHMARK
ADVANCED COMPRESSION OPTIONS
----------------------------
### -B#:
Select the size of each compression job.
Specify the size of each compression job.
This parameter is only available when multi-threading is enabled.
Each compression job is run in parallel, so this value indirectly impacts the nb of active threads.
Default job size varies depending on compression level (generally `4 * windowSize`).
`-B#` makes it possible to manually select a custom size.
Note that job size must respect a minimum value which is enforced transparently.
This minimum is either 512 KB, or `overlapSize`, whichever is largest.
Different job sizes will lead to (slightly) different compressed frames.
Different job sizes will lead to non-identical compressed frames.
### --zstd[=options]:
`zstd` provides 22 predefined compression levels.