mirror of
https://github.com/facebook/zstd.git
synced 2025-07-29 11:21:22 +03:00
updated man page, providing more details for --train mode
following questions from #3111. Note : only the source markdown has been updated, the actual man page zstd.1 still need to be processed.
This commit is contained in:
@ -19,8 +19,8 @@ DESCRIPTION
|
||||
with command line syntax similar to `gzip (1)` and `xz (1)`.
|
||||
It is based on the **LZ77** family, with further FSE & huff0 entropy stages.
|
||||
`zstd` offers highly configurable compression speed,
|
||||
with fast modes at > 200 MB/s per core,
|
||||
and strong modes nearing lzma compression ratios.
|
||||
from fast modes at > 200 MB/s per core,
|
||||
to strong modes with excellent compression ratios.
|
||||
It also features a very fast decoder, with speeds > 500 MB/s per core.
|
||||
|
||||
`zstd` command line syntax is generally similar to gzip,
|
||||
@ -31,13 +31,12 @@ but features the following differences :
|
||||
- When compressing a single file, `zstd` displays progress notifications
|
||||
and result summary by default.
|
||||
Use `-q` to turn them off.
|
||||
- `zstd` does not accept input from console,
|
||||
but it properly accepts `stdin` when it's not the console.
|
||||
- `zstd` displays a short help page when command line is an error.
|
||||
Use `-q` to turn it off.
|
||||
- `zstd` does not accept input from console,
|
||||
though it does accept `stdin` when it's not the console.
|
||||
|
||||
`zstd` compresses or decompresses each _file_ according to the selected
|
||||
operation mode.
|
||||
`zstd` processes each _file_ according to the selected operation mode.
|
||||
If no _files_ are given or _file_ is `-`, `zstd` reads from standard input
|
||||
and writes the processed data to standard output.
|
||||
`zstd` will refuse to write compressed data to standard output
|
||||
@ -54,8 +53,8 @@ whose name is derived from the source _file_ name:
|
||||
get the target filename
|
||||
|
||||
### Concatenation with .zst files
|
||||
It is possible to concatenate `.zst` files as is.
|
||||
`zstd` will decompress such files as if they were a single `.zst` file.
|
||||
It is possible to concatenate multiple `.zst` files. `zstd` will decompress
|
||||
such agglomerated file as if it was a single `.zst` file.
|
||||
|
||||
OPTIONS
|
||||
-------
|
||||
@ -85,8 +84,8 @@ the last one takes effect.
|
||||
Decompress.
|
||||
* `-t`, `--test`:
|
||||
Test the integrity of compressed _files_.
|
||||
This option is equivalent to `--decompress --stdout` except that the
|
||||
decompressed data is discarded instead of being written to standard output.
|
||||
This option is equivalent to `--decompress --stdout > /dev/null`,
|
||||
decompressed data is discarded and checksummed for errors.
|
||||
No files are created or removed.
|
||||
* `-b#`:
|
||||
Benchmark file(s) using compression level #
|
||||
@ -96,7 +95,7 @@ the last one takes effect.
|
||||
* `-l`, `--list`:
|
||||
Display information related to a zstd compressed file, such as size, ratio, and checksum.
|
||||
Some of these fields may not be available.
|
||||
This command can be augmented with the `-v` modifier.
|
||||
This command's output can be augmented with the `-v` modifier.
|
||||
|
||||
### Operation modifiers
|
||||
|
||||
@ -292,10 +291,10 @@ options that intend to mimic the `gzip` behavior:
|
||||
alias to the option `-9`.
|
||||
|
||||
|
||||
### Restricted usage of Environment Variables
|
||||
### Interactions with Environment Variables
|
||||
|
||||
Using environment variables to set parameters has security implications.
|
||||
Therefore, this avenue is intentionally restricted.
|
||||
Employing environment variables to set parameters has security implications.
|
||||
Therefore, this avenue is intentionally limited.
|
||||
Only `ZSTD_CLEVEL` and `ZSTD_NBTHREADS` are currently supported.
|
||||
They set the compression level and number of threads to use during compression, respectively.
|
||||
|
||||
@ -305,8 +304,8 @@ If the value of `ZSTD_CLEVEL` is not a valid integer, it will be ignored with a
|
||||
|
||||
`ZSTD_NBTHREADS` can be used to set the number of threads `zstd` will attempt to use during compression.
|
||||
If the value of `ZSTD_NBTHREADS` is not a valid unsigned integer, it will be ignored with a warning message.
|
||||
`ZSTD_NBTHREADS` has a default value of (`1`), and is capped at ZSTDMT_NBWORKERS_MAX==200. `zstd` must be
|
||||
compiled with multithread support for this to have any effect.
|
||||
`ZSTD_NBTHREADS` has a default value of (`1`), and is capped at ZSTDMT_NBWORKERS_MAX==200.
|
||||
`zstd` must be compiled with multithread support for this to have any effect.
|
||||
|
||||
They can both be overridden by corresponding command line arguments:
|
||||
`-#` for compression level and `-T#` for number of compression threads.
|
||||
@ -318,27 +317,36 @@ DICTIONARY BUILDER
|
||||
which greatly improves efficiency on small files and messages.
|
||||
It's possible to train `zstd` with a set of samples,
|
||||
the result of which is saved into a file called a `dictionary`.
|
||||
Then during compression and decompression, reference the same dictionary,
|
||||
Then, during compression and decompression, reference the same dictionary,
|
||||
using command `-D dictionaryFileName`.
|
||||
Compression of small files similar to the sample set will be greatly improved.
|
||||
|
||||
* `--train FILEs`:
|
||||
Use FILEs as training set to create a dictionary.
|
||||
The training set should contain a lot of small files (> 100),
|
||||
The training set should ideally contain a lot of samples (> 100),
|
||||
and weight typically 100x the target dictionary size
|
||||
(for example, 10 MB for a 100 KB dictionary).
|
||||
(for example, ~10 MB for a 100 KB dictionary).
|
||||
`--train` can be combined with `-r` to indicate a directory rather than listing all the files,
|
||||
which can be useful to circumvent shell expansion limits.
|
||||
|
||||
Since dictionary compression is mostly effective for small files,
|
||||
the expectation is that the training set will only contain small files.
|
||||
In the case where some samples happen to be large,
|
||||
only the first 128 KB of these samples will be used for training.
|
||||
|
||||
`--train` supports multithreading if `zstd` is compiled with threading support (default).
|
||||
Additional parameters can be specified with `--train-fastcover`.
|
||||
Additional advanced parameters can be specified with `--train-fastcover`.
|
||||
The legacy dictionary builder can be accessed with `--train-legacy`.
|
||||
The slower cover dictionary builder can be accessed with `--train-cover`.
|
||||
Default is equivalent to `--train-fastcover=d=8,steps=4`.
|
||||
* `-o file`:
|
||||
Dictionary saved into `file` (default name: dictionary).
|
||||
Default `--train` is equivalent to `--train-fastcover=d=8,steps=4`.
|
||||
|
||||
* `-o FILE`:
|
||||
Dictionary saved into `FILE` (default name: dictionary).
|
||||
* `--maxdict=#`:
|
||||
Limit dictionary to specified size (default: 112640).
|
||||
Limit dictionary to specified size (default: 112640 bytes).
|
||||
As usual, quantities are expressed in bytes by default,
|
||||
and it's possible to employ suffixes (like `KB` or `MB`)
|
||||
to specify larger values.
|
||||
* `-#`:
|
||||
Use `#` compression level during training (optional).
|
||||
Will generate statistics more tuned for selected compression level,
|
||||
@ -346,17 +354,37 @@ Compression of small files similar to the sample set will be greatly improved.
|
||||
* `-B#`:
|
||||
Split input files into blocks of size # (default: no split)
|
||||
* `-M#`, `--memory=#`:
|
||||
Limit the amount of sample data loaded for training (default: 2 GB). See above for details.
|
||||
Limit the amount of sample data loaded for training (default: 2 GB).
|
||||
Note that the default (2 GB) is also the maximum.
|
||||
This parameter can be useful in situations where the training set size
|
||||
is not well controlled and could be potentially very large.
|
||||
Since speed of the training process is directly correlated to
|
||||
the size of the training sample set,
|
||||
a smaller sample set leads to faster training.
|
||||
|
||||
In situations where the training set is larger than maximum memory,
|
||||
the CLI will randomly select samples among the available ones,
|
||||
up to the maximum allowed memory budget.
|
||||
This is meant to improve dictionary relevance
|
||||
by mitigating the potential impact of clustering,
|
||||
such as selecting only files from the beginning of a list
|
||||
sorted by modification date, or sorted by alphabetical order.
|
||||
The randomization process is deterministic, so
|
||||
training of the same list of files with the same parameters
|
||||
will lead to the creation of the same dictionary.
|
||||
|
||||
* `--dictID=#`:
|
||||
A dictionary ID is a locally unique ID
|
||||
that a decoder can use to verify it is using the right dictionary.
|
||||
A dictionary ID is a locally unique ID.
|
||||
The decoder will use this value to verify it is using the right dictionary.
|
||||
By default, zstd will create a 4-bytes random number ID.
|
||||
It's possible to give a precise number instead.
|
||||
Short numbers have an advantage : an ID < 256 will only need 1 byte in the
|
||||
compressed frame header, and an ID < 65536 will only need 2 bytes.
|
||||
This compares favorably to 4 bytes default.
|
||||
However, it's up to the dictionary manager to not assign twice the same ID to
|
||||
It's possible to provide an explicit number ID instead.
|
||||
It's up to the dictionary manager to not assign twice the same ID to
|
||||
2 different dictionaries.
|
||||
Note that short numbers have an advantage :
|
||||
an ID < 256 will only need 1 byte in the compressed frame header,
|
||||
and an ID < 65536 will only need 2 bytes.
|
||||
This compares favorably to 4 bytes default.
|
||||
|
||||
* `--train-cover[=k#,d=#,steps=#,split=#,shrink[=#]]`:
|
||||
Select parameters for the default dictionary builder algorithm named cover.
|
||||
If _d_ is not specified, then it tries _d_ = 6 and _d_ = 8.
|
||||
@ -421,7 +449,7 @@ Compression of small files similar to the sample set will be greatly improved.
|
||||
Use legacy dictionary builder algorithm with the given dictionary
|
||||
_selectivity_ (default: 9).
|
||||
The smaller the _selectivity_ value, the denser the dictionary,
|
||||
improving its efficiency but reducing its possible maximum size.
|
||||
improving its efficiency but reducing its achievable maximum size.
|
||||
`--train-legacy=s=#` is also accepted.
|
||||
|
||||
Examples:
|
||||
@ -452,14 +480,14 @@ BENCHMARK
|
||||
ADVANCED COMPRESSION OPTIONS
|
||||
----------------------------
|
||||
### -B#:
|
||||
Select the size of each compression job.
|
||||
Specify the size of each compression job.
|
||||
This parameter is only available when multi-threading is enabled.
|
||||
Each compression job is run in parallel, so this value indirectly impacts the nb of active threads.
|
||||
Default job size varies depending on compression level (generally `4 * windowSize`).
|
||||
`-B#` makes it possible to manually select a custom size.
|
||||
Note that job size must respect a minimum value which is enforced transparently.
|
||||
This minimum is either 512 KB, or `overlapSize`, whichever is largest.
|
||||
Different job sizes will lead to (slightly) different compressed frames.
|
||||
Different job sizes will lead to non-identical compressed frames.
|
||||
|
||||
### --zstd[=options]:
|
||||
`zstd` provides 22 predefined compression levels.
|
||||
|
Reference in New Issue
Block a user