diff --git a/programs/zstd.1.md b/programs/zstd.1.md index 3901b58db..4e176f399 100644 --- a/programs/zstd.1.md +++ b/programs/zstd.1.md @@ -19,8 +19,8 @@ DESCRIPTION with command line syntax similar to `gzip (1)` and `xz (1)`. It is based on the **LZ77** family, with further FSE & huff0 entropy stages. `zstd` offers highly configurable compression speed, -with fast modes at > 200 MB/s per core, -and strong modes nearing lzma compression ratios. +from fast modes at > 200 MB/s per core, +to strong modes with excellent compression ratios. It also features a very fast decoder, with speeds > 500 MB/s per core. `zstd` command line syntax is generally similar to gzip, @@ -31,13 +31,12 @@ but features the following differences : - When compressing a single file, `zstd` displays progress notifications and result summary by default. Use `-q` to turn them off. - - `zstd` does not accept input from console, - but it properly accepts `stdin` when it's not the console. - `zstd` displays a short help page when command line is an error. Use `-q` to turn it off. + - `zstd` does not accept input from console, + though it does accept `stdin` when it's not the console. -`zstd` compresses or decompresses each _file_ according to the selected -operation mode. +`zstd` processes each _file_ according to the selected operation mode. If no _files_ are given or _file_ is `-`, `zstd` reads from standard input and writes the processed data to standard output. `zstd` will refuse to write compressed data to standard output @@ -54,8 +53,8 @@ whose name is derived from the source _file_ name: get the target filename ### Concatenation with .zst files -It is possible to concatenate `.zst` files as is. -`zstd` will decompress such files as if they were a single `.zst` file. +It is possible to concatenate multiple `.zst` files. `zstd` will decompress +such agglomerated file as if it was a single `.zst` file. OPTIONS ------- @@ -85,8 +84,8 @@ the last one takes effect. Decompress. * `-t`, `--test`: Test the integrity of compressed _files_. - This option is equivalent to `--decompress --stdout` except that the - decompressed data is discarded instead of being written to standard output. + This option is equivalent to `--decompress --stdout > /dev/null`, + decompressed data is discarded and checksummed for errors. No files are created or removed. * `-b#`: Benchmark file(s) using compression level # @@ -96,7 +95,7 @@ the last one takes effect. * `-l`, `--list`: Display information related to a zstd compressed file, such as size, ratio, and checksum. Some of these fields may not be available. - This command can be augmented with the `-v` modifier. + This command's output can be augmented with the `-v` modifier. ### Operation modifiers @@ -292,10 +291,10 @@ options that intend to mimic the `gzip` behavior: alias to the option `-9`. -### Restricted usage of Environment Variables +### Interactions with Environment Variables -Using environment variables to set parameters has security implications. -Therefore, this avenue is intentionally restricted. +Employing environment variables to set parameters has security implications. +Therefore, this avenue is intentionally limited. Only `ZSTD_CLEVEL` and `ZSTD_NBTHREADS` are currently supported. They set the compression level and number of threads to use during compression, respectively. @@ -305,8 +304,8 @@ If the value of `ZSTD_CLEVEL` is not a valid integer, it will be ignored with a `ZSTD_NBTHREADS` can be used to set the number of threads `zstd` will attempt to use during compression. If the value of `ZSTD_NBTHREADS` is not a valid unsigned integer, it will be ignored with a warning message. -`ZSTD_NBTHREADS` has a default value of (`1`), and is capped at ZSTDMT_NBWORKERS_MAX==200. `zstd` must be -compiled with multithread support for this to have any effect. +`ZSTD_NBTHREADS` has a default value of (`1`), and is capped at ZSTDMT_NBWORKERS_MAX==200. +`zstd` must be compiled with multithread support for this to have any effect. They can both be overridden by corresponding command line arguments: `-#` for compression level and `-T#` for number of compression threads. @@ -318,27 +317,36 @@ DICTIONARY BUILDER which greatly improves efficiency on small files and messages. It's possible to train `zstd` with a set of samples, the result of which is saved into a file called a `dictionary`. -Then during compression and decompression, reference the same dictionary, +Then, during compression and decompression, reference the same dictionary, using command `-D dictionaryFileName`. Compression of small files similar to the sample set will be greatly improved. * `--train FILEs`: Use FILEs as training set to create a dictionary. - The training set should contain a lot of small files (> 100), + The training set should ideally contain a lot of samples (> 100), and weight typically 100x the target dictionary size - (for example, 10 MB for a 100 KB dictionary). + (for example, ~10 MB for a 100 KB dictionary). `--train` can be combined with `-r` to indicate a directory rather than listing all the files, which can be useful to circumvent shell expansion limits. + Since dictionary compression is mostly effective for small files, + the expectation is that the training set will only contain small files. + In the case where some samples happen to be large, + only the first 128 KB of these samples will be used for training. + `--train` supports multithreading if `zstd` is compiled with threading support (default). - Additional parameters can be specified with `--train-fastcover`. + Additional advanced parameters can be specified with `--train-fastcover`. The legacy dictionary builder can be accessed with `--train-legacy`. The slower cover dictionary builder can be accessed with `--train-cover`. - Default is equivalent to `--train-fastcover=d=8,steps=4`. -* `-o file`: - Dictionary saved into `file` (default name: dictionary). + Default `--train` is equivalent to `--train-fastcover=d=8,steps=4`. + +* `-o FILE`: + Dictionary saved into `FILE` (default name: dictionary). * `--maxdict=#`: - Limit dictionary to specified size (default: 112640). + Limit dictionary to specified size (default: 112640 bytes). + As usual, quantities are expressed in bytes by default, + and it's possible to employ suffixes (like `KB` or `MB`) + to specify larger values. * `-#`: Use `#` compression level during training (optional). Will generate statistics more tuned for selected compression level, @@ -346,17 +354,37 @@ Compression of small files similar to the sample set will be greatly improved. * `-B#`: Split input files into blocks of size # (default: no split) * `-M#`, `--memory=#`: - Limit the amount of sample data loaded for training (default: 2 GB). See above for details. + Limit the amount of sample data loaded for training (default: 2 GB). + Note that the default (2 GB) is also the maximum. + This parameter can be useful in situations where the training set size + is not well controlled and could be potentially very large. + Since speed of the training process is directly correlated to + the size of the training sample set, + a smaller sample set leads to faster training. + + In situations where the training set is larger than maximum memory, + the CLI will randomly select samples among the available ones, + up to the maximum allowed memory budget. + This is meant to improve dictionary relevance + by mitigating the potential impact of clustering, + such as selecting only files from the beginning of a list + sorted by modification date, or sorted by alphabetical order. + The randomization process is deterministic, so + training of the same list of files with the same parameters + will lead to the creation of the same dictionary. + * `--dictID=#`: - A dictionary ID is a locally unique ID - that a decoder can use to verify it is using the right dictionary. + A dictionary ID is a locally unique ID. + The decoder will use this value to verify it is using the right dictionary. By default, zstd will create a 4-bytes random number ID. - It's possible to give a precise number instead. - Short numbers have an advantage : an ID < 256 will only need 1 byte in the - compressed frame header, and an ID < 65536 will only need 2 bytes. - This compares favorably to 4 bytes default. - However, it's up to the dictionary manager to not assign twice the same ID to + It's possible to provide an explicit number ID instead. + It's up to the dictionary manager to not assign twice the same ID to 2 different dictionaries. + Note that short numbers have an advantage : + an ID < 256 will only need 1 byte in the compressed frame header, + and an ID < 65536 will only need 2 bytes. + This compares favorably to 4 bytes default. + * `--train-cover[=k#,d=#,steps=#,split=#,shrink[=#]]`: Select parameters for the default dictionary builder algorithm named cover. If _d_ is not specified, then it tries _d_ = 6 and _d_ = 8. @@ -421,7 +449,7 @@ Compression of small files similar to the sample set will be greatly improved. Use legacy dictionary builder algorithm with the given dictionary _selectivity_ (default: 9). The smaller the _selectivity_ value, the denser the dictionary, - improving its efficiency but reducing its possible maximum size. + improving its efficiency but reducing its achievable maximum size. `--train-legacy=s=#` is also accepted. Examples: @@ -452,14 +480,14 @@ BENCHMARK ADVANCED COMPRESSION OPTIONS ---------------------------- ### -B#: -Select the size of each compression job. +Specify the size of each compression job. This parameter is only available when multi-threading is enabled. Each compression job is run in parallel, so this value indirectly impacts the nb of active threads. Default job size varies depending on compression level (generally `4 * windowSize`). `-B#` makes it possible to manually select a custom size. Note that job size must respect a minimum value which is enforced transparently. This minimum is either 512 KB, or `overlapSize`, whichever is largest. -Different job sizes will lead to (slightly) different compressed frames. +Different job sizes will lead to non-identical compressed frames. ### --zstd[=options]: `zstd` provides 22 predefined compression levels.