mirror of
https://github.com/facebook/zstd.git
synced 2025-09-05 14:04:09 +03:00
The existing scalar implementation uses a 4-way pipelined histogram
calculation which is very efficient on out-of-order CPUs. However,
this can be further accelerated using the SVE2 HISTSEG instructions -
which compute a histogram for 16 byte chunks in a vector register.
On a system with 128-bit vectors (VL128) we need 16 HISTSEG executions
to compute the histogram for the whole symbol space (0..255) of 16
bytes input. However we can only accumulate 15 of such 16 byte strips
before possible overflow. So we need to extend and save the 8-bit
histogram accumulators to 16-bit after every 240 byte chunks of input.
To store all in registers we would need 32 128-bit registers. Longer
SVE2 vectors could help here, if such machines become available.
The maximum input block size in Zstd is 128 KiB, so 16-bit accumulators
would not be enough. However an LZ pass will prepend the histogram
calculation, so it is impossible (my assumption) to overflow the 16-bit
accumulators.
The symbol distribution is also not uniform, the lower values are more
common, so we used a 3 pass algorithm to prevent stack spilling. In the
first pass we only compute histograms for 64 symbols (4-way SIMD) while
also computing the maximum symbol value. If we have symbol values
larger than 64 we start the second pass to compute the next 96 elements
of the histogram. The final pass calculates the remaining part of the
histogram (256 symbols in total) if needed. This split of histogram
generation gave the best overall results for performance.
This implementation is the best performing of a number of different
cache blocking schemes tested.
Compression uplifts on a Neoverse V2 system, using Zstd-1.5.8
(e26dde3d
) as a baseline, compiled with "-O3 -march=armv8.2-a+sve2":
Clang-20 GCC-14
1#silesia.tar: +6.173% +5.987%
2#silesia.tar: +5.200% +5.011%
3#silesia.tar: +4.332% +5.031%
4#silesia.tar: +2.789% +3.064%
5#silesia.tar: +2.028% +1.838%
6#silesia.tar: +1.562% +1.340%
7#silesia.tar: +1.160% +0.959%
87 lines
3.7 KiB
C
87 lines
3.7 KiB
C
/* ******************************************************************
|
|
* hist : Histogram functions
|
|
* part of Finite State Entropy project
|
|
* Copyright (c) Meta Platforms, Inc. and affiliates.
|
|
*
|
|
* You can contact the author at :
|
|
* - FSE source repository : https://github.com/Cyan4973/FiniteStateEntropy
|
|
* - Public forum : https://groups.google.com/forum/#!forum/lz4c
|
|
*
|
|
* This source code is licensed under both the BSD-style license (found in the
|
|
* LICENSE file in the root directory of this source tree) and the GPLv2 (found
|
|
* in the COPYING file in the root directory of this source tree).
|
|
* You may select, at your option, one of the above-listed licenses.
|
|
****************************************************************** */
|
|
|
|
/* --- dependencies --- */
|
|
#include "../common/zstd_deps.h" /* size_t */
|
|
|
|
|
|
/* --- simple histogram functions --- */
|
|
|
|
/*! HIST_count():
|
|
* Provides the precise count of each byte within a table 'count'.
|
|
* 'count' is a table of unsigned int, of minimum size (*maxSymbolValuePtr+1).
|
|
* Updates *maxSymbolValuePtr with actual largest symbol value detected.
|
|
* @return : count of the most frequent symbol (which isn't identified).
|
|
* or an error code, which can be tested using HIST_isError().
|
|
* note : if return == srcSize, there is only one symbol.
|
|
*/
|
|
size_t HIST_count(unsigned* count, unsigned* maxSymbolValuePtr,
|
|
const void* src, size_t srcSize);
|
|
|
|
unsigned HIST_isError(size_t code); /**< tells if a return value is an error code */
|
|
|
|
|
|
/* --- advanced histogram functions --- */
|
|
|
|
#if defined(__ARM_FEATURE_SVE2)
|
|
#define HIST_WKSP_SIZE_U32 0
|
|
#else
|
|
#define HIST_WKSP_SIZE_U32 1024
|
|
#endif
|
|
#define HIST_WKSP_SIZE (HIST_WKSP_SIZE_U32 * sizeof(unsigned))
|
|
/** HIST_count_wksp() :
|
|
* Same as HIST_count(), but using an externally provided scratch buffer.
|
|
* Benefit is this function will use very little stack space.
|
|
* `workSpace` is a writable buffer which must be 4-bytes aligned,
|
|
* `workSpaceSize` must be >= HIST_WKSP_SIZE
|
|
*/
|
|
size_t HIST_count_wksp(unsigned* count, unsigned* maxSymbolValuePtr,
|
|
const void* src, size_t srcSize,
|
|
void* workSpace, size_t workSpaceSize);
|
|
|
|
/** HIST_countFast() :
|
|
* same as HIST_count(), but blindly trusts that all byte values within src are <= *maxSymbolValuePtr.
|
|
* This function is unsafe, and will segfault if any value within `src` is `> *maxSymbolValuePtr`
|
|
*/
|
|
size_t HIST_countFast(unsigned* count, unsigned* maxSymbolValuePtr,
|
|
const void* src, size_t srcSize);
|
|
|
|
/** HIST_countFast_wksp() :
|
|
* Same as HIST_countFast(), but using an externally provided scratch buffer.
|
|
* `workSpace` is a writable buffer which must be 4-bytes aligned,
|
|
* `workSpaceSize` must be >= HIST_WKSP_SIZE
|
|
*/
|
|
size_t HIST_countFast_wksp(unsigned* count, unsigned* maxSymbolValuePtr,
|
|
const void* src, size_t srcSize,
|
|
void* workSpace, size_t workSpaceSize);
|
|
|
|
/*! HIST_count_simple() :
|
|
* Same as HIST_countFast(), this function is unsafe,
|
|
* and will segfault if any value within `src` is `> *maxSymbolValuePtr`.
|
|
* It is also a bit slower for large inputs.
|
|
* However, it does not need any additional memory (not even on stack).
|
|
* @return : count of the most frequent symbol.
|
|
* Note this function doesn't produce any error (i.e. it must succeed).
|
|
*/
|
|
unsigned HIST_count_simple(unsigned* count, unsigned* maxSymbolValuePtr,
|
|
const void* src, size_t srcSize);
|
|
|
|
/*! HIST_add() :
|
|
* Lowest level: just add nb of occurrences of characters from @src into @count.
|
|
* @count is not reset. @count array is presumed large enough (i.e. 1 KB).
|
|
@ This function does not need any additional stack memory.
|
|
*/
|
|
void HIST_add(unsigned* count, const void* src, size_t srcSize);
|