Add a 4-way Neon implementation for the convertSequences_noRepcodes
function. Remove 'static' keywords from all of its implementations to
be able to add unit tests.
Relative performance to Clang-18 using: `./fullbench -b18 -l5 enwik5`
Neoverse-V2 before after
Clang-18: 100.000% 311.703%
Clang-19: 100.191% 311.714%
Clang-20: 100.181% 311.723%
GCC-13: 107.520% 252.309%
GCC-14: 107.652% 253.158%
GCC-15: 107.674% 253.168%
Cortex-A720 before after
Clang-18: 100.000% 204.512%
Clang-19: 102.825% 204.600%
Clang-20: 102.807% 204.558%
GCC-13: 110.668% 203.594%
GCC-14: 110.684% 203.978%
GCC-15: 102.864% 204.299%
Co-authored by, Thomas Daubney <Thomas.Daubney@arm.com>
Add a faster scalar implementation of ZSTD_get1BlockSummary which
removes the data dependency of the accumulators in the hot loop to
leverage the superscalar potential of recent out-of-order CPUs.
The new algorithm leverages SWAR (SIMD Within A Register) methodology
to exploit the capabilities of 64-bit architectures. It achieves this
by packing two 32-bit data elements into a single 64-bit register,
enabling parallel operations on these subcomponents while ensuring
that the 32-bit boundaries prevent overflow, thereby optimizing
computational efficiency.
Corresponding unit tests are included.
Relative performance to GCC-13 using: `./fullbench -b19 -l5 enwik5`
Neoverse-V2 before after
GCC-13: 100.000% 290.527%
GCC-14: 100.000% 291.714%
GCC-15: 99.914% 291.495%
Clang-18: 148.072% 264.524%
Clang-19: 148.075% 264.512%
Clang-20: 148.062% 264.490%
Cortex-A720 before after
GCC-13: 100.000% 235.261%
GCC-14: 101.064% 234.903%
GCC-15: 112.977% 218.547%
Clang-18: 127.135% 180.359%
Clang-19: 127.149% 180.297%
Clang-20: 127.154% 180.260%
Co-authored by, Thomas Daubney <Thomas.Daubney@arm.com>
ZSTDMT_freeCCtx calls ZSTDMT_releaseAllJobResources, but ZSTDMT_releaseAllJobResources may be called when ZSTDMT_freeCCtx is called when initialization fails, resulting in a NULL pointer dereference.
The existing scalar implementation uses a 4-way pipelined histogram
calculation which is very efficient on out-of-order CPUs. However,
this can be further accelerated using the SVE2 HISTSEG instructions -
which compute a histogram for 16 byte chunks in a vector register.
On a system with 128-bit vectors (VL128) we need 16 HISTSEG executions
to compute the histogram for the whole symbol space (0..255) of 16
bytes input. However we can only accumulate 15 of such 16 byte strips
before possible overflow. So we need to extend and save the 8-bit
histogram accumulators to 16-bit after every 240 byte chunks of input.
To store all in registers we would need 32 128-bit registers. Longer
SVE2 vectors could help here, if such machines become available.
The maximum input block size in Zstd is 128 KiB, so 16-bit accumulators
would not be enough. However an LZ pass will prepend the histogram
calculation, so it is impossible (my assumption) to overflow the 16-bit
accumulators.
The symbol distribution is also not uniform, the lower values are more
common, so we used a 3 pass algorithm to prevent stack spilling. In the
first pass we only compute histograms for 64 symbols (4-way SIMD) while
also computing the maximum symbol value. If we have symbol values
larger than 64 we start the second pass to compute the next 96 elements
of the histogram. The final pass calculates the remaining part of the
histogram (256 symbols in total) if needed. This split of histogram
generation gave the best overall results for performance.
This implementation is the best performing of a number of different
cache blocking schemes tested.
Compression uplifts on a Neoverse V2 system, using Zstd-1.5.8
(e26dde3d) as a baseline, compiled with "-O3 -march=armv8.2-a+sve2":
Clang-20 GCC-14
1#silesia.tar: +6.173% +5.987%
2#silesia.tar: +5.200% +5.011%
3#silesia.tar: +4.332% +5.031%
4#silesia.tar: +2.789% +3.064%
5#silesia.tar: +2.028% +1.838%
6#silesia.tar: +1.562% +1.340%
7#silesia.tar: +1.160% +0.959%
The row based match finder is slower without SIMD. We used to detect the
presence of SIMD to set the lower bound to 17, but that breaks
determinism. Instead, specifically opt into it for the kernel, because
it is one of the rare cases that doesn't have SIMD support.
this is a prototype definition error:
`_mm_storeu_si128()` should accept a `void*` pointer,
since it explicitly states that it accepts unaligned addresses
yet requiring a `__m128i*` tells otherwise, and requires the compiler the enforce this alignment.
seems like a prototype interface error:
input parameter should have been `const void*`,
since the documentation is explicit that input doesn't have to be aligned,
but `const __m256i*` makes the compiler enforce it.
do not solve the equation, even though some members cancel each other,
this is done for clarity,
we'll let the compiler do the resolution at compile time.