The following tests are included:
- Empty input scenario test.
- Workspace size and alignment tests.
- Symbol out-of-range tests.
- Cover multiple input sizes, vary permitted maximum symbol
values, and include diverse symbol distributions.
These tests verifies count table correctness, maxSymbolValuePtr
updates, and error-handling paths. It enables automated regression
of core histogram logic as well.
When an error occurs in BMK_isSuccessful_runOutcome, the code
previously skipped the call to BMK_freeTimedFnState(tfs),
leaking the allocated tfs object.
Fiexed by calling BMK_freeTimedFnState(tfs) before goto _cleanOut.
The existing scalar implementation uses a 4-way pipelined histogram
calculation which is very efficient on out-of-order CPUs. However,
this can be further accelerated using the SVE2 HISTSEG instructions -
which compute a histogram for 16 byte chunks in a vector register.
On a system with 128-bit vectors (VL128) we need 16 HISTSEG executions
to compute the histogram for the whole symbol space (0..255) of 16
bytes input. However we can only accumulate 15 of such 16 byte strips
before possible overflow. So we need to extend and save the 8-bit
histogram accumulators to 16-bit after every 240 byte chunks of input.
To store all in registers we would need 32 128-bit registers. Longer
SVE2 vectors could help here, if such machines become available.
The maximum input block size in Zstd is 128 KiB, so 16-bit accumulators
would not be enough. However an LZ pass will prepend the histogram
calculation, so it is impossible (my assumption) to overflow the 16-bit
accumulators.
The symbol distribution is also not uniform, the lower values are more
common, so we used a 3 pass algorithm to prevent stack spilling. In the
first pass we only compute histograms for 64 symbols (4-way SIMD) while
also computing the maximum symbol value. If we have symbol values
larger than 64 we start the second pass to compute the next 96 elements
of the histogram. The final pass calculates the remaining part of the
histogram (256 symbols in total) if needed. This split of histogram
generation gave the best overall results for performance.
This implementation is the best performing of a number of different
cache blocking schemes tested.
Compression uplifts on a Neoverse V2 system, using Zstd-1.5.8
(e26dde3d) as a baseline, compiled with "-O3 -march=armv8.2-a+sve2":
Clang-20 GCC-14
1#silesia.tar: +6.173% +5.987%
2#silesia.tar: +5.200% +5.011%
3#silesia.tar: +4.332% +5.031%
4#silesia.tar: +2.789% +3.064%
5#silesia.tar: +2.028% +1.838%
6#silesia.tar: +1.562% +1.340%
7#silesia.tar: +1.160% +0.959%
- Split monolithic 235-line CMakeLists.txt into focused modules
- Main file reduced to 78 lines with clear section organization
- Created 5 specialized modules:
* ZstdVersion.cmake - CMake policies and version management
* ZstdOptions.cmake - Build options and platform configuration
* ZstdDependencies.cmake - External dependency management
* ZstdBuild.cmake - Build targets and validation
* ZstdPackage.cmake - Package configuration generation
Benefits:
- Improved readability and maintainability
- Better separation of concerns
- Easier debugging and modification
- Preserved 100% backward compatibility
- All existing build options and targets unchanged
The refactored build system passes all tests and maintains
identical functionality while being much easier to understand
and maintain.
- Create new .github/workflows/cmake-tests.yml with all cmake-related jobs
- Move cmake-build-and-test-check, cmake-source-directory-with-spaces, and cmake-visual-2022 jobs
- Remove cmake tests from dev-short-tests.yml to improve organization
- Maintain same trigger conditions and test configurations
- Add dedicated concurrency group for cmake tests
This separation allows cmake tests to run independently and makes
the CI configuration more modular and easier to maintain.
The FUZZ_malloc_rand() function was incorrectly always returning NULL for
zero-size allocations. The random offset generated by
FUZZ_dataProducer_int32Range() was not being added to the pointer variable,
causing the function to always return (void *)0.
Replace direct returns in error-handling branches with a unified
cleanup block that frees allocated resources before returning,
improving code quality and robustness.
In main, resources were freed on the success path but not in the error path.
This change ensures all allocated resources are released before returning.