1
0
mirror of https://github.com/facebook/zstd.git synced 2025-09-07 01:06:59 +03:00
Commit Graph

2480 Commits

Author SHA1 Message Date
Yann Collet
40c285e0ba Merge pull request #4419 from AZero13/patch-1
Check for job before releasing resources
2025-08-19 17:02:48 -07:00
Yann Collet
6f1cb87ade Merge pull request #4443 from facebook/opt_simplify_4442
simplify sequence resolution in zstd_opt
2025-07-23 15:01:36 -08:00
Yann Collet
0055ce7a02 simplify sequence resolution in zstd_opt
initially hinted by @pitaj in #4442
2025-07-18 21:21:47 -07:00
Yann Collet
f9e26bb42b Merge pull request #4394 from AZero13/zstd
Remove redundant setting of allJobsCompleted to 1
2025-07-18 18:55:47 -08:00
Yann Collet
a1e11db08a Merge pull request #4435 from zijianli1234/dev
add riscv  ci
2025-07-18 18:54:24 -08:00
Arpad Panyik
07cd78d366 AArch64: Add Neon path for convertSequences_noRepcodes
Add a 4-way Neon implementation for the convertSequences_noRepcodes
function. Remove 'static' keywords from all of its implementations to
be able to add unit tests.

Relative performance to Clang-18 using: `./fullbench -b18 -l5 enwik5`

Neoverse-V2   before     after
Clang-18:    100.000%  311.703%
Clang-19:    100.191%  311.714%
Clang-20:    100.181%  311.723%
GCC-13:      107.520%  252.309%
GCC-14:      107.652%  253.158%
GCC-15:      107.674%  253.168%

Cortex-A720   before     after
Clang-18:    100.000%  204.512%
Clang-19:    102.825%  204.600%
Clang-20:    102.807%  204.558%
GCC-13:      110.668%  203.594%
GCC-14:      110.684%  203.978%
GCC-15:      102.864%  204.299%

Co-authored by, Thomas Daubney <Thomas.Daubney@arm.com>
2025-07-10 18:20:57 +00:00
Arpad Panyik
8e4400463a Improve ZSTD_get1BlockSummary
Add a faster scalar implementation of ZSTD_get1BlockSummary which
removes the data dependency of the accumulators in the hot loop to
leverage the superscalar potential of recent out-of-order CPUs.
The new algorithm leverages SWAR (SIMD Within A Register) methodology
to exploit the capabilities of 64-bit architectures. It achieves this
by packing two 32-bit data elements into a single 64-bit register,
enabling parallel operations on these subcomponents while ensuring
that the 32-bit boundaries prevent overflow, thereby optimizing
computational efficiency.

Corresponding unit tests are included.

Relative performance to GCC-13 using: `./fullbench -b19 -l5 enwik5`

Neoverse-V2   before     after
GCC-13:      100.000%  290.527%
GCC-14:      100.000%  291.714%
GCC-15:       99.914%  291.495%
Clang-18:    148.072%  264.524%
Clang-19:    148.075%  264.512%
Clang-20:    148.062%  264.490%

Cortex-A720   before     after
GCC-13:      100.000%  235.261%
GCC-14:      101.064%  234.903%
GCC-15:      112.977%  218.547%
Clang-18:    127.135%  180.359%
Clang-19:    127.149%  180.297%
Clang-20:    127.154%  180.260%

Co-authored by, Thomas Daubney <Thomas.Daubney@arm.com>
2025-07-10 18:20:49 +00:00
ZijianLi
2c3f23b018 fix dereferencing type-punned pointer error 2025-06-29 15:36:25 +08:00
Rose
4efbd56749 Check for job before releasing
ZSTDMT_freeCCtx calls ZSTDMT_releaseAllJobResources, but ZSTDMT_releaseAllJobResources may be called when ZSTDMT_freeCCtx is called when initialization fails, resulting in a NULL pointer dereference.
2025-06-24 14:05:08 -04:00
Rose
50f169411b Remove redundant setting of allJobsCompleted to 1
This will do it automatically.
2025-06-24 14:04:21 -04:00
Arpad Panyik
7e4937bc75 AArch64: Add SVE2 implementation of histogram computation
The existing scalar implementation uses a 4-way pipelined histogram
calculation which is very efficient on out-of-order CPUs. However,
this can be further accelerated using the SVE2 HISTSEG instructions -
which compute a histogram for 16 byte chunks in a vector register.

On a system with 128-bit vectors (VL128) we need 16 HISTSEG executions
to compute the histogram for the whole symbol space (0..255) of 16
bytes input. However we can only accumulate 15 of such 16 byte strips
before possible overflow. So we need to extend and save the 8-bit
histogram accumulators to 16-bit after every 240 byte chunks of input.
To store all in registers we would need 32 128-bit registers. Longer
SVE2 vectors could help here, if such machines become available.

The maximum input block size in Zstd is 128 KiB, so 16-bit accumulators
would not be enough. However an LZ pass will prepend the histogram
calculation, so it is impossible (my assumption) to overflow the 16-bit
accumulators.

The symbol distribution is also not uniform, the lower values are more
common, so we used a 3 pass algorithm to prevent stack spilling. In the
first pass we only compute histograms for 64 symbols (4-way SIMD) while
also computing the maximum symbol value. If we have symbol values
larger than 64 we start the second pass to compute the next 96 elements
of the histogram. The final pass calculates the remaining part of the
histogram (256 symbols in total) if needed. This split of histogram
generation gave the best overall results for performance.

This implementation is the best performing of a number of different
cache blocking schemes tested.

Compression uplifts on a Neoverse V2 system, using Zstd-1.5.8
(e26dde3d) as a baseline, compiled with "-O3 -march=armv8.2-a+sve2":

                 Clang-20    GCC-14
 1#silesia.tar:   +6.173%   +5.987%
 2#silesia.tar:   +5.200%   +5.011%
 3#silesia.tar:   +4.332%   +5.031%
 4#silesia.tar:   +2.789%   +3.064%
 5#silesia.tar:   +2.028%   +1.838%
 6#silesia.tar:   +1.562%   +1.340%
 7#silesia.tar:   +1.160%   +0.959%
2025-06-11 12:14:22 +00:00
李子建
d95123f2e6 Improve speed of ZSTD_compressSequencesAndLiterals() using RVV 2025-06-02 17:21:02 +08:00
Yann Collet
2fec3989c1 add an assert
to help static analyzers understand there is no overflow risk there.
2025-03-22 18:23:31 -07:00
Nick Terrell
68dfd14a8c [linux] Opt out of row based match finder for the kernel
The row based match finder is slower without SIMD. We used to detect the
presence of SIMD to set the lower bound to 17, but that breaks
determinism. Instead, specifically opt into it for the kernel, because
it is one of the rare cases that doesn't have SIMD support.
2025-03-11 16:18:59 -04:00
Yann Collet
22b2fd2517 Merge pull request #4317 from hirohira9119/fix-function-signature
Fix function signature mismatch for ZSTD_convertBlockSequences
2025-02-27 13:03:03 -08:00
Yann Collet
db2d205ada fixed -Wconversion for lib/decompress/zstd_decompress_block.c 2025-02-26 10:01:05 -08:00
hirohira
2840631dc1 Fix function signature mismatch for ZSTD_convertBlockSequences 2025-02-26 08:23:48 +09:00
Yann Collet
d2c562b803 update hrlog comment 2025-02-10 10:48:56 -08:00
Yann Collet
67fad95f79 derive hashratelog from hashlog when only hashlog is set 2025-02-10 10:46:37 -08:00
Yann Collet
09d7e34ed8 adjust mml 2025-02-10 10:46:37 -08:00
Yann Collet
d5e4698267 fix boundary condition 2025-02-10 10:46:37 -08:00
Yann Collet
72406b71c3 update hrlog rule to favor compression ratio a bit more at low levels 2025-02-10 10:46:37 -08:00
Yann Collet
f26cc54f37 dynamic bucket sizes 2025-02-10 10:46:37 -08:00
Yann Collet
4609a40b89 dynamically adjust hratelog and ldmml based on strategy 2025-02-10 10:46:37 -08:00
Yann Collet
23e5f80390 Revert "pass dictionary loading method as parameter"
This reverts commit 821fc567f9.
2025-02-05 18:47:26 -08:00
Yann Collet
c7cd7dc04b better MT fluidity
--patch-from no longer blocked on first job dictionary loading
2025-02-05 18:42:00 -08:00
Yann Collet
f11bd19c7f ensure cdict is properly reset to NULL 2025-02-05 18:42:00 -08:00
Yann Collet
7406d2b6eb skips the need to create a temporary cdict for --patch-from
thus saving a bit of memory and a little bit of cpu time
2025-02-05 18:42:00 -08:00
Yann Collet
220abe6da8 reduced memory usage
by avoiding to duplicate in memory
a dictionary that was passed by reference.
2025-02-05 18:42:00 -08:00
Yann Collet
85a44b233a always free .cdictLocal 2025-02-05 18:41:59 -08:00
Yann Collet
e637fc64c5 update type naming convention 2025-02-05 18:41:59 -08:00
Yann Collet
34ba14437a minor boundary change
improves compression ratio at low levels
2025-02-05 18:41:59 -08:00
Yann Collet
ffa66a6971 fix speed of --patch-from at high compression mode 2025-02-05 18:41:59 -08:00
Yann Collet
e117d79e22 fix minor alignment warning 2025-02-05 16:13:58 -08:00
Yann Collet
c39424ea87 fix minor alignment warning
this is a prototype definition error:
`_mm_storeu_si128()` should accept a `void*` pointer,
since it explicitly states that it accepts unaligned addresses
yet requiring a `__m128i*` tells otherwise, and requires the compiler the enforce this alignment.
2025-02-05 16:11:54 -08:00
Yann Collet
32dff04d32 fix one minor alignment warning
seems like a prototype interface error:
input parameter should have been `const void*`,
since the documentation is explicit that input doesn't have to be aligned,
but `const __m256i*` makes the compiler enforce it.
2025-02-05 15:46:44 -08:00
Yann Collet
f0b5f65bca fixed minor static function declaration issue
in AVX2 mode only
2025-01-18 22:49:16 -08:00
Yann Collet
19025f3da0 Merge pull request #4238 from szsam/patch-1
fix out-of-bounds array index access
2025-01-15 17:56:41 -08:00
Yann Collet
87f0a4fbe0 restore full equation
do not solve the equation, even though some members cancel each other,
this is done for clarity,
we'll let the compiler do the resolution at compile time.
2025-01-15 17:11:27 -08:00
Yann Collet
8bff69af86 Alignment instruction ZSTD_ALIGNED() in common/compiler.h 2025-01-15 17:11:27 -08:00
Yann Collet
2f3ee8b530 changed code compilation test to employ ZSTD_ARCH_X86_AVX2 2025-01-15 17:11:27 -08:00
Yann Collet
debe3d20d9 removed unused branch 2025-01-15 17:11:27 -08:00
Yann Collet
e3181cfd32 minor code doc update 2025-01-15 17:11:27 -08:00
Yann Collet
aa2cdf964f added compilation-time checks to ensure AVX2 code is valid
since it depends on a specific definition of ZSTD_Sequence structure.
2025-01-15 17:11:27 -08:00
Yann Collet
57a4554192 removed unused variable 2025-01-15 17:11:27 -08:00
Yann Collet
4aaf9cefe9 fix minor conversion warning 2025-01-15 17:11:27 -08:00
Yann Collet
db3d48823a no need for specialized variant
the branch is not in the hot loop
2025-01-15 17:11:27 -08:00
Yann Collet
cd53924eff removed erroneous #includes
that were automatically added by the editor without notification
2025-01-15 17:11:27 -08:00
Yann Collet
ed0a8b8be1 AVX2 version of ZSTD_get1BlockSummary() 2025-01-15 17:11:27 -08:00
Yann Collet
b6a4d5a8ba minor +10% speed improvement for scalar ZSTD_get1BlockSummary() 2025-01-15 17:11:27 -08:00