In commit 232d7a5e2d we almost got
the detection logic right. However, the XGETBV instruction would
crash if Linux was started up with the option noxsave.
have_vpclmulqdq(): Check for the XSAVE flag at the correct position
and also for the AVX flag.
This was tested on Ubuntu 22.04 by starting up its Linux 5.15 kernel
with and without the noxsave option.
It is not sufficient to check that the CPU supports the necessary
instructions. Also the operating system (or virtual machine hypervisor)
must enable all the AVX registers to be saved and restored on a
context switch.
Because clang 8 does not support the compiler intrinsic _xgetbv()
we will require clang 9 or later for enabling the use of VPCLMULQDQ
and the related AVX512 features.
In commit 9ec7819c58 the CRC-32 function
signatures had been unified somewhat, but not enough.
clang -fsanitize=undefined would flag a function pointer signature
mismatch between const char* and const void*, but not between
uint32_t and unsigned. We try to fix both inconsistencies anyway.
Reviewed by: Vladislav Vaintroub
According to https://discussions.apple.com/thread/8256853
an attempt to use AVX512 registers on macOS will result in #UD
(crash at runtime).
Also, starting with clang-18 and GCC 14, we must add "evex512" to the
target flags so that AVX and SSE instructions can use AVX512 specific
encodings. This flag was introduced together with the avx10.1-512 target.
Older compiler versions do not recognize "evex512". We do not want to
write "avx10.1-512" because it could enable some AVX512 subfeatures
that we do not have any CPUID check for.
Reviewed by: Vladislav Vaintroub
Tested on macOS by: Valerii Kravchuk
This is based on https://github.com/intel/intel-ipsec-mb/
and has been tested both on x86 and x86-64, with code that
was generated by several versions of GCC and clang.
GCC 11 or clang 8 or later should be able to compile this,
and so should recent versions of MSVC.
Thanks to Intel Corporation for providing access to hardware,
for answering my questions regarding the code, and for
providing the coefficients for the CRC-32C computation.
crc32_avx512(): Compute a reverse polynomial CRC-32 using
precomputed tables and carry-less product, for up to 256 bytes
of unaligned input per loop iteration.
Reviewed by: Vladislav Vaintroub
In our unit test, let us rely on our own reference
implementation using the reflected
CRC-32 ISO 3309 and CRC-32C polynomials. Let us also
test with various lengths.
Let us refactor the CRC-32 and CRC-32C implementations
so that no special compilation flags will be needed and
that some function call indirection will be avoided.
pmull_supported: Remove. We will have pointers to two separate
functions crc32c_aarch64_pmull() and crc32c_aarch64().