The m68k provided an optimized version through __m81_u(fmod)
(mathimpl.h), and gcc does not implement it through a builtin
(different than i386).
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
The m68k provided an optimized version through __m81_u(fmodf)
(mathimpl.h), and gcc does not implement it through a builtin
(different than i386).
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
The optimized i386 version is faster than the generic one, and gcc
implements it through the builtin. It allows us to move the
implementation to a C one.
The performance on a Zen3 chip is slight better:
reciprocal-throughput input master no-SVID improvement
i686 subnormals 22.4741 20.1571 10.31%
i686 normal 74.1631 70.3606 5.13%
i686 close-exponent 22.5625 20.2435 10.28%
Tested on i686-linux-gnu.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
The optimized i386 version is faster than the generic one, and gcc
implements it through the builtin. It allows us to move the
implementation to a C one. The performance on a Zen3 chip is
similar to the SVID one.
Tested on i686-linux-gnu.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
The ifunc selector for wmemset had a stray '!' in the
X86_ISA_CPU_FEATURES_ARCH_P(...) check:
if (X86_ISA_CPU_FEATURE_USABLE_P (cpu_features, AVX2)
&& X86_ISA_CPU_FEATURES_ARCH_P (cpu_features,
AVX_Fast_Unaligned_Load, !))
This effectively negated the predicate and caused the AVX2/AVX512
paths to be skipped, making the dispatcher fall back to the SSE2
implementation even on CPUs where AVX2/AVX512 are available. The
regression leads to noticeable throughput loss for wmemset.
Remove the stray '!' so the AVX_Fast_Unaligned_Load capability is
tested as intended and the correct AVX2/EVEX variants are selected.
Impact:
- On AVX2/AVX512-capable x86_64, wmemset no longer incorrectly
falls back to SSE2; perf now shows __wmemset_evex/avx2 variants.
Testing:
- benchtests/bench-wmemset shows improved bandwidth across sizes.
- perf confirm the selected symbol is no longer SSE2.
Signed-off-by: xiejiamei <xiejiamei@hygon.com>
Signed-off-by: Li jing <lijing@hygon.cn>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
It issues:
../sysdeps/aarch64/tst-ifunc-arg-4.c:39:1: error: unused function 'resolver' [-Werror,-Wunused-function]
39 | resolver (uint64_t arg0, const uint64_t arg1[])
| ^~~~~~~~
1 error generated.
clang-19 and onwards do not trigger the warning.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
The clang default to warning for missing fall-through and it does
not support all comment-like annotation that gcc does. Use C23
[[fallthrough]] annotation instead.
proper attribute instead.
Reviewed-by: Collin Funk <collin.funk1@gmail.com>
The C2y function uimaxabs has been renamed to umaxabs. Implement this
change in glibc, keeping a compat symbol under the old name, copying
the test to test the new name and changing the old test to test the
compat symbol. Jakub has done the corresponding change to the
built-in function in GCC.
Tested for x86_64 and x86.
For lgamma and tgamma the muldd, mulddd, and polydd are renamed
to muldd2, mulddd2, and polydd2 respectively.
Checked on aarch64-linux-gnu and x86_64-linux-gnu.
Reviewed-by: DJ Delorie <dj@redhat.com>
The common code definitions are consolidated in s_erf_common.h
and s_erf_common.c.
Checked on x86_64-linux-gnu, aarch64-linux-gnu, and
powerpc64le-linux-gnu.
Reviewed-by: DJ Delorie <dj@redhat.com>
The shared internal data definitions are consolidated in
s_erf_data.c and the erfc only one are moved to s_erfc_data.c.
Checked on x86_64-linux-gnu, aarch64-linux-gnu, and
powerpc64le-linux-gnu.
Reviewed-by: DJ Delorie <dj@redhat.com>
The current implementation precision shows the following accuracy, on
three rangeis ([-DBL_MIN, -4.2], [-4.2, 4.2], [4.2, DBL_MAX]) with
10e9 uniform randomly generated numbers for each range (first column
is the accuracy in ULP, with '0' being correctly rounded, second is the
number of samples with the corresponding precision):
* Range [-DBL_MIN, -4.2]
* FE_TONEAREST
0: 10000000000 100.00%
* FE_UPWARD
0: 10000000000 100.00%
* FE_DOWNWARD
0: 10000000000 100.00%
* FE_TOWARDZERO
0: 10000000000 100.00%
* Range [-4.2, 4.2]
* FE_TONEAREST
0: 9764404513 97.64%
1: 235595487 2.36%
* FE_UPWARD
0: 9468013928 94.68%
1: 531986072 5.32%
* FE_DOWNWARD
0: 9493787693 94.94%
1: 506212307 5.06%
* FE_TOWARDZERO
0: 9585271351 95.85%
1: 414728649 4.15%
* Range [4.2, DBL_MAX]
* FE_TONEAREST
0: 10000000000 100.00%
* FE_UPWARD
0: 10000000000 100.00%
* FE_DOWNWARD
0: 10000000000 100.00%
* FE_TOWARDZERO
0: 10000000000 100.00%
The CORE-MATH implementation is correctly rounded for any rounding mode.
The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).
Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1) shows:
reciprocal-throughput master patched improvement
x86_64 38.2754 78.0311 -103.87%
x86_64v2 38.3325 75.7555 -97.63%
x86_64v3 34.6604 28.3182 18.30%
aarch64 23.1499 21.4307 7.43%
power10 12.3051 9.3766 23.80%
Latency master patched improvement
x86_64 84.3062 121.3580 -43.95%
x86_64v2 84.1817 117.4250 -39.49%
x86_64v3 81.0933 70.6458 12.88%
aarch64 35.012 29.5012 15.74%
power10 21.7205 18.4589 15.02%
For x86_64/x86_64-v2, most performance hit came from the fma call
through the ifunc mechanism.
Checked on x86_64-linux-gnu, aarch64-linux-gnu, and
powerpc64le-linux-gnu.
Reviewed-by: DJ Delorie <dj@redhat.com>
The internal data definitions are moved to s_atanh_data.c.
It helps on ABIs that build the implementation multiple times for
ifunc optimizations, like x86_64.
Reviewed-by: DJ Delorie <dj@redhat.com>
The current implementation precision shows the following accuracy, on
one range ([-1,1]) with 10e9 uniform randomly generated numbers for
each range (first column is the accuracy in ULP, with '0' being
correctly rounded, second is the number of samples with the
corresponding precision):
* Range [-1, 1]
* FE_TONEAREST
0: 8180011860 81.80%
1: 1819865257 18.20%
2: 122883 0.00%
* FE_UPWARDA
0: 3903695744 39.04%
1: 4992324465 49.92%
2: 1096319340 10.96%
3: 7660451 0.08%
* FE_DOWNWARDA
0: 3904555484 39.05%
1: 4991970864 49.92%
2: 1095447471 10.95%
3: 8026181 0.08%
* FE_TOWARDZERO
0: 7070209165 70.70%
1: 2908447434 29.08%
2: 21343401 0.21%
The CORE-MATH implementation is correctly rounded for any rounding mode.
The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).
Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1) shows:
reciprocal-throughput master patched improvement
x86_64 26.4969 22.4625 15.23%
x86_64v2 26.0792 22.9822 11.88%
x86_64v3 25.6357 22.2147 13.34%
aarch64 20.2295 19.7001 2.62%
power10 10.0986 9.3846 7.07%
Latency master patched improvement
x86_64 80.2311 59.9745 25.25%
x86_64v2 79.7010 61.4066 22.95%
x86_64v3 78.2679 58.5804 25.15%
aarch64 34.3959 28.1523 18.15%
power10 23.2417 18.2694 21.39%
Checked on x86_64-linux-gnu, aarch64-linux-gnu, and
powerpc64le-linux-gnu.
Reviewed-by: DJ Delorie <dj@redhat.com>
Since SSIZE_MAX is less than UINT_MAX on 32-bit platforms we must AND
the expression with SSIZE_MAX.
Tested on x86_64 and x86.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
- Performance testing revealed significant memcpy performance degradation
when bit_arch_AVX_Fast_Unaligned_Load is enabled on Hygon 3.
- Hygon confirmed AVX performance issues in certain memory functions.
- Glibc benchmarks show SSE outperforms AVX for
memcpy/memmove/memset/strcmp/strcpy/strlen and so on.
- Hardware differences primarily in floating-point operations don't justify
AVX usage for memory operations.
Reviewed-by: gaoxiang <gaoxiang@kylinos.cn>
Signed-off-by: litenglong <litenglong@kylinos.cn>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Changes with respect to v1:
- added comment in e_j1f.c to explain the use of float is enough
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
The powl implementation for x86_64 ends up multiplying X once more than
necessary and then throwing away that result. This results in an
overflow flag being set in cases where there is no overflow.
Simplify the relevant portion by special casing the -3 to 3 range and
simply multiplying repetitively.
Resolves: BZ #33411
Signed-off-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
Reviewed by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
clang-18 and onwards issues:
../sysdeps/unix/sysv/linux/speed.c:71:23: error: initializer overrides prior initialization of this subobject [-Werror,-Winitializer-overrides]
71 | [_cbix(__B0)] = 0,
| ^
../sysdeps/unix/sysv/linux/speed.c:70:34: note: previous initialization is here
70 | [0 ... _cbix(CBAUDMASK)] = -1,
[...]
The override is explicit used to support the same initialization on
multiple platforms (since the baud values differ on alpha and powerpc).
Reviewed-by: Collin Funk <collin.funk1@gmail.com>
Clang issues:
../sysdeps/ieee754/dbl-64/s_llround.c:83:30: error: incompatible
redeclaration of library function 'lround'
[-Werror,-Wincompatible-library-redeclaration]
libm_alias_double (__lround, lround)
^
../sysdeps/ieee754/dbl-64/s_llround.c:83:30: note: 'lround' is a builtin
with type 'long (double)'
Reviewed-by: Sam James <sam@gentoo.org>
clang issues:
../sysdeps/ieee754/dbl-64/e_lgamma_r.c:234:29: error: absolute value function 'fabsf'
given an argument of type 'double' but has parameter of type 'float' which may cause \
truncation of value [-Werror,-Wabsolute-value]
It should not matter because the value is 0.0, but using fabs is
simpler than adding a warning suppresion.
Reviewed-by: Sam James <sam@gentoo.org>
The __syscall_cancel_arch function has an epilogue that does not match
the prologue. The stack is not used and the return address still lies in
r15 when reaching the epilogue. Fix the epilogue by simply returning
from the function.
Signed-off-by: Luc Michel <luc.michel@amd.com>
Tested-by: gopi@sankhya.com
Reviewed-by: Neal Frager <neal.frager@amd.com>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
clang does not work by using whitespace for defining the -z option:
$ make test t=misc/tst-gcs-disabled
[...]
clang: error: no such file or directory: 'gcs=always'
Use the usual comma separate way.
Reviewed-by: Florian Weimer <fweimer@redhat.com>
The constant should be used with c_cc, which for all supported ABIs
is defined as unsigned char. By using it as literar char constant,
clang triggers an error when compared with signal literal on ABIs that
define 'char' as unsigned.
On aarch64, clang shows:
../sysdeps/posix/fpathconf.c:118:21: error: right side of operator
converted from negative value to unsigned: -1 to 18446744073709551615
[-Werror]
#if _POSIX_VDISABLE == -1
~~~~~~~~~~~~~~~ ^ ~~
Reviewed-by: Collin Funk <collin.funk1@gmail.com>
Remove target implementations of roundeven(f)/lrint(f)/lround(f) and
use the math-use-builtins mechanism instead.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Add the C23 memalignment function (query the alignment of a pointer)
to glibc.
Given how simple this operation is, it would make sense for compilers
to inline calls to this function, but I'm treating that as a compiler
matter (compilers should add it as a built-in function) rather than
adding an inline version to glibc headers (although such an inline
version would be reasonable as well). I've filed
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122117 for this feature
in GCC.
Tested for x86_64 and x86.
And remove some unused entries of the fallback table.
Checked on x86_64-linux-gnu and aarch64-linux-gnu.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
The fma is required only for x == -0x1.da285cp-5 in FE_TONEAREST
to provide correctly rounded results.
Checked on x86_64-linux-gnu and i686-linux-gnu.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
The fma is required only for x == +/-0x1.6371e8p-4f in FE_TOWARDZERO
to provide correctly rounded results.
Checked on x86_64-linux-gnu and aarch64-linux-gnu.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
The fma is not required to provide correctly rounded and it helps
on !__FP_FAST_FMA ISAs.
Checked on x86_64-linux-gnu and i686-linux-gnu.
Reviewed-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
The fma is required only for inputs less than 0x1.0fd288p-127. Also
only add the extra check for !__FP_FAST_FMA targets.
Checked on x86_64-linux-gnu and aarch64-linux-gnu.
Reviewed-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
The fma is not strickly required to provide correctly rounded and
it helps on !__FP_FAST_FMA ABIs.
Checked on x86_64-linux-gnu and i686-linux-gnu.
Reviewed-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
This commit adds tests for the following use cases relevant to handing of
the SME state:
- fork() and vfork()
- clone() and clone3()
- signal handler
While most cases are trivial, the case of clone3() is more complicated since
the clone3() symbol is not public in Glibc.
To avoid having to check all possible ways clone3() may be called via other
public functions (e.g. vfork() or pthread_create()), we put together a test
that links directly with clone3.o. All the existing functions that have calls
to clone3() may not actually use it, in which case the outcome of such tests
would be unexpected. Having a direct call to the clone3() symbol in the test
allows to check precisely what we need to test: that the __arm_za_disable()
function is indeed called and has the desired effect.
Linking to clone3.o also requires linking to __arm_za_disable.o that in
turn requires the _dl_hwcap2 hidden symbol which to provide in the test
and initialise it before using.
Co-authored-by: Adhemerval Zanella Netto <adhemerval.zanella@linaro.org>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
This change adds a call to the __arm_za_disable() function immediately
before the SVC instruction inside clone() and clone3() wrappers. It also
adds a macro for inline clone() used in fork() and adds the same call to
the vfork implementation. This sets the ZA state of SME to "off" on return
from these functions (for both the child and the parent).
The __arm_za_disable() function is described in [1] (8.1.3). Note that
the internal Glibc name for this function is __libc_arm_za_disable().
When this change was originally proposed [2,3], it generated a long
discussion where several questions and concerns were raised. Here we
will address these concerns and explain why this change is useful and,
in fact, necessary.
In a nutshell, a C library that conforms to the AAPCS64 spec [1] (pertinent
to this change, mainly, the chapters 6.2 and 6.6), should have a call to the
__arm_za_disable() function in clone() and clone3() wrappers. The following
explains in detail why this is the case.
When we consider using the __arm_za_disable() function inside the clone()
and clone3() libc wrappers, we talk about the C library subroutines clone()
and clone3() rather than the syscalls with similar names. In the current
version of Glibc, clone() is public and clone3() is private, but it being
private is not pertinent to this discussion.
We will begin with stating that this change is NOT a bug fix for something
in the kernel. The requirement to call __arm_za_disable() does NOT come from
the kernel. It also is NOT needed to satisfy a contract between the kernel
and userspace. This is why it is not for the kernel documentation to describe
this requirement. This requirement is instead needed to satisfy a pure userspace
scheme outlined in [1] and to make sure that software that uses Glibc (or any
other C library that has correct handling of SME states (see below)) conforms
to [1] without having to unnecessarily become SME-aware thus losing portability.
To recap (see [1] (6.2)), SME extension defines SME state which is part of
processor state. Part of this SME state is ZA state that is necessary to
manage ZA storage register in the context of the ZA lazy saving scheme [1]
(6.6). This scheme exists because it would be challenging to handle ZA
storage of SME in either callee-saved or caller-saved manner.
There are 3 kinds of ZA state that are defined in terms of the PSTATE.ZA
bit and the TPIDR2_EL0 register (see [1] (6.6.3)):
- "off": PSTATE.ZA == 0
- "active": PSTATE.ZA == 1 TPIDR2_EL0 == null
- "dormant": PSTATE.ZA == 1 TPIDR2_EL0 != null
As [1] (6.7.2) outlines, every subroutine has exactly one SME-interface
depending on the permitted ZA-states on entry and on normal return from
a call to this subroutine. Callers of a subroutine must know and respect
the ZA-interface of the subroutines they are using. Using a subroutine
in a way that is not permitted by its ZA-interface is undefined behaviour.
In particular, clone() and clone3() (the C library functions) have the
ZA-private interface. This means that the permitted ZA-states on entry
are "off" and "dormant" and that the permitted states on return are "off"
or "dormant" (but if and only if it was "dormant" on entry).
This means that both functions in question should correctly handle both
"off" and "dormant" ZA-states on entry. The conforming states on return
are "off" and "dormant" (if inbound state was already "dormant").
This change ensures that the ZA-state on return is always "off". Note,
that, in the context of clone() and clone3(), "on return" means a point
when execution resumes at certain address after transferring from clone()
or clone3(). For the caller (we may refer to it as "parent") this is the
return address in the link register where the RET instruction jumps. For
the "child", this is the target branch address.
So, the "off" state on return is permitted and conformant. Why can't we
retain the "dormant" state? In theory, we can, but we shouldn't, here is
why.
Every subroutine with a private-ZA interface, including clone() and clone3(),
must comply with the lazy saving scheme [1] (6.7.2). This puts additional
responsibility on a subroutine if ZA-state on return is "dormant" because
this state has special meaning. The "caller" (that is the place in code
where execution is transferred to, so this include both "parent" and "child")
may check the ZA-state and use it as per the spec of the "dormant" state that
is outlined in [1] (6.6.6 and 6.6.7).
Conforming to this would require more code inside of clone() and clone3()
which hardly is desirable.
For the return to "parent" this could be achieved in theory, but given that
neither clone() nor clone3() are supposed to be used in the middle of an
SME operation, if wouldn't be useful. For the "return" to "child" this
would be particularly difficult to achieve given the complexity of these
functions and their interfaces. Most importantly, it would be illegal
and somewhat meaningless to allow a "child" to start execution in the
"dormant" ZA-state because the very essence of the "dormant" state implies
that there is a place to return and that there is some outer context that
we are allowed to interact with.
To sum up, calling __arm_za_disable() to ensure the "off" ZA-state when the
execution resumes after a call to clone() or clone3() is correct and also
the most simple way to conform to [1].
Can there be situations when we can avoid calling __arm_za_disable()?
Calling __arm_za_disable() implies certain (sufficiently small) overhead,
so one might rightly ponder avoiding making a call to this function when
we can afford not to. The most trivial cases like this (e.g. when the
calling thread doesn't have access to SME or to the TPIDR2_EL0 register)
are already handled by this function (see [1] (8.1.3 and 8.1.2)). Reasoning
about other possible use cases would require making code inside clone() and
clone3() more complicated and it would defeat the point of trying to make
an optimisation of not calling __arm_za_disable().
Why can't the kernel do this instead?
The handling of SME state by the kernel is described in [4]. In short,
kernel must not impose a specific ZA-interface onto a userspace function.
Interaction with the kernel happens (among other thing) via system calls.
In Glibc many of the system calls (notably, including SYS_clone and
SYS_clone3) are used via wrappers, and the kernel has no control of them
and, moreover, it cannot dictate how these wrappers should behave because
it is simply outside of the kernel's remit.
However, in certain cases, the kernel may ensure that a "child" doesn't
start in an incorrect state. This is what is done by the recent change
included in 6.16 kernel [5]. This is not enough to ensure that code that
uses clone() and clone3() function conforms to [1] when it runs on a
system that provides SME, hence this change.
[1]: https://github.com/ARM-software/abi-aa/blob/main/aapcs64/aapcs64.rst
[2]: https://inbox.sourceware.org/libc-alpha/20250522114828.2291047-1-yury.khrustalev@arm.com
[3]: https://inbox.sourceware.org/libc-alpha/20250609121407.3316070-1-yury.khrustalev@arm.com
[4]: https://www.kernel.org/doc/html/v6.16/arch/arm64/sme.html
[5]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=cde5c32db55740659fca6d56c09b88800d88fd29
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>