The generic implementation is slight more optimized than the powerpc
one, where it has a more optimized inf/nan check (by not using FP
unit checks, along with branch prediction hints), and removed one
branch by issuing trunc instead of a combination of floor/ceil (which
also generated less code).
On power10 with gcc 14.2.1:
reciprocal-throughput master patch difference
workload-0_1 1.1351 0.9067 20.12%
workload-1_maxint 1.4230 0.9040 36.47%
workload-maxint_maxfloat 1.5038 0.9076 39.65%
workload-integral 1.1280 0.9111 19.23%
latency master patch difference
workload-0_1 1.1440 2.7117 -137.03%
workload-1_maxint 4.0556 2.7070 33.25%
workload-maxint_maxfloat 3.2122 2.7164 15.43%
workload-integral 3.2381 2.7281 15.75%
Checked on powerpc64le-linux-gnu.
Reviewed-by: Sachin Monga <smonga@linux.ibm.com>
The generic implementation is slight more optimized than the powerpc
one, where it has a more optimized inf/nan check (by not using FP
unit checks, along with branch prediction hints), and removed one
branch by issuing trunc instead of a combination of floor/ceil (which
also generated less code).
On power10 with gcc 14.2.1:
reciprocal-throughput master patch difference
workload-0_1 1.5210 1.3942 8.34%
workload-1_maxint 2.0926 1.3940 33.38%
workload-maxint_maxfloat 1.7851 1.3940 21.91%
workload-integral 1.5216 1.3941 8.37%
latency master patch difference
workload-0_1 1.5928 2.6337 -65.35%
workload-1_maxint 3.2929 2.6337 20.02%
workload-maxint_maxfloat 1.9697 2.6341 -33.73%
workload-integral 2.0597 2.6337 -27.87%
Checked on powerpc64le-linux-gnu.
Reviewed-by: Sachin Monga <smonga@linux.ibm.com>
Compiler generates the following instruction sequence for dynamic TLS
access:
leal tls_var@tlsgd(,%ebx,1), %eax
call ___tls_get_addr@PLT
CALL instruction is transparent to compiler which assumes all registers,
except for EFLAGS, AX, CX, and DX, are unchanged after CALL. But
___tls_get_addr is a normal function which doesn't preserve any vector
registers.
1. Rename the generic __tls_get_addr function to ___tls_get_addr_internal.
2. Change ___tls_get_addr to a wrapper function with implementations for
FNSAVE, FXSAVE, XSAVE and XSAVEC to save and restore all vector registers.
3. dl-tlsdesc-dynamic.h has:
_dl_tlsdesc_dynamic:
/* Like all TLS resolvers, preserve call-clobbered registers.
We need two scratch regs anyway. */
subl $32, %esp
cfi_adjust_cfa_offset (32)
It is wrong to use
movl %ebx, -28(%esp)
movl %esp, %ebx
cfi_def_cfa_register(%ebx)
...
mov %ebx, %esp
cfi_def_cfa_register(%esp)
movl -28(%esp), %ebx
to preserve EBX on stack. Fix it with:
movl %ebx, 28(%esp)
movl %esp, %ebx
cfi_def_cfa_register(%ebx)
...
mov %ebx, %esp
cfi_def_cfa_register(%esp)
movl 28(%esp), %ebx
4. Update _dl_tlsdesc_dynamic to call ___tls_get_addr_internal directly.
5. Add have-test-mtls-traditional to compile tst-tls23-mod.c with
traditional TLS variant to verify the fix.
6. Define DL_RUNTIME_RESOLVE_REALIGN_STACK in sysdeps/x86/sysdep.h.
This fixes BZ #32996.
Co-Authored-By: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
This reverts commit 3367d8e180
Reason for revert: Power10 strcmp clobbers non-volatile vector
registers (Bug 33056)
Tested on ppc64le without regression.
This reverts commit b9182c793c
Reason for revert: Power10 memchr clobbers v20 vector register
(Bug 33059)
This is not a security issue, unlike CVE-2025-5745 and
CVE-2025-5702.
Tested on ppc64le without regression.
This reverts commit 90bcc8721e
This change is in the chain of the final revert that fixes the CVE
i.e. 3367d8e180
Reason for revert: Power10 strcmp clobbers non-volatile vector
registers (Bug 33056)
Tested on ppc64le with no regressions.
This reverts commit 23f0d81608
Reason for revert: Power10 strncmp clobbers non-volatile vector
registers (Bug 33060)
Tested on ppc64le with no regressions.
It removes the wrapper by moving the error/EDOM handling to an
out-of-line implementation (__math_invalidf_i/__math_invalidf_li).
Also, __glibc_unlikely is used on errors case since it helps
code generation on recent gcc.
The code now builds to with gcc-14 on aarch64:
0000000000000000 <__ilogbf>:
0: 1e260000 fmov w0, s0
4: d3577801 ubfx x1, x0, #23, #8
8: 340000e1 cbz w1, 24 <__ilogbf+0x24>
c: 5101fc20 sub w0, w1, #0x7f
10: 7103fc3f cmp w1, #0xff
14: 54000040 b.eq 1c <__ilogbf+0x1c> // b.none
18: d65f03c0 ret
1c: 12b00000 mov w0, #0x7fffffff // #2147483647
20: 14000000 b 0 <__math_invalidf_i>
24: 53175800 lsl w0, w0, #9
28: 340000a0 cbz w0, 3c <__ilogbf+0x3c>
2c: 5ac01000 clz w0, w0
30: 12800fc1 mov w1, #0xffffff81 // #-127
34: 4b000020 sub w0, w1, w0
38: d65f03c0 ret
3c: 320107e0 mov w0, #0x80000001 // #-2147483647
40: 14000000 b 0 <__math_invalidf_i>
Some ABI requires additional adjustments:
* i386 and m68k requires to use the template version, since
both provide __ieee754_ilogb implementatations.
* loongarch uses a custom implementation as well.
* powerpc64le also has a custom implementation for POWER9, which
is also used for float and float128 version. The generic
e_ilogb.c implementation is moved on powerpc to keep the
current code as-is.
Checked on aarch64-linux-gnu and x86_64-linux-gnu.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
It removes the wrapper by moving the error/EDOM handling to an
out-of-line implementation (__math_invalid_i/__math_invalid_li).
Also, __glibc_unlikely is used on errors case since it helps
code generation on recent gcc.
The code now builds to with gcc-14 on aarch64:
0000000000000000 <__ilogb>:
0: 9e660000 fmov x0, d0
4: d374f801 ubfx x1, x0, #52, #11
8: 340000e1 cbz w1, 24 <__ilogb+0x24>
c: 510ffc20 sub w0, w1, #0x3ff
10: 711ffc3f cmp w1, #0x7ff
14: 54000040 b.eq 1c <__ilogb+0x1c> // b.none
18: d65f03c0 ret
1c: 12b00000 mov w0, #0x7fffffff // #2147483647
20: 14000000 b 0 <__math_invalid_i>
24: d374cc00 lsl x0, x0, #12
28: b40000a0 cbz x0, 3c <__ilogb+0x3c>
2c: dac01000 clz x0, x0
30: 12807fc1 mov w1, #0xfffffc01 // #-1023
34: 4b000020 sub w0, w1, w0
38: d65f03c0 ret
3c: 320107e0 mov w0, #0x80000001 // #-2147483647
40: 14000000 b 0 <__math_invalid_i>
Some ABI requires additional adjustments:
* i386 and m68k requires to use the template version, since
both provide __ieee754_ilogb implementatations.
* loongarch uses a custom implementation as well.
* powerpc64le also has a custom implementation for POWER9, which
is also used for float and float128 version. The generic
e_ilogb.c implementation is moved on powerpc to keep the
current code as-is.
Checked on aarch64-linux-gnu and x86_64-linux-gnu.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Due to raising the minimum binutils version to >= 2.26, the configure
check for testing support of --update-section is not needed anymore.
Reviewed-by: Peter Bergner <bergner@tenstorrent.com>
C23 adds various <math.h> function families originally defined in TS
18661-4. Add the compoundn functions, which compute (1+X) to the
power Y for integer Y (and X at least -1). The integer exponent has
type long long int in C23; it was intmax_t in TS 18661-4, and as with
other interfaces changed after their initial appearance in the TS, I
don't think we need to support the original version of the interface.
Note that these functions are "compoundn" with a trailing "n", *not*
"compound" (CORE-MATH has the wrong name, for example).
As with pown, I strongly encourage searching for worst cases for ulps
error for these implementations (necessarily non-exhaustively, given
the size of the input space). I also expect a custom implementation
for a given format could be much faster as well as more accurate (I
haven't tested or benchmarked the CORE-MATH implementation for
binary32); this is one of the more complicated and less efficient
functions to implement in a type-generic way.
As with exp2m1 and exp10m1, this showed up places where the
powerpc64le IFUNC setup is not as self-contained as one might hope (in
this case, without the changes specific to powerpc64le, there were
undefined references to __GI___expf128).
Tested for x86_64 and x86, and with build-many-glibcs.py.
These routines are not extensively used (gnulib documentation even
recommend use a replacement [1]), and there is already a POWER8
version that uses proper vectorized instructions.
[1] https://www.gnu.org/software/gnulib/manual/gnulib.html#C-strings
Checked with a build for some powerpc variations.
Reviewed-by: Peter Bergner <bergner@linux.ibm.com>
This is only needed for -mno-secure-plt, and this linkage mode
is not supported with powerpc64 and powerp64le.
Reviewed-by: Peter Bergner <bergner@linux.ibm.com>
The current approach tracks math maximum supported errors by explicitly
setting them per function and architecture. On newer implementations or
new compiler versions, the file is updated with newer values if it
shows higher results. The idea is to track the maximum known error, to
update the manual with the obtained values.
The constant libm-test-ulps shows little value, where it is usually a
mechanical change done by the maintainer, for past releases it is
usually ignored whether the ulp change resulted from a compiler
regression, and the math tests already have a maximum ulp error that
triggers a regression.
It was shown by a recent update after the new acosf [1] implementation
that is correctly rounded, where the libm-test-ulps was indeed from a
compiler issue.
This patch removes all arch-specific libm-test-ulps, adds system generic
libm-test-ulps where applicable, and changes its semantics. The generic
files now track specific implementation constraints, like if it is
expected to be correctly rounded, or if the system-specific has
different error expectations.
Now multiple libm-test-ulps can be defined, and system-specific
overrides generic implementation. This is for the case where
arch-specific implementation might show worse precision than generic
implementation, for instance, the cbrtf on i686.
Regressions are only reported if the implementation shows larger errors
than 9 ulps (13 for IBM long double) unless it is overridden by
libm-test-ulps and the maximum error is not printed at the end of tests.
The regen-ulps rule is also removed since it does not make sense to
update the libm-test-ulps automatically.
The manual error table is also removed, Paul Zimmermann and others have
been tracking libm precision with a more comprehensive analysis for some
releases; so link to his work instead.
[1] https://sourceware.org/git/?p=glibc.git;a=commit;h=9cc9f8e11e8fb8f54f1e84d9f024917634a78201
powerpc was the only architecture with arch-specific hooks for
LD_SHOW_AUXV, and with the information moved to ld diagnostics there
is no need to keep the _dl_procinfo hook.
Checked with a build for all affected ABIs.
Reviewed-by: Peter Bergner <bergner@linux.ibm.com>
The _dl_string_platform is moved to hwcapinfo.h, since it is only used
by hwcapinfo.c and test-get_hwcap internal test.
Checked on powerpc64le-linux-gnu.
Reviewed-by: Peter Bergner <bergner@linux.ibm.com>
The ld.so diagnostics already prints AT_HWCAP values, but only in
hexadecimal. To avoid duplicating the strings, consolidate the
hwcap_names from cpu-features.h on a new file, dl-hwcap-info.h
(and it also improves the hwcap string description with more
values).
For future AT_HWCAP3/AT_HWCAP4 extensions, it is just a matter
to add them on dl-hwcap-info.c so both ld diagnostics and
tunable filtering will parse the new values.
Checked on powerpc64le-linux-gnu.
Reviewed-by: Peter Bergner <bergner@linux.ibm.com>
Code used during early static startup in elf/dl-tls.c uses
__mempcpy.
Fixes commit cbd9fd2369 ("Consolidate
TLS block allocation for static binaries with ld.so").
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Linux bogsucker 6.1.55-gentoo-dist-hardened #1 SMP Sun Oct 1 18:03:02 UTC 2023 ppc64le POWER9 (architected), altivec supported CHRP IBM pSeries (emulated by qemu) GNU/Linux
Signed-off-by: Andreas K. Hüttel <dilfridge@gentoo.org>
Linux timberdoodle 6.1.60-gentoo-dist-hardened #1 SMP Fri Dec 1 22:10:49 UTC 2023 ppc64 POWER9 (architected), altivec supported CHRP IBM pSeries (emulated by qemu) GNU/Linux
Signed-off-by: Andreas K. Hüttel <dilfridge@gentoo.org>
Unconditionally define it to false for static builds.
This avoids the awkward use of weak_extern for _dl_rtld_map
in checks that cannot be possibly true on static builds.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
The CORE-MATH implementation is correctly rounded (for any rounding mode),
although it should worse performance than current one. The current
implementation performance comes mainly from the internal usage of
the optimize expf implementation, and shows a maximum ULPs of 2 for
FE_TONEAREST and 3 for other rounding modes.
The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).
Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1):
Latency master patched improvement
x86_64 40.6995 49.0737 -20.58%
x86_64v2 40.5841 44.3604 -9.30%
x86_64v3 39.3879 39.7502 -0.92%
i686 112.3380 129.8570 -15.59%
aarch64 (Neoverse) 18.6914 17.0946 8.54%
power10 11.1343 9.3245 16.25%
reciprocal-throughput master patched improvement
x86_64 18.6471 24.1077 -29.28%
x86_64v2 17.7501 20.2946 -14.34%
x86_64v3 17.8262 17.1877 3.58%
i686 64.1454 86.5645 -34.95%
aarch64 (Neoverse) 9.77226 12.2314 -25.16%
power10 4.0200 5.3316 -32.63%
Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
C23 adds various <math.h> function families originally defined in TS
18661-4. Add the atan2pi functions (atan2(y,x)/pi).
Tested for x86_64 and x86, and with build-many-glibcs.py.
C23 adds various <math.h> function families originally defined in TS
18661-4. Add the atanpi functions (atan(x)/pi).
Tested for x86_64 and x86, and with build-many-glibcs.py.