And add extra checks to enable for binutils 2.45 and if the architecture
explicitly enables it. When SFrame is disabled, all the related code
is also not enabled for backtrace() and _dl_find_object(), so SFrame
backtracking is not used even if the binary has the SFrame segment.
This patch also adds some other related fixes:
* Fixed an issue with AC_CHECK_PROG_VER, where the READELF_SFRAME
usage prevented specifying a different readelf through READELF
environment variable at configure time.
* Add an extra arch-specific internal definition,
libc_cv_support_sframe, to disable --enable-sframe on architectures
that have binutils but not glibc support (s390x).
* Renamed the tests without the .sframe segment and move the
tst-backtrace1 from pthread to debug.
* Use the built compiler strip to remove the .sframe segment,
instead of the system one (which might not support SFrame).
Checked on x86_64-linux-gnu and aarch64-linux-gnu.
Reviewed-by: Sam James <sam@gentoo.org>
There are 3 variants of modf and modff: SSE2, SSE4.1 and AVX. s_modf.c
and s_modff.c include the generic implementation compiled with the minimum
x86 ISA level. The IFUNC selector is used only if the minimum ISA level
is less than AVX. SSE4.1 variant is included only if the ISA level is
less than SSE4.1. AVX variant is included only the ISA level is less than
AVX.
AVX variant should be compiled with -mavx, not -msse2avx -DSSE2AVX which
are used to encode SSE assembly sources with EVEX encoding.
The routines that are shared between libc and libm should use different
rules to avoid using the same MODULE_NAME, to avoid potential issues
like BZ #33165 where __stack_chk_fail not being routed to the internal
symbol.
Tested with -march=x86-64, -march=x86-64-v2, -march=x86-64-v3 and
-march=x86-64-v4.
This fixes BZ #33165 and BZ #33173.
Co-authored-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Since modf and modff are compiled into both libc and libm, when glibc is
configured with --enable-stack-protector=all, ISA versions of modf and
modff should be compiled with -fno-stack-protector to avoid calling
__stack_chk_fail via PLT in libc.so.
This fixes BZ #33165.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Andreas K. Huettel <dilfridge@gentoo.org>
Since mcount_internal is called from mcount/__fentry__ which preserve
only RAX, RCX, RDX, RSI, RDI, R8 and R9, compile mcount.c with
-fno-tree-loop-distribute-patterns -mgeneral-regs-only -mno-apxf
to void vector/r16-r31 registers and memcpy/memset in mcount_internal.
This fixes BZ #33134.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Andreas K. Huettel <dilfridge@gentoo.org>
Compiler generates the following instruction sequence for dynamic TLS
access:
leal tls_var@tlsgd(,%ebx,1), %eax
call ___tls_get_addr@PLT
CALL instruction is transparent to compiler which assumes all registers,
except for EFLAGS, AX, CX, and DX, are unchanged after CALL. But
___tls_get_addr is a normal function which doesn't preserve any vector
registers.
1. Rename the generic __tls_get_addr function to ___tls_get_addr_internal.
2. Change ___tls_get_addr to a wrapper function with implementations for
FNSAVE, FXSAVE, XSAVE and XSAVEC to save and restore all vector registers.
3. dl-tlsdesc-dynamic.h has:
_dl_tlsdesc_dynamic:
/* Like all TLS resolvers, preserve call-clobbered registers.
We need two scratch regs anyway. */
subl $32, %esp
cfi_adjust_cfa_offset (32)
It is wrong to use
movl %ebx, -28(%esp)
movl %esp, %ebx
cfi_def_cfa_register(%ebx)
...
mov %ebx, %esp
cfi_def_cfa_register(%esp)
movl -28(%esp), %ebx
to preserve EBX on stack. Fix it with:
movl %ebx, 28(%esp)
movl %esp, %ebx
cfi_def_cfa_register(%ebx)
...
mov %ebx, %esp
cfi_def_cfa_register(%esp)
movl 28(%esp), %ebx
4. Update _dl_tlsdesc_dynamic to call ___tls_get_addr_internal directly.
5. Add have-test-mtls-traditional to compile tst-tls23-mod.c with
traditional TLS variant to verify the fix.
6. Define DL_RUNTIME_RESOLVE_REALIGN_STACK in sysdeps/x86/sysdep.h.
This fixes BZ #32996.
Co-Authored-By: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
rtld.c has
extern const ElfW(Ehdr) __ehdr_start attribute_hidden;
...
_dl_rtld_map.l_map_start = (ElfW(Addr)) &__ehdr_start;
_dl_rtld_map.l_map_end = (ElfW(Addr)) _end;
As
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120653
shows, compiler may generate run-time relocation on __ehdr_start with
movq .LC0(%rip), %xmm0
...
.section .data.rel.ro.local,"aw"
.align 8
.LC0:
.quad __ehdr_start
This won't work before run-time relocation is finished in rtld.c. Add
optimization barrier to prevent run-time relocations against __ehdr_start
and _end.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Sam James <sam@gentoo.org>
"Recent" GCC versions (since commit fc62716fe8d1, backported to stable
branches) emit a vzeroupper instruction at the end of functions
containing AVX instructions. This causes the tst-audit10 test to fail
on CPUs lacking AVX instructions, despite the AVX512F check. The crash
occurs in the pltenter function of tst-auditmod10b.c.
Fix that by moving the code guarded by the check_avx512 function into
specific functions using the target ("avx512f") attribute. Note that
since commit 5359c3bc91 ("x86-64: Remove compiler -mavx512f check") it
is safe to assume that the compiler has AVX512F support, thus the
__AVX512F__ checks can be dropped.
Tested on non-AVX, AVX2 and AVX512F machines.
Reviewed-by: Florian Weimer <fweimer@redhat.com>
Previously, the initialization code reused the xsave_state_full_size
member of struct cpu_features for the TLSDESC state size. However,
the tunable processing code assumes that this member has the
original XSAVE (non-compact) state size, so that it can use its
value if XSAVEC is disabled via tunable.
This change uses a separate variable and not a struct member because
the value is only needed in ld.so and the static libc, but not in
libc.so. As a result, struct cpu_features layout does not change,
helping a future backport of this change.
Fixes commit 9b7091415a ("x86-64:
Update _dl_tlsdesc_dynamic to preserve AMX registers").
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
On SPR, it improves atanh bench performance by:
Before After Improvement
reciprocal-throughput 15.1715 14.8628 2%
latency 57.1941 56.1883 2%
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
On SPR, it improves sinh bench performance by:
Before After Improvement
reciprocal-throughput 14.2017 11.815 17%
latency 36.4917 35.2114 4%
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
On Skylake, it improves tanh bench performance by:
Before After Improvement
max 110.89 95.826 14%
min 20.966 20.157 4%
mean 30.9601 29.8431 4%
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
The current approach tracks math maximum supported errors by explicitly
setting them per function and architecture. On newer implementations or
new compiler versions, the file is updated with newer values if it
shows higher results. The idea is to track the maximum known error, to
update the manual with the obtained values.
The constant libm-test-ulps shows little value, where it is usually a
mechanical change done by the maintainer, for past releases it is
usually ignored whether the ulp change resulted from a compiler
regression, and the math tests already have a maximum ulp error that
triggers a regression.
It was shown by a recent update after the new acosf [1] implementation
that is correctly rounded, where the libm-test-ulps was indeed from a
compiler issue.
This patch removes all arch-specific libm-test-ulps, adds system generic
libm-test-ulps where applicable, and changes its semantics. The generic
files now track specific implementation constraints, like if it is
expected to be correctly rounded, or if the system-specific has
different error expectations.
Now multiple libm-test-ulps can be defined, and system-specific
overrides generic implementation. This is for the case where
arch-specific implementation might show worse precision than generic
implementation, for instance, the cbrtf on i686.
Regressions are only reported if the implementation shows larger errors
than 9 ulps (13 for IBM long double) unless it is overridden by
libm-test-ulps and the maximum error is not printed at the end of tests.
The regen-ulps rule is also removed since it does not make sense to
update the libm-test-ulps automatically.
The manual error table is also removed, Paul Zimmermann and others have
been tracking libm precision with a more comprehensive analysis for some
releases; so link to his work instead.
[1] https://sourceware.org/git/?p=glibc.git;a=commit;h=9cc9f8e11e8fb8f54f1e84d9f024917634a78201
C23 adds various <math.h> function families originally defined in TS
18661-4. Add the rsqrt functions (1/sqrt(x)). The test inputs are
taken from those for sqrt.
Tested for x86_64 and x86, and with build-many-glibcs.py.
commit 494d65129e
Author: Michael Jeanson <mjeanson@efficios.com>
Date: Thu Aug 1 10:35:34 2024 -0400
nptl: Introduce <rseq-access.h> for RSEQ_* accessors
added things like
asm volatile ("movl %%fs:%P1(%q2),%0" \
: "=r" (__value) \
: "i" (offsetof (struct rseq_area, member)), \
"r" (__rseq_offset)); \
But this doesn't work for x32 when __rseq_offset is negative since the
address is computed as
FS + 32-bit to 64-bit zero extension of __rseq_offset
+ offsetof (struct rseq_area, member)
Cast __rseq_offset to long long int
"r" ((long long int) __rseq_offset)); \
to sign-extend 32-bit __rseq_offset to 64-bit. This is a no-op for x86-64
since x86-64 __rseq_offset is 64-bit. This fixes BZ #32543.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Florian Weimer <fweimer@redhat.com>
In preparation to move the rseq area to the 'extra TLS' block, we need
accessors based on the thread pointer and the rseq offset. The ONCE
variant of the accessors ensures single-copy atomicity for loads and
stores which is required for all fields once the registration is active.
A separate header is required to allow including <atomic.h> which
results in an include loop when added to <tcb-access.h>.
Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: Florian Weimer <fweimer@redhat.com>
This reverts commit 30d3fd7f4f.
The padding is required by Chromium's MaybeUpdateGlibcTidCache
in sandbox/linux/services/namespace_sandbox.cc.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
These inputs were generated with the programs from
https://gitlab.inria.fr/zimmerma/math_accuracy,
with rounding to nearest:
* for univariate binary32 functions by exhaustive search
* for other functions with the "threshold" parameter up to 10^6
On arc, the definition of TLS_DTV_UNALLOCATED now comes from
<dl-dtv.h>.
For x86-64 x32, a separate version is needed because unsigned long int
is 32 bits on this target.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
This was used to manage an IA-64 ABI divergence is no longere needed
after the IA-64 removal.
(It should be possible to encode all the required information in
one machine word, so the pointer indirection is really unnecessary.
Technically, none of this is part of the ABI, so perhaps it's
possible to do this retroactively. See bug 27404.)
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Add __attribute_optimization_barrier__ to disable inlining and cloning on a
function. For Clang, expand it to
__attribute__ ((optnone))
Otherwise, expand it to
__attribute__ ((noinline, clone))
Co-Authored-By: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Sam James <sam@gentoo.org>
Compiler may default to -fno-semantic-interposition. But some elf test
modules must be compiled with -fsemantic-interposition to function properly.
Add a TEST_CC check for -fsemantic-interposition and use it on elf test
modules. This fixed
FAIL: elf/tst-dlclose-lazy
FAIL: elf/tst-pie1
FAIL: elf/tst-plt-rewrite1
FAIL: elf/unload4
when Clang 19 is used to test glibc.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Sam James <sam@gentoo.org>
Unlike GCC, libmvec support in Clang is hard-coded. Clang doesn't use
macros defined in <bits/libm-simd-decl-stubs.h> to support new libmvec
functions added to glibc and can't vectorize all test loops to test
libmvec ABI:
https://github.com/llvm/llvm-project/issues/120868
disable libmvec ABI test for Clang.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Sam James <sam@gentoo.org>
Since -mamx-tile is used only for testing, use LIBC_TRY_TEST_CC_COMMAND,
instead of LIBC_TRY_CC_AND_TEST_CC_COMMAND to check it and don't check
__builtin_ia32_ldtilecfg for Clang.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Sam James <sam@gentoo.org>
After
commit 215447f5cb
Author: H.J. Lu <hjl.tools@gmail.com>
Date: Tue Dec 17 06:18:55 2024 +0800
cet: Pass -mshstk to compiler for tst-cet-legacy-10a[-static].c
we can remove '#pragma GCC target' in tst-cet-legacy-10a[-static].c.
Co-Authored-By: H.J. Lu <hjl.tools@gmail.com>
This padding is difficult to use for preserving the internal
GLIBC_PRIVATE ABI. The comment is misleading. Current Address
Sanitizer uses heuristics to determine struct pthread size.
It does not depend on its precise layout. It merely scans for
pointers allocated using malloc.
Due to the removal of the padding, the assert for its start
is no longer required.
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
The CORE-MATH implementation is correctly rounded (for any rounding mode),
although it should worse performance than current one. The current
implementation performance comes mainly from the internal usage of
the optimize expf implementation, and shows a maximum ULPs of 2 for
FE_TONEAREST and 3 for other rounding modes.
The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).
Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1):
Latency master patched improvement
x86_64 40.6995 49.0737 -20.58%
x86_64v2 40.5841 44.3604 -9.30%
x86_64v3 39.3879 39.7502 -0.92%
i686 112.3380 129.8570 -15.59%
aarch64 (Neoverse) 18.6914 17.0946 8.54%
power10 11.1343 9.3245 16.25%
reciprocal-throughput master patched improvement
x86_64 18.6471 24.1077 -29.28%
x86_64v2 17.7501 20.2946 -14.34%
x86_64v3 17.8262 17.1877 3.58%
i686 64.1454 86.5645 -34.95%
aarch64 (Neoverse) 9.77226 12.2314 -25.16%
power10 4.0200 5.3316 -32.63%
Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>