1
0
mirror of https://sourceware.org/git/glibc.git synced 2026-01-06 11:51:29 +03:00
Commit Graph

1673 Commits

Author SHA1 Message Date
Joseph Myers
26e4810210 Rename fromfp files in preparation for changing types for C23
As discussed in bug 28327, the fromfp functions changed type in C23
(compared to the version in TS 18661-1); they now return the same type
as the floating-point argument, instead of intmax_t / uintmax_t.

As with other such incompatible changes compared to the initial TS
18661 versions of interfaces (the types of totalorder functions, in
particular), it seems appropriate to support only the new version as
an API, not the old one (although many programs written for the old
API might in fact work wtih the new one as well).  Thus, the existing
implementations should become compat symbols.  They are sufficiently
different from how I'd expect to implement the new version that using
separate implementations in separate files is more convenient than
trying to share code, and directly sharing testcases would be
problematic as well.

Rename the existing fromfp implementation and test files to names
reflecting how they're intended to become compat symbols, so freeing
up the existing filenames for a subsequent implementation of the C23
versions of these functions (which is the point at which the existing
implementations would actually become compat symbols).

gen-fromfp-tests.py and gen-fromfp-tests-inputs are not renamed; I
think it will make sense to adapt the test generator to be able to
generate most tests for both versions of the functions (with extra
test inputs added that are only of interest with the C23 version).
The ldbl-opt/nldbl-* files are also not renamed; since those are for a
static only library, no compat versions are needed, and they'll just
have their contents changed when the C23 version is implemented.

Tested for x86_64, and with build-many-glibcs.py.
2025-11-04 23:41:35 +00:00
Adhemerval Zanella
0dfc849eff math: Remove the SVID error handling wrapper from sqrt
i386 and m68k architectures should use math-use-builtins-sqrt.h rather
than relying on architecture-specific or inline assembly implementations.

The PowerPC optimization for PPC 601/603 (30 years old) is removed.

Tested on x86_64-linux-gnu and i686-linux-gnu.

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2025-11-04 04:14:01 -03:00
Wilco Dijkstra
324c088a18 nptl: Remove ATOMIC_EXCHANGE_USES_CAS usage
The only usage was for pthread_spin_lock, introduced by 12d2dd7060,
as a way to optimize the code for certain architectures. Now that atomic
builtins are used by default, let the compiler use the best code sequence
for the atomic exchange.

Co-authored-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2025-11-04 04:14:01 -03:00
Wilco Dijkstra
53807741fb Define __HAVE_64B_ATOMICS from compiler support
Now that atomic builtins are used by default, we can rely on the
compiler to define when to use 64-bit atomic operations.

It allows the use of 64-bit atomic operations on some 32-bit ABIs where
they were not previously enabled due to missing pre-processor handling:
hppa, mips64n32, s390, and sparcv9.

Co-authored-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
Reviewed-by: Uros Bizjak <ubizjak@gmail.com>
Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2025-11-04 04:14:01 -03:00
Adhemerval Zanella
95a0ad1ea1 atomic: Consolidate atomic_write_barrier implementation
All ABIs, except alpha and sparc, define it to
atomic_full_barrier/__sync_synchronize, which can be mapped to
__atomic_thread_fence (__ATOMIC_RELEASE).

For alpha, it uses a 'wmb' which does not map to any of C11
barriers.

For sparc it uses a stronger 'member #LoadStore | #StoreStore',
where the release barrier maps to just 'membar #StoreLoad'.  The
patch keeps the sparc definition.

For PowerPC, it allows the use of lwsync for additional chips
(since _ARCH_PWR4 does not cover all chips that support it).

Tested on aarch64-linux-gnu.

Co-authored-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2025-11-04 04:14:01 -03:00
Adhemerval Zanella
304b22d7f9 atomic: Consolidate atomic_read_barrier implementation
All ABIs, except alpha, powerpc, and x86_64, define it to
atomic_full_barrier/__sync_synchronize, which can be mapped to
__atomic_thread_fence (__ATOMIC_SEQ_CST) in most cases, with the
exception of aarch64 (where the acquire fence is generated as
'dmb ishld' instead of 'dmb ish').

For s390x, it defaults to a memory barrier where __sync_synchronize
emits a 'bcr 15,0' (which the manual describes as pipeline
synchronization).

For PowerPC, it allows the use of lwsync for additional chips
(since _ARCH_PWR4 does not cover all chips that support it).

Tested on aarch64-linux-gnu, where the acquire produces a different
instruction that the current code.

Co-authored-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2025-11-04 04:14:01 -03:00
Adhemerval Zanella
70ee250fb8 atomic: Consolidate atomic_full_barrier implementation
All ABIs save for sparcv9 and s390 defines it to __sync_synchronize,
which can be mapped to __atomic_thread_fence (__ATOMIC_SEQ_CST).

For Sparc, it uses a stricter #StoreStore|#LoadStore|#StoreLoad|#LoadLoad
instead of the #StoreLoad generated by __sync_synchronize.

For s390x, it defaults to a memory barrier where __sync_synchronize
emits a 'bcr 15,0' (which the manual describes as pipeline synchronization).

The barrier is used only in one place (pthread_mutex_setprioceiling),
and using a stricter barrier for s390 is ok performance-wise.

Co-authored-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2025-11-04 04:14:01 -03:00
Adhemerval Zanella
d76e20791b powerpc: Consolidate atomic-machine.h
The __HAVE_64B_ATOMICS can be define based on __WORDSIZE, and
the __ARCH_ACQ_INSTR, MUTEX_HINT_*, and barriers definition are
defined by the target cpu.

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2025-11-04 04:14:01 -03:00
Sachin Monga
b59799f14f ppc64le: Power 10 rawmemchr clobbers v20 (bug #33091)
Replace non-volatile(v20) by volatile(v17)
since v20 is not restored

Reviewed-by: Peter Bergner <bergner@tenstorrent.com>
2025-10-26 12:19:53 -05:00
Sachin Monga
2ea943f7d4 ppc64le: Restore optimized strncmp for power10
This patch addresses the actual cause of CVE-2025-5745

The vector non-volatile registers are not used anymore for
32 byte load and comparison operation

Additionally, the assembler workaround used earlier for the
instruction lxvp is replaced with actual instruction.

Signed-off-by: Sachin Monga <smonga@linux.ibm.com>
Co-authored-by: Paul Murphy <paumurph@redhat.com>
2025-10-07 03:25:42 -05:00
Sachin Monga
9a40b1cda5 ppc64le: Restore optimized strcmp for power10
This patch addresses the actual cause of CVE-2025-5702

The vector non-volatile registers are not used anymore for
32 byte load and comparison operation

Additionally, the assembler workaround used earlier for the
instruction lxvp is replaced with actual instruction.

Signed-off-by: Sachin Monga <smonga@linux.ibm.com>
Co-authored-by: Paul Murphy <paumurph@redhat.com>
2025-10-07 03:20:44 -05:00
Adhemerval Zanella
63ba1a1509 math: Add fetestexcept internal alias
To avoid linknamespace issues on old standards.  It is required
if the fallback fma implementation is used if/when it is also
used internally for other implementation.
Reviewed-by: DJ Delorie <dj@redhat.com>
2025-09-11 14:46:07 -03:00
Adhemerval Zanella
2eb8836de7 math: Add feclearexcept internal alias
To avoid linknamespace issues on old standards.  It is required
if the fallback fma implementation is used if/when it is also
used internally for other implementation.
Reviewed-by: DJ Delorie <dj@redhat.com>
2025-09-11 14:46:07 -03:00
Wilco Dijkstra
b568af853b atomic: Switch power to builtin atomics
Switch power to builtin atomics.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2025-09-09 13:57:59 +00:00
Cupertino Miranda
3b2b88ccee elf: early conversion of elf p_flags to mprotect flags
This patch replaces _dl_stack_flags global variable by
_dl_stack_prot_flags.
The advantage is that any convertion from p_flags to final used mprotect
flags occurs at loading of p_flags. It avoids repeated spurious
convertions of _dl_stack_flags, for example in allocate_thread_stack.

This modification was suggested in:
  https://sourceware.org/pipermail/libc-alpha/2025-March/165537.html

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2025-08-27 10:45:45 -03:00
Adhemerval Zanella
79bfbc93de powerpc: Remove modf optimization
The generic implementation is slight more optimized than the powerpc
one, where it has a more optimized inf/nan check (by not using FP
unit checks, along with branch prediction hints), and removed one
branch by issuing trunc instead of a combination of floor/ceil (which
also generated less code).

On power10 with gcc 14.2.1:

reciprocal-throughput        master         patch        difference
workload-0_1                 1.1351        0.9067            20.12%
workload-1_maxint            1.4230        0.9040            36.47%
workload-maxint_maxfloat     1.5038        0.9076            39.65%
workload-integral            1.1280        0.9111            19.23%

latency                      master         patch        difference
workload-0_1                 1.1440        2.7117          -137.03%
workload-1_maxint            4.0556        2.7070            33.25%
workload-maxint_maxfloat     3.2122        2.7164            15.43%
workload-integral            3.2381        2.7281            15.75%

Checked on powerpc64le-linux-gnu.
Reviewed-by: Sachin Monga <smonga@linux.ibm.com>
2025-06-25 15:05:30 -03:00
Adhemerval Zanella
5c2b21c478 powerpc: Remove modff optimization
The generic implementation is slight more optimized than the powerpc
one, where it has a more optimized inf/nan check (by not using FP
unit checks, along with branch prediction hints), and removed one
branch by issuing trunc instead of a combination of floor/ceil (which
also generated less code).

On power10 with gcc 14.2.1:

reciprocal-throughput        master        patch        difference
workload-0_1                 1.5210       1.3942             8.34%
workload-1_maxint            2.0926       1.3940            33.38%
workload-maxint_maxfloat     1.7851       1.3940            21.91%
workload-integral            1.5216       1.3941             8.37%

latency                      master        patch        difference
workload-0_1                 1.5928       2.6337           -65.35%
workload-1_maxint            3.2929       2.6337            20.02%
workload-maxint_maxfloat     1.9697       2.6341           -33.73%
workload-integral            2.0597       2.6337           -27.87%

Checked on powerpc64le-linux-gnu.
Reviewed-by: Sachin Monga <smonga@linux.ibm.com>
2025-06-25 15:05:30 -03:00
Andreas Schwab
9b3730a54b powerpc: use .machine power10 in POWER10 assembler sources
They were misattributed as POWER9 sources.
2025-06-23 14:40:34 +02:00
H.J. Lu
848f0e46f0 i386: Update ___tls_get_addr to preserve vector registers
Compiler generates the following instruction sequence for dynamic TLS
access:

	leal	tls_var@tlsgd(,%ebx,1), %eax
	call	___tls_get_addr@PLT

CALL instruction is transparent to compiler which assumes all registers,
except for EFLAGS, AX, CX, and DX, are unchanged after CALL.  But
___tls_get_addr is a normal function which doesn't preserve any vector
registers.

1. Rename the generic __tls_get_addr function to ___tls_get_addr_internal.
2. Change ___tls_get_addr to a wrapper function with implementations for
FNSAVE, FXSAVE, XSAVE and XSAVEC to save and restore all vector registers.
3. dl-tlsdesc-dynamic.h has:

_dl_tlsdesc_dynamic:
	/* Like all TLS resolvers, preserve call-clobbered registers.
	   We need two scratch regs anyway.  */
	subl	$32, %esp
	cfi_adjust_cfa_offset (32)

It is wrong to use

	movl	%ebx, -28(%esp)
	movl	%esp, %ebx
	cfi_def_cfa_register(%ebx)
	...
	mov	%ebx, %esp
	cfi_def_cfa_register(%esp)
	movl	-28(%esp), %ebx

to preserve EBX on stack.  Fix it with:

	movl	%ebx, 28(%esp)
	movl	%esp, %ebx
	cfi_def_cfa_register(%ebx)
	...
	mov	%ebx, %esp
	cfi_def_cfa_register(%esp)
	movl	28(%esp), %ebx

4. Update _dl_tlsdesc_dynamic to call ___tls_get_addr_internal directly.
5. Add have-test-mtls-traditional to compile tst-tls23-mod.c with
traditional TLS variant to verify the fix.
6. Define DL_RUNTIME_RESOLVE_REALIGN_STACK in sysdeps/x86/sysdep.h.

This fixes BZ #32996.

Co-Authored-By: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2025-06-19 04:30:31 +08:00
Andreas Schwab
eae5bb0f60 powerpc: Remove assembler workarounds
Now that we require at least binutils 2.39 the support for POWER9 and
POWER10 instructions can be assumed.
2025-06-18 09:29:10 +02:00
Carlos O'Donell
15808c77b3 ppc64le: Revert "powerpc: Optimized strcmp for power10" (CVE-2025-5702)
This reverts commit 3367d8e180

Reason for revert: Power10 strcmp clobbers non-volatile vector
registers (Bug 33056)

Tested on ppc64le without regression.
2025-06-16 18:02:58 -04:00
Carlos O'Donell
a7877bb668 ppc64le: Revert "powerpc : Add optimized memchr for POWER10" (Bug 33059)
This reverts commit b9182c793c

Reason for revert: Power10 memchr clobbers v20 vector register
(Bug 33059)

This is not a security issue, unlike CVE-2025-5745 and
CVE-2025-5702.

Tested on ppc64le without regression.
2025-06-16 18:02:58 -04:00
Carlos O'Donell
c22de63588 ppc64le: Revert "powerpc: Fix performance issues of strcmp power10" (CVE-2025-5702)
This reverts commit 90bcc8721e

This change is in the chain of the final revert that fixes the CVE
i.e. 3367d8e180

Reason for revert: Power10 strcmp clobbers non-volatile vector
registers (Bug 33056)

Tested on ppc64le with no regressions.
2025-06-16 18:02:58 -04:00
Carlos O'Donell
63c60101ce ppc64le: Revert "powerpc: Optimized strncmp for power10" (CVE-2025-5745)
This reverts commit 23f0d81608

Reason for revert: Power10 strncmp clobbers non-volatile vector
registers (Bug 33060)

Tested on ppc64le with no regressions.
2025-06-16 18:02:58 -04:00
Adhemerval Zanella
39775f00b1 math: Optimize float ilogb/llogb
It removes the wrapper by moving the error/EDOM handling to an
out-of-line implementation (__math_invalidf_i/__math_invalidf_li).
Also, __glibc_unlikely is used on errors case since it helps
code generation on recent gcc.

The code now builds to with gcc-14 on aarch64:

0000000000000000 <__ilogbf>:
   0:   1e260000        fmov    w0, s0
   4:   d3577801        ubfx    x1, x0, #23, #8
   8:   340000e1        cbz     w1, 24 <__ilogbf+0x24>
   c:   5101fc20        sub     w0, w1, #0x7f
  10:   7103fc3f        cmp     w1, #0xff
  14:   54000040        b.eq    1c <__ilogbf+0x1c>  // b.none
  18:   d65f03c0        ret
  1c:   12b00000        mov     w0, #0x7fffffff                 // #2147483647
  20:   14000000        b       0 <__math_invalidf_i>
  24:   53175800        lsl     w0, w0, #9
  28:   340000a0        cbz     w0, 3c <__ilogbf+0x3c>
  2c:   5ac01000        clz     w0, w0
  30:   12800fc1        mov     w1, #0xffffff81                 // #-127
  34:   4b000020        sub     w0, w1, w0
  38:   d65f03c0        ret
  3c:   320107e0        mov     w0, #0x80000001                 // #-2147483647
  40:   14000000        b       0 <__math_invalidf_i>

Some ABI requires additional adjustments:

  * i386 and m68k requires to use the template version, since
    both provide __ieee754_ilogb implementatations.

  * loongarch uses a custom implementation as well.

  * powerpc64le also has a custom implementation for POWER9, which
    is also used for float and float128 version.  The generic
    e_ilogb.c implementation is moved on powerpc to keep the
    current code as-is.

Checked on aarch64-linux-gnu and x86_64-linux-gnu.

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2025-06-02 13:32:19 -03:00
Adhemerval Zanella
c4be334400 math: Optimize double ilogb/llogb
It removes the wrapper by moving the error/EDOM handling to an
out-of-line implementation (__math_invalid_i/__math_invalid_li).
Also, __glibc_unlikely is used on errors case since it helps
code generation on recent gcc.

The code now builds to with gcc-14 on aarch64:

0000000000000000 <__ilogb>:
   0:   9e660000        fmov    x0, d0
   4:   d374f801        ubfx    x1, x0, #52, #11
   8:   340000e1        cbz     w1, 24 <__ilogb+0x24>
   c:   510ffc20        sub     w0, w1, #0x3ff
  10:   711ffc3f        cmp     w1, #0x7ff
  14:   54000040        b.eq    1c <__ilogb+0x1c>  // b.none
  18:   d65f03c0        ret
  1c:   12b00000        mov     w0, #0x7fffffff                 // #2147483647
  20:   14000000        b       0 <__math_invalid_i>
  24:   d374cc00        lsl     x0, x0, #12
  28:   b40000a0        cbz     x0, 3c <__ilogb+0x3c>
  2c:   dac01000        clz     x0, x0
  30:   12807fc1        mov     w1, #0xfffffc01                 // #-1023
  34:   4b000020        sub     w0, w1, w0
  38:   d65f03c0        ret
  3c:   320107e0        mov     w0, #0x80000001                 // #-2147483647
  40:   14000000        b       0 <__math_invalid_i>

Some ABI requires additional adjustments:

  * i386 and m68k requires to use the template version, since
    both provide __ieee754_ilogb implementatations.

  * loongarch uses a custom implementation as well.

  * powerpc64le also has a custom implementation for POWER9, which
    is also used for float and float128 version.  The generic
    e_ilogb.c implementation is moved on powerpc to keep the
    current code as-is.

Checked on aarch64-linux-gnu and x86_64-linux-gnu.

Reviewed-by: Wilco Dijkstra  <Wilco.Dijkstra@arm.com>
2025-06-02 13:32:19 -03:00
Stefan Liebler
4b1ffb828c powerpc64le: Remove configure check for objcopy >= 2.26.
Due to raising the minimum binutils version to >= 2.26, the configure
check for testing support of --update-section is not needed anymore.
Reviewed-by: Peter Bergner <bergner@tenstorrent.com>
2025-05-14 10:35:55 +02:00
Joseph Myers
ae31254432 Implement C23 compoundn
C23 adds various <math.h> function families originally defined in TS
18661-4.  Add the compoundn functions, which compute (1+X) to the
power Y for integer Y (and X at least -1).  The integer exponent has
type long long int in C23; it was intmax_t in TS 18661-4, and as with
other interfaces changed after their initial appearance in the TS, I
don't think we need to support the original version of the interface.

Note that these functions are "compoundn" with a trailing "n", *not*
"compound" (CORE-MATH has the wrong name, for example).

As with pown, I strongly encourage searching for worst cases for ulps
error for these implementations (necessarily non-exhaustively, given
the size of the input space).  I also expect a custom implementation
for a given format could be much faster as well as more accurate (I
haven't tested or benchmarked the CORE-MATH implementation for
binary32); this is one of the more complicated and less efficient
functions to implement in a type-generic way.

As with exp2m1 and exp10m1, this showed up places where the
powerpc64le IFUNC setup is not as self-contained as one might hope (in
this case, without the changes specific to powerpc64le, there were
undefined references to __GI___expf128).

Tested for x86_64 and x86, and with build-many-glibcs.py.
2025-05-09 15:17:27 +00:00
Adhemerval Zanella
ac4e838289 powerpc: Remove POWER7 strncasecmp optimization
These routines are not extensively used (gnulib documentation even
recommend use a replacement [1]), and there is already a POWER8
version that uses proper vectorized instructions.

[1] https://www.gnu.org/software/gnulib/manual/gnulib.html#C-strings

Checked with a build for some powerpc variations.
Reviewed-by: Peter Bergner <bergner@linux.ibm.com>
2025-05-06 13:31:01 -03:00
Florian Weimer
77e8b40a6e powerpc: Remove relocation cache flush code for power64
This is only needed for -mno-secure-plt, and this linkage mode
is not supported with powerpc64 and powerp64le.

Reviewed-by: Peter Bergner <bergner@linux.ibm.com>
2025-04-10 06:52:18 +02:00
Adhemerval Zanella
3e8814903c math: Refactor how to use libm-test-ulps
The current approach tracks math maximum supported errors by explicitly
setting them per function and architecture. On newer implementations or
new compiler versions, the file is updated with newer values if it
shows higher results. The idea is to track the maximum known error, to
update the manual with the obtained values.

The constant libm-test-ulps shows little value, where it is usually a
mechanical change done by the maintainer, for past releases it is
usually ignored whether the ulp change resulted from a compiler
regression, and the math tests already have a maximum ulp error that
triggers a regression.

It was shown by a recent update after the new acosf [1] implementation
that is correctly rounded, where the libm-test-ulps was indeed from a
compiler issue.

This patch removes all arch-specific libm-test-ulps, adds system generic
libm-test-ulps where applicable, and changes its semantics. The generic
files now track specific implementation constraints, like if it is
expected to be correctly rounded, or if the system-specific has
different error expectations.

Now multiple libm-test-ulps can be defined, and system-specific
overrides generic implementation.  This is for the case where
arch-specific implementation might show worse precision than generic
implementation, for instance, the cbrtf on i686.

Regressions are only reported if the implementation shows larger errors
than 9 ulps (13 for IBM long double) unless it is overridden by
libm-test-ulps and the maximum error is not printed at the end of tests.
The regen-ulps rule is also removed since it does not make sense to
update the libm-test-ulps automatically.

The manual error table is also removed, Paul Zimmermann and others have
been tracking libm precision with a more comprehensive analysis for some
releases; so link to his work instead.

[1] https://sourceware.org/git/?p=glibc.git;a=commit;h=9cc9f8e11e8fb8f54f1e84d9f024917634a78201
2025-03-12 13:40:07 -03:00
Adhemerval Zanella
1d60b9dfda Remove dl-procinfo.h
powerpc was the only architecture with arch-specific hooks for
LD_SHOW_AUXV, and with the information moved to ld diagnostics there
is no need to keep the _dl_procinfo hook.

Checked with a build for all affected ABIs.

Reviewed-by: Peter Bergner <bergner@linux.ibm.com>
2025-03-05 11:22:09 -03:00
Adhemerval Zanella
2fd580ea46 powerpc: Remove unused dl-procinfo.h
The _dl_string_platform is moved to hwcapinfo.h, since it is only used
by hwcapinfo.c and test-get_hwcap internal test.

Checked on powerpc64le-linux-gnu.

Reviewed-by: Peter Bergner <bergner@linux.ibm.com>
2025-03-05 11:22:09 -03:00
Adhemerval Zanella
a768993c10 powerpc: Move cache geometry information to ld diagnostics
From LD_SHOW_AUXV output.

Checked on powerpc64le-linux-gnu.

Reviewed-by: Peter Bergner <bergner@linux.ibm.com>
2025-03-05 11:22:09 -03:00
Adhemerval Zanella
8a995670a8 powerpc: Move AT_HWCAP descriptions to ld diagnostics
The ld.so diagnostics already prints AT_HWCAP values, but only in
hexadecimal.  To avoid duplicating the strings, consolidate the
hwcap_names from cpu-features.h on a new file, dl-hwcap-info.h
(and it also improves the hwcap string description with more
values).

For future AT_HWCAP3/AT_HWCAP4 extensions, it is just a matter
to add them on dl-hwcap-info.c so both ld diagnostics and
tunable filtering will parse the new values.

Checked on powerpc64le-linux-gnu.

Reviewed-by: Peter Bergner <bergner@linux.ibm.com>
2025-03-05 11:22:09 -03:00
Adhemerval Zanella
8f170dc819 math: Use tanpif from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows better performance to the generic tanpif.

The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1):

latency                      master        patched   improvement
x86_64                      85.1683        47.7990        43.88%
x86_64v2                    76.8219        41.4679        46.02%
x86_64v3                    73.7775        37.7734        48.80%
aarch64 (Neoverse)          35.4514        18.0742        49.02%
power8                      22.7604        10.1054        55.60%
power10                     22.1358         9.9553        55.03%

reciprocal-throughput        master        patched   improvement
x86_64                      41.0174        19.4718        52.53%
x86_64v2                    34.8565        11.3761        67.36%
x86_64v3                    34.0325         9.6989        71.50%
aarch64 (Neoverse)          25.4349         9.2017        63.82%
power8                      13.8626         3.8486        72.24%
power10                     11.7933         3.6420        69.12%

Reviewed-by: DJ Delorie <dj@redhat.com>
2025-02-12 16:31:57 -03:00
Adhemerval Zanella
de2fca9fe2 math: Use sinpif from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows better performance to the generic sinpif.

The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1):

latency                      master        patched   improvement
x86_64                      47.5710        38.4455        19.18%
x86_64v2                    46.8828        40.7563        13.07%
x86_64v3                    44.0034        34.1497        22.39%
aarch64 (Neoverse)          19.2493        14.1968        26.25%
power8                      23.5312        16.3854        30.37%
power10                     22.6485        10.2888        54.57%

reciprocal-throughput        master        patched   improvement
x86_64                      21.8858        11.6717        46.67%
x86_64v2                    22.0620        11.9853        45.67%
x86_64v3                    21.5653        11.3291        47.47%
aarch64 (Neoverse)          13.0615         6.5499        49.85%
power8                      16.2030         6.9580        57.06%
power10                     12.8911         4.2858        66.75%

Reviewed-by: DJ Delorie <dj@redhat.com>
2025-02-12 16:31:57 -03:00
Adhemerval Zanella
be85208b9f math: Use cospif from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows better performance to the generic cospif.

The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1):

latency                    master        patched   improvement
x86_64                    47.4679        38.4157        19.07%
x86_64v2                  46.9686        38.3329        18.39%
x86_64v3                  43.8929        31.8510        27.43%
aarch64 (Neoverse)        18.8867        13.2089        30.06%
power8                    22.9435         7.8023        65.99%
power10                   15.4472        7.77505        49.67%

reciprocal-throughput      master        patched   improvement
x86_64                    20.9518        11.4991        45.12%
x86_64v2                  19.8699        10.5921        46.69%
x86_64v3                  19.3475         9.3998        51.42%
aarch64 (Neoverse)        12.5767         6.2158        50.58%
power8                    15.0566         3.2654        78.31%
power10                    9.2866         3.1147        66.46%

Reviewed-by: DJ Delorie <dj@redhat.com>
2025-02-12 16:31:57 -03:00
Adhemerval Zanella
95a01ea955 math: Use atanpif from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows better performance to the generic atanpif.

The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1):

latency                     master        patched   improvement
x86_64                     66.3296        52.7558        20.46%
x86_64v2                   66.0429        51.4007        22.17%
x86_64v3                   60.6294        48.7876        19.53%
aarch64 (Neoverse)         24.3163        20.9110        14.00%
power8                     16.5766        13.3620        19.39%
power10                    16.5115        13.4072        18.80%

reciprocal-throughput       master        patched   improvement
x86_64                     30.8599        16.0866        47.87%
x86_64v2                   29.2286        15.4688        47.08%
x86_64v3                   23.0960        12.8510        44.36%
aarch64 (Neoverse)         15.4619        10.6752        30.96%
power8                      7.9200         5.2483        33.73%
power10                     6.8539         4.6262        32.50%

Reviewed-by: DJ Delorie <dj@redhat.com>
2025-02-12 16:31:57 -03:00
Adhemerval Zanella
1cd9ccd8c0 math: Use atan2pif from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows better performance to the generic atan2pif.

The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1):

latency                 master        patched   improvement
x86_64                 79.4006        70.8726        10.74%
x86_64v2               77.5136        69.1424        10.80%
x86_64v3               71.8050        68.1637         5.07%
aarch64 (Neoverse)     27.8363        24.7700        11.02%
power8                 39.3893        17.2929        56.10%
power10                19.7200        16.8187        14.71%

reciprocal-throughput   master        patched   improvement
x86_64                 38.3457        30.9471        19.29%
x86_64v2               37.4023        30.3112        18.96%
x86_64v3               33.0713        24.4891        25.95%
aarch64 (Neoverse)     19.3683        15.3259        20.87%
power8                 19.5507        8.27165        57.69%
power10                9.05331        7.63775        15.64%

Reviewed-by: DJ Delorie <dj@redhat.com>
2025-02-12 16:31:57 -03:00
Adhemerval Zanella
ae679a0aca math: Use asinpif from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows better performance to the generic asinpif.

The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1):

latency                 master        patched   improvement
x86_64                 46.4996        41.6126        10.51%
x86_64v2               46.7551        38.8235        16.96%
x86_64v3               42.6235        33.7603        20.79%
aarch64 (Neoverse)     17.4161        14.3604        17.55%
power8                 10.7347         9.0193        15.98%
power10                10.6420         9.0362        15.09%

reciprocal-throughput   master        patched   improvement
x86_64                 24.7208        16.5544        33.03%
x86_64v2               24.2177        14.8938        38.50%
x86_64v3               20.5617        10.5452        48.71%
aarch64 (Neoverse)     13.4827        7.17613        46.78%
power8                 6.46134        3.56089        44.89%
power10                5.79007        3.49544        39.63%

Reviewed-by: DJ Delorie <dj@redhat.com>
2025-02-12 16:31:57 -03:00
Adhemerval Zanella
edb2a8f0ae math: Use acospif from CORE-MATH
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows better performance to the generic acospif.

The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).

Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1):

latency                  master        patched   improvement
x86_64                  54.8281        42.9070        21.74%
x86_64v2                54.1717        42.7497        21.08%
x86_64v3                49.3552        34.1512        30.81%
aarch64 (Neoverse)      17.9395        14.3733        19.88%
power8                  20.3110         8.8609        56.37%
power10                 11.3113        8.84067        21.84%

reciprocal-throughput    master        patched   improvement
x86_64                  21.2301        14.4803        31.79%
x86_64v2                20.6858        13.9506        32.56%
x86_64v3                16.1944        11.3377        29.99%
aarch64 (Neoverse)      11.4474        7.13282        37.69%
power8                  10.6916        3.57547        66.56%
power10                 4.64269        3.54145        23.72%

Reviewed-by: DJ Delorie <dj@redhat.com>
2025-02-12 16:31:57 -03:00
Florian Weimer
3755ffb665 powerpc64le: Also avoid IFUNC for __mempcpy
Code used during early static startup in elf/dl-tls.c uses
__mempcpy.

Fixes commit cbd9fd2369 ("Consolidate
TLS block allocation for static binaries with ld.so").

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
2025-02-05 09:53:11 +01:00
Florian Weimer
7a3e2e877a Move <thread_pointer.h> to kernel-independent sysdeps directories
Hurd is expected to use the same thread ABI as Linux.

Reviewed-by: Michael Jeanson <mjeanson@efficios.com>
2025-01-09 19:30:16 +01:00
Adhemerval Zanella
9cc9f8e11e math: Fix acosf when building with gcc <= 11
GCC <= 11 wrongly assumes the rounding is to nearest and performs a
constant folding where it should evaluate since the result is not
exact [1].

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57245
2025-01-09 12:53:58 -03:00
Andreas K. Hüttel
2750548afe math: update powerpc ulps (this time LE)
Linux bogsucker 6.1.55-gentoo-dist-hardened #1 SMP Sun Oct  1 18:03:02 UTC 2023 ppc64le POWER9 (architected), altivec supported CHRP IBM pSeries (emulated by qemu) GNU/Linux

Signed-off-by: Andreas K. Hüttel <dilfridge@gentoo.org>
2025-01-07 15:58:45 +01:00
Andreas K. Hüttel
3674004f3f math: update powerpc ulps
Linux timberdoodle 6.1.60-gentoo-dist-hardened #1 SMP Fri Dec  1 22:10:49 UTC 2023 ppc64 POWER9 (architected), altivec supported CHRP IBM pSeries (emulated by qemu) GNU/Linux

Signed-off-by: Andreas K. Hüttel <dilfridge@gentoo.org>
2025-01-03 19:34:53 +01:00
Florian Weimer
cc74583f23 elf: Remove the remaining uses of GET_ADDR_OFFSET
Expand the macro where it is used in static definitions of
__tls_get_addr.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2025-01-02 13:45:27 +01:00
Florian Weimer
64d07e117d powerpc: Update acosf ulps
As seen on powerpc64le-linux-gnu with GCC 11 defaulting to POWER9
instructions.
2025-01-02 11:57:39 +01:00
Paul Eggert
ad16577ae1 Update copyright in generated files by running "make" 2025-01-01 11:22:09 -08:00