1
0
mirror of https://sourceware.org/git/glibc.git synced 2025-11-02 09:33:31 +03:00
Commit Graph

596 Commits

Author SHA1 Message Date
litenglong
00d406e77b x86: Disable AVX Fast Unaligned Load on Hygon 1/2/3
- Performance testing revealed significant memcpy performance degradation
  when bit_arch_AVX_Fast_Unaligned_Load is enabled on Hygon 3.
- Hygon confirmed AVX performance issues in certain memory functions.
- Glibc benchmarks show SSE outperforms AVX for
  memcpy/memmove/memset/strcmp/strcpy/strlen and so on.
- Hardware differences primarily in floating-point operations don't justify
  AVX usage for memory operations.

Reviewed-by: gaoxiang <gaoxiang@kylinos.cn>
Signed-off-by: litenglong <litenglong@kylinos.cn>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2025-10-27 05:16:30 +08:00
Sunil K Pandey
a114e29ddd x86: Detect Intel Nova Lake Processor
Detect Intel Nova Lake Processor and tune it similar to Intel Panther
Lake.  https://cdrdv2.intel.com/v1/dl/getContent/671368 Section 1.2.

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2025-10-07 20:50:24 -07:00
Sunil K Pandey
f8dd52901b x86: Detect Intel Wildcat Lake Processor
Detect Intel Wildcat Lake Processor and tune it similar to Intel Panther
Lake.  https://cdrdv2.intel.com/v1/dl/getContent/671368 Section 1.2.

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2025-10-07 16:41:06 -07:00
Uros Bizjak
a9a8b106bb x86: Restore "*&" GCC asm memory operand workaround to installed fpu-control.h
fpu_control.h is an installed header so a wider range of compiler versions
(including ones older than GCC 9) are relevant with it than are relevant
for building glibc.

Fixes commit 3014dec3ad
('x86: Remove obsolete "*&" GCC asm memory operand workaround')

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
2025-09-24 08:04:41 +02:00
Uros Bizjak
ff8be6152b x86: Use "%v" to emit VEX encoded instructions for AVX targets
Legacy encodings of SSE instructions incur AVX-SSE domain transition
penalties on some Intel microarchitectures (e.g. Haswell, Broadwell).
Using the VEX forms avoids these penatlies and keeps all instructions
in the VEX decode domain.  Use "%v" sequence to emit the "v" prefix
for opcodes when compiling with -mavx.

No functional changes intended.

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Reviewed-by: Florian Weimer <fweimer@redhat.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2025-09-22 17:33:25 +02:00
Uros Bizjak
3014dec3ad x86: Remove obsolete "*&" GCC asm memory operand workaround
GCC now accept plain variable names as valid lvalues for "m"
constraints, automatically spilling locals to memory if necessary.
The long-standing "*&" pattern was originally used as a defensive
workaround for older compiler versions that rejected operands
such as:

     asm ("incl %0" : "+m"(x));

with errors like "memory input is not directly addressable".

Modern compilers (GCC >= 9) reliably generate correct code
without the workaround, and the resulting assembly is identical.

No functional changes intended.

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Reviewed-by: Florian Weimer <fweimer@redhat.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2025-09-22 17:33:25 +02:00
H.J. Lu
1fa5773eb1 x86: Don't use asm statement for trunc/truncf
Compiler inlines trunc and truncf with SSE4.1.  But older versions of GCC
doesn't inline them with -Os:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121861

Don't use asm statement for trunc and truncf if compiler can inline them
with -Os.  It removes one register move with GCC 16:

__modff_sse41:                        __modff_sse41:
.LFB23:                               .LFB23:
   .cfi_startproc                        .cfi_startproc
   endbr64                               endbr64
   subq  $24, %rsp                       subq  $24, %rsp
   .cfi_def_cfa_offset 32                .cfi_def_cfa_offset 32
   movq  %fs:40, %rax                    movq  %fs:40, %rax
   movq  %rax, 8(%rsp)                   movq  %rax, 8(%rsp)
   xorl  %eax, %eax                      xorl  %eax, %eax
   movd  %xmm0, %eax                     movd  %xmm0, %eax
   addl  %eax, %eax                      addl  %eax, %eax
   cmpl  $-16777216, %eax                cmpl  $-16777216, %eax
   je .L7                                je .L7
                                   >     movaps   %xmm0, %xmm3
   movaps   %xmm0, %xmm4                 movaps   %xmm0, %xmm4
   movss .LC0(%rip), %xmm2         |     movss .LC0(%rip), %xmm1
   movaps   %xmm2, %xmm3           |     movaps   %xmm1, %xmm2
   andps %xmm0, %xmm2              |     roundss  $11, %xmm3, %xmm3
   roundss $11, %xmm0, %xmm1       |     subss %xmm3, %xmm4
   subss %xmm1, %xmm4              |     andps %xmm0, %xmm1
   andnps   %xmm4, %xmm3           |     andnps   %xmm4, %xmm2
   orps  %xmm3, %xmm2              |     orps  %xmm2, %xmm1
.L3:                                  .L3:
   movss %xmm1, (%rdi)             |     movss %xmm3, (%rdi)
   movq  8(%rsp), %rax                   movq  8(%rsp), %rax
   subq  %fs:40, %rax                    subq  %fs:40, %rax
   jne   .L8                             jne   .L8
   movaps   %xmm2, %xmm0           |     movaps   %xmm1, %xmm0
   addq  $24, %rsp                       addq  $24, %rsp
   .cfi_remember_state                   .cfi_remember_state
   .cfi_def_cfa_offset 8                 .cfi_def_cfa_offset 8
   ret                                   ret

Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Uros Bizjak <ubizjak@gmail.com>
2025-09-17 04:30:11 -07:00
Adhemerval Zanella
b5d88fa6c3 math: Fix x86_64 build for -Os (BZ 33367)
The compiler might not inline the trunc function call for
USE_TRUNC_BUILTIN [1].

This patch adds an optimized __trunc/__truncf for x86 used
on modf ifunc variant to avoid the trunc libcall.

Checked on x86_64, x86_64-v2, x86_64-v3, and x86_64-v4. Used -O2 and
-Os options. Performed a full make check on x86_64 with both
 optimizations.

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121861
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2025-09-11 06:23:33 -07:00
Uros Bizjak
4be94f6a9c x86: Remove x86 version of thread_pointer.h
The x86 version of thread_pointer.h is the same as the generic one.

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: H.J.Lu <hjl.tools@gmail.com>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: Carlos O'Donell <carlos@redhat.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2025-09-10 05:30:07 -07:00
Uros Bizjak
e5222ceb73 x86: Remove stale __GNUC_PREREQ (11, 1) test from __thread_pointer()
GCC 12 is currently the minimum supported compiler version.
Remove no longer needed __GNUC_PREREQ (11, 1) test from
__thread_pointer().

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: H.J.Lu <hjl.tools@gmail.com>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: Carlos O'Donell <carlos@redhat.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2025-09-10 05:30:07 -07:00
Uros Bizjak
b8253693b7 x86: Define atomic_compare_and_exchange_{val, bool}_acq using __atomic_compare_exchange_n
No functional changes.

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: Adhemerval Zanella Netto <adhemerval.zanella@linaro.org>
Cc: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Cc: Collin Funk <collin.funk1@gmail.com>
Cc: H.J.Lu <hjl.tools@gmail.com>
Cc: Carlos O'Donell <carlos@redhat.com>
Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2025-09-09 07:58:52 -07:00
Uros Bizjak
935ee691bc x86: Define atomic_exchange_acq using __atomic_exchange_n
The resulting libc.so is identical on both x86_64 and
i386 targets compared to unpatched builds:

$ sha1sum libc-x86_64-old.so libc-x86_64-new.so
74eca1b87f2ecc9757a984c089a582b7615d93e7  libc-x86_64-old.so
74eca1b87f2ecc9757a984c089a582b7615d93e7  libc-x86_64-new.so
$ sha1sum libc-i386-old.so libc-i386-new.so
882bbab8324f79f4fbc85224c4c914fc6822ece7  libc-i386-old.so
882bbab8324f79f4fbc85224c4c914fc6822ece7  libc-i386-new.so

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: Adhemerval Zanella Netto <adhemerval.zanella@linaro.org>
Cc: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Cc: Collin Funk <collin.funk1@gmail.com>
Cc: H.J.Lu <hjl.tools@gmail.com>
Cc: Carlos O'Donell <carlos@redhat.com>
Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2025-09-09 07:51:41 -07:00
Uros Bizjak
e6b5ad1b1d x86: Define atomic_full_barrier using __sync_synchronize
For x86_64 targets, __sync_synchronize emits a full 64-bit
'LOCK ORQ $0x0,(%rsp)' instead of 'LOCK ORL $0x0,(%rsp)'.

No functional changes.

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: Adhemerval Zanella Netto <adhemerval.zanella@linaro.org>
Cc: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Cc: Collin Funk <collin.funk1@gmail.com>
Cc: H.J.Lu <hjl.tools@gmail.com>
Cc: Carlos O'Donell <carlos@redhat.com>
Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2025-09-09 07:44:41 -07:00
Uros Bizjak
4eef002328 x86: Remove catomic_* locking primitives
Remove obsolete catomic_* locking primitives which don't map
to standard compiler builtins.

There are still a couple of places in the tree that uses them
(malloc/arena.c and malloc/malloc.c).

x86 didn't define __arch_c_compare_and_exchange_bool_* primitives
so fallback code used __arch_c_compare_and_exchange_val_* primitives
instead.  This resulted in unoptimal code for
catomic_compare_and_exchange_bool_acq where superfluous
CMP was emitted after CMPXCHG, e.g. in arena_get2:

   775b8:	48 8d 4a 01          	lea    0x1(%rdx),%rcx
   775bc:	48 89 d0             	mov    %rdx,%rax
   775bf:	64 83 3c 25 18 00 00 	cmpl   $0x0,%fs:0x18
   775c6:	00 00
   775c8:	74 01                	je     775cb <arena_get2+0x35b>
   775ca:	f0 48 0f b1 0d 75 3d 	lock cmpxchg %rcx,0x163d75(%rip)        # 1db348 <narenas>
   775d1:	16 00
   775d3:	48 39 c2             	cmp    %rax,%rdx
   775d6:	74 7f                	je     77657 <arena_get2+0x3e7>

that now becomes:

   775b8:	48 8d 4a 01          	lea    0x1(%rdx),%rcx
   775bc:	48 89 d0             	mov    %rdx,%rax
   775bf:	f0 48 0f b1 0d 80 3d 	lock cmpxchg %rcx,0x163d80(%rip)        # 1db348 <narenas>
   775c6:	16 00
   775c8:	74 7f                	je     77649 <arena_get2+0x3d9>

OTOH, catomic_decrement does not fallback to atomic_fetch_add (, -1)
builtin but to the cmpxchg loop, so the generated code in arena_get2
regresses a bit, from using LOCK DECQ insn:

   77829:	64 83 3c 25 18 00 00 	cmpl   $0x0,%fs:0x18
   77830:	00 00
   77832:	74 01                	je     77835 <arena_get2+0x5c5>
   77834:	f0 48 ff 0d 0c 3b 16 	lock decq 0x163b0c(%rip)        # 1db348 <narenas>
   7783b:	00

to a cmpxchg loop:

   7783d:	48 8b 0d 04 3b 16 00 	mov    0x163b04(%rip),%rcx        # 1db348 <narenas>
   77844:	48 8d 71 ff          	lea    -0x1(%rcx),%rsi
   77848:	48 89 c8             	mov    %rcx,%rax
   7784b:	f0 48 0f b1 35 f4 3a 	lock cmpxchg %rsi,0x163af4(%rip)        # 1db348 <narenas>
   77852:	16 00
   77854:	0f 84 c9 fa ff ff    	je     77323 <arena_get2+0xb3>
   7785a:	eb e1                	jmp    7783d <arena_get2+0x5cd>

Defining catomic_exchange_and_add using __atomic_fetch_add solves the
above issue and generates optimal:

   77809:	f0 48 83 2d 36 3b 16 	lock subq $0x1,0x163b36(%rip)        # 1db348 <narenas>
   77810:	00 01

Depending on the target processor, the compiler may emit either
'LOCK ADD/SUB $1, m' or 'INC/DEC $1, m' instruction, due to partial
flag register stall issue.

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: Adhemerval Zanella Netto <adhemerval.zanella@linaro.org>
Cc: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Cc: Collin Funk <collin.funk1@gmail.com>
Cc: H.J.Lu <hjl.tools@gmail.com>
Cc: Carlos O'Donell <carlos@redhat.com>
Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2025-09-09 07:36:02 -07:00
Uros Bizjak
af5b01dc26 x86: Remove unused atomics
Remove unused atomics from <sysdeps/x86/atomic-machine.h>.

The resulting libc.so is identical on both x86_64 and
i386 targets compared to unpatched builds:

$ sha1sum libc-x86_64-old.so libc-x86_64-new.so
b89aaa2b71efd435104ebe6f4cd0f2ef89fcac90  libc-x86_64-old.so
b89aaa2b71efd435104ebe6f4cd0f2ef89fcac90  libc-x86_64-new.so
$ sha1sum libc-i386-old.so libc-i386-new.so
aa70f2d64da2f0f516634b116014cfe7af3e5b1a  libc-i386-old.so
aa70f2d64da2f0f516634b116014cfe7af3e5b1a  libc-i386-new.so

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: Adhemerval Zanella Netto <adhemerval.zanella@linaro.org>
Cc: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Cc: Collin Funk <collin.funk1@gmail.com>
Cc: H.J.Lu <hjl.tools@gmail.com>
Cc: Carlos O'Donell <carlos@redhat.com>
Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2025-09-09 07:29:57 -07:00
H.J. Lu
5c522d7a58 x86: Include <bits/stdlib-bsearch.h> in dl-cacheinfo.h
On x86-64, when glibc is configured with --enable-stack-protector=all
and compiled with -Os, ld.so crashes very early:

(gdb) r --direct
Starting program: /export/build/gnu/tools-build/glibc-gitlab/build-x86_64-linux/string/test-memswap --direct

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7f41b0a in bsearch (__key=__key@entry=0x7fffffffda28,
    __base=__base@entry=0x7ffff7fca140 <intel_02_known>,
    __nmemb=__nmemb@entry=68, __size=__size@entry=8,
    __compar=__compar@entry=0x7ffff7f3b691 <intel_02_known_compare>)
    at ../bits/stdlib-bsearch.h:22
22	{
(gdb) disass
Dump of assembler code for function bsearch:
   0x00007ffff7f41af0 <+0>:	push   %r15
   0x00007ffff7f41af2 <+2>:	mov    %rcx,%r15
   0x00007ffff7f41af5 <+5>:	push   %r14
   0x00007ffff7f41af7 <+7>:	push   %r13
   0x00007ffff7f41af9 <+9>:	mov    %rsi,%r13
   0x00007ffff7f41afc <+12>:	push   %r12
   0x00007ffff7f41afe <+14>:	mov    %rdi,%r12
   0x00007ffff7f41b01 <+17>:	push   %rbp
   0x00007ffff7f41b02 <+18>:	mov    %rdx,%rbp
   0x00007ffff7f41b05 <+21>:	push   %rbx
   0x00007ffff7f41b06 <+22>:	sub    $0x18,%rsp
=> 0x00007ffff7f41b0a <+26>:	mov    %fs:0x28,%r14
^^^^^^^^^^^^^^^^^^^^^^^^^^^^ We can't use stack protector at this point.
   0x00007ffff7f41b13 <+35>:	mov    %r14,0x8(%rsp)
   0x00007ffff7f41b18 <+40>:	mov    %r8,%r14
   0x00007ffff7f41b1b <+43>:	test   %rbp,%rbp
   0x00007ffff7f41b1e <+46>:	je     0x7ffff7f41b48 <bsearch+88>
   0x00007ffff7f41b20 <+48>:	mov    %rbp,%rbx
   0x00007ffff7f41b23 <+51>:	mov    %r12,%rdi
   0x00007ffff7f41b26 <+54>:	shr    $1,%rbx
   0x00007ffff7f41b29 <+57>:	imul   %r15,%rbx
   0x00007ffff7f41b2d <+61>:	add    %r13,%rbx
   0x00007ffff7f41b30 <+64>:	mov    %rbx,%rsi
(gdb) bt
 #0  0x00007ffff7f41b0a in bsearch (__key=__key@entry=0x7fffffffda28,
    __base=__base@entry=0x7ffff7fca140 <intel_02_known>,
    __nmemb=__nmemb@entry=68, __size=__size@entry=8,
    __compar=__compar@entry=0x7ffff7f3b691 <intel_02_known_compare>)
    at ../bits/stdlib-bsearch.h:22
 #1  0x00007ffff7f3c1be in intel_check_word (name=188, value=1979933440,
    has_level_2=has_level_2@entry=0x7fffffffda7f,
    no_level_2_or_3=no_level_2_or_3@entry=0x7fffffffda7e,
    cpu_features=<optimized out>) at ../sysdeps/x86/dl-cacheinfo.h:217
 #2  0x00007ffff7f3c29f in handle_intel (name=name@entry=188,
    cpu_features=<optimized out>) at ../sysdeps/x86/dl-cacheinfo.h:279
 #3  0x00007ffff7f3ccf9 in dl_init_cacheinfo (cpu_features=<optimized out>)
    at ../sysdeps/x86/dl-cacheinfo.h:852
 #4  init_cpu_features (cpu_features=<optimized out>)
    at ../sysdeps/x86/cpu-features.c:1153
 #5  0x00007ffff7f3d6f9 in __libc_start_main_impl (main=0x7ffff7f396dc <main>,
    argc=2, argv=0x7fffffffdbe8, init=<optimized out>, fini=<optimized out>,
    rtld_fini=0x0, stack_end=0x7fffffffdbd8) at ../csu/libc-start.c:269
 #6  0x00007ffff7f39901 in _start () at ../sysdeps/x86_64/start.S:115
(gdb)

The problem is that since __USE_EXTERN_INLINES isn't defined with -Os,
the inline bsearch in <bits/stdlib-bsearch.h> isn't available and the
external bsearch is compiled with stack protector.  Include
<bits/stdlib-bsearch.h> in dl-cacheinfo.h fixed BZ #33374.

Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Sam James <sam@gentoo.org>
2025-09-08 20:31:36 -07:00
Uros Bizjak
119d658ac2 x86: Use flag output operands for inline asm in atomic-machine.h
Use the flag output constraints feature available in gcc 6+
("=@cc<cond>") instead of explicitly setting a boolean variable
with SETcc instruction.  This approach decouples the instruction
that sets the flags from the code that consumes them, allowing
the compiler to create better code when working with flags users.
Instead of e.g.:

   lock add %esi,(%rdi)
   sets   %sil
   test   %sil,%sil
   jne    <...>

the compiler now generates:

   lock add %esi,(%rdi)
   js     <...>

No functional changes intended.

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: H.J.Lu <hjl.tools@gmail.com>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: Carlos O'Donell <carlos@redhat.com>
Reviewed-by: Florian Weimer <fweimer@redhat.com>
2025-08-29 09:05:23 +02:00
Henrik Lindström
c49a32d7eb x86/configure: Improve portability of isa level check
wc -l pads the output with leading spaces on some systems, e.g. FreeBSD.
This results in the check `test "$count" = 1` failing. Use -eq for integer
comparison instead.

Signed-off-by: Henrik Lindström <henrik@lxm.se>
Reviewed-by: Arjun Shankar <arjun@redhat.com>
2025-08-27 17:14:15 +02:00
H.J. Lu
027505a07b Don't pass -c to LIBC_TRY_TEST_CC_OPTION
LIBC_TRY_TEST_CC_OPTION is defined with LIBC_TRY_CC_OPTION:

dnl Test a compiler option or options with an empty input file.
dnl LIBC_TRY_CC_OPTION([options], [action-if-true], [action-if-false])
AC_DEFUN([LIBC_TRY_CC_OPTION],
[AS_IF([AC_TRY_COMMAND([${CC-cc} $1 -xc /dev/null -S -o /dev/null])],
        [$2], [$3])])

which passes -S to compiler.  Unlike gcc, when -c is also passed to clang
20, we get

configure:7838: clang -c -Werror -fsemantic-interposition -xc /dev/null -S -o /dev/null
clang: error: argument unused during compilation: '-c' [-Werror,-Wunused-command-line-argument]

Don't pass -c to LIBC_TRY_TEST_CC_OPTION since -c isn't needed.

This fixes BZ #33318.

Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Sam James <sam@gentoo.org>
2025-08-23 15:59:42 -07:00
H.J. Lu
dd4394b249 x86: Set have-protected-data to no if unsupported
If the building compiler enables no direct external data access by
default, access to protected data in shared libraries from executables
must be compiled with no direct external data access.  If the testing
compiler doesn't support it, set have-protected-data to no to disable
the tests which requires no direct external data access.

Add LIBC_TRY_CC_COMMAND to test a building compiler option or options
with an input file.

This fixes BZ #33286.

Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Sam James <sam@gentoo.org>
2025-08-22 17:55:32 -07:00
H.J. Lu
bd4628f3f1 i386: Also add GLIBC_ABI_GNU2_TLS version [BZ #33129]
Since the GNU2 TLS run-time bug:

https://sourceware.org/bugzilla/show_bug.cgi?id=31372

affects both i386 and x86-64, also add GLIBC_ABI_GNU2_TLS version to i386
to indicate the working GNU2 TLS run-time.  For x86-64, the additional
GNU2 TLS run-time bug fix is needed for

https://sourceware.org/bugzilla/show_bug.cgi?id=31501

Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Sam James <sam@gentoo.org>
2025-08-18 11:58:01 -07:00
H.J. Lu
aec8498873 x86-64: Properly compile ISA optimized modf and modff
There are 3 variants of modf and modff: SSE2, SSE4.1 and AVX.  s_modf.c
and s_modff.c include the generic implementation compiled with the minimum
x86 ISA level.  The IFUNC selector is used only if the minimum ISA level
is less than AVX.  SSE4.1 variant is included only if the ISA level is
less than SSE4.1.  AVX variant is included only the ISA level is less than
AVX.

AVX variant should be compiled with -mavx, not -msse2avx -DSSE2AVX which
are used to encode SSE assembly sources with EVEX encoding.

The routines that are shared between libc and libm should use different
rules to avoid using the same MODULE_NAME, to avoid potential issues
like BZ #33165 where __stack_chk_fail not being routed to the internal
symbol.

Tested with -march=x86-64, -march=x86-64-v2, -march=x86-64-v3 and
-march=x86-64-v4.

This fixes BZ #33165 and BZ #33173.

Co-authored-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
2025-07-18 10:22:19 -07:00
H.J. Lu
7130c2ae97 x86: Avoid vector/r16-r31 registers and memcpy/memset in mcount_internal
Since mcount_internal is called from mcount/__fentry__ which preserve
only RAX, RCX, RDX, RSI, RDI, R8 and R9, compile mcount.c with

-fno-tree-loop-distribute-patterns -mgeneral-regs-only -mno-apxf

to void vector/r16-r31 registers and memcpy/memset in mcount_internal.
This fixes BZ #33134.

Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Andreas K. Huettel <dilfridge@gentoo.org>
2025-07-09 05:33:05 +08:00
H.J. Lu
0ef7965e5b x86: Update tst-gnu2-tls2 tests
Update tst-gnu2-tls2 tests to set XMM0...XMM7 to all 1s in malloc to
verify that XMM registers are preserved when _dl_tlsdesc_dynamic is
called by clearing vectors with zeroed XMM registers before
_dl_tlsdesc_dynamic and using these XMM registers to clear vectors
after _dl_tlsdesc_dynamic.  This improves the BZ #31372 test.

Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Sam James <sam@gentoo.org>
2025-06-19 05:46:31 +08:00
H.J. Lu
848f0e46f0 i386: Update ___tls_get_addr to preserve vector registers
Compiler generates the following instruction sequence for dynamic TLS
access:

	leal	tls_var@tlsgd(,%ebx,1), %eax
	call	___tls_get_addr@PLT

CALL instruction is transparent to compiler which assumes all registers,
except for EFLAGS, AX, CX, and DX, are unchanged after CALL.  But
___tls_get_addr is a normal function which doesn't preserve any vector
registers.

1. Rename the generic __tls_get_addr function to ___tls_get_addr_internal.
2. Change ___tls_get_addr to a wrapper function with implementations for
FNSAVE, FXSAVE, XSAVE and XSAVEC to save and restore all vector registers.
3. dl-tlsdesc-dynamic.h has:

_dl_tlsdesc_dynamic:
	/* Like all TLS resolvers, preserve call-clobbered registers.
	   We need two scratch regs anyway.  */
	subl	$32, %esp
	cfi_adjust_cfa_offset (32)

It is wrong to use

	movl	%ebx, -28(%esp)
	movl	%esp, %ebx
	cfi_def_cfa_register(%ebx)
	...
	mov	%ebx, %esp
	cfi_def_cfa_register(%esp)
	movl	-28(%esp), %ebx

to preserve EBX on stack.  Fix it with:

	movl	%ebx, 28(%esp)
	movl	%esp, %ebx
	cfi_def_cfa_register(%ebx)
	...
	mov	%ebx, %esp
	cfi_def_cfa_register(%esp)
	movl	28(%esp), %ebx

4. Update _dl_tlsdesc_dynamic to call ___tls_get_addr_internal directly.
5. Add have-test-mtls-traditional to compile tst-tls23-mod.c with
traditional TLS variant to verify the fix.
6. Define DL_RUNTIME_RESOLVE_REALIGN_STACK in sysdeps/x86/sysdep.h.

This fixes BZ #32996.

Co-Authored-By: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2025-06-19 04:30:31 +08:00
H.J. Lu
0a027674a1 x86: Avoid GLRO(dl_x86_cpu_features)
In init_cpu_features, replace GLRO(dl_x86_cpu_features) with
cpu_features to avoid an extra load.

Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Florian Weimer <fweimer@redhat.com>
2025-06-09 13:03:13 +08:00
H.J. Lu
de14f1959e x86: Detect Intel Diamond Rapids
Detect Intel Diamond Rapids and tune it similar to Intel Granite Rapids.

Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Sunil K Pandey <skpgkp2@gmail.com>
2025-04-12 09:43:15 -07:00
Sunil K Pandey
9f0deff558 x86: Handle unknown Intel processor with default tuning
Enable default tuning for unknown Intel processor.

Tested on x86, no regression.

Co-Authored-By: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2025-04-11 17:05:22 -07:00
Sunil K Pandey
e53eb952b9 x86: Add ARL/PTL/CWF model detection support
- Add ARROWLAKE model detection.
- Add PANTHERLAKE model detection.
- Add CLEARWATERFOREST model detection.

Intel® Architecture Instruction Set Extensions Programming Reference
https://cdrdv2.intel.com/v1/dl/getContent/671368 Section 1.2.

No regression, validated model detection on SDE.

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2025-04-10 07:47:08 -07:00
Sunil K Pandey
70b6488551 x86: Optimize xstate size calculation
Scan xstate IDs up to the maximum supported xstate ID.  Remove the
separate AMX xstate calculation.  Instead, exclude the AMX space from
the start of TILECFG to the end of TILEDATA in xsave_state_size.

Completed validation on SKL/SKX/SPR/SDE and compared xsave state size
with "ld.so --list-diagnostics" option, no regression.

Co-Authored-By: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Sunil K Pandey <skpgkp2@gmail.com>
2025-04-05 07:51:38 -07:00
Florian Weimer
c6e2895695 x86: Link tst-gnu2-tls2-x86-noxsave{,c,xsavec} with libpthread
This fixes a test build failure on Hurd.

Fixes commit 145097dff1 ("x86: Use separate
variable for TLSDESC XSAVE/XSAVEC state size (bug 32810)").

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2025-03-31 21:33:18 +02:00
YLK
dbb2880e61 Fix typo in comment 2025-03-31 10:54:52 -03:00
Florian Weimer
145097dff1 x86: Use separate variable for TLSDESC XSAVE/XSAVEC state size (bug 32810)
Previously, the initialization code reused the xsave_state_full_size
member of struct cpu_features for the TLSDESC state size.  However,
the tunable processing code assumes that this member has the
original XSAVE (non-compact) state size, so that it can use its
value if XSAVEC is disabled via tunable.

This change uses a separate variable and not a struct member because
the value is only needed in ld.so and the static libc, but not in
libc.so.  As a result, struct cpu_features layout does not change,
helping a future backport of this change.

Fixes commit 9b7091415a ("x86-64:
Update _dl_tlsdesc_dynamic to preserve AMX registers").

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2025-03-29 09:17:38 +01:00
Florian Weimer
59585ddaa2 x86: Skip XSAVE state size reset if ISA level requires XSAVE
If we have to use XSAVE or XSAVEC trampolines, do not adjust the size
information they need.  Technically, it is an operator error to try to
run with -XSAVE,-XSAVEC on such builds, but this change here disables
some unnecessary code with higher ISA levels and simplifies testing.

Related to commit befe2d3c4d
("x86-64: Don't use SSE resolvers for ISA level 3 or above").

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2025-03-29 09:17:38 +01:00
Adhemerval Zanella
1d60b9dfda Remove dl-procinfo.h
powerpc was the only architecture with arch-specific hooks for
LD_SHOW_AUXV, and with the information moved to ld diagnostics there
is no need to keep the _dl_procinfo hook.

Checked with a build for all affected ABIs.

Reviewed-by: Peter Bergner <bergner@linux.ibm.com>
2025-03-05 11:22:09 -03:00
Wilco Dijkstra
e5893e6349 Remove unused dl-procinfo.h
Remove unused _dl_hwcap_string defines.  As a result many dl-procinfo.h headers
can be removed.  This also removes target specific _dl_procinfo implementations
which only printed HWCAP strings using dl_hwcap_string.

Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2025-02-28 16:55:18 +00:00
koraynilay
29803ed3ce math: Fix unknown type name '__float128' for clang 3.4 to 3.8.1 (bug 32694)
When compiling a program that includes <bits/floatn.h> using a clang version
between 3.4 (included) and 3.8.1 (included), clang will fail with `unknown type
name '__float128'; did you mean '__cfloat128'?`. This changes fixes the clang
prerequirements macro call in floatn.h to check for clang 3.9 instead of 3.4,
since support for __float128 was actually enabled in 3.9 by:

commit 50f29e06a1b6a38f0bba9360cbff72c82d46cdd4
Author: Nemanja Ivanovic <nemanja.i.ibm@gmail.com>
Date:   Wed Apr 13 09:49:45 2016 +0000

    Enable support for __float128 in Clang

This fixes bug 32694.

Signed-off-by: koraynilay <koray.fra@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2025-02-23 11:47:11 +08:00
H.J. Lu
5a4573be6f x86 (__HAVE_FLOAT128): Defined to 0 for Intel SYCL compiler [BZ #32723]
Intel compiler always defines __INTEL_LLVM_COMPILER.  When SYCL is
enabled by -fsycl, it also defines SYCL_LANGUAGE_VERSION.  Since Intel
SYCL compiler doesn't support _Float128:

https://github.com/intel/llvm/issues/16903

define __HAVE_FLOAT128 to 0 for Intel SYCL compiler.

This fixes BZ #32723.

Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Sam James <sam@gentoo.org>
2025-02-20 08:42:37 +08:00
Florian Weimer
9b71570c46 x86: Add missing #include <features.h> to <thread_pointer.h>
It is required for __GNUC_PREREQ.

Reviewed-by: Michael Jeanson <mjeanson@efficios.com>
2025-01-09 19:30:41 +01:00
Florian Weimer
7a3e2e877a Move <thread_pointer.h> to kernel-independent sysdeps directories
Hurd is expected to use the same thread ABI as Linux.

Reviewed-by: Michael Jeanson <mjeanson@efficios.com>
2025-01-09 19:30:16 +01:00
Paul Eggert
2642002380 Update copyright dates with scripts/update-copyrights 2025-01-01 11:22:09 -08:00
Adhemerval Zanella
a2b0ff98a0 include/sys/cdefs.h: Add __attribute_optimization_barrier__
Add __attribute_optimization_barrier__ to disable inlining and cloning on a
function.  For Clang, expand it to

__attribute__ ((optnone))

Otherwise, expand it to

__attribute__ ((noinline, clone))

Co-Authored-By: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Sam James <sam@gentoo.org>
2024-12-23 06:28:55 +08:00
Fangrui Song
d773aff467 x86: Define __HAVE_FLOAT128 for Clang and use __builtin_*f128 code path
Clang supports __builtin_fabsf128 (despite not supporting _Float128) but
it does not support __builtin_fabsq.  Fallback to back to
`typedef __float128 _Float128;` it clang is used.

Originally developed by Fangrui Song <maskray@google.com>.
Co-Authored-By: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Sam James <sam@gentoo.org>
2024-12-22 16:07:11 +08:00
Adhemerval Zanella
6412d8cc46 x86: Use inhibit_stack_protector on tst-ifunc-isa.h
Co-Authored-By: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Sam James <sam@gentoo.org>
2024-12-22 13:19:12 +08:00
H.J. Lu
f5fb9fa011 x86: Include test-flt-eval-method-387 if -mfpmath=387 works
Since Clang doesn't support -mfpmath=387 on x86-64, on x86, include
test-flt-eval-method-387 only if -mfpmath=387 works.

Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Sam James <sam@gentoo.org>
2024-12-22 12:54:44 +08:00
Florian Weimer
2b1dba3eb3 elf: Introduce is_rtld_link_map
Unconditionally define it to false for static builds.

This avoids the awkward use of weak_extern for _dl_rtld_map
in checks that cannot be possibly true on static builds.

Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
2024-12-20 15:52:57 +01:00
H.J. Lu
a194871b13 sys/platform/x86.h: Do not depend on _Bool definition in C++ mode
Clang does not define _Bool for -std=c++98:

/usr/include/bits/platform/features.h:31:19: error: unknown type name '_Bool'
   31 | static __inline__ _Bool
      |                   ^

Change _Bool to bool to silence clang++ error.

Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Florian Weimer <fweimer@redhat.com>
2024-12-18 02:32:27 +08:00
Florian Weimer
61c3450db9 x86: Avoid integer truncation with large cache sizes (bug 32470)
Some hypervisors report 1 TiB L3 cache size.  This results
in some variables incorrectly getting zeroed, causing crashes
in memcpy/memmove because invariants are violated.
2024-12-17 18:49:50 +01:00
H.J. Lu
dd413a4d2f Fix sysdeps/x86/fpu/Makefile: Split and sort tests
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
2024-12-16 05:57:28 +08:00
H.J. Lu
57a44f27c4 sysdeps/x86/fpu/Makefile: Split and sort tests
Split and sort tests in sysdeps/x86/fpu/Makefile.

Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
2024-12-16 05:51:02 +08:00