In init_cpu_features, replace GLRO(dl_x86_cpu_features) with
cpu_features to avoid an extra load.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Florian Weimer <fweimer@redhat.com>
Detect Intel Diamond Rapids and tune it similar to Intel Granite Rapids.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Sunil K Pandey <skpgkp2@gmail.com>
Enable default tuning for unknown Intel processor.
Tested on x86, no regression.
Co-Authored-By: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
- Add ARROWLAKE model detection.
- Add PANTHERLAKE model detection.
- Add CLEARWATERFOREST model detection.
Intel® Architecture Instruction Set Extensions Programming Reference
https://cdrdv2.intel.com/v1/dl/getContent/671368 Section 1.2.
No regression, validated model detection on SDE.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Scan xstate IDs up to the maximum supported xstate ID. Remove the
separate AMX xstate calculation. Instead, exclude the AMX space from
the start of TILECFG to the end of TILEDATA in xsave_state_size.
Completed validation on SKL/SKX/SPR/SDE and compared xsave state size
with "ld.so --list-diagnostics" option, no regression.
Co-Authored-By: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Sunil K Pandey <skpgkp2@gmail.com>
Previously, the initialization code reused the xsave_state_full_size
member of struct cpu_features for the TLSDESC state size. However,
the tunable processing code assumes that this member has the
original XSAVE (non-compact) state size, so that it can use its
value if XSAVEC is disabled via tunable.
This change uses a separate variable and not a struct member because
the value is only needed in ld.so and the static libc, but not in
libc.so. As a result, struct cpu_features layout does not change,
helping a future backport of this change.
Fixes commit 9b7091415a ("x86-64:
Update _dl_tlsdesc_dynamic to preserve AMX registers").
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
If we have to use XSAVE or XSAVEC trampolines, do not adjust the size
information they need. Technically, it is an operator error to try to
run with -XSAVE,-XSAVEC on such builds, but this change here disables
some unnecessary code with higher ISA levels and simplifies testing.
Related to commit befe2d3c4d
("x86-64: Don't use SSE resolvers for ISA level 3 or above").
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
This patch uses 'Avoid_Non_Temporal_Memset' flag to access
the non-temporal memset implementation for hygon processors.
Test Results:
hygon1 arch
x86_memset_non_temporal_threshold = 8MB
size new performance time / old performance time
1MB 0.994
4MB 0.996
8MB 0.670
16MB 0.343
32MB 0.355
hygon2 arch
x86_memset_non_temporal_threshold = 8MB
size new performance time / old performance time
1MB 1
4MB 1
8MB 1.312
16MB 0.822
32MB 0.830
hygon3 arch
x86_memset_non_temporal_threshold = 8MB
size new performance time / old performance time
1MB 1
4MB 0.990
8MB 0.737
16MB 0.390
32MB 0.401
For hygon arch with this patch, non-temporal stores can improve
performance by 20% - 65%.
Signed-off-by: Feifei Wang <wangfeifei@hygon.cn>
Reviewed-by: Jing Li <lijing@hygon.cn>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Add a new architecture type arch_kind_hygon to spilt Hygon branch
from AMD. This is to facilitate the Hygon processors to make settings
that are suitable for its own characteristics.
Signed-off-by: Feifei Wang <wangfeifei@hygon.cn>
Reviewed-by: Jing Li <lijing@hygon.cn>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
The goal of this flag is to allow targets which don't prefer/have ERMS
to still access the non-temporal memset implementation.
There are 4 cases for tuning memset:
1) `Avoid_STOSB && Avoid_Non_Temporal_Memset`
- Memset with temporal stores
2) `Avoid_STOSB && !Avoid_Non_Temporal_Memset`
- Memset with temporal/non-temporal stores. Non-temporal path
goes through `rep stosb` path. We accomplish this by setting
`x86_rep_stosb_threshold` to
`x86_memset_non_temporal_threshold`.
3) `!Avoid_STOSB && Avoid_Non_Temporal_Memset`
- Memset with temporal stores/`rep stosb`
3) `!Avoid_STOSB && !Avoid_Non_Temporal_Memset`
- Memset with temporal stores/`rep stosb`/non-temporal stores.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
This is just a refactor and there should be no behavioral change from
this commit.
The goal is to make `Avoid_Non_Temporal_Memset` a more universal knob
for controlling whether we use non-temporal memset rather than having
extra logic based on vendor.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
The original commit enabling non-temporal memset on Skylake Server had
erroneous benchmarks (actually done on ICX).
Further benchmarks indicate non-temporal stores may in fact by a
regression on Skylake Server.
This commit may be over-cautious in some cases, but should avoid any
regressions for 2.40.
Tested using qemu on all x86_64 cpu arch supported by both qemu +
GLIBC.
Reviewed-by: DJ Delorie <dj@redhat.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
Current 'non_temporal_threshold' set to 'non_temporal_threshold_lowbound'
on Zhaoxin processors without ERMS. The default
'non_temporal_threshold_lowbound' is too small for the KH-40000 and KX-7000
Zhaoxin processors, this patch updates the value to
'shared / cachesize_non_temporal_divisor'.
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
Fix code formatting under the Zhaoxin branch and add comments for
different Zhaoxin models.
Unaligned AVX load are slower on KH-40000 and KX-7000, so disable
the AVX_Fast_Unaligned_Load.
Enable Prefer_No_VZEROUPPER and Fast_Unaligned_Load features to
use sse2_unaligned version of memset,strcpy and strcat.
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
_dl_tlsdesc_dynamic preserves RDI, RSI and RBX before realigning stack.
After realigning stack, it saves RCX, RDX, R8, R9, R10 and R11. Define
TLSDESC_CALL_REGISTER_SAVE_AREA to allocate space for RDI, RSI and RBX
to avoid clobbering saved RDI, RSI and RBX values on stack by xsave to
STATE_SAVE_OFFSET(%rsp).
+==================+<- stack frame start aligned at 8 or 16 bytes
| |<- RDI saved in the red zone
| |<- RSI saved in the red zone
| |<- RBX saved in the red zone
| |<- paddings for stack realignment of 64 bytes
|------------------|<- xsave buffer end aligned at 64 bytes
| |<-
| |<-
| |<-
|------------------|<- xsave buffer start at STATE_SAVE_OFFSET(%rsp)
| |<- 8-byte padding for 64-byte alignment
| |<- 8-byte padding for 64-byte alignment
| |<- R11
| |<- R10
| |<- R9
| |<- R8
| |<- RDX
| |<- RCX
+==================+<- RSP aligned at 64 bytes
Define TLSDESC_CALL_REGISTER_SAVE_AREA, the total register save area size
for all integer registers by adding 24 to STATE_SAVE_OFFSET since RDI, RSI
and RBX are saved onto stack without adjusting stack pointer first, using
the red-zone. This fixes BZ #31501.
Reviewed-by: Sunil K Pandey <skpgkp2@gmail.com>
Replace minimum ISA check ifdef conditional with if. Since
MINIMUM_X86_ISA_LEVEL and AVX_X86_ISA_LEVEL are compile time constants,
compiler will perform constant folding optimization, getting same
results.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
_dl_tlsdesc_dynamic should also preserve AMX registers which are
caller-saved. Add X86_XSTATE_TILECFG_ID and X86_XSTATE_TILEDATA_ID
to x86-64 TLSDESC_CALL_STATE_SAVE_MASK. Compute the AMX state size
and save it in xsave_state_full_size which is only used by
_dl_tlsdesc_dynamic_xsave and _dl_tlsdesc_dynamic_xsavec. This fixes
the AMX part of BZ #31372. Tested on AMX processor.
AMX test is enabled only for compilers with the fix for
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114098
GCC 14 and GCC 11/12/13 branches have the bug fix.
Reviewed-by: Sunil K Pandey <skpgkp2@gmail.com>
When glibc is built with ISA level 3 or above enabled, SSE resolvers
aren't available and glibc fails to build:
ld: .../elf/librtld.os: in function `init_cpu_features':
.../elf/../sysdeps/x86/cpu-features.c:1200:(.text+0x1445f): undefined reference to `_dl_runtime_resolve_fxsave'
ld: .../elf/librtld.os: relocation R_X86_64_PC32 against undefined hidden symbol `_dl_runtime_resolve_fxsave' can not be used when making a shared object
/usr/local/bin/ld: final link failed: bad value
For ISA level 3 or above, don't use _dl_runtime_resolve_fxsave nor
_dl_tlsdesc_dynamic_fxsave.
This fixes BZ #31429.
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
Compiler generates the following instruction sequence for GNU2 dynamic
TLS access:
leaq tls_var@TLSDESC(%rip), %rax
call *tls_var@TLSCALL(%rax)
or
leal tls_var@TLSDESC(%ebx), %eax
call *tls_var@TLSCALL(%eax)
CALL instruction is transparent to compiler which assumes all registers,
except for EFLAGS and RAX/EAX, are unchanged after CALL. When
_dl_tlsdesc_dynamic is called, it calls __tls_get_addr on the slow
path. __tls_get_addr is a normal function which doesn't preserve any
caller-saved registers. _dl_tlsdesc_dynamic saved and restored integer
caller-saved registers, but didn't preserve any other caller-saved
registers. Add _dl_tlsdesc_dynamic IFUNC functions for FNSAVE, FXSAVE,
XSAVE and XSAVEC to save and restore all caller-saved registers. This
fixes BZ #31372.
Add GLRO(dl_x86_64_runtime_resolve) with GLRO(dl_x86_tlsdesc_dynamic)
to optimize elf_machine_runtime_setup.
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
Systemd execution environment configuration may prohibit changing a memory
mapping to become executable:
MemoryDenyWriteExecute=
Takes a boolean argument. If set, attempts to create memory mappings
that are writable and executable at the same time, or to change existing
memory mappings to become executable, or mapping shared memory segments
as executable, are prohibited.
When it is set, systemd service stops working if PLT rewrite is enabled.
Check if mprotect works before rewriting PLT. This fixes BZ #31230.
This also works with SELinux when deny_execmem is on.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
1. Remove _dl_runtime_resolve_shstk and _dl_runtime_profile_shstk.
2. Move CET offsets from x86 cpu-features-offsets.sym to x86-64
features-offsets.sym.
3. Rename x86 cet-control.h to x86-64 feature-control.h since it is only
for x86-64 and also used for PLT rewrite.
4. Add x86-64 ldsodefs.h to include feature-control.h.
5. Change TUNABLE_CALLBACK (set_plt_rewrite) to x86-64 only.
6. Move x86 dl-procruntime.c to x86-64.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Add ELF_DYNAMIC_AFTER_RELOC to allow target specific processing after
relocation.
For x86-64, add
#define DT_X86_64_PLT (DT_LOPROC + 0)
#define DT_X86_64_PLTSZ (DT_LOPROC + 1)
#define DT_X86_64_PLTENT (DT_LOPROC + 3)
1. DT_X86_64_PLT: The address of the procedure linkage table.
2. DT_X86_64_PLTSZ: The total size, in bytes, of the procedure linkage
table.
3. DT_X86_64_PLTENT: The size, in bytes, of a procedure linkage table
entry.
With the r_addend field of the R_X86_64_JUMP_SLOT relocation set to the
memory offset of the indirect branch instruction.
Define ELF_DYNAMIC_AFTER_RELOC for x86-64 to rewrite the PLT section
with direct branch after relocation when the lazy binding is disabled.
PLT rewrite is disabled by default since SELinux may disallow modifying
code pages and ld.so can't detect it in all cases. Use
$ export GLIBC_TUNABLES=glibc.cpu.plt_rewrite=1
to enable PLT rewrite with 32-bit direct jump at run-time or
$ export GLIBC_TUNABLES=glibc.cpu.plt_rewrite=2
to enable PLT rewrite with 32-bit direct jump and on APX processors with
64-bit absolute jump at run-time.
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
Not all CET enabled applications and libraries have been properly tested
in CET enabled environments. Some CET enabled applications or libraries
will crash or misbehave when CET is enabled. Don't set CET active by
default so that all applications and libraries will run normally regardless
of whether CET is active or not. Shadow stack can be enabled by
$ export GLIBC_TUNABLES=glibc.cpu.hwcaps=SHSTK
at run-time if shadow stack can be enabled by kernel.
NB: This commit can be reverted if it is OK to enable CET by default for
all applications and libraries.
Previously, CET was enabled by kernel before passing control to user
space and the startup code must disable CET if applications or shared
libraries aren't CET enabled. Since the current kernel only supports
shadow stack and won't enable shadow stack before passing control to
user space, we need to enable shadow stack during startup if the
application and all shared library are shadow stack enabled. There
is no need to disable shadow stack at startup. Shadow stack can only
be enabled in a function which will never return. Otherwise, shadow
stack will underflow at the function return.
1. GL(dl_x86_feature_1) is set to the CET features which are supported
by the processor and are not disabled by the tunable. Only non-zero
features in GL(dl_x86_feature_1) should be enabled. After enabling
shadow stack with ARCH_SHSTK_ENABLE, ARCH_SHSTK_STATUS is used to check
if shadow stack is really enabled.
2. Use ARCH_SHSTK_ENABLE in RTLD_START in dynamic executable. It is
safe since RTLD_START never returns.
3. Call arch_prctl (ARCH_SHSTK_ENABLE) from ARCH_SETUP_TLS in static
executable. Since the start function using ARCH_SETUP_TLS never returns,
it is safe to enable shadow stack in ARCH_SETUP_TLS.
Sync with Linux kernel 6.6 shadow stack interface. Since only x86-64 is
supported, i386 shadow stack codes are unchanged and CET shouldn't be
enabled for i386.
1. When the shadow stack base in TCB is unset, the default shadow stack
is in use. Use the current shadow stack pointer as the marker for the
default shadow stack. It is used to identify if the current shadow stack
is the same as the target shadow stack when switching ucontexts. If yes,
INCSSP will be used to unwind shadow stack. Otherwise, shadow stack
restore token will be used.
2. Allocate shadow stack with the map_shadow_stack syscall. Since there
is no function to explicitly release ucontext, there is no place to
release shadow stack allocated by map_shadow_stack in ucontext functions.
Such shadow stacks will be leaked.
3. Rename arch_prctl CET commands to ARCH_SHSTK_XXX.
4. Rewrite the CET control functions with the current kernel shadow stack
interface.
Since CET is no longer enabled by kernel, a separate patch will enable
shadow stack during startup.
This commit add support for the new AVX10 cpu features:
https://cdrdv2-public.intel.com/784267/355989-intel-avx10-spec.pdf
We add checks for:
- `AVX10`: Check if AVX10 is present.
- `AVX10_{X,Y,Z}MM`: Check if a given vec class has AVX10 support.
`make check` passes and cpuid output was checked against GNR/DMR on an
emulator.
Different systems prefer a different divisors.
From benchmarks[1] so far the following divisors have been found:
ICX : 2
SKX : 2
BWD : 8
For Intel, we are generalizing that BWD and older prefers 8 as a
divisor, and SKL and newer prefers 2. This number can be further tuned
as benchmarks are run.
[1]: https://github.com/goldsteinn/memcpy-nt-benchmarks
Reviewed-by: DJ Delorie <dj@redhat.com>
This patch should have no affect on existing functionality.
The current code, which has a single switch for model detection and
setting prefered features, is difficult to follow/extend. The cases
use magic numbers and many microarchitectures are missing. This makes
it difficult to reason about what is implemented so far and/or
how/where to add support for new features.
This patch splits the model detection and preference setting stages so
that CPU preferences can be set based on a complete list of available
microarchitectures, rather than based on model magic numbers.
Reviewed-by: DJ Delorie <dj@redhat.com>
Linux kernel uses AT_HWCAP2 to indicate if FSGSBASE instructions are
enabled. If the HWCAP2_FSGSBASE bit in AT_HWCAP2 is set, FSGSBASE
instructions can be used in user space. Define dl_check_hwcap2 to set
the FSGSBASE feature to active on Linux when the HWCAP2_FSGSBASE bit is
set.
Add a test to verify that FSGSBASE is active on current kernels.
NB: This test will fail if the kernel doesn't set the HWCAP2_FSGSBASE
bit in AT_HWCAP2 while fsgsbase shows up in /proc/cpuinfo.
Reviewed-by: Florian Weimer <fweimer@redhat.com>
And make always supported. The configure option was added on glibc 2.25
and some features require it (such as hwcap mask, huge pages support, and
lock elisition tuning). It also simplifies the build permutations.
Changes from v1:
* Remove glibc.rtld.dynamic_sort changes, it is orthogonal and needs
more discussion.
* Cleanup more code.
Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
Crossing 2GB boundaries with indirect calls and jumps can use more
branch prediction resources on Intel Golden Cove CPU (see the
"Misprediction for Branches >2GB" section in Intel 64 and IA-32
Architectures Optimization Reference Manual.) There is visible
performance improvement on workloads with many PLT calls when executable
and shared libraries are mmapped below 2GB. Add the Prefer_MAP_32BIT_EXEC
bit so that mmap will try to map executable or denywrite pages in shared
libraries with MAP_32BIT first.
NB: Prefer_MAP_32BIT_EXEC reduces bits available for address space
layout randomization (ASLR), which is always disabled for SUID programs
and can only be enabled by the tunable, glibc.cpu.prefer_map_32bit_exec,
or the environment variable, LD_PREFER_MAP_32BIT_EXEC. This works only
between shared libraries or between shared libraries and executables with
addresses below 2GB. PIEs are usually loaded at a random address above
4GB by the kernel.
I used these shell commands:
../glibc/scripts/update-copyrights $PWD/../gnulib/build-aux/update-copyright
(cd ../glibc && git commit -am"[this commit message]")
and then ignored the output, which consisted lines saying "FOO: warning:
copyright statement not found" for each of 7061 files FOO.
I then removed trailing white space from math/tgmath.h,
support/tst-support-open-dev-null-range.c, and
sysdeps/x86_64/multiarch/strlen-vec.S, to work around the following
obscure pre-commit check failure diagnostics from Savannah. I don't
know why I run into these diagnostics whereas others evidently do not.
remote: *** 912-#endif
remote: *** 913:
remote: *** 914-
remote: *** error: lines with trailing whitespace found
...
remote: *** error: sysdeps/unix/sysv/linux/statx_cp.c: trailing lines
Remove Prefer_AVX2_STRCMP to enable EVEX strcmp. When comparing 2 32-byte
strings, EVEX strcmp has been improved to require 1 load, 1 VPTESTM, 1
VPCMP, 1 KMOVD and 1 INCL instead of 2 loads, 3 VPCMPs, 2 KORDs, 1 KMOVD
and 1 TESTL while AVX2 strcmp requires 1 load, 2 VPCMPEQs, 1 VPMINU, 1
VPMOVMSKB and 1 TESTL. EVEX strcmp is now faster than AVX2 strcmp by up
to 40% on Tiger Lake and Ice Lake.