The old Intel software developer manual specified that the low byte of
EAX of CPUID leaf 2 returned 1 which indicated the number of rounds of
CPUDID leaf 2 was needed to retrieve the complete cache information. The
newer Intel manual has been changed to that it should always return 1
and be ignored. If the lower byte isn't 1, CPUID leaf 2 can't be used.
In this case, we ignore CPUID leaf 2 and use CPUID leaf 4 instead. If
CPUID leaf 4 doesn't contain the cache information, cache information
isn't available at all. This addresses BZ #30643.
(cherry picked from commit 1493622f4f)
Some hypervisors report 1 TiB L3 cache size. This results
in some variables incorrectly getting zeroed, causing crashes
in memcpy/memmove because invariants are violated.
(cherry picked from commit 61c3450db9)
The:
```
if (shared_per_thread > 0 && threads > 0)
shared_per_thread /= threads;
```
Code was accidentally moved to inside the else scope. This doesn't
match how it was previously (before af992e7abd).
This patch fixes that by putting the division after the `else` block.
(cherry picked from commit 084fb31bc2)
On some machines we end up with incomplete cache information. This can
make the new calculation of `sizeof(total-L3)/custom-divisor` end up
lower than intended (and lower than the prior value). So reintroduce
the old bound as a lower bound to avoid potentially regressing code
where we don't have complete information to make the decision.
Reviewed-by: DJ Delorie <dj@redhat.com>
(cherry picked from commit 8b9a0af8ca)
After:
```
commit af992e7abd
Author: Noah Goldstein <goldstein.w.n@gmail.com>
Date: Wed Jun 7 13:18:01 2023 -0500
x86: Increase `non_temporal_threshold` to roughly `sizeof_L3 / 4`
```
Split `shared` (cumulative cache size) from `shared_per_thread` (cache
size per socket), the `shared_per_thread` *can* be slightly off from
the previous calculation.
Previously we added `core` even if `threads_l2` was invalid, and only
used `threads_l2` to divide `core` if it was present. The changed
version only included `core` if `threads_l2` was valid.
This change restores the old behavior if `threads_l2` is invalid by
adding the entire value of `core`.
Reviewed-by: DJ Delorie <dj@redhat.com>
(cherry picked from commit 47f7472178)
Current `non_temporal_threshold` set to roughly '3/4 * sizeof_L3 /
ncores_per_socket'. This patch updates that value to roughly
'sizeof_L3 / 4`
The original value (specifically dividing the `ncores_per_socket`) was
done to limit the amount of other threads' data a `memcpy`/`memset`
could evict.
Dividing by 'ncores_per_socket', however leads to exceedingly low
non-temporal thresholds and leads to using non-temporal stores in
cases where REP MOVSB is multiple times faster.
Furthermore, non-temporal stores are written directly to main memory
so using it at a size much smaller than L3 can place soon to be
accessed data much further away than it otherwise could be. As well,
modern machines are able to detect streaming patterns (especially if
REP MOVSB is used) and provide LRU hints to the memory subsystem. This
in affect caps the total amount of eviction at 1/cache_associativity,
far below meaningfully thrashing the entire cache.
As best I can tell, the benchmarks that lead this small threshold
where done comparing non-temporal stores versus standard cacheable
stores. A better comparison (linked below) is to be REP MOVSB which,
on the measure systems, is nearly 2x faster than non-temporal stores
at the low-end of the previous threshold, and within 10% for over
100MB copies (well past even the current threshold). In cases with a
low number of threads competing for bandwidth, REP MOVSB is ~2x faster
up to `sizeof_L3`.
The divisor of `4` is a somewhat arbitrary value. From benchmarks it
seems Skylake and Icelake both prefer a divisor of `2`, but older CPUs
such as Broadwell prefer something closer to `8`. This patch is meant
to be followed up by another one to make the divisor cpu-specific, but
in the meantime (and for easier backporting), this patch settles on
`4` as a middle-ground.
Benchmarks comparing non-temporal stores, REP MOVSB, and cacheable
stores where done using:
https://github.com/goldsteinn/memcpy-nt-benchmarks
Sheets results (also available in pdf on the github):
https://docs.google.com/spreadsheets/d/e/2PACX-1vS183r0rW_jRX6tG_E90m9qVuFiMbRIJvi5VAE8yYOvEOIEEc3aSNuEsrFbuXw5c3nGboxMmrupZD7K/pubhtml
Reviewed-by: DJ Delorie <dj@redhat.com>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit af992e7abd)
The minimum non_temporal_threshold is 0x4040. non_temporal_threshold may
be set to less than the minimum value when the shared cache size isn't
available (e.g., in an emulator) or by the tunable. Add checks for
minimum and maximum of non_temporal_threshold.
This fixes BZ #29953.
(cherry picked from commit 48b74865c6)
The lower-bound (16448) and upper-bound (SIZE_MAX / 16) are assumed
by memmove-vec-unaligned-erms.
The lower-bound is needed because memmove-vec-unaligned-erms unrolls
the loop aggressively in the L(large_memset_4x) case.
The upper-bound is needed because memmove-vec-unaligned-erms
right-shifts the value of `x86_non_temporal_threshold` by
LOG_4X_MEMCPY_THRESH (4) which without a bound may overflow.
The lack of lower-bound can be a correctness issue. The lack of
upper-bound cannot.
(cherry picked from commit b446822b6a)
Move the setting of `rep_movsb_stop_threshold` to after the tunables
have been collected so that the `rep_movsb_stop_threshold` (which
is used to redirect control flow to the non_temporal case) will
use any user value for `non_temporal_threshold` (set using
glibc.cpu.x86_non_temporal_threshold)
(cherry picked from commit 0355915514)
I used these shell commands:
../glibc/scripts/update-copyrights $PWD/../gnulib/build-aux/update-copyright
(cd ../glibc && git commit -am"[this commit message]")
and then ignored the output, which consisted lines saying "FOO: warning:
copyright statement not found" for each of 7061 files FOO.
I then removed trailing white space from math/tgmath.h,
support/tst-support-open-dev-null-range.c, and
sysdeps/x86_64/multiarch/strlen-vec.S, to work around the following
obscure pre-commit check failure diagnostics from Savannah. I don't
know why I run into these diagnostics whereas others evidently do not.
remote: *** 912-#endif
remote: *** 913:
remote: *** 914-
remote: *** error: lines with trailing whitespace found
...
remote: *** error: sysdeps/unix/sysv/linux/statx_cp.c: trailing lines
commit 2d651eb926
Author: H.J. Lu <hjl.tools@gmail.com>
Date: Fri Sep 18 07:55:14 2020 -0700
x86: Move x86 processor cache info to cpu_features
missed _SC_LEVEL1_ICACHE_LINESIZE.
1. Add level1_icache_linesize to struct cpu_features.
2. Initialize level1_icache_linesize by calling handle_intel,
handle_zhaoxin and handle_amd with _SC_LEVEL1_ICACHE_LINESIZE.
3. Return level1_icache_linesize for _SC_LEVEL1_ICACHE_LINESIZE.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
The TUNABLE_SET interface took a primitive C type argument, which
resulted in inconsistent type conversions internally due to incorrect
dereferencing of types, especialy on 32-bit architectures. This
change simplifies the TUNABLE setting logic along with the interfaces.
Now all numeric tunable values are stored as signed numbers in
tunable_num_t, which is intmax_t. All calls to set tunables cast the
input value to its primitive type and then to tunable_num_t for
storage. This relies on gcc-specific (although I suspect other
compilers woul also do the same) unsigned to signed integer conversion
semantics, i.e. the bit pattern is conserved. The reverse conversion
is guaranteed by the standard.
In the process of optimizing memcpy for AMD machines, we have found the
vector move operations are outperforming enhanced REP MOVSB for data
transfers above the L2 cache size on Zen3 architectures.
To handle this use case, we are adding an upper bound parameter on
enhanced REP MOVSB:'__x86_rep_movsb_stop_threshold'.
As per large-bench results, we are configuring this parameter to the
L2 cache size for AMD machines and applicable from Zen3 architecture
supporting the ERMS feature.
For architectures other than AMD, it is the computed value of
non-temporal threshold parameter.
Reviewed-by: Premachandra Mallappa <premachandra.mallappa@amd.com>
In <sys/platform/x86.h>, define CPU features as enum instead of using
the C preprocessor magic to make it easier to wrap this functionality
in other languages. Move the C preprocessor magic to internal header
for better GCC codegen when more than one features are checked in a
single expression as in x86-64 dl-hwcaps-subdirs.c.
1. Rename COMMON_CPUID_INDEX_XXX to CPUID_INDEX_XXX.
2. Move CPUID_INDEX_MAX to sysdeps/x86/include/cpu-features.h.
3. Remove struct cpu_features and __x86_get_cpu_features from
<sys/platform/x86.h>.
4. Add __x86_get_cpuid_feature_leaf to <sys/platform/x86.h> and put it
in libc.
5. Make __get_cpu_features() private to glibc.
6. Replace __x86_get_cpu_features(N) with __get_cpu_features().
7. Add _dl_x86_get_cpu_features to GLIBC_PRIVATE.
8. Use a single enum index for each CPU feature detection.
9. Pass the CPUID feature leaf to __x86_get_cpuid_feature_leaf.
10. Return zero struct cpuid_feature for the older glibc binary with a
smaller CPUID_INDEX_MAX [BZ #27104].
11. Inside glibc, use the C preprocessor magic so that cpu_features data
can be loaded just once leading to more compact code for glibc.
256 bits are used for each CPUID leaf. Some leaves only contain a few
features. We can add exceptions to such leaves. But it will increase
code sizes and it is harder to provide backward/forward compatibilities
when new features are added to such leaves in the future.
When new leaves are added, _rtld_global_ro offsets will change which
leads to race condition during in-place updates. We may avoid in-place
updates by
1. Rename the old glibc.
2. Install the new glibc.
3. Remove the old glibc.
NB: A function, __x86_get_cpuid_feature_leaf , is used to avoid the copy
relocation issue with IFUNC resolver as shown in IFUNC resolver tests.
1. Move x86 processor cache info to _dl_x86_cpu_features in ld.so.
2. Update tunable bounds with TUNABLE_SET_WITH_BOUNDS.
3. Move x86 cache info initialization to dl-cacheinfo.h and initialize
x86 cache info in init_cpu_features ().
4. Put x86 cache info for libc in cacheinfo.h, which is included in
libc-start.c in libc.a and is included in cacheinfo.c in libc.so.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
I used these shell commands:
../glibc/scripts/update-copyrights $PWD/../gnulib/build-aux/update-copyright
(cd ../glibc && git commit -am"[this commit message]")
and then ignored the output, which consisted lines saying "FOO: warning:
copyright statement not found" for each of 6694 files FOO.
I then removed trailing white space from benchtests/bench-pthread-locks.c
and iconvdata/tst-iconv-big5-hkscs-to-2ucs4.c, to work around this
diagnostic from Savannah:
remote: *** pre-commit check failed ...
remote: *** error: lines with trailing whitespace found
remote: error: hook declined to update refs/heads/master
X86 CPU features in ld.so are initialized by init_cpu_features, which is
invoked by DL_PLATFORM_INIT from _dl_sysdep_start. But when ld.so is
loaded by static executable, DL_PLATFORM_INIT is never called. Also
x86 cache info in libc.o and libc.a is initialized by a constructor
which may be called too late. Since some fields in _rtld_global_ro
in ld.so are initialized by dynamic relocation, we can also initialize
x86 CPU features in _rtld_global_ro in ld.so and cache info in libc.so
by initializing dummy function pointers in ld.so and libc.so via IFUNC
relocation.
Key points:
1. IFUNC is always supported, independent of --enable-multi-arch or
--disable-multi-arch. Linker generates IFUNC relocations from input
IFUNC objects and ld.so performs IFUNC relocations.
2. There are no IFUNC dependencies in ld.so before dynamic relocation
have been performed,
3. The x86 CPU features in ld.so is initialized by DL_PLATFORM_INIT
in dynamic executable and by IFUNC relocation in dlopen in static
executable.
4. The x86 cache info in libc.o is initialized by IFUNC relocation.
5. In libc.a, both x86 CPU features and cache info are initialized from
ARCH_INIT_CPU_FEATURES, not by IFUNC relocation, before __libc_early_init
is called.
Note: _dl_x86_init_cpu_features can be called more than once from
DL_PLATFORM_INIT and during relocation in ld.so.