1
0
mirror of https://sourceware.org/git/glibc.git synced 2025-09-18 10:23:24 +03:00
Commit Graph

37010 Commits

Author SHA1 Message Date
H.J. Lu
68cca6e1e8 x86-64: Add GLIBC_ABI_DT_X86_64_PLT [BZ #33212]
When the linker -z mark-plt option is used to add DT_X86_64_PLT,
DT_X86_64_PLTSZ and DT_X86_64_PLTENT, the r_addend field of the
R_X86_64_JUMP_SLOT relocation stores the offset of the indirect
branch instruction.  However, glibc versions without the commit:

commit f8587a6189
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Fri May 20 19:21:48 2022 -0700

    x86-64: Ignore r_addend for R_X86_64_GLOB_DAT/R_X86_64_JUMP_SLOT

    According to x86-64 psABI, r_addend should be ignored for R_X86_64_GLOB_DAT
    and R_X86_64_JUMP_SLOT.  Since linkers always set their r_addends to 0, we
    can ignore their r_addends.

    Reviewed-by: Fangrui Song <maskray@google.com>

won't ignore the r_addend value in the R_X86_64_JUMP_SLOT relocation.
Such programs and shared libraries will fail at run-time randomly.

Add GLIBC_ABI_DT_X86_64_PLT version to indicate that glibc is compatible
with DT_X86_64_PLT.

The linker can add the glibc GLIBC_ABI_DT_X86_64_PLT version dependency
whenever -z mark-plt is passed to the linker.  The resulting programs and
shared libraries will fail to load at run-time against libc.so without the
GLIBC_ABI_DT_X86_64_PLT version, instead of fail randomly.

This fixes BZ #33212.

Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Sam James <sam@gentoo.org>
(cherry picked from commit 399384e0c8)
2025-08-18 01:06:48 +01:00
Florian Weimer
1ec499b165 posix: Fix double-free after allocation failure in regcomp (bug 33185)
If a memory allocation failure occurs during bracket expression
parsing in regcomp, a double-free error may result.

Reported-by: Anastasia Belova <abelova@astralinux.ru>
Co-authored-by: Paul Eggert <eggert@cs.ucla.edu>
Reviewed-by: Andreas K. Huettel <dilfridge@gentoo.org>
(cherry picked from commit 7ea06e9940)
2025-07-24 12:11:24 +02:00
Stefan Liebler
5f08d1df2c s390x: Fix segfault in wcsncmp [BZ #31934]
The z13/vector-optimized wcsncmp implementation segfaults if n=1
and there is only one character (equal on both strings) before
the page end.  Then it loads and compares one character and misses
to check n again.  The following load fails.

This patch removes the extra load and compare of the first character
and just start with the loop which uses vector-load-to-block-boundary.
This code-path also checks n.

With this patch both tests are passing:
- the simplified one mentioned in the bugzilla 31934
- the full one in Florian Weimer's patch:
"manual: Document a GNU extension for strncmp/wcsncmp"
(https://patchwork.sourceware.org/project/glibc/patch/874j9eml6y.fsf@oldenburg.str.redhat.com/):
On s390x-linux-gnu (z16), the new wcsncmp test fails due to bug 31934.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>

(cherry picked from commit 9b76514103)
2024-07-16 10:34:50 +02:00
H.J. Lu
d0916db233 Force DT_RPATH for --enable-hardcoded-path-in-tests
On Fedora 40/x86-64, linker enables --enable-new-dtags by default which
generates DT_RUNPATH instead of DT_RPATH.  Unlike DT_RPATH, DT_RUNPATH
only applies to DT_NEEDED entries in the executable and doesn't applies
to DT_NEEDED entries in shared libraries which are loaded via DT_NEEDED
entries in the executable.  Some glibc tests have libstdc++.so.6 in
DT_NEEDED, which has libm.so.6 in DT_NEEDED.  When DT_RUNPATH is generated,
/lib64/libm.so.6 is loaded for such tests.  If the newly built glibc is
older than glibc 2.36, these tests fail with

assert/tst-assert-c++: /export/build/gnu/tools-build/glibc-gitlab-release/build-x86_64-linux/libc.so.6: version `GLIBC_2.36' not found (required by /lib64/libm.so.6)
assert/tst-assert-c++: /export/build/gnu/tools-build/glibc-gitlab-release/build-x86_64-linux/libc.so.6: version `GLIBC_ABI_DT_RELR' not found (required by /lib64/libm.so.6)

Pass -Wl,--disable-new-dtags to linker when building glibc tests with
--enable-hardcoded-path-in-tests.  This fixes BZ #31719.

Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 2dcaf70643)
2024-05-10 05:46:28 -07:00
Florian Weimer
430e825909 elf: Disable some subtests of ifuncmain1, ifuncmain5 for !PIE
(cherry picked from commit 9cc9d61ee1)
2024-05-09 16:47:01 -07:00
Florian Weimer
1e398f406b nscd: Use time_t for return type of addgetnetgrentX
Using int may give false results for future dates (timeouts after the
year 2028).

Fixes commit 04a21e050d64a1193a6daab872bca2528bda44b ("CVE-2024-33601,
CVE-2024-33602: nscd: netgroup: Use two buffers in addgetnetgrentX
(bug 31680)").

Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit 4bbca1a446)
2024-05-03 09:22:44 +02:00
Florian Weimer
4d27d4b9a1 CVE-2024-33601, CVE-2024-33602: nscd: netgroup: Use two buffers in addgetnetgrentX (bug 31680)
This avoids potential memory corruption when the underlying NSS
callback function does not use the buffer space to store all strings
(e.g., for constant strings).

Instead of custom buffer management, two scratch buffers are used.
This increases stack usage somewhat.

Scratch buffer allocation failure is handled by return -1
(an invalid timeout value) instead of terminating the process.
This fixes bug 31679.

Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
(cherry picked from commit c04a21e050)
2024-04-25 16:07:52 +02:00
Florian Weimer
e3eef1b8fb CVE-2024-33600: nscd: Avoid null pointer crashes after notfound response (bug 31678)
The addgetnetgrentX call in addinnetgrX may have failed to produce
a result, so the result variable in addinnetgrX can be NULL.
Use db->negtimeout as the fallback value if there is no result data;
the timeout is also overwritten below.

Also avoid sending a second not-found response.  (The client
disconnects after receiving the first response, so the data stream did
not go out of sync even without this fix.)  It is still beneficial to
add the negative response to the mapping, so that the client can get
it from there in the future, instead of going through the socket.

Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
(cherry picked from commit b048a482f0)
2024-04-25 16:07:52 +02:00
Florian Weimer
f20a8d696b CVE-2024-33600: nscd: Do not send missing not-found response in addgetnetgrentX (bug 31678)
If we failed to add a not-found response to the cache, the dataset
point can be null, resulting in a null pointer dereference.

Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
(cherry picked from commit 7835b00dbc)
2024-04-25 16:07:52 +02:00
Florian Weimer
5c75001a96 CVE-2024-33599: nscd: Stack-based buffer overflow in netgroup cache (bug 31677)
Using alloca matches what other caches do.  The request length is
bounded by MAXKEYLEN.

Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit 87801a8fd0)
2024-04-25 16:07:52 +02:00
Charles Fol
ed4f16ff6b iconv: ISO-2022-CN-EXT: fix out-of-bound writes when writing escape sequence (CVE-2024-2961)
ISO-2022-CN-EXT uses escape sequences to indicate character set changes
(as specified by RFC 1922).  While the SOdesignation has the expected
bounds checks, neither SS2designation nor SS3designation have its;
allowing a write overflow of 1, 2, or 3 bytes with fixed values:
'$+I', '$+J', '$+K', '$+L', '$+M', or '$*H'.

Checked on aarch64-linux-gnu.

Co-authored-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Tested-by: Carlos O'Donell <carlos@redhat.com>

(cherry picked from commit f9dc609e06)
2024-04-17 14:03:53 -03:00
Wilco Dijkstra
91ac82d0c6 aarch64: Use memcpy_simd as the default memcpy
Since __memcpy_simd is the fastest memcpy on almost all cores, replace
the generic memcpy with it.

(cherry picked from commit e6f3fe362f)
2024-04-09 18:52:29 +01:00
Andreas Schwab
4818620ee2 aarch64: correct CFI in rawmemchr (bug 31113)
The .cfi_return_column directive changes the return column for the whole
FDE range.  But the actual intent is to tell the unwinder that the value
in x30 (lr) now resides in x15 after the move, and that is expressed by
the .cfi_register directive.

(cherry picked from commit 3f79842788)
2024-04-09 18:43:49 +01:00
Wilco Dijkstra
062d8a648f AArch64: Improve strrchr
Use shrn for narrowing the mask which simplifies code and speeds up small
strings.  Unroll the first search loop to improve performance on large
strings.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
(cherry picked from commit 55599d4804)
2024-04-09 18:41:06 +01:00
Wilco Dijkstra
3527d31d0b AArch64: Optimize strnlen
Optimize strnlen using the shrn instruction and improve the main loop.
Small strings are around 10% faster, large strings are 40% faster on
modern CPUs.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
(cherry picked from commit ad098893ba)
2024-04-09 18:40:41 +01:00
Wilco Dijkstra
12784e1d82 AArch64: Optimize strlen
Optimize strlen by unrolling the main loop.  Large strings are 64% faster on
modern CPUs.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
(cherry picked from commit 03c8ce5000)
2024-04-09 18:40:17 +01:00
Wilco Dijkstra
b01d60fdef AArch64: Optimize strcpy
Unroll the main loop.  Large strings are around 20% faster on modern CPUs.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
(cherry picked from commit 349e48c01e)
2024-04-09 18:39:53 +01:00
Wilco Dijkstra
7d628ab7cf AArch64: Improve strchrnul
Unroll the main loop, which improves performance slightly.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
(cherry picked from commit 09ebd8549b)
2024-04-09 18:39:28 +01:00
Wilco Dijkstra
a7eae18db2 AArch64: Optimize strchr
Simplify calculation of the mask using shrn.  Unroll the main loop.
Small strings are 20% faster on modern CPUs.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
(cherry picked from commit 51541a2297)
2024-04-09 18:38:42 +01:00
Wilco Dijkstra
97b1c803cb AArch64: Improve strlen_asimd
Use shrn for the mask, merge tst+bne into cbnz, and tweak code alignment.
Performance improves slightly as a result.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
(cherry picked from commit 1bbb1a2022)
2024-04-09 18:38:04 +01:00
Wilco Dijkstra
0604c9d255 AArch64: Optimize memrchr
Optimize the main loop - large strings are 43% faster on modern CPUs.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
(cherry picked from commit 0077624177)
2024-04-09 18:37:38 +01:00
Wilco Dijkstra
50ebfdfe8a AArch64: Optimize memchr
Optimize the main loop - large strings are 40% faster on modern CPUs.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
(cherry picked from commit ce758d4f06)
2024-04-09 18:37:21 +01:00
Danila Kutenin
fa9cd4fd3c aarch64: Optimize string functions with shrn instruction
We found that string functions were using AND+ADDP
to find the nibble/syndrome mask but there is an easier
opportunity through `SHRN dst.8b, src.8h, 4` (shift
right every 2 bytes by 4 and narrow to 1 byte) and has
same latency on all SIMD ARMv8 targets as ADDP. There
are also possible gaps for memcmp but that's for
another patch.

We see 10-20% savings for small-mid size cases (<=128)
which are primary cases for general workloads.

(cherry picked from commit 3c99806989)
2024-04-09 18:30:02 +01:00
Wilco Dijkstra
ea547d896d AArch64: Optimize memcmp
Rewrite memcmp to improve performance. On small and medium inputs performance
is 10-20% better. Large inputs use a SIMD loop processing 64 bytes per
iteration, which is 30-50% faster depending on the size.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
(cherry picked from commit b51eb35c57)
2024-04-09 18:28:20 +01:00
Wilco Dijkstra
ce54f78828 AArch64: Improve strnlen performance
Optimize strnlen by avoiding UMINV which is slow on most cores. On Neoverse N1
large strings are 1.8x faster than the current version, and bench-strnlen is
50% faster overall. This version is MTE compatible.

Reviewed-by: Szabolcs Nagy  <szabolcs.nagy@arm.com>
(cherry picked from commit 252cad02d4)
2024-04-09 18:28:11 +01:00
Sunil K Pandey
3b8d158b38 x86_64: Optimize ffsll function code size.
Ffsll function randomly regress by ~20%, depending on how code gets
aligned in memory.  Ffsll function code size is 17 bytes.  Since default
function alignment is 16 bytes, it can load on 16, 32, 48 or 64 bytes
aligned memory.  When ffsll function load at 16, 32 or 64 bytes aligned
memory, entire code fits in single 64 bytes cache line.  When ffsll
function load at 48 bytes aligned memory, it splits in two cache line,
hence random regression.

Ffsll function size reduction from 17 bytes to 12 bytes ensures that it
will always fit in single 64 bytes cache line.

This patch fixes ffsll function random performance regression.

Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit 9d94997b5f)
2024-01-31 18:56:23 -08:00
Noah Goldstein
1a200935e1 x86: Fix incorrect scope of setting shared_per_thread [BZ# 30745]
The:

```
    if (shared_per_thread > 0 && threads > 0)
      shared_per_thread /= threads;
```

Code was accidentally moved to inside the else scope.  This doesn't
match how it was previously (before af992e7abd).

This patch fixes that by putting the division after the `else` block.

(cherry picked from commit 084fb31bc2)
2023-09-11 22:47:08 -05:00
Noah Goldstein
6c8475a3f7 x86: Use 3/4*sizeof(per-thread-L3) as low bound for NT threshold.
On some machines we end up with incomplete cache information. This can
make the new calculation of `sizeof(total-L3)/custom-divisor` end up
lower than intended (and lower than the prior value). So reintroduce
the old bound as a lower bound to avoid potentially regressing code
where we don't have complete information to make the decision.
Reviewed-by: DJ Delorie <dj@redhat.com>

(cherry picked from commit 8b9a0af8ca)
2023-09-11 22:47:08 -05:00
Noah Goldstein
31b06441f9 x86: Fix slight bug in shared_per_thread cache size calculation.
After:
```
    commit af992e7abd
    Author: Noah Goldstein <goldstein.w.n@gmail.com>
    Date:   Wed Jun 7 13:18:01 2023 -0500

        x86: Increase `non_temporal_threshold` to roughly `sizeof_L3 / 4`
```

Split `shared` (cumulative cache size) from `shared_per_thread` (cache
size per socket), the `shared_per_thread` *can* be slightly off from
the previous calculation.

Previously we added `core` even if `threads_l2` was invalid, and only
used `threads_l2` to divide `core` if it was present. The changed
version only included `core` if `threads_l2` was valid.

This change restores the old behavior if `threads_l2` is invalid by
adding the entire value of `core`.
Reviewed-by: DJ Delorie <dj@redhat.com>

(cherry picked from commit 47f7472178)
2023-09-11 22:47:08 -05:00
Noah Goldstein
5e2d2d7cbd x86: Increase non_temporal_threshold to roughly sizeof_L3 / 4
Current `non_temporal_threshold` set to roughly '3/4 * sizeof_L3 /
ncores_per_socket'. This patch updates that value to roughly
'sizeof_L3 / 4`

The original value (specifically dividing the `ncores_per_socket`) was
done to limit the amount of other threads' data a `memcpy`/`memset`
could evict.

Dividing by 'ncores_per_socket', however leads to exceedingly low
non-temporal thresholds and leads to using non-temporal stores in
cases where REP MOVSB is multiple times faster.

Furthermore, non-temporal stores are written directly to main memory
so using it at a size much smaller than L3 can place soon to be
accessed data much further away than it otherwise could be. As well,
modern machines are able to detect streaming patterns (especially if
REP MOVSB is used) and provide LRU hints to the memory subsystem. This
in affect caps the total amount of eviction at 1/cache_associativity,
far below meaningfully thrashing the entire cache.

As best I can tell, the benchmarks that lead this small threshold
where done comparing non-temporal stores versus standard cacheable
stores. A better comparison (linked below) is to be REP MOVSB which,
on the measure systems, is nearly 2x faster than non-temporal stores
at the low-end of the previous threshold, and within 10% for over
100MB copies (well past even the current threshold). In cases with a
low number of threads competing for bandwidth, REP MOVSB is ~2x faster
up to `sizeof_L3`.

The divisor of `4` is a somewhat arbitrary value. From benchmarks it
seems Skylake and Icelake both prefer a divisor of `2`, but older CPUs
such as Broadwell prefer something closer to `8`. This patch is meant
to be followed up by another one to make the divisor cpu-specific, but
in the meantime (and for easier backporting), this patch settles on
`4` as a middle-ground.

Benchmarks comparing non-temporal stores, REP MOVSB, and cacheable
stores where done using:
https://github.com/goldsteinn/memcpy-nt-benchmarks

Sheets results (also available in pdf on the github):
https://docs.google.com/spreadsheets/d/e/2PACX-1vS183r0rW_jRX6tG_E90m9qVuFiMbRIJvi5VAE8yYOvEOIEEc3aSNuEsrFbuXw5c3nGboxMmrupZD7K/pubhtml
Reviewed-by: DJ Delorie <dj@redhat.com>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>

(cherry picked from commit af992e7abd)
2023-09-11 22:47:08 -05:00
Florian Weimer
aef97dc952 debug: Mark libSegFault.so as NODELETE
The signal handler installed in the ELF constructor cannot easily
be removed again (because the program may have changed handlers
in the meantime).  Mark the object as NODELETE so that the registered
handler function is never unloaded.

Reviewed-by: Carlos O'Donell <carlos@redhat.com>
(cherry picked from commit 23ee92deea)
2023-07-21 16:39:46 +02:00
H.J. Lu
eed27b3e46 Document BZ #20975 fix 2023-05-23 16:47:45 -07:00
H.J. Lu
24302748fc __check_pf: Add a cancellation cleanup handler [BZ #20975]
There are reports for hang in __check_pf:

https://github.com/JoeDog/siege/issues/4

It is reproducible only under specific configurations:

1. Large number of cores (>= 64) and large number of threads (> 3X of
the number of cores) with long lived socket connection.
2. Low power (frequency) mode.
3. Power management is enabled.

While holding lock, __check_pf calls make_request which calls __sendto
and __recvmsg.  Since __sendto and __recvmsg are cancellation points,
lock held by __check_pf won't be released and can cause deadlock when
thread cancellation happens in __sendto or __recvmsg.  Add a cancellation
cleanup handler for __check_pf to unlock the lock when cancelled by
another thread.  This fixes BZ #20975 and the siege hang issue.

(cherry picked from commit a443bd3fb2)
2023-05-23 16:06:42 -07:00
Noah Goldstein
017b9df5ea x86: Fix wcsnlen-avx2 page cross length comparison [BZ #29591]
Previous implementation was adjusting length (rsi) to match
bytes (eax), but since there is no bound to length this can cause
overflow.

Fix is to just convert the byte-count (eax) to length by dividing by
sizeof (wchar_t) before the comparison.

Full check passes on x86-64 and build succeeds w/ and w/o multiarch.

(cherry picked from commit b0969fa53a)
2022-11-24 15:12:44 -08:00
Aurelien Jarno
67e310afb9 x86-64: Require BMI1/BMI2 for AVX2 strrchr and wcsrchr implementations
The AVX2 strrchr and wcsrchr implementation uses the 'blsmsk'
instruction which belongs to the BMI1 CPU feature and the 'shrx'
instruction, which belongs to the BMI2 CPU feature.

Fixes: df7e295d18 ("x86: Optimize {str|wcs}rchr-avx2")
Partially resolves: BZ #29611

Reviewed-by: Noah Goldstein  <goldstein.w.n@gmail.com>
(cherry picked from commit 7e8283170c)
2022-10-04 00:03:20 +02:00
Aurelien Jarno
38e321f4ac x86-64: Require BMI2 and LZCNT for AVX2 memrchr implementation
The AVX2 memrchr implementation uses the 'shlxl' instruction, which
belongs to the BMI2 CPU feature and uses the 'lzcnt' instruction, which
belongs to the LZCNT CPU feature.

Fixes: af5306a735 ("x86: Optimize memrchr-avx2.S")
Partially resolves: BZ #29611

Reviewed-by: Noah Goldstein  <goldstein.w.n@gmail.com>
(cherry picked from commit 3c0c78afab)
2022-10-04 00:03:19 +02:00
Aurelien Jarno
068c8d5aa9 x86-64: Require BMI2 for AVX2 (raw|w)memchr implementations
The AVX2 memchr, rawmemchr and wmemchr implementations use the 'bzhi'
and 'sarx' instructions, which belongs to the BMI2 CPU feature.

Fixes: acfd088a19 ("x86: Optimize memchr-avx2.S")
Partially resolves: BZ #29611

Reviewed-by: Noah Goldstein  <goldstein.w.n@gmail.com>
(cherry picked from commit e3e7fab7fe)
2022-10-04 00:03:19 +02:00
Aurelien Jarno
35f655c890 x86-64: Require BMI2 for AVX2 wcs(n)cmp implementations
The AVX2 wcs(n)cmp implementations use the 'bzhi' instruction, which
belongs to the BMI2 CPU feature.

NB: It also uses the 'tzcnt' BMI1 instruction, but it is executed as BSF
as BSF if the CPU doesn't support TZCNT, and produces the same result
for non-zero input.

Partially fixes: b77b06e0e2 ("x86: Optimize strcmp-avx2.S")
Partially resolves: BZ #29611

Reviewed-by: Noah Goldstein  <goldstein.w.n@gmail.com>
(cherry picked from commit f31a5a884e)
2022-10-04 00:03:19 +02:00
Aurelien Jarno
050d652900 x86-64: Require BMI2 for AVX2 strncmp implementation
The AVX2 strncmp implementations uses the 'bzhi' instruction, which
belongs to the BMI2 CPU feature.

NB: It also uses the 'tzcnt' BMI1 instruction, but it is executed as BSF
as BSF if the CPU doesn't support TZCNT, and produces the same result
for non-zero input.

Partially fixes: b77b06e0e2 ("x86: Optimize strcmp-avx2.S")
Partially resolves: BZ #29611

Reviewed-by: Noah Goldstein  <goldstein.w.n@gmail.com>
(cherry picked from commit fc7de1d9b9)
2022-10-04 00:03:19 +02:00
Aurelien Jarno
26d4359b44 x86-64: Require BMI2 for AVX2 strcmp implementation
The AVX2 strcmp implementation uses the 'bzhi' instruction, which
belongs to the BMI2 CPU feature.

NB: It also uses the 'tzcnt' BMI1 instruction, but it is executed as BSF
as BSF if the CPU doesn't support TZCNT, and produces the same result
for non-zero input.

Partially fixes: b77b06e0e2 ("x86: Optimize strcmp-avx2.S")
Partially resolves: BZ #29611

Reviewed-by: Noah Goldstein  <goldstein.w.n@gmail.com>
(cherry picked from commit 4d64c64457)
2022-10-04 00:03:19 +02:00
Aurelien Jarno
007581a1cc x86-64: Require BMI2 for AVX2 str(n)casecmp implementations
The AVX2 str(n)casecmp implementations use the 'bzhi' instruction, which
belongs to the BMI2 CPU feature.

NB: It also uses the 'tzcnt' BMI1 instruction, but it is executed as BSF
as BSF if the CPU doesn't support TZCNT, and produces the same result
for non-zero input.

Partially fixes: b77b06e0e2 ("x86: Optimize strcmp-avx2.S")
Partially resolves: BZ #29611

Reviewed-by: Noah Goldstein  <goldstein.w.n@gmail.com>
(cherry picked from commit 10f79d3670)
2022-10-04 00:03:19 +02:00
Aurelien Jarno
1480535aa3 x86: include BMI1 and BMI2 in x86-64-v3 level
The "System V Application Binary Interface AMD64 Architecture Processor
Supplement" mandates the BMI1 and BMI2 CPU features for the x86-64-v3
level.

Reviewed-by: Noah Goldstein  <goldstein.w.n@gmail.com>
(cherry picked from commit b80f16adbd)
2022-10-04 00:03:19 +02:00
Wangyang Guo
ea326b2e98 nptl: Add backoff mechanism to spinlock loop
When mutiple threads waiting for lock at the same time, once lock owner
releases the lock, waiters will see lock available and all try to lock,
which may cause an expensive CAS storm.

Binary exponential backoff with random jitter is introduced. As try-lock
attempt increases, there is more likely that a larger number threads
compete for adaptive mutex lock, so increase wait time in exponential.
A random jitter is also added to avoid synchronous try-lock from other
threads.

v2: Remove read-check before try-lock for performance.

v3:
1. Restore read-check since it works well in some platform.
2. Make backoff arch dependent, and enable it for x86_64.
3. Limit max backoff to reduce latency in large critical section.

v4: Fix strict-prototypes error in sysdeps/nptl/pthread_mutex_backoff.h

v5: Commit log updated for regression in large critical section.

Result of pthread-mutex-locks bench

Test Platform: Xeon 8280L (2 socket, 112 CPUs in total)
First Row: thread number
First Col: critical section length
Values: backoff vs upstream, time based, low is better

non-critical-length: 1
	1	2	4	8	16	32	64	112	140
0	0.99	0.58	0.52	0.49	0.43	0.44	0.46	0.52	0.54
1	0.98	0.43	0.56	0.50	0.44	0.45	0.50	0.56	0.57
2	0.99	0.41	0.57	0.51	0.45	0.47	0.48	0.60	0.61
4	0.99	0.45	0.59	0.53	0.48	0.49	0.52	0.64	0.65
8	1.00	0.66	0.71	0.63	0.56	0.59	0.66	0.72	0.71
16	0.97	0.78	0.91	0.73	0.67	0.70	0.79	0.80	0.80
32	0.95	1.17	0.98	0.87	0.82	0.86	0.89	0.90	0.90
64	0.96	0.95	1.01	1.01	0.98	1.00	1.03	0.99	0.99
128	0.99	1.01	1.01	1.17	1.08	1.12	1.02	0.97	1.02

non-critical-length: 32
	1	2	4	8	16	32	64	112	140
0	1.03	0.97	0.75	0.65	0.58	0.58	0.56	0.70	0.70
1	0.94	0.95	0.76	0.65	0.58	0.58	0.61	0.71	0.72
2	0.97	0.96	0.77	0.66	0.58	0.59	0.62	0.74	0.74
4	0.99	0.96	0.78	0.66	0.60	0.61	0.66	0.76	0.77
8	0.99	0.99	0.84	0.70	0.64	0.66	0.71	0.80	0.80
16	0.98	0.97	0.95	0.76	0.70	0.73	0.81	0.85	0.84
32	1.04	1.12	1.04	0.89	0.82	0.86	0.93	0.91	0.91
64	0.99	1.15	1.07	1.00	0.99	1.01	1.05	0.99	0.99
128	1.00	1.21	1.20	1.22	1.25	1.31	1.12	1.10	0.99

non-critical-length: 128
	1	2	4	8	16	32	64	112	140
0	1.02	1.00	0.99	0.67	0.61	0.61	0.61	0.74	0.73
1	0.95	0.99	1.00	0.68	0.61	0.60	0.60	0.74	0.74
2	1.00	1.04	1.00	0.68	0.59	0.61	0.65	0.76	0.76
4	1.00	0.96	0.98	0.70	0.63	0.63	0.67	0.78	0.77
8	1.01	1.02	0.89	0.73	0.65	0.67	0.71	0.81	0.80
16	0.99	0.96	0.96	0.79	0.71	0.73	0.80	0.84	0.84
32	0.99	0.95	1.05	0.89	0.84	0.85	0.94	0.92	0.91
64	1.00	0.99	1.16	1.04	1.00	1.02	1.06	0.99	0.99
128	1.00	1.06	0.98	1.14	1.39	1.26	1.08	1.02	0.98

There is regression in large critical section. But adaptive mutex is
aimed for "quick" locks. Small critical section is more common when
users choose to use adaptive pthread_mutex.

Signed-off-by: Wangyang Guo <wangyang.guo@intel.com>
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 8162147872)
2022-09-28 15:36:46 -07:00
Noah Goldstein
870d564e7c sysdeps: Add 'get_fast_jitter' interace in fast-jitter.h
'get_fast_jitter' is meant to be used purely for performance
purposes. In all cases it's used it should be acceptable to get no
randomness (see default case). An example use case is in setting
jitter for retries between threads at a lock. There is a
performance benefit to having jitter, but only if the jitter can
be generated very quickly and ultimately there is no serious issue
if no jitter is generated.

The implementation generally uses 'HP_TIMING_NOW' iff it is
inlined (avoid any potential syscall paths).
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>

(cherry picked from commit 911c63a51c)
2022-09-28 13:53:11 -07:00
Jangwoong Kim
8b9ad5ec93 nptl: Effectively skip CAS in spinlock loop
The commit:
"Add LLL_MUTEX_READ_LOCK [BZ #28537]"
SHA1: d672a98a1a

introduced LLL_MUTEX_READ_LOCK, to skip CAS in spinlock loop
if atomic load fails. But, "continue" inside of do-while loop
does not skip the evaluation of escape expression, thus CAS
is not skipped.

Replace do-while with while and skip LLL_MUTEX_TRYLOCK if
LLL_MUTEX_READ_LOCK fails.

Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
(cherry picked from commit 6b8dbbd03a)
2022-09-28 13:52:51 -07:00
H.J. Lu
60b8295333 Move assignment out of the CAS condition
Update

commit 49302b8fdf
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Nov 11 06:54:01 2021 -0800

    Avoid extra load with CAS in __pthread_mutex_clocklock_common [BZ #28537]

    Replace boolean CAS with value CAS to avoid the extra load.

and

commit 0b82747dc4
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Thu Nov 11 06:31:51 2021 -0800

    Avoid extra load with CAS in __pthread_mutex_lock_full [BZ #28537]

    Replace boolean CAS with value CAS to avoid the extra load.

by moving assignment out of the CAS condition.

(cherry picked from commit 120ac6d238)
2022-09-28 13:52:25 -07:00
H.J. Lu
8844d9b22d Add LLL_MUTEX_READ_LOCK [BZ #28537]
CAS instruction is expensive.  From the x86 CPU's point of view, getting
a cache line for writing is more expensive than reading.  See Appendix
A.2 Spinlock in:

https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf

The full compare and swap will grab the cache line exclusive and cause
excessive cache line bouncing.

Add LLL_MUTEX_READ_LOCK to do an atomic load and skip CAS in spinlock
loop if compare may fail to reduce cache line bouncing on contended locks.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
(cherry picked from commit d672a98a1a)
2022-09-28 13:47:29 -07:00
H.J. Lu
0ec276739a Avoid extra load with CAS in __pthread_mutex_clocklock_common [BZ #28537]
Replace boolean CAS with value CAS to avoid the extra load.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
(cherry picked from commit 49302b8fdf)
2022-09-28 13:47:16 -07:00
H.J. Lu
f5a70409db Avoid extra load with CAS in __pthread_mutex_lock_full [BZ #28537]
Replace boolean CAS with value CAS to avoid the extra load.

Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
(cherry picked from commit 0b82747dc4)
2022-09-28 13:47:04 -07:00
Florian Weimer
da7afc97ad elf: Call __libc_early_init for reused namespaces (bug 29528)
libc_map is never reset to NULL, neither during dlclose nor on a
dlopen call which reuses the namespace structure.  As a result, if a
namespace is reused, its libc is not initialized properly.  The most
visible result is a crash in the <ctype.h> functions.

To prevent similar bugs on namespace reuse from surfacing,
unconditionally initialize the chosen namespace to zero using memset.

(cherry picked from commit d0e357ff45)
2022-08-30 16:53:14 +02:00