commit 7fec8a5de6
Author: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Date: Thu Nov 13 14:26:08 2025 -0300
Revert __HAVE_64B_ATOMICS configure check
uses 64-bit atomic operations on sem_t if 64-bit atomics are supported.
But sem_t may be aligned to 32-bit on 32-bit architectures.
1. Add a macro, SEM_T_ALIGN, for sem_t alignment.
2. Add a macro, HAVE_UNALIGNED_64B_ATOMICS. Define it if unaligned 64-bit
atomic operations are supported.
3. Add a macro, USE_64B_ATOMICS_ON_SEM_T. Define to 1 if 64-bit atomic
operations are supported and SEM_T_ALIGN is at least 8-byte aligned or
HAVE_UNALIGNED_64B_ATOMICS is defined.
4. Assert that size and alignment of sem_t are not lower than those of
the internal struct new_sem.
5. Check USE_64B_ATOMICS_ON_SEM_T, instead of USE_64B_ATOMICS, when using
64-bit atomic operations on sem_t.
This fixes BZ #33632.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Test wrapper script was used twice: once to run the test
command and second time within the text command which
seems unnecessary and results in false errors when running
this test.
Fixes 332f8e62af
Reviewed-by: Frédéric Bérat <fberat@redhat.com>
The support for lock elision was already deprecated with glibc 2.42:
commit 77438db8cf
"Mark support for lock elision as deprecated."
See also discussions:
https://sourceware.org/pipermail/libc-alpha/2025-July/168492.html
This patch removes the architecture specific support for lock elision
for x86, powerpc and s390 by removing the elision-conf.h, elision-conf.c,
elision-lock.c, elision-timed.c, elision-unlock.c, elide.h, htm.h/hle.h files.
Those generic files are also removed.
The architecture specific structures are adjusted and the elision fields are
marked as unused. See struct_mutex.h files.
Furthermore in struct_rwlock.h, the leftover __rwelision was also removed.
Those were originally removed with commit 0377a7fde6
"nptl: Remove rwlock elision definitions"
and by chance reintroduced with commit 7df8af43ad
"nptl: Add struct_rwlock.h"
The common code (e.g. the pthread_mutex-files) are changed back to the time
before lock elision was introduced with the x86-support:
- commit 1cdbe57948
"Add the low level infrastructure for pthreads lock elision with TSX"
- commit b023e4ca99
"Add new internal mutex type flags for elision."
- commit 68cc29355f
"Add minimal test suite changes for elision enabled kernels"
- commit e8c659d74e
"Add elision to pthread_mutex_{try,timed,un}lock"
- commit 49186d21ef
"Disable elision for any pthread_mutexattr_settype call"
- commit 1717da59ae
"Add a configure option to enable lock elision and disable by default"
Elision is removed also from the tunables, the initialization part, the
pretty-printers and the manual.
Some extra handling in the testsuite is removed as well as the full tst-mutex10
testcase, which tested a race while enabling lock elision.
I've also searched the code for "elision", "elide", "transaction" and e.g.
cleaned some comments.
I've run the testsuite on x86_64 and s390x and run the build-many-glibcs.py
script.
Thanks to Sachin Monga, this patch is also tested on powerpc.
A NEWS entry also mentions the removal.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
The 53807741fb added a configure check
for 64-bit atomic operations that were not previously enabled on some
32-bit ABIs.
However, the NPTL semaphore code casts a sem_t to a new_sem and issues
a 64-bit atomic operation for __HAVE_64B_ATOMICS. Since sem_t has
32-bit alignment on 32-bit architectures, this prevents the use of
64-bit atomics even if the ABI supports them.
Assume 64-bit atomic support from __WORDSIZE, which maps to how glibc
defines it before the broken change. Also rename __HAVE_64B_ATOMICS
to USE_64B_ATOMICS to define better the flag meaning.
Checked on x86_64-linux-gnu and i686-linux-gnu.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
The only usage was for pthread_spin_lock, introduced by 12d2dd7060,
as a way to optimize the code for certain architectures. Now that atomic
builtins are used by default, let the compiler use the best code sequence
for the atomic exchange.
Co-authored-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Introduce the `DL_DEBUG_TLS` debug mask to enable detailed logging for
Thread-Local Storage (TLS) and Thread Control Block (TCB) management.
This change integrates a new `tls` option into the `LD_DEBUG`
environment variable, allowing developers to trace:
- TCB allocation, deallocation, and reuse events in `dl-tls.c`,
`nptl/allocatestack.c`, and `nptl/nptl-stack.c`.
- Thread startup events, including the TID and TCB address, in
`nptl/pthread_create.c`.
A new test, `tst-dl-debug-tid`, has been added to validate the
functionality of this new debug logging, ensuring that relevant messages
are correctly generated for both main and worker threads.
This enhances the debugging capabilities for diagnosing issues related
to TLS allocation and thread lifecycle within the dynamic linker.
Reviewed-by: DJ Delorie <dj@redhat.com>
The clang default to warning for missing fall-through and it does
not support all comment-like annotation that gcc does. Use C23
[[fallthrough]] annotation instead.
proper attribute instead.
Reviewed-by: Collin Funk <collin.funk1@gmail.com>
Remove the odd atomic_forced_read which is neither atomic nor forced.
Some uses are completely redundant, so simply remove them. In other cases
the intended use is to force a memory ordering, so use acquire load for those.
In yet other cases their purpose is unclear, for example __nscd_cache_search
appears to allow concurrent accesses to the cache while it is being garbage
collected by another thread! Use relaxed atomic loads here to block spills
from accidentally reloading memory that is being changed.
Passes regress on AArch64, OK for commit?
The main issue is that setup_stack_prot fails to account for cases where
the cached thread stack lacks a guard page, which can cause madvise to
fail. Update the logic to also handle whether MADV_GUARD_INSTALL is
supported when resizing the guard page.
Checked on x86_64-linux-gnu with 6.8.0 and 6.15 kernels.
Reviewed-by: Florian Weimer <fweimer@redhat.com>
Add check_mem_access(addr) function to check if memory at addr can
be written or read returning false if memory is not accessible.
This function changes signal handler for SIGSEGV and SIGBUS signals
when it is called first, and it is not thread-safe.
Co-authored-by: Adhemerval Zanella Netto <adhemerval.zanella@linaro.org>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
The symbol was unintentionally leaked on ports introduced after
GLIBC_2.34, provide the compat symbol to avoid breaking ABI on them.
Signed-off-by: Xi Ruoyao <xry111@xry111.site>
Reviewed-by: Florian Weimer <fweimer@redhat.com>
This patch replaces _dl_stack_flags global variable by
_dl_stack_prot_flags.
The advantage is that any convertion from p_flags to final used mprotect
flags occurs at loading of p_flags. It avoids repeated spurious
convertions of _dl_stack_flags, for example in allocate_thread_stack.
This modification was suggested in:
https://sourceware.org/pipermail/libc-alpha/2025-March/165537.html
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
The SYSCALL_CANCEL calls __syscall_cancel, which in turn
calls __internal_syscall_cancel with an 'int' return instead of the
expected 'long int'. This causes issues with syscalls that return
values larger than INT_MAX, such as copy_file_range [1].
Checked on x86_64-linux-gnu.
[1] https://debbugs.gnu.org/cgi/bugreport.cgi?bug=79139
Reviewed-by: Andreas K. Huettel <dilfridge@gentoo.org>
The SIGCANCEL signal handler should not issue __syscall_do_cancel,
which calls __do_cancel and __pthread_unwind, if the cancellation
is already in proces (and libgcc unwind is not reentrant). Any
cancellation signal received after is ignored.
Checked on x86_64-linux-gnu and aarch64-linux-gnu.
Tested-by: Aurelien Jarno <aurelien@aurel32.net>
Reviewed-by: Florian Weimer <fweimer@redhat.com>
Do not add the pthread_atfork routine again in nptl/Makefile,
instead rely on sysdeps/pthread/Makefile for the integration
(as this is the directory that contains the source file).
In sysdeps/pthread/Makefile, add to static-only-routines.
Reviewed-by: Joseph Myers <josmyers@redhat.com>
Current Bionic has this function, with enhanced error checking
(the undefined case terminates the process).
Reviewed-by: Joseph Myers <josmyers@redhat.com>
When GNU Binutils is configured with --enable-error-execstack=yes, a handful
of our tests which rely on -Wl,-z,execstack fail. Pass --Wl,--no-error-execstack
to override the behaviour and get a warning instead.
Bug: https://sourceware.org/PR32717
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Decorate BSS mappings with [anon: glibc: .bss <file>], for example
[anon: glibc: .bss /lib/libc.so.6]. The string ".bss" is already used
by bionic so use the same, but add the filename as well. If the name
would be longer than what the kernel allows, drop the directory part
of the path.
Refactor glibc.mem.decorate_maps check to a separate function and use
it to avoid assembling a name, which would not be used later.
Signed-off-by: Petr Malat <oss@malat.biz>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Linux 6.13 (662df3e5c3766) added a lightweight way to define guard areas
through madvise syscall. Instead of PROT_NONE the guard region through
mprotect, userland can madvise the same area with a special flag, and
the kernel ensures that accessing the area will trigger a SIGSEGV (as for
PROT_NONE mapping).
The madvise way has the advantage of less kernel memory consumption for
the process page-table (one less VMA per guard area), and slightly less
contention on kernel (also due to the fewer VMA areas being tracked).
The pthread_create allocates a new thread stack in two ways: if a guard
area is set (the default) it allocates the memory range required using
PROT_NONE and then mprotect the usable stack area. Otherwise, if a
guard page is not set it allocates the region with the required flags.
For the MADV_GUARD_INSTALL support, the stack area region is allocated
with required flags and then the guard region is installed. If the
kernel does not support it, the usual way is used instead (and
MADV_GUARD_INSTALL is disabled for future stack creations).
The stack allocation strategy is recorded on the pthread struct, and it
is used in case the guard region needs to be resized. To avoid needing
an extra field, the 'user_stack' is repurposed and renamed to 'stack_mode'.
This patch also adds a proper test for the pthread guard.
I checked on x86_64, aarch64, powerpc64le, and hppa with kernel 6.13.0-rc7.
Reviewed-by: DJ Delorie <dj@redhat.com>
Set stack size attribute to the size of the mmap'd region only
when the size of the remaining stack space is less than the size
of the mmap'd region.
This was reversed. As a result, the initial stack size was only
135168 bytes. On architectures where the stack grows down, the
initial stack size is approximately 8384512 bytes with the default
rlimit settings. The small main stack size on hppa broke
applications like ruby that check for stack overflows.
Signed-off-by: John David Anglin <dave.anglin@bell.net>
The LSB of g_signals was unused. The LSB of g1_start was used to indicate
which group is G2. This was used to always go to sleep in pthread_cond_wait
if a waiter is in G2. A comment earlier in the file says that this is not
correct to do:
"Waiters cannot determine whether they are currently in G2 or G1 -- but they
do not have to because all they are interested in is whether there are
available signals"
I either would have had to update the comment, or get rid of the check. I
chose to get rid of the check. In fact I don't quite know why it was there.
There will never be available signals for group G2, so we didn't need the
special case. Even if there were, this would just be a spurious wake. This
might have caught some cases where the count has wrapped around, but it
wouldn't reliably do that, (and even if it did, why would you want to force a
sleep in that case?) and we don't support that many concurrent waiters
anyway. Getting rid of it allows us to use one more bit, making us more
robust to wraparound.
Signed-off-by: Malte Skarupke <malteskarupke@fastmail.fm>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
This function no longer waits for threads to leave g1, so rename it to
__condvar_switch_g1
Signed-off-by: Malte Skarupke <malteskarupke@fastmail.fm>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
In my previous change I turned a nested loop into a simple loop. I'm doing
the resulting indentation changes in a separate commit to make the diff on
the previous commit easier to review.
Signed-off-by: Malte Skarupke <malteskarupke@fastmail.fm>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
The loop was a little more complicated than necessary. There was only one
break statement out of the inner loop, and the outer loop was nearly empty.
So just remove the outer loop, moving its code to the one break statement in
the inner loop. This allows us to replace all gotos with break statements.
Signed-off-by: Malte Skarupke <malteskarupke@fastmail.fm>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
This variable used to be needed to wait in group switching until all sleepers
have confirmed that they have woken. This is no longer needed. Nothing waits
on this variable so there is no need to track how many threads are currently
asleep in each group.
Signed-off-by: Malte Skarupke <malteskarupke@fastmail.fm>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
pthread_cond_wait was checking whether it was in a closed group no less than
four times. Checking once is enough. Here are the four checks:
1. While spin-waiting. This was dead code: maxspin is set to 0 and has been
for years.
2. Before deciding to go to sleep, and before incrementing grefs: I kept this
3. After incrementing grefs. There is no reason to think that the group would
close while we do an atomic increment. Obviously it could close at any
point, but that doesn't mean we have to recheck after every step. This
check was equally good as check 2, except it has to do more work.
4. When we find ourselves in a group that has a signal. We only get here after
we check that we're not in a closed group. There is no need to check again.
The check would only have helped in cases where the compare_exchange in the
next line would also have failed. Relying on the compare_exchange is fine.
Removing the duplicate checks clarifies the code.
Signed-off-by: Malte Skarupke <malteskarupke@fastmail.fm>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
This wake is unnecessary. We only switch groups after every sleeper in a group
has been woken. Sure, they may take a while to actually wake up and may still
hold a reference, but waking them a second time doesn't speed that up. Instead
this just makes the code more complicated and may hide problems.
In particular this safety wake wouldn't even have helped with the bug that was
fixed by Barrus' patch: The bug there was that pthread_cond_signal would not
switch g1 when it should, so we wouldn't even have entered this code path.
Signed-off-by: Malte Skarupke <malteskarupke@fastmail.fm>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Some comments were wrong after the most recent commit. This fixes that.
Also fixing indentation where it was using spaces instead of tabs.
Signed-off-by: Malte Skarupke <malteskarupke@fastmail.fm>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
This fixes the lost wakeup (from a bug in signal stealing) with a change
in the usage of g_signals[] in the condition variable internal state.
It also completely eliminates the concept and handling of signal stealing,
as well as the need for signalers to block to wait for waiters to wake
up every time there is a G1/G2 switch. This greatly reduces the average
and maximum latency for pthread_cond_signal.
The g_signals[] field now contains a signal count that is relative to
the current g1_start value. Since it is a 32-bit field, and the LSB is
still reserved (though not currently used anymore), it has a 31-bit value
that corresponds to the low 31 bits of the sequence number in g1_start.
(since g1_start also has an LSB flag, this means bits 31:1 in g_signals
correspond to bits 31:1 in g1_start, plus the current signal count)
By making the signal count relative to g1_start, there is no longer
any ambiguity or A/B/A issue, and thus any checks before blocking,
including the futex call itself, are guaranteed not to block if the G1/G2
switch occurs, even if the signal count remains the same. This allows
initially safely blocking in G2 until the switch to G1 occurs, and
then transitioning from G1 to a new G1 or G2, and always being able to
distinguish the state change. This removes the race condition and A/B/A
problems that otherwise ocurred if a late (pre-empted) waiter were to
resume just as the futex call attempted to block on g_signal since
otherwise there was no last opportunity to re-check things like whether
the current G1 group was already closed.
By fixing these issues, the signal stealing code can be eliminated,
since there is no concept of signal stealing anymore. The code to block
for all waiters to exit g_refs can also be removed, since any waiters
that are still in the g_refs region can be guaranteed to safely wake
up and exit. If there are still any left at this time, they are all
sent one final futex wakeup to ensure that they are not blocked any
longer, but there is no need for the signaller to block and wait for
them to wake up and exit the g_refs region.
The signal count is then effectively "zeroed" but since it is now
relative to g1_start, this is done by advancing it to a new value that
can be observed by any pending blocking waiters. Any late waiters can
always tell the difference, and can thus just cleanly exit if they are
in a stale G1 or G2. They can never steal a signal from the current
G1 if they are not in the current G1, since the signal value that has
to match in the cmpxchg has the low 31 bits of the g1_start value
contained in it, and that's first checked, and then it won't match if
there's a G1/G2 change.
Note: the 31-bit sequence number used in g_signals is designed to
handle wrap-around when checking the signal count, but if the entire
31-bit wraparound (2 billion signals) occurs while there is still a
late waiter that has not yet resumed, and it happens to then match
the current g1_start low bits, and the pre-emption occurs after the
normal "closed group" checks (which are 64-bit) but then hits the
futex syscall and signal consuming code, then an A/B/A issue could
still result and cause an incorrect assumption about whether it
should block. This particular scenario seems unlikely in practice.
Note that once awake from the futex, the waiter would notice the
closed group before consuming the signal (since that's still a 64-bit
check that would not be aliased in the wrap-around in g_signals),
so the biggest impact would be blocking on the futex until the next
full wakeup from a G1/G2 switch.
Signed-off-by: Frank Barrus <frankbarrus_sw@shaggy.cc>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Some kernels on S390 appear to return a CPU affinity mask based on
configured processors rather than the ones online. Overallocate the CPU
set to match that, but operate only on the ones online.
Signed-off-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
Co-authored-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
The rseq extensible ABI implementation moved the rseq area to the 'extra
TLS' block, remove the unused 'rseq_area' member of 'struct pthread'.
Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: Florian Weimer <fweimer@redhat.com>
Move the rseq area to the newly added 'extra TLS' block, this is the
last step in adding support for the rseq extended ABI. The size of the
rseq area is now dynamic and depends on the rseq features reported by
the kernel through the elf auxiliary vector. This will allow
applications to use rseq features past the 32 bytes of the original rseq
ABI as they become available in future kernels.
Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: Florian Weimer <fweimer@redhat.com>
Add a couple of tests to verify that CPU affinity set using
sched_setaffinity and pthread_setaffinity_np are inherited by a child
process and child thread.
Signed-off-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
This reverts commit 7c22dcda27.
The padding is required by Chromium's MaybeUpdateGlibcTidCache
in sandbox/linux/services/namespace_sandbox.cc.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
If some shared library loaded with dlopen/dlmopen requires an executable
stack, either implicitly because of a missing GNU_STACK ELF header
(where the ABI default flags implies in the executable bit) or explicitly
because of the executable bit from GNU_STACK; the loader will try to set
the both the main thread and all thread stacks (from the pthread cache)
as executable.
Besides the issue where any __nptl_change_stack_perm failure does not
undo the previous executable transition (meaning that if the library
fails to load, there can be thread stacks with executable stacks), this
behavior was used on a CVE [1] as a vector for RCE.
This patch changes that if a shared library requires an executable
stack, and the current stack is not executable, dlopen fails. The
change is done only for dynamically loaded modules, if the program
or any dependency requires an executable stack, the loader will still
change the main thread before program execution and any thread created
with default stack configuration.
[1] https://www.qualys.com/2023/07/19/cve-2023-38408/rce-openssh-forwarded-ssh-agent.txt
Checked on x86_64-linux-gnu and i686-linux-gnu.
Reviewed-by: Florian Weimer <fweimer@redhat.com>
The previous use of padding within a union made it impossible to
re-use the padding for GLIBC_PRIVATE ABI preservation because
tcbhead_t could use up all of the padding (as was historically the
case on x86-64). Allocating padding unconditionally addresses this
issue.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Add __attribute_optimization_barrier__ to disable inlining and cloning on a
function. For Clang, expand it to
__attribute__ ((optnone))
Otherwise, expand it to
__attribute__ ((noinline, clone))
Co-Authored-By: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Sam James <sam@gentoo.org>
Since trampoline is required to test execstack, enable execstack tests
only if compiler supports trampoline.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Sam James <sam@gentoo.org>
Add a descriptive comment to the tst-pthread-cpuclockid-invalid test and
also drop pthread_getcpuclockid from the TODO-testing list since it now
has full coverage.
Signed-off-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
Exercise the case where an exited thread will cause
pthread_getcpuclockid to fail.
Signed-off-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
Reviewed-by: Florian Weimer <fweimer@redhat.com>