Commit 404526ee2e changed _start to write
the last argument to __libc_start_main without taking into consideration
that the function did not create a full stack frame, which leads to
overwriting the argv[0].
sparc start.S does not provide the final argument for
__libc_start_main, which is the highest stack address used to
update the __libc_stack_end.A
This fixes elf/tst-execstack-prog-static-tunable on sparc64.
On sparcv9 this does not happen because the kernel puts an
auxv value, which turns to point to a value in the stack itself.
Checked on sparc64-linux-gnu.
Reviewed-by: Florian Weimer <fweimer@redhat.com>
The current approach tracks math maximum supported errors by explicitly
setting them per function and architecture. On newer implementations or
new compiler versions, the file is updated with newer values if it
shows higher results. The idea is to track the maximum known error, to
update the manual with the obtained values.
The constant libm-test-ulps shows little value, where it is usually a
mechanical change done by the maintainer, for past releases it is
usually ignored whether the ulp change resulted from a compiler
regression, and the math tests already have a maximum ulp error that
triggers a regression.
It was shown by a recent update after the new acosf [1] implementation
that is correctly rounded, where the libm-test-ulps was indeed from a
compiler issue.
This patch removes all arch-specific libm-test-ulps, adds system generic
libm-test-ulps where applicable, and changes its semantics. The generic
files now track specific implementation constraints, like if it is
expected to be correctly rounded, or if the system-specific has
different error expectations.
Now multiple libm-test-ulps can be defined, and system-specific
overrides generic implementation. This is for the case where
arch-specific implementation might show worse precision than generic
implementation, for instance, the cbrtf on i686.
Regressions are only reported if the implementation shows larger errors
than 9 ulps (13 for IBM long double) unless it is overridden by
libm-test-ulps and the maximum error is not printed at the end of tests.
The regen-ulps rule is also removed since it does not make sense to
update the libm-test-ulps automatically.
The manual error table is also removed, Paul Zimmermann and others have
been tracking libm precision with a more comprehensive analysis for some
releases; so link to his work instead.
[1] https://sourceware.org/git/?p=glibc.git;a=commit;h=9cc9f8e11e8fb8f54f1e84d9f024917634a78201
Remove unused _dl_hwcap_string defines. As a result many dl-procinfo.h headers
can be removed. This also removes target specific _dl_procinfo implementations
which only printed HWCAP strings using dl_hwcap_string.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Linux catbus 6.1.112 #1 SMP Sun Oct 13 10:52:08 PDT 2024 sparc64 sun4v UltraSparc T5 (Niagara5) GNU/Linux
Signed-off-by: Andreas K. Hüttel <dilfridge@gentoo.org>
On arc, the definition of TLS_DTV_UNALLOCATED now comes from
<dl-dtv.h>.
For x86-64 x32, a separate version is needed because unsigned long int
is 32 bits on this target.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
The CORE-MATH implementation is correctly rounded (for any rounding mode),
although it should worse performance than current one. The current
implementation performance comes mainly from the internal usage of
the optimize expf implementation, and shows a maximum ULPs of 2 for
FE_TONEAREST and 3 for other rounding modes.
The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow).
Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1):
Latency master patched improvement
x86_64 40.6995 49.0737 -20.58%
x86_64v2 40.5841 44.3604 -9.30%
x86_64v3 39.3879 39.7502 -0.92%
i686 112.3380 129.8570 -15.59%
aarch64 (Neoverse) 18.6914 17.0946 8.54%
power10 11.1343 9.3245 16.25%
reciprocal-throughput master patched improvement
x86_64 18.6471 24.1077 -29.28%
x86_64v2 17.7501 20.2946 -14.34%
x86_64v3 17.8262 17.1877 3.58%
i686 64.1454 86.5645 -34.95%
aarch64 (Neoverse) 9.77226 12.2314 -25.16%
power10 4.0200 5.3316 -32.63%
Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
This will be required by the rseq extensible ABI implementation on all
Linux architectures exposing the '__rseq_size' and '__rseq_offset'
symbols to set the initial value of the 'cpu_id' field which can be used
by applications to test if rseq is available and registered. As long as
the symbols are exposed it is valid for an application to perform this
test even if rseq is not yet implemented in libc for this architecture.
Compile tested with build-many-glibcs.py but I don't have access to any
hardware to run the tests.
Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows better performance to the generic tanf.
The code was adapted to glibc style, to use the definition of
math_config.h, to remove errno handling, and to use a generic
128 bit routine for ABIs that do not support it natively.
Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (neoverse1,
gcc 13.2.1), and powerpc (POWER10, gcc 13.2.1):
latency master patched improvement
x86_64 82.3961 54.8052 33.49%
x86_64v2 82.3415 54.8052 33.44%
x86_64v3 69.3661 50.4864 27.22%
i686 219.271 45.5396 79.23%
aarch64 29.2127 19.1951 34.29%
power10 19.5060 16.2760 16.56%
reciprocal-throughput master patched improvement
x86_64 28.3976 19.7334 30.51%
x86_64v2 28.4568 19.7334 30.65%
x86_64v3 21.1815 16.1811 23.61%
i686 105.016 15.1426 85.58%
aarch64 18.1573 10.7681 40.70%
power10 8.7207 8.7097 0.13%
Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows better performance compared to the generic exp2m1f.
The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow). The
only change is to handle FLT_MAX_EXP for FE_DOWNWARD or FE_TOWARDZERO.
The benchmark inputs are based on exp2f ones.
Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1):
Latency master patched improvement
x86_64 40.6042 48.7104 -19.96%
x86_64v2 40.7506 35.9032 11.90%
x86_64v3 35.2301 31.7956 9.75%
i686 102.094 94.6657 7.28%
aarch64 18.2704 15.1387 17.14%
power10 11.9444 8.2402 31.01%
reciprocal-throughput master patched improvement
x86_64 20.8683 16.1428 22.64%
x86_64v2 19.5076 10.4474 46.44%
x86_64v3 19.2106 10.4014 45.86%
i686 56.4054 59.3004 -5.13%
aarch64 12.0781 7.3953 38.77%
power10 6.5306 5.9388 9.06%
The generic implementation calls __ieee754_exp2f and x86_64 provides
an optimized ifunc version (built with -mfma -mavx2, not correctly
rounded). This explains the performance difference for x86_64.
Same for i686, where the ABI provides an optimized __ieee754_exp2f
version built with '-msse2 -mfpmath=sse'. When built wth same
flags, the new algorithm shows a better performance:
master patched improvement
latency 102.094 91.2823 10.59%
reciprocal-throughput 56.4054 52.7984 6.39%
Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
The CORE-MATH implementation is correctly rounded (for any rounding mode)
and shows better performance compared to the generic exp10m1f.
The code was adapted to glibc style and to use the definition of
math_config.h (to handle errno, overflow, and underflow). I mostly
fixed some small issues in corner cases (sNaN handling, -INFINITY,
a specific overflow check).
Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1,
gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1):
Latency master patched improvement
x86_64 45.4690 49.5845 -9.05%
x86_64v2 46.1604 36.2665 21.43%
x86_64v3 37.8442 31.0359 17.99%
i686 121.367 93.0079 23.37%
aarch64 21.1126 15.0165 28.87%
power10 12.7426 8.4929 33.35%
reciprocal-throughput master patched improvement
x86_64 19.6005 17.4005 11.22%
x86_64v2 19.6008 11.1977 42.87%
x86_64v3 17.5427 10.2898 41.34%
i686 59.4215 60.9675 -2.60%
aarch64 13.9814 7.9173 43.37%
power10 6.7814 6.4258 5.24%
The generic implementation calls __ieee754_exp10f which has an
optimized version, although it is not correctly rounded, which is
the main culprit of the the latency difference for x86_64 and
throughp for i686.
Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: DJ Delorie <dj@redhat.com>
The CORE-MATH implementation is correctly rounded (for any rounding mode).
This can be checked by exhaustive tests in a few minutes since there are
less than 2^32 values to check against for example GNU MPFR.
This patch also adds some bench values for tgammaf.
Tested on x86_64 and x86 (cfarm26).
With the initial GNU libc code it gave on an Intel(R) Core(TM) i7-8700:
"tgammaf": {
"": {
"duration": 3.50188e+09,
"iterations": 2e+07,
"max": 602.891,
"min": 65.1415,
"mean": 175.094
}
}
With the new code:
"tgammaf": {
"": {
"duration": 3.30825e+09,
"iterations": 5e+07,
"max": 211.592,
"min": 32.0325,
"mean": 66.1649
}
}
With the initial GNU libc code it gave on cfarm26 (i686):
"tgammaf": {
"": {
"duration": 3.70505e+09,
"iterations": 6e+06,
"max": 2420.23,
"min": 243.154,
"mean": 617.509
}
}
With the new code:
"tgammaf": {
"": {
"duration": 3.24497e+09,
"iterations": 1.8e+07,
"max": 1238.15,
"min": 101.155,
"mean": 180.276
}
}
Signed-off-by: Alexei Sibidanov <sibid@uvic.ca>
Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>
Changes in v2:
- include <math.h> (fix the linknamespace failures)
- restored original benchtests/strcoll-inputs/filelist#en_US.UTF-8 file
- restored original wrapper code (math/w_tgammaf_compat.c),
except for the dealing with the sign
- removed the tgammaf/float entries in all libm-test-ulps files
- address other comments from Joseph Myers
(https://sourceware.org/pipermail/libc-alpha/2024-July/158736.html)
Changes in v3:
- pass NULL argument for signgam from w_tgammaf_compat.c
- use of math_narrow_eval
- added more comments
Changes in v4:
- initialize local_signgam to 0 in math/w_tgamma_template.c
- replace sysdeps/ieee754/dbl-64/gamma_productf.c by dummy file
Changes in v5:
- do not mention local_signgam any more in math/w_tgammaf_compat.c
- initialize local_signgam to 1 instead of 0 in w_tgamma_template.c
and added comment
Changes in v6:
- pass NULL as 2nd argument of __ieee754_gammaf_r in
w_tgammaf_compat.c, and check for NULL in e_gammaf_r.c
Changes in v7:
- added Signed-off-by line for Alexei Sibidanov (author of the code)
Changes in v8:
- added Signed-off-by line for Paul Zimmermann (submitted of the patch)
Changes in v9:
- address comments from review by Adhemerval Zanella
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Remove the definitions of HWCAP_IMPORTANT after removal of
LD_HWCAP_MASK / tunable glibc.cpu.hwcap_mask. There HWCAP_IMPORTANT
was used as default value.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Remove the environment variable LD_HWCAP_MASK and the tunable
glibc.cpu.hwcap_mask as those are not used anymore in common-code
after removal in elf/dl-cache.c:search_cache().
The only remaining user is sparc32 where it is used in
elf_machine_matches_host(). If sparc32 does not need it anymore,
we can get rid of it at all. Otherwise we could also move
LD_HWCAP_MASK / tunable glibc.cpu.hwcap_mask to be sparc32 specific.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Remove the definitions of _DL_HWCAP_PLATFORM as those are not used
anymore after removal in elf/dl-cache.c:search_cache().
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Despite of powerpc where the returned integer is stored in tcb,
and the diagnostics output, there is no user anymore.
Thus this patch removes the diagnostics output and
_dl_string_platform for all other platforms.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
C23 adds various <math.h> function families originally defined in TS
18661-4. Add the logp1 functions (aliases for log1p functions - the
name is intended to be more consistent with the new log2p1 and
log10p1, where clearly it would have been very confusing to name those
functions log21p and log101p). As aliases rather than new functions,
the content of this patch is somewhat different from those actually
adding new functions.
Tests are shared with log1p, so this patch *does* mechanically update
all affected libm-test-ulps files to expect the same errors for both
functions.
The vector versions of log1p on aarch64 and x86_64 are *not* updated
to have logp1 aliases (and thus there are no corresponding header,
tests, abilist or ulps changes for vector functions either). It would
be reasonable for such vector aliases and corresponding changes to
other files to be made separately. For now, the log1p tests instead
avoid testing logp1 in the vector case (a Makefile change is needed to
avoid problems with grep, used in generating the .c files for vector
function tests, matching more than one ALL_RM_TEST line in a file
testing multiple functions with the same inputs, when it assumes that
the .inc file only has a single such line).
Tested for x86_64 and x86, and with build-many-glibcs.py.
The 680c597e9c commit made loader reject ill-formatted strings by
first tracking all set tunables and then applying them. However, it does
not take into consideration if the same tunable is set multiple times,
where parse_tunables_string appends the found tunable without checking
if it was already in the list. It leads to a stack-based buffer overflow
if the tunable is specified more than the total number of tunables. For
instance:
GLIBC_TUNABLES=glibc.malloc.check=2:... (repeat over the number of
total support for different tunable).
Instead, use the index of the tunable list to get the expected tunable
entry. Since now the initial list is zero-initialized, the compiler
might emit an extra memset and this requires some minor adjustment
on some ports.
Checked on x86_64-linux-gnu and aarch64-linux-gnu.
Reported-by: Yuto Maeda <maeda@cyberdefense.jp>
Reported-by: Yutaro Shimizu <shimizu@cyberdefense.jp>
Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
These structs describe file formats under /var/log, and should not
depend on the definition of _TIME_BITS. This is achieved by
defining __WORDSIZE_TIME64_COMPAT32 to 1 on 32-bit ports that
support 32-bit time_t values (where __time_t is 32 bits).
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>