glibc

lib/glibc

mirror of https://sourceware.org/git/glibc.git synced 2026-01-06 11:51:29 +03:00

Author	SHA1	Message	Date
Adhemerval Zanella	6deadd4eb6	m68k: Remove SVID error handling on fmod The m68k provided an optimized version through __m81_u(fmod) (mathimpl.h), and gcc does not implement it through a builtin (different than i386). Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>	2025-10-30 15:41:15 -03:00
Adhemerval Zanella	b19904cfb2	m68k: Avoid include e_fmod.c on fmod/remainder implementation And open-code each implementation. It simplifies SVID error handling removal. Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>	2025-10-30 15:41:12 -03:00
Adhemerval Zanella	ade9f30ce2	m68k: Remove the SVID error handling from fmodf The m68k provided an optimized version through __m81_u(fmodf) (mathimpl.h), and gcc does not implement it through a builtin (different than i386). Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>	2025-10-30 15:41:10 -03:00
Adhemerval Zanella	1dd2163e51	i386: Remove the SVID error handling from fmodf The optimized i386 version is faster than the generic one, and gcc implements it through the builtin. It allows us to move the implementation to a C one. The performance on a Zen3 chip is slight better: reciprocal-throughput input master no-SVID improvement i686 subnormals 22.4741 20.1571 10.31% i686 normal 74.1631 70.3606 5.13% i686 close-exponent 22.5625 20.2435 10.28% Tested on i686-linux-gnu. Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>	2025-10-30 15:41:07 -03:00
Adhemerval Zanella	bfee89dc8a	i386: Remove the SVID error handling from fmod The optimized i386 version is faster than the generic one, and gcc implements it through the builtin. It allows us to move the implementation to a C one. The performance on a Zen3 chip is similar to the SVID one. Tested on i686-linux-gnu. Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>	2025-10-30 15:40:41 -03:00
Jiamei Xie	4d86b6cdd8	x86: fix wmemset ifunc stray '!' (bug 33542) The ifunc selector for wmemset had a stray '!' in the X86_ISA_CPU_FEATURES_ARCH_P(...) check: if (X86_ISA_CPU_FEATURE_USABLE_P (cpu_features, AVX2) && X86_ISA_CPU_FEATURES_ARCH_P (cpu_features, AVX_Fast_Unaligned_Load, !)) This effectively negated the predicate and caused the AVX2/AVX512 paths to be skipped, making the dispatcher fall back to the SSE2 implementation even on CPUs where AVX2/AVX512 are available. The regression leads to noticeable throughput loss for wmemset. Remove the stray '!' so the AVX_Fast_Unaligned_Load capability is tested as intended and the correct AVX2/EVEX variants are selected. Impact: - On AVX2/AVX512-capable x86_64, wmemset no longer incorrectly falls back to SSE2; perf now shows __wmemset_evex/avx2 variants. Testing: - benchtests/bench-wmemset shows improved bandwidth across sizes. - perf confirm the selected symbol is no longer SSE2. Signed-off-by: xiejiamei <xiejiamei@hygon.com> Signed-off-by: Li jing <lijing@hygon.cn> Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>	2025-10-29 12:54:14 -03:00
Jiayuan Chen	1177d2f26c	Updates struct tcp_zerocopy_receive from 5.11 to netinet/tcp.h. This patch updates struct tcp_zerocopy_receive to contain filed including copybuf_address, copybuf_len, and others. Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>	2025-10-29 12:54:12 -03:00
Adhemerval Zanella	8711c29bb7	aarch64: Fix tst-ifunc-arg-4 on clang-18 It issues: ../sysdeps/aarch64/tst-ifunc-arg-4.c:39:1: error: unused function 'resolver' [-Werror,-Wunused-function] 39 \| resolver (uint64_t arg0, const uint64_t arg1[]) \| ^~~~~~~~ 1 error generated. clang-19 and onwards do not trigger the warning. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>	2025-10-29 12:54:10 -03:00
Adhemerval Zanella	970364dac0	Annotate swtich fall-through The clang default to warning for missing fall-through and it does not support all comment-like annotation that gcc does. Use C23 [[fallthrough]] annotation instead. proper attribute instead. Reviewed-by: Collin Funk <collin.funk1@gmail.com>	2025-10-29 12:54:01 -03:00
Adhemerval Zanella	36b4c553e6	Replace count_leading_zeros with stdc_leading_zeros Checked on x86_64-linux-gnu and aarch64-linux-gnu. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> Reviewed-by: Collin Funk <collin.funk1@gmail.com>	2025-10-29 12:53:55 -03:00
Adhemerval Zanella	5ee722d3ac	i386: Build s_erf_common.c with -fexcess-precision=standard It is requires to provide correctly rounded results. Checked on i686-linux-gnu.	2025-10-29 10:17:34 -03:00
Joseph Myers	096fcdc0a5	Rename uimaxabs to umaxabs (bug 33325) The C2y function uimaxabs has been renamed to umaxabs. Implement this change in glibc, keeping a compat symbol under the old name, copying the test to test the new name and changing the old test to test the compat symbol. Jakub has done the corresponding change to the built-in function in GCC. Tested for x86_64 and x86.	2025-10-28 12:15:02 +00:00
Adhemerval Zanella	013f5167b9	math: Consolidate CORE-MATH double-double routines For lgamma and tgamma the muldd, mulddd, and polydd are renamed to muldd2, mulddd2, and polydd2 respectively. Checked on aarch64-linux-gnu and x86_64-linux-gnu. Reviewed-by: DJ Delorie <dj@redhat.com>	2025-10-27 09:46:04 -03:00
Adhemerval Zanella	e4d812c980	math: Consolidate erf/erfc definitions The common code definitions are consolidated in s_erf_common.h and s_erf_common.c. Checked on x86_64-linux-gnu, aarch64-linux-gnu, and powerpc64le-linux-gnu. Reviewed-by: DJ Delorie <dj@redhat.com>	2025-10-27 09:46:01 -03:00
Adhemerval Zanella	fc419290f9	math: Consolidate internal erf/erfc tables The shared internal data definitions are consolidated in s_erf_data.c and the erfc only one are moved to s_erfc_data.c. Checked on x86_64-linux-gnu, aarch64-linux-gnu, and powerpc64le-linux-gnu. Reviewed-by: DJ Delorie <dj@redhat.com>	2025-10-27 09:34:04 -03:00
Adhemerval Zanella	acaad9ab06	math: Use erfc from CORE-MATH The current implementation precision shows the following accuracy, on three ranges ([-DBL_MAX,5], [-5,5], [5,DBL_MAX]) with 10e9 uniform randomly generated numbers for each range (first column is the accuracy in ULP, with '0' being correctly rounded, second is the number of samples with the corresponding precision): * Range [-DBL_MAX, -5] * FE_TONEAREST 0: 10000000000 100.00% * FE_UPWARD 0: 10000000000 100.00% * FE_DOWNWARD 0: 10000000000 100.00% * FE_TOWARDZERO 0: 10000000000 100.00% * Range [-5, 5] * FE_TONEAREST 0: 8069309665 80.69% 1: 1882910247 18.83% 2: 47485296 0.47% 3: 293749 0.00% 4: 1043 0.00% * FE_UPWARD 0: 5540301026 55.40% 1: 2026739127 20.27% 2: 1774882486 17.75% 3: 567324466 5.67% 4: 86913847 0.87% 5: 3820789 0.04% 6: 18259 0.00% * FE_DOWNWARD 0: 5520969586 55.21% 1: 2057293099 20.57% 2: 1778334818 17.78% 3: 557521494 5.58% 4: 82473927 0.82% 5: 3393276 0.03% 6: 13800 0.00% * FE_TOWARDZERO 0: 6220287175 62.20% 1: 2323846149 23.24% 2: 1251999920 12.52% 3: 190748245 1.91% 4: 12996232 0.13% 5: 122279 0.00% * Range [5, DBL_MAX] * FE_TONEAREST 0: 10000000000 100.00% * FE_UPWARD 0: 10000000000 100.00% * FE_DOWNWARD 0: 10000000000 100.00% * FE_TOWARDZERO 0: 10000000000 100.00% The CORE-MATH implementation is correctly rounded for any rounding mode. The code was adapted to glibc style and to use the definition of math_config.h (to handle errno, overflow, and underflow). Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1, gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1) shows: reciprocal-throughput master patched improvement x86_64 49.0980 267.0660 -443.94% x86_64v2 49.3220 257.6310 -422.34% x86_64v3 42.9539 84.9571 -97.79% aarch64 28.7266 52.9096 -84.18% power10 14.1673 25.1273 -77.36% Latency master patched improvement x86_64 95.6640 269.7060 -181.93% x86_64v2 95.8296 260.4860 -171.82% x86_64v3 91.1658 112.7150 -23.64% aarch64 37.0745 58.6791 -58.27% power10 23.3197 31.5737 -35.39% Checked on x86_64-linux-gnu, aarch64-linux-gnu, and powerpc64le-linux-gnu. Reviewed-by: DJ Delorie <dj@redhat.com>	2025-10-27 09:34:04 -03:00
Adhemerval Zanella	72a48e45bd	math: Use erf from CORE-MATH The current implementation precision shows the following accuracy, on three rangeis ([-DBL_MIN, -4.2], [-4.2, 4.2], [4.2, DBL_MAX]) with 10e9 uniform randomly generated numbers for each range (first column is the accuracy in ULP, with '0' being correctly rounded, second is the number of samples with the corresponding precision): * Range [-DBL_MIN, -4.2] * FE_TONEAREST 0: 10000000000 100.00% * FE_UPWARD 0: 10000000000 100.00% * FE_DOWNWARD 0: 10000000000 100.00% * FE_TOWARDZERO 0: 10000000000 100.00% * Range [-4.2, 4.2] * FE_TONEAREST 0: 9764404513 97.64% 1: 235595487 2.36% * FE_UPWARD 0: 9468013928 94.68% 1: 531986072 5.32% * FE_DOWNWARD 0: 9493787693 94.94% 1: 506212307 5.06% * FE_TOWARDZERO 0: 9585271351 95.85% 1: 414728649 4.15% * Range [4.2, DBL_MAX] * FE_TONEAREST 0: 10000000000 100.00% * FE_UPWARD 0: 10000000000 100.00% * FE_DOWNWARD 0: 10000000000 100.00% * FE_TOWARDZERO 0: 10000000000 100.00% The CORE-MATH implementation is correctly rounded for any rounding mode. The code was adapted to glibc style and to use the definition of math_config.h (to handle errno, overflow, and underflow). Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1, gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1) shows: reciprocal-throughput master patched improvement x86_64 38.2754 78.0311 -103.87% x86_64v2 38.3325 75.7555 -97.63% x86_64v3 34.6604 28.3182 18.30% aarch64 23.1499 21.4307 7.43% power10 12.3051 9.3766 23.80% Latency master patched improvement x86_64 84.3062 121.3580 -43.95% x86_64v2 84.1817 117.4250 -39.49% x86_64v3 81.0933 70.6458 12.88% aarch64 35.012 29.5012 15.74% power10 21.7205 18.4589 15.02% For x86_64/x86_64-v2, most performance hit came from the fma call through the ifunc mechanism. Checked on x86_64-linux-gnu, aarch64-linux-gnu, and powerpc64le-linux-gnu. Reviewed-by: DJ Delorie <dj@redhat.com>	2025-10-27 09:34:04 -03:00
Adhemerval Zanella	1cae0550e8	math: Use tgamma from CORE-MATH The current implementation precision shows the following accuracy, on one range ([-20,20]) with 10e9 uniform randomly generated numbers for each range (first column is the accuracy in ULP, with '0' being correctly rounded, second is the number of samples with the corresponding precision): * Range [-20,20] * FE_TONEAREST 0: 4504877808 45.05% 1: 4402224940 44.02% 2: 947652295 9.48% 3: 131076831 1.31% 4: 13222216 0.13% 5: 910045 0.01% 6: 35253 0.00% 7: 606 0.00% 8: 6 0.00% * FE_UPWARD 0: 3477307921 34.77% 1: 4838637866 48.39% 2: 1413942684 14.14% 3: 240762564 2.41% 4: 27113094 0.27% 5: 2130934 0.02% 6: 102599 0.00% 7: 2324 0.00% 8: 14 0.00% * FE_DOWNWARD 0: 3923545410 39.24% 1: 4745067290 47.45% 2: 1137899814 11.38% 3: 171596912 1.72% 4: 20013805 0.20% 5: 1773899 0.02% 6: 99911 0.00% 7: 2928 0.00% 8: 31 0.00% * FE_TOWARDZERO 0: 3697160741 36.97% 1: 4731951491 47.32% 2: 1303092738 13.03% 3: 231969191 2.32% 4: 32344517 0.32% 5: 3283092 0.03% 6: 193010 0.00% 7: 5175 0.00% 8: 45 0.00% The CORE-MATH implementation is correctly rounded for any rounding mode. The code was adapted to glibc style and to use the definition of math_config.h (to handle errno, overflow, and underflow). Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1, gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1) shows: reciprocal-throughput master patched improvement x86_64 237.7960 175.4090 26.24% x86_64v2 232.9320 163.4460 29.83% x86_64v3 193.0680 89.7721 53.50% aarch64 113.6340 56.7350 50.07% power10 92.0617 26.6137 71.09% Latency master patched improvement x86_64 266.7190 208.0130 22.01% x86_64v2 263.6070 200.0280 24.12% x86_64v3 214.0260 146.5180 31.54% aarch64 114.4760 58.5235 48.88% power10 84.3718 35.7473 57.63% Checked on x86_64-linux-gnu, aarch64-linux-gnu, and powerpc64le-linux-gnu. Reviewed-by: DJ Delorie <dj@redhat.com>	2025-10-27 09:34:04 -03:00
Adhemerval Zanella	d67d2f4688	math: Use lgamma from CORE-MATH The current implementation precision shows the following accuracy, on one range ([-1,1]) with 10e9 uniform randomly generated numbers for each range (first column is the accuracy in ULP, with '0' being correctly rounded, second is the number of samples with the corresponding precision): * Range [-20, 20] * FE_TONEAREST 0: 6701254075 67.01% 1: 3230897408 32.31% 2: 63986940 0.64% 3: 3605417 0.04% 4: 233189 0.00% 5: 20973 0.00% 6: 1869 0.00% 7: 125 0.00% 8: 4 0.00% * FE_UPWARDA 0: 4207428861 42.07% 1: 5001137116 50.01% 2: 740542213 7.41% 3: 49116304 0.49% 4: 1715617 0.02% 5: 54464 0.00% 6: 4956 0.00% 7: 451 0.00% 8: 16 0.00% 9: 2 0.00% * FE_DOWNWARD 0: 4155925193 41.56% 1: 4989821364 49.90% 2: 770312796 7.70% 3: 72014726 0.72% 4: 11040522 0.11% 5: 872811 0.01% 6: 12480 0.00% 7: 106 0.00% 8: 2 0.00% * FE_TOWARDZERO 0: 4225861532 42.26% 1: 5027051105 50.27% 2: 706443411 7.06% 3: 39877908 0.40% 4: 713109 0.01% 5: 47513 0.00% 6: 4961 0.00% 7: 438 0.00% 8: 23 0.00% * Range [20, 0x5.d53649e2d4674p+1012] * FE_TONEAREST 0: 7262241995 72.62% 1: 2737758005 27.38% * FE_UPWARD 0: 4690392401 46.90% 1: 5143728216 51.44% 2: 165879383 1.66% * FE_DOWNWARD 0: 4690333331 46.90% 1: 5143794937 51.44% 2: 165871732 1.66% * FE_TOWARDZERO 0: 4690343071 46.90% 1: 5143786761 51.44% 2: 165870168 1.66% The CORE-MATH implementation is correctly rounded for any rounding mode. The code was adapted to glibc style and to use the definition of math_config.h (to handle errno, overflow, and underflow). Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1, gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1) shows: reciprocal-throughput master patched improvement x86_64 112.9740 135.8640 -20.26% x86_64v2 111.8910 131.7590 -17.76% x86_64v3 108.2800 68.0935 37.11% aarch64 61.3759 49.2403 19.77% power10 42.4483 24.1943 43.00% Latency master patched improvement x86_64 144.0090 167.9750 -16.64% x86_64v2 139.2690 167.1900 -20.05% x86_64v3 130.1320 96.9347 25.51% aarch64 66.8538 53.2747 20.31% power10 49.5076 29.6917 40.03% For x86_64/x86_64-v2, most performance hit came from the fma call through the ifunc mechanism. Checked on x86_64-linux-gnu, aarch64-linux-gnu, and powerpc64le-linux-gnu. Reviewed-by: DJ Delorie <dj@redhat.com>	2025-10-27 09:34:04 -03:00
Adhemerval Zanella	140e802cb3	math: Move atanh internal data to separate file The internal data definitions are moved to s_atanh_data.c. It helps on ABIs that build the implementation multiple times for ifunc optimizations, like x86_64. Reviewed-by: DJ Delorie <dj@redhat.com>	2025-10-27 09:34:04 -03:00
Adhemerval Zanella	cb8d1575b6	math: Consolidate acosh and asinh internal table The shared internal data definitions are consolidated in s_asincosh_data.c. Reviewed-by: DJ Delorie <dj@redhat.com>	2025-10-27 09:34:04 -03:00
Adhemerval Zanella	79b70fc09f	math: Use atanh from CORE-MATH The current implementation precision shows the following accuracy, on one range ([-1,1]) with 10e9 uniform randomly generated numbers for each range (first column is the accuracy in ULP, with '0' being correctly rounded, second is the number of samples with the corresponding precision): * Range [-1, 1] * FE_TONEAREST 0: 8180011860 81.80% 1: 1819865257 18.20% 2: 122883 0.00% * FE_UPWARDA 0: 3903695744 39.04% 1: 4992324465 49.92% 2: 1096319340 10.96% 3: 7660451 0.08% * FE_DOWNWARDA 0: 3904555484 39.05% 1: 4991970864 49.92% 2: 1095447471 10.95% 3: 8026181 0.08% * FE_TOWARDZERO 0: 7070209165 70.70% 1: 2908447434 29.08% 2: 21343401 0.21% The CORE-MATH implementation is correctly rounded for any rounding mode. The code was adapted to glibc style and to use the definition of math_config.h (to handle errno, overflow, and underflow). Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1, gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1) shows: reciprocal-throughput master patched improvement x86_64 26.4969 22.4625 15.23% x86_64v2 26.0792 22.9822 11.88% x86_64v3 25.6357 22.2147 13.34% aarch64 20.2295 19.7001 2.62% power10 10.0986 9.3846 7.07% Latency master patched improvement x86_64 80.2311 59.9745 25.25% x86_64v2 79.7010 61.4066 22.95% x86_64v3 78.2679 58.5804 25.15% aarch64 34.3959 28.1523 18.15% power10 23.2417 18.2694 21.39% Checked on x86_64-linux-gnu, aarch64-linux-gnu, and powerpc64le-linux-gnu. Reviewed-by: DJ Delorie <dj@redhat.com>	2025-10-27 09:34:04 -03:00
Adhemerval Zanella	30e66b085c	math: Use asinh from CORE-MATH The current implementation precision shows the following accuracy, on tthree different ranges ([-DBL_MAX, -10], [-10,10], and [10, DBL_MAX)) with 10e9 uniform randomly generated numbers for each range (first column is the accuracy in ULP, with '0' being correctly rounded, second is the number of samples with the corresponding precision): * range [-DBL_MAX, -10] * FE_TONEAREST 0: 5164019099 51.64% 1: 4835980901 48.36% * FE_UPWARD 1: 4836053540 48.36% 2: 5163946460 51.64% * FE_DOWNWARD 1: 5163926134 51.64% 2: 4836073866 48.36% * FE_TOWARDZERO 0: 5163937001 51.64% 1: 4836062999 48.36% * Range [-10, 10) * FE_TONEAREST 0: 8679029381 86.79% 1: 1320934581 13.21% 2: 36038 0.00% * FE_UPWARD 0: 3965704277 39.66% 1: 4993616710 49.94% 2: 1039680225 10.40% 3: 998788 0.01% * FE_DOWNWARD 0: 3965806523 39.66% 1: 4993534438 49.94% 2: 1039601726 10.40% 3: 1057313 0.01% * FE_TOWARDZEROA 0: 7734210130 77.34% 1: 2261868439 22.62% 2: 3921431 0.04% * Range [10, DBL_MAX) * FE_TONEAREST 0: 5163973212 51.64% 1: 4836026788 48.36% * FE_UPWARD 0: 4835991071 48.36% 1: 5164008929 51.64% * FE_DOWNWARD 0: 5163983594 51.64% 1: 4836016406 48.36% * FE_TOWARDZERO 0: 5163993394 51.64% 1: 4836006606 48.36% The CORE-MATH implementation is correctly rounded for any rounding mode. The code was adapted to glibc style and to use the definition of math_config.h (to handle errno, overflow, and underflow). Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1, gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1) shows: reciprocal-throughput master patched improvement x86_64 26.5178 45.3754 -71.11% x86_64v2 26.3167 44.7870 -70.18% x86_64v3 25.9109 25.4887 1.63% aarch64 18.0555 17.3374 3.98% power10 19.8535 20.5586 -3.55% Latency master patched improvement x86_64 82.6755 91.2026 -10.31% x86_64v2 82.4581 90.7152 -10.01% x86_64v3 80.7000 71.9454 10.85% aarch64 32.8320 28.8565 12.11% power10 44.5309 37.0096 16.89% For x86_64/x86_64-v2, most performance hit came from the fma call through the ifunc mechanism. Checked on x86_64-linux-gnu, aarch64-linux-gnu, and powerpc64le-linux-gnu. Reviewed-by: DJ Delorie <dj@redhat.com>	2025-10-27 09:34:04 -03:00
Adhemerval Zanella	d1509f2ce3	math: Use acosh from CORE-MATH The current implementation precision shows the following accuracy, on two different ranges ([1,21) and [21, DBL_MAX)) with 10e9 uniform randomly generated numbers (first column is the accuracy in ULP, with '0' being correctly rounded, second is the number of samples with the corresponding precision): * range [1,21] * FE_TONEAREST 0: 8931139411 89.31% 1: 1068697545 10.69% 2: 163044 0.00% * FE_UPWARD 0: 7936620731 79.37% 1: 2062594522 20.63% 2: 783977 0.01% 3: 770 0.00% * FE_DOWNWARD 0: 7936459794 79.36% 1: 2062734117 20.63% 2: 805312 0.01% 3: 777 0.00% * FE_TOWARDZERO 0: 7910345595 79.10% 1: 2088584522 20.89% 2: 1069106 0.01% 3: 777 0.00% * Range [21, DBL_MAX) * FE_TONEAREST 0: 5163888431 51.64% 1: 4836111569 48.36% * FE_UPWARD 0: 4835951885 48.36% 1: 5164048115 51.64% * FE_DOWNWARD 0: 5164048432 51.64% 1: 4835951568 48.36% * FE_TOWARDZERO 0: 5164058042 51.64% 1: 4835941958 48.36% The CORE-MATH implementation is correctly rounded for any rounding mode. The code was adapted to glibc style and to use the definition of math_config.h (to handle errno, overflow, and underflow). Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1, gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1) shows: reciprocal-throughput master patched improvement x86_64 20.9131 47.2187 -125.79% x86_64v2 20.8823 41.1042 -96.84% x86_64v3 19.0282 25.8045 -35.61% aarch64 14.7419 18.1535 -23.14% power10 8.98341 11.0423 -22.92% Latency master patched improvement x86_64 75.5494 89.5979 -18.60% x86_64v2 74.4443 87.6292 -17.71% x86_64v3 71.8558 70.7086 1.60% aarch64 30.3361 29.2709 3.51% power10 20.9263 19.2482 8.02% For x86_64/x86_64-v2, most performance hit came from the fma call through the ifunc mechanism. Checked on x86_64-linux-gnu, aarch64-linux-gnu, and powerpc64le-linux-gnu. Reviewed-by: DJ Delorie <dj@redhat.com>	2025-10-27 09:34:04 -03:00
Collin Funk	3d20d746c3	Linux: fix tst-copy_file_range-large test on 32-bit platforms. Since SSIZE_MAX is less than UINT_MAX on 32-bit platforms we must AND the expression with SSIZE_MAX. Tested on x86_64 and x86. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>	2025-10-26 19:39:49 -07:00
litenglong	00d406e77b	x86: Disable AVX Fast Unaligned Load on Hygon 1/2/3 - Performance testing revealed significant memcpy performance degradation when bit_arch_AVX_Fast_Unaligned_Load is enabled on Hygon 3. - Hygon confirmed AVX performance issues in certain memory functions. - Glibc benchmarks show SSE outperforms AVX for memcpy/memmove/memset/strcmp/strcpy/strlen and so on. - Hardware differences primarily in floating-point operations don't justify AVX usage for memory operations. Reviewed-by: gaoxiang <gaoxiang@kylinos.cn> Signed-off-by: litenglong <litenglong@kylinos.cn> Reviewed-by: H.J. Lu <hjl.tools@gmail.com>	2025-10-27 05:16:30 +08:00
Sachin Monga	b59799f14f	ppc64le: Power 10 rawmemchr clobbers v20 (bug #33091 ) Replace non-volatile(v20) by volatile(v17) since v20 is not restored Reviewed-by: Peter Bergner <bergner@tenstorrent.com>	2025-10-26 12:19:53 -05:00
Paul Zimmermann	48fde7b026	various fixes detected with -Wdouble-promotion Changes with respect to v1: - added comment in e_j1f.c to explain the use of float is enough Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>	2025-10-22 12:35:40 +02:00
Siddhesh Poyarekar	1b657c53c2	Simplify powl computation for small integral y [BZ #33411 ] The powl implementation for x86_64 ends up multiplying X once more than necessary and then throwing away that result. This results in an overflow flag being set in cases where there is no overflow. Simplify the relevant portion by special casing the -3 to 3 range and simply multiplying repetitively. Resolves: BZ #33411 Signed-off-by: Siddhesh Poyarekar <siddhesh@sourceware.org> Reviewed by: Paul Zimmermann <Paul.Zimmermann@inria.fr>	2025-10-21 14:00:10 -04:00
Adhemerval Zanella	062510a0c1	termios: Suppress clang -Winitializer-overrider on ___cbaud_to_speed clang-18 and onwards issues: ../sysdeps/unix/sysv/linux/speed.c:71:23: error: initializer overrides prior initialization of this subobject [-Werror,-Winitializer-overrides] 71 \| [_cbix(__B0)] = 0, \| ^ ../sysdeps/unix/sysv/linux/speed.c:70:34: note: previous initialization is here 70 \| [0 ... _cbix(CBAUDMASK)] = -1, [...] The override is explicit used to support the same initialization on multiple platforms (since the baud values differ on alpha and powerpc). Reviewed-by: Collin Funk <collin.funk1@gmail.com>	2025-10-21 09:26:04 -03:00
Adhemerval Zanella	9d0b7ec87c	math: Suppress clang -Wincompatible-library-redeclaration on s_llround Clang issues: ../sysdeps/ieee754/dbl-64/s_llround.c:83:30: error: incompatible redeclaration of library function 'lround' [-Werror,-Wincompatible-library-redeclaration] libm_alias_double (__lround, lround) ^ ../sysdeps/ieee754/dbl-64/s_llround.c:83:30: note: 'lround' is a builtin with type 'long (double)' Reviewed-by: Sam James <sam@gentoo.org>	2025-10-21 09:24:27 -03:00
Adhemerval Zanella	407b2eea75	math: use fabs on __ieee754_lgamma_r clang issues: ../sysdeps/ieee754/dbl-64/e_lgamma_r.c:234:29: error: absolute value function 'fabsf' given an argument of type 'double' but has parameter of type 'float' which may cause \ truncation of value [-Werror,-Wabsolute-value] It should not matter because the value is 0.0, but using fabs is simpler than adding a warning suppresion. Reviewed-by: Sam James <sam@gentoo.org>	2025-10-21 09:24:24 -03:00
Adhemerval Zanella	76dfd91275	Suppress -Wmaybe-uninitialized only for gcc The warning is not supported by clang. Reviewed-by: Sam James <sam@gentoo.org>	2025-10-21 09:24:05 -03:00
Luc Michel	c284fd5eaf	microblaze: fix __syscall_cancel_arch (BZ 33547) The __syscall_cancel_arch function has an epilogue that does not match the prologue. The stack is not used and the return address still lies in r15 when reaching the epilogue. Fix the epilogue by simply returning from the function. Signed-off-by: Luc Michel <luc.michel@amd.com> Tested-by: gopi@sankhya.com Reviewed-by: Neal Frager <neal.frager@amd.com> Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>	2025-10-20 11:33:54 -03:00
Adhemerval Zanella	a252205e1c	linux: Fix function point cast on vDSO handling There is no need to cast to avoid, both pointer already have the expected type. It fixes the clang -Wpointer-type-mismatch error: ../sysdeps/unix/sysv/linux/gettimeofday.c:43:6: error: pointer type mismatch ('int ()(struct timeval , void )' and 'void ') [-Werror,-Wpointer-type-mismatch] 41 \| libc_ifunc (__gettimeofday, \| ~~~~~~~~~~~~~~~~~~~~~~~~~~~ 42 \| GLRO(dl_vdso_gettimeofday) != NULL \| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 43 \| ? VDSO_IFUNC_RET (GLRO(dl_vdso_gettimeofday)) \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 44 \| : (void) __gettimeofday_syscall) \| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ./../include/libc-symbols.h:789:53: note: expanded from macro 'libc_ifunc' 789 \| #define libc_ifunc(name, expr) __ifunc (name, name, expr, void, INIT_ARCH) \| ~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~ ./../include/libc-symbols.h:705:34: note: expanded from macro '__ifunc' 705 \| __ifunc_args (type_name, name, expr, init, arg) \| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~ ./../include/libc-symbols.h:677:38: note: expanded from macro '__ifunc_args' 677 \| __ifunc_resolver (type_name, name, expr, init, static, __VA_ARGS__); \ \| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ./../include/libc-symbols.h:667:33: note: expanded from macro '__ifunc_resolver' 667 \| __typeof (type_name) res = expr; \ \| ^~~~ Reviewed-by: Sam James <sam@gentoo.org>	2025-10-20 11:33:54 -03:00
Adhemerval Zanella	8ec0754067	aarch64: Fix gcs linker flags clang does not work by using whitespace for defining the -z option: $ make test t=misc/tst-gcs-disabled [...] clang: error: no such file or directory: 'gcs=always' Use the usual comma separate way. Reviewed-by: Florian Weimer <fweimer@redhat.com>	2025-10-20 11:33:54 -03:00
Adhemerval Zanella	917425ca6d	posix: Defined _POSIX_VDISABLE as integer literal The constant should be used with c_cc, which for all supported ABIs is defined as unsigned char. By using it as literar char constant, clang triggers an error when compared with signal literal on ABIs that define 'char' as unsigned. On aarch64, clang shows: ../sysdeps/posix/fpathconf.c:118:21: error: right side of operator converted from negative value to unsigned: -1 to 18446744073709551615 [-Werror] #if _POSIX_VDISABLE == -1 ~~~~~~~~~~~~~~~ ^ ~~ Reviewed-by: Collin Funk <collin.funk1@gmail.com>	2025-10-20 11:33:54 -03:00
Wilco Dijkstra	0375e6e233	AArch64: Use math-use-builtins for roundeven(f)/lrint(f)/lround(f) Remove target implementations of roundeven(f)/lrint(f)/lround(f) and use the math-use-builtins mechanism instead. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>	2025-10-17 17:03:54 +00:00
Wilco Dijkstra	35807cc5cd	math: Add builtin support for (l)lround(f) Add builtin support for (l)lround(f) via the math-use-builtins header mechanism. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>	2025-10-17 17:03:54 +00:00
Joseph Myers	ea18d5a4c2	Implement C23 memalignment Add the C23 memalignment function (query the alignment of a pointer) to glibc. Given how simple this operation is, it would make sense for compilers to inline calls to this function, but I'm treating that as a compiler matter (compilers should add it as a built-in function) rather than adding an inline version to glibc headers (although such an inline version would be reasonable as well). I've filed https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122117 for this feature in GCC. Tested for x86_64 and x86.	2025-10-17 16:56:59 +00:00
Adhemerval Zanella	850d93f514	math: Use binary search on lgammaf slow path And remove some unused entries of the fallback table. Checked on x86_64-linux-gnu and aarch64-linux-gnu. Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>	2025-10-14 11:12:08 -03:00
Adhemerval Zanella	6610a293b3	math: Use stdbit.h instead of builtin in math_config.h Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>	2025-10-14 11:12:04 -03:00
Adhemerval Zanella	ae49afe74d	math: Optimize fma call on log2pf1 The fma is required only for x == -0x1.da285cp-5 in FE_TONEAREST to provide correctly rounded results. Checked on x86_64-linux-gnu and i686-linux-gnu. Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>	2025-10-14 11:12:00 -03:00
Adhemerval Zanella	82a4f50b4e	math: Optimize fma call on asinpif The fma is required only for x == +/-0x1.6371e8p-4f in FE_TOWARDZERO to provide correctly rounded results. Checked on x86_64-linux-gnu and aarch64-linux-gnu. Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>	2025-10-14 11:11:56 -03:00
Adhemerval Zanella	fab32b6526	math: Remove erfcf fma usage The fma is not required to provide correctly rounded and it helps on !__FP_FAST_FMA ISAs. Reviewed-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>	2025-10-14 08:46:06 -03:00
Adhemerval Zanella	68cb78eccc	math: Remove asinhf fma usage The fma is not required to provide correctly rounded and it helps on !__FP_FAST_FMA ISAs. Checked on x86_64-linux-gnu and i686-linux-gnu. Reviewed-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>	2025-10-14 08:46:06 -03:00
Adhemerval Zanella	c075ff00a6	math: Optimize fma call on acospif The fma is required only for inputs less than 0x1.0fd288p-127. Also only add the extra check for !__FP_FAST_FMA targets. Checked on x86_64-linux-gnu and aarch64-linux-gnu. Reviewed-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>	2025-10-14 08:46:06 -03:00
Adhemerval Zanella	c9d9336f50	math: Remove acoshf fma usage The fma is not strickly required to provide correctly rounded and it helps on !__FP_FAST_FMA ABIs. Checked on x86_64-linux-gnu and i686-linux-gnu. Reviewed-by: Paul Zimmermann <Paul.Zimmermann@inria.fr>	2025-10-14 08:46:06 -03:00
Yury Khrustalev	ecb0fc2f0f	aarch64: tests for SME This commit adds tests for the following use cases relevant to handing of the SME state: - fork() and vfork() - clone() and clone3() - signal handler While most cases are trivial, the case of clone3() is more complicated since the clone3() symbol is not public in Glibc. To avoid having to check all possible ways clone3() may be called via other public functions (e.g. vfork() or pthread_create()), we put together a test that links directly with clone3.o. All the existing functions that have calls to clone3() may not actually use it, in which case the outcome of such tests would be unexpected. Having a direct call to the clone3() symbol in the test allows to check precisely what we need to test: that the __arm_za_disable() function is indeed called and has the desired effect. Linking to clone3.o also requires linking to __arm_za_disable.o that in turn requires the _dl_hwcap2 hidden symbol which to provide in the test and initialise it before using. Co-authored-by: Adhemerval Zanella Netto <adhemerval.zanella@linaro.org> Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>	2025-10-14 09:42:46 +01:00
Yury Khrustalev	27effb3d50	aarch64: clear ZA state of SME before clone and clone3 syscalls This change adds a call to the __arm_za_disable() function immediately before the SVC instruction inside clone() and clone3() wrappers. It also adds a macro for inline clone() used in fork() and adds the same call to the vfork implementation. This sets the ZA state of SME to "off" on return from these functions (for both the child and the parent). The __arm_za_disable() function is described in [1] (8.1.3). Note that the internal Glibc name for this function is __libc_arm_za_disable(). When this change was originally proposed [2,3], it generated a long discussion where several questions and concerns were raised. Here we will address these concerns and explain why this change is useful and, in fact, necessary. In a nutshell, a C library that conforms to the AAPCS64 spec [1] (pertinent to this change, mainly, the chapters 6.2 and 6.6), should have a call to the __arm_za_disable() function in clone() and clone3() wrappers. The following explains in detail why this is the case. When we consider using the __arm_za_disable() function inside the clone() and clone3() libc wrappers, we talk about the C library subroutines clone() and clone3() rather than the syscalls with similar names. In the current version of Glibc, clone() is public and clone3() is private, but it being private is not pertinent to this discussion. We will begin with stating that this change is NOT a bug fix for something in the kernel. The requirement to call __arm_za_disable() does NOT come from the kernel. It also is NOT needed to satisfy a contract between the kernel and userspace. This is why it is not for the kernel documentation to describe this requirement. This requirement is instead needed to satisfy a pure userspace scheme outlined in [1] and to make sure that software that uses Glibc (or any other C library that has correct handling of SME states (see below)) conforms to [1] without having to unnecessarily become SME-aware thus losing portability. To recap (see [1] (6.2)), SME extension defines SME state which is part of processor state. Part of this SME state is ZA state that is necessary to manage ZA storage register in the context of the ZA lazy saving scheme [1] (6.6). This scheme exists because it would be challenging to handle ZA storage of SME in either callee-saved or caller-saved manner. There are 3 kinds of ZA state that are defined in terms of the PSTATE.ZA bit and the TPIDR2_EL0 register (see [1] (6.6.3)): - "off": PSTATE.ZA == 0 - "active": PSTATE.ZA == 1 TPIDR2_EL0 == null - "dormant": PSTATE.ZA == 1 TPIDR2_EL0 != null As [1] (6.7.2) outlines, every subroutine has exactly one SME-interface depending on the permitted ZA-states on entry and on normal return from a call to this subroutine. Callers of a subroutine must know and respect the ZA-interface of the subroutines they are using. Using a subroutine in a way that is not permitted by its ZA-interface is undefined behaviour. In particular, clone() and clone3() (the C library functions) have the ZA-private interface. This means that the permitted ZA-states on entry are "off" and "dormant" and that the permitted states on return are "off" or "dormant" (but if and only if it was "dormant" on entry). This means that both functions in question should correctly handle both "off" and "dormant" ZA-states on entry. The conforming states on return are "off" and "dormant" (if inbound state was already "dormant"). This change ensures that the ZA-state on return is always "off". Note, that, in the context of clone() and clone3(), "on return" means a point when execution resumes at certain address after transferring from clone() or clone3(). For the caller (we may refer to it as "parent") this is the return address in the link register where the RET instruction jumps. For the "child", this is the target branch address. So, the "off" state on return is permitted and conformant. Why can't we retain the "dormant" state? In theory, we can, but we shouldn't, here is why. Every subroutine with a private-ZA interface, including clone() and clone3(), must comply with the lazy saving scheme [1] (6.7.2). This puts additional responsibility on a subroutine if ZA-state on return is "dormant" because this state has special meaning. The "caller" (that is the place in code where execution is transferred to, so this include both "parent" and "child") may check the ZA-state and use it as per the spec of the "dormant" state that is outlined in [1] (6.6.6 and 6.6.7). Conforming to this would require more code inside of clone() and clone3() which hardly is desirable. For the return to "parent" this could be achieved in theory, but given that neither clone() nor clone3() are supposed to be used in the middle of an SME operation, if wouldn't be useful. For the "return" to "child" this would be particularly difficult to achieve given the complexity of these functions and their interfaces. Most importantly, it would be illegal and somewhat meaningless to allow a "child" to start execution in the "dormant" ZA-state because the very essence of the "dormant" state implies that there is a place to return and that there is some outer context that we are allowed to interact with. To sum up, calling __arm_za_disable() to ensure the "off" ZA-state when the execution resumes after a call to clone() or clone3() is correct and also the most simple way to conform to [1]. Can there be situations when we can avoid calling __arm_za_disable()? Calling __arm_za_disable() implies certain (sufficiently small) overhead, so one might rightly ponder avoiding making a call to this function when we can afford not to. The most trivial cases like this (e.g. when the calling thread doesn't have access to SME or to the TPIDR2_EL0 register) are already handled by this function (see [1] (8.1.3 and 8.1.2)). Reasoning about other possible use cases would require making code inside clone() and clone3() more complicated and it would defeat the point of trying to make an optimisation of not calling __arm_za_disable(). Why can't the kernel do this instead? The handling of SME state by the kernel is described in [4]. In short, kernel must not impose a specific ZA-interface onto a userspace function. Interaction with the kernel happens (among other thing) via system calls. In Glibc many of the system calls (notably, including SYS_clone and SYS_clone3) are used via wrappers, and the kernel has no control of them and, moreover, it cannot dictate how these wrappers should behave because it is simply outside of the kernel's remit. However, in certain cases, the kernel may ensure that a "child" doesn't start in an incorrect state. This is what is done by the recent change included in 6.16 kernel [5]. This is not enough to ensure that code that uses clone() and clone3() function conforms to [1] when it runs on a system that provides SME, hence this change. [1]: https://github.com/ARM-software/abi-aa/blob/main/aapcs64/aapcs64.rst [2]: https://inbox.sourceware.org/libc-alpha/20250522114828.2291047-1-yury.khrustalev@arm.com [3]: https://inbox.sourceware.org/libc-alpha/20250609121407.3316070-1-yury.khrustalev@arm.com [4]: https://www.kernel.org/doc/html/v6.16/arch/arm64/sme.html [5]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=cde5c32db55740659fca6d56c09b88800d88fd29 Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>	2025-10-14 09:42:46 +01:00

1 2 3 4 5 ...

17160 Commits