The generic implementation is slight more optimized than the powerpc
one, where it has a more optimized inf/nan check (by not using FP
unit checks, along with branch prediction hints), and removed one
branch by issuing trunc instead of a combination of floor/ceil (which
also generated less code).
On power10 with gcc 14.2.1:
reciprocal-throughput master patch difference
workload-0_1 1.5210 1.3942 8.34%
workload-1_maxint 2.0926 1.3940 33.38%
workload-maxint_maxfloat 1.7851 1.3940 21.91%
workload-integral 1.5216 1.3941 8.37%
latency master patch difference
workload-0_1 1.5928 2.6337 -65.35%
workload-1_maxint 3.2929 2.6337 20.02%
workload-maxint_maxfloat 1.9697 2.6341 -33.73%
workload-integral 2.0597 2.6337 -27.87%
Checked on powerpc64le-linux-gnu.
Reviewed-by: Sachin Monga <smonga@linux.ibm.com>
Refactor the generic implementation to use math_config.h definitions,
and add an alternative one if the ABI supports truncf instructions
(gated through math-use-builtins-trunc.h).
The generic implementation generates similar code on x86_64, while
the optimization one for aarch64 (where truncf is supported as a
builtin by through frintz), the improvements are:
reciprocal-throughput master patch difference
workload-0_1 3.0595 3.0698 -0.34%
workload-1_maxint 5.1747 3.0542 40.98%
workload-maxint_maxfloat 3.4391 3.0349 11.75%
workload-integral 3.2732 3.0293 7.45%
latency master patch difference
workload-0_1 3.5267 4.7107 -33.57%
workload-1_maxint 6.9074 4.7282 31.55%
workload-maxint_maxfloat 3.7210 4.7506 -27.67%
workload-integral 3.8634 4.8137 -24.60%
Checked on aarch64-linux-gnu and x86_64-linux-gnu.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Refactor the generic implementation to use math_config.h definitions,
and add an alternative one if the ABI supports truncf instructions
(gated through math-use-builtins-trunc.h).
The generic implementation generates similar code for x86_64, while
the optimization path aarch64 (where truncf is supported as a builtin)
through frintz), the improvements are:
reciprocal-throughput master patch difference
workload-0_1 3.0740 3.0326 1.35%
workload-1_maxint 5.2231 3.0436 41.73%
workload-maxint_maxfloat 4.0962 3.0551 25.42%
workload-integral 3.7093 3.0612 17.47%
latency master patch difference
workload-0_1 3.5521 4.7313 -33.20%
workload-1_maxint 6.7148 4.7314 29.54%
workload-maxint_maxfloat 4.0458 4.7518 -17.45%
workload-integral 3.9719 4.7427 -19.40%
Checked on aarch64-linux-gnu and x86_64-linux-gnu.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
It removes the wrapper by moving the error/EDOM handling to an
out-of-line implementation (__math_invalidf_i/__math_invalidf_li).
Also, __glibc_unlikely is used on errors case since it helps
code generation on recent gcc.
The code now builds to with gcc-14 on aarch64:
0000000000000000 <__ilogbf>:
0: 1e260000 fmov w0, s0
4: d3577801 ubfx x1, x0, #23, #8
8: 340000e1 cbz w1, 24 <__ilogbf+0x24>
c: 5101fc20 sub w0, w1, #0x7f
10: 7103fc3f cmp w1, #0xff
14: 54000040 b.eq 1c <__ilogbf+0x1c> // b.none
18: d65f03c0 ret
1c: 12b00000 mov w0, #0x7fffffff // #2147483647
20: 14000000 b 0 <__math_invalidf_i>
24: 53175800 lsl w0, w0, #9
28: 340000a0 cbz w0, 3c <__ilogbf+0x3c>
2c: 5ac01000 clz w0, w0
30: 12800fc1 mov w1, #0xffffff81 // #-127
34: 4b000020 sub w0, w1, w0
38: d65f03c0 ret
3c: 320107e0 mov w0, #0x80000001 // #-2147483647
40: 14000000 b 0 <__math_invalidf_i>
Some ABI requires additional adjustments:
* i386 and m68k requires to use the template version, since
both provide __ieee754_ilogb implementatations.
* loongarch uses a custom implementation as well.
* powerpc64le also has a custom implementation for POWER9, which
is also used for float and float128 version. The generic
e_ilogb.c implementation is moved on powerpc to keep the
current code as-is.
Checked on aarch64-linux-gnu and x86_64-linux-gnu.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
It removes the wrapper by moving the error/EDOM handling to an
out-of-line implementation (__math_invalid_i/__math_invalid_li).
Also, __glibc_unlikely is used on errors case since it helps
code generation on recent gcc.
The code now builds to with gcc-14 on aarch64:
0000000000000000 <__ilogb>:
0: 9e660000 fmov x0, d0
4: d374f801 ubfx x1, x0, #52, #11
8: 340000e1 cbz w1, 24 <__ilogb+0x24>
c: 510ffc20 sub w0, w1, #0x3ff
10: 711ffc3f cmp w1, #0x7ff
14: 54000040 b.eq 1c <__ilogb+0x1c> // b.none
18: d65f03c0 ret
1c: 12b00000 mov w0, #0x7fffffff // #2147483647
20: 14000000 b 0 <__math_invalid_i>
24: d374cc00 lsl x0, x0, #12
28: b40000a0 cbz x0, 3c <__ilogb+0x3c>
2c: dac01000 clz x0, x0
30: 12807fc1 mov w1, #0xfffffc01 // #-1023
34: 4b000020 sub w0, w1, w0
38: d65f03c0 ret
3c: 320107e0 mov w0, #0x80000001 // #-2147483647
40: 14000000 b 0 <__math_invalid_i>
Some ABI requires additional adjustments:
* i386 and m68k requires to use the template version, since
both provide __ieee754_ilogb implementatations.
* loongarch uses a custom implementation as well.
* powerpc64le also has a custom implementation for POWER9, which
is also used for float and float128 version. The generic
e_ilogb.c implementation is moved on powerpc to keep the
current code as-is.
Checked on aarch64-linux-gnu and x86_64-linux-gnu.
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
C23 adds various <math.h> function families originally defined in TS
18661-4. Add the rootn functions, which compute the Yth root of X for
integer Y (with a domain error if Y is 0, even if X is a NaN). The
integer exponent has type long long int in C23; it was intmax_t in TS
18661-4, and as with other interfaces changed after their initial
appearance in the TS, I don't think we need to support the original
version of the interface.
As with pown and compoundn, I strongly encourage searching for worst
cases for ulps error for these implementations (necessarily
non-exhaustively, given the size of the input space). I also expect a
custom implementation for a given format could be much faster as well
as more accurate, although the implementation is simpler than those
for pown and compoundn.
This completes adding to glibc those TS 18661-4 functions (ignoring
DFP) that are included in C23. See
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118592 regarding the C23
mathematical functions (not just the TS 18661-4 ones) missing built-in
functions in GCC, where such functions might usefully be added.
Tested for x86_64 and x86, and with build-many-glibcs.py.
C23 adds various <math.h> function families originally defined in TS
18661-4. Add the compoundn functions, which compute (1+X) to the
power Y for integer Y (and X at least -1). The integer exponent has
type long long int in C23; it was intmax_t in TS 18661-4, and as with
other interfaces changed after their initial appearance in the TS, I
don't think we need to support the original version of the interface.
Note that these functions are "compoundn" with a trailing "n", *not*
"compound" (CORE-MATH has the wrong name, for example).
As with pown, I strongly encourage searching for worst cases for ulps
error for these implementations (necessarily non-exhaustively, given
the size of the input space). I also expect a custom implementation
for a given format could be much faster as well as more accurate (I
haven't tested or benchmarked the CORE-MATH implementation for
binary32); this is one of the more complicated and less efficient
functions to implement in a type-generic way.
As with exp2m1 and exp10m1, this showed up places where the
powerpc64le IFUNC setup is not as self-contained as one might hope (in
this case, without the changes specific to powerpc64le, there were
undefined references to __GI___expf128).
Tested for x86_64 and x86, and with build-many-glibcs.py.
The left shift overflows for 'int', use uint32_t instead. It syncs
with CORE-MATH commit bbfabd99.
Checked on aarch64-linux-gnu, x86_64-linux-gnu, and i686-linux-gnu.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
The left shift overflows for 'int', use uint64_t instead. It syncs
with CORE-MATH commit d0a2be200cbc1344d800d9ef0ebee9ad67dd3ad8.
Checked on aarch64-linux-gnu, x86_64-linux-gnu, and i686-linux-gnu.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
The left shift overflows for 'int', use uint32_t instead. It syncs
with CORE-MATH commit bbfabd993a71b049c210b0febfd06d18369fadc1.
Checked on aarch64-linux-gnu, x86_64-linux-gnu, and i686-linux-gnu.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
The left shift overflows for 'int64_t', use unsigned instead. It syncs
with CORE-MATH commit f7c7408d1749ec2859ea249495af699359ae559b.
Checked on aarch64-linux-gnu, x86_64-linux-gnu, and i686-linux-gnu.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
The left shift overflows for 'int', use uint64_t instead. It syncs
with CORE-MATH commit bbfabd99.
Checked on aarch64-linux-gnu, x86_64-linux-gnu, and i686-linux-gnu.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
The left shift overflows for 'int', use a literal instead. It syncs
with OPTIMIZED-ROUTINES commit 0f87f607b976820ef41fe64d004fe67dc7af8236.
Checked on aarch64-linux-gnu, x86_64-linux-gnu, and i686-linux-gnu.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
The left shift overflows for 'int', use uint64_t instead. It syncs
with CORE-MATH commit 4d6192d2.
Checked on aarch64-linux-gnu, x86_64-linux-gnu, and i686-linux-gnu.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
The left shift overflows for 'int', use unsigned instead. It syncs
with CORE-MATH commit 4d6192d2.
Checked on aarch64-linux-gnu, x86_64-linux-gnu, and i686-linux-gnu.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
As mentioned by the reporter in a pull request against gcc-mirror,
the THREEp96 constant in e_expl.c is incorrect, it is actually 0x3.p+94f128
rather than 0x3.p+96f128.
The algorithm uses that to compute the t2 integer (tval2), by whose
delta it adjusts the x+xl pair and then in the result uses the precomputed
exp value for that entry.
Using 0x3.p+94f128 rather than 0x3.p+96f128 results in tval2 sometimes
being one smaller, sometimes one larger than the desired value, thus can mean
the x+xl pair after adjustment will be larger in absolute value than it
should be.
DesWursters created a test program for this
https://github.com/DesWurstes/comparefloats
and his results were
total: 1135000000 not_equal: 4322 earlier_score: 674 later_score: 3648
I've modified this so with
https://sourceware.org/bugzilla/show_bug.cgi?id=32411#c3
so that it actually tests pseudo-random _Float128 values with range
(-16384.,16384) with strong bias on values larger than 0.0002 in absolute
value (so that tval1/tval2 aren't zero most of the time) and that gave
total: 10000000000 not_equal: 29861 earlier_score: 4606 later_score: 25255
So, in both cases, in most cases the change doesn't result in any differences,
and in those rare cases where does, about 85% have smaller ulp than without
the patch.
Additionally I've tried
https://sourceware.org/bugzilla/show_bug.cgi?id=32411#c4
and in 2 billion iterations it didn't find any case where x+xl after the
adjustments without this change would be smaller in absolute value compared
to x+xl after the adjustments with this change.
Reviewed-by: Joseph Myers <josmyers@redhat.com>
Reject invalid formatted scanf real input data the exponent part of
which is comprised of an exponent introducing character, optionally
followed by a sign, and with no actual digits following. Such data is a
prefix of, but not a matching input sequence and it is required by ISO C
to cause a matching failure.
Currently a matching success is instead incorrectly produced along with
the conversion result according to the input significand read and the
exponent of zero, with the significand and the exponent part wholly
consumed from input.
Correct an invalid `tstscanf.c' test accordingly that expects a matching
success for input data provided in the ISO C standard as an example for
a matching failure.
Enable input data that causes test failures without this fix in place.
Reviewed-by: Joseph Myers <josmyers@redhat.com>
Reject invalid formatted scanf real input data that is comprised of a
hexadecimal prefix, optionally preceded by a sign, and with no actual
digits following owing to the field width restriction in effect. Such
data is a prefix of, but not a matching input sequence and it is
required by ISO C to cause a matching failure.
Currently a matching success is instead incorrectly produced along with
the conversion result of zero, with the prefix wholly consumed from
input. Where the end of input is marked by the end-of-file condition
rather than the field width restriction in effect a matching failure is
already correctly produced.
Enable input data that causes test failures without this fix in place.
Reviewed-by: Joseph Myers <josmyers@redhat.com>
Add Makefile infrastructure, a format-specific test skeleton providing a
data comparison implementation that ignores bits of data representation
in memory that do not participate in holding floating-point data, and
`long double' real input data for targets using the Intel/Motorola
80-bit format.
Keep input data disabled and referring to BZ #12701 for entries that are
are currently incorrectly accepted as valid data, such as '0e', '0e+',
'0x', '0x8p', '0x0p-', etc.
Reviewed-by: Joseph Myers <josmyers@redhat.com>
C23 adds various <math.h> function families originally defined in TS
18661-4. Add the pown functions, which are like pow but with an
integer exponent. That exponent has type long long int in C23; it was
intmax_t in TS 18661-4, and as with other interfaces changed after
their initial appearance in the TS, I don't think we need to support
the original version of the interface. The test inputs are based on
the subset of test inputs for pow that use integer exponents that fit
in long long.
As the first such template implementation that saves and restores the
rounding mode internally (to avoid possible issues with directed
rounding and intermediate overflows or underflows in the wrong
rounding mode), support also needed to be added for using
SET_RESTORE_ROUND* in such template function implementations. This
required math-type-macros-float128.h to include <fenv_private.h>, so
it can tell whether SET_RESTORE_ROUNDF128 is defined. In turn, the
include order with <fenv_private.h> included before <math_private.h>
broke loongarch builds, showing up that
sysdeps/loongarch/math_private.h is really a fenv_private.h file
(maybe implemented internally before the consistent split of those
headers in 2018?) and needed to be renamed to fenv_private.h to avoid
errors with duplicate macro definitions if <math_private.h> is
included after <fenv_private.h>.
The underlying implementation uses __ieee754_pow functions (called
more than once in some cases, where the exponent does not fit in the
floating type). I expect a custom implementation for a given format,
that only handles integer exponents but handles larger exponents
directly, could be faster and more accurate in some cases.
I encourage searching for worst cases for ulps error for these
implementations (necessarily non-exhaustively, given the size of the
input space).
Tested for x86_64 and x86, and with build-many-glibcs.py.
Add Makefile infrastructure and IBM 128-bit 'long double' real input for
targets switching between the IEEE 754 binary128 and IBM 128-bit formats
with '-mabi=ieeelongdouble' and '-mabi=ibmlongdouble'. Reuse IEEE 754
binary128 input data but with modified output file names so as not to
clash with the names used for IBM 128-bit format tests made with common
rules for the 'long double' data type.
Keep input data disabled and referring to BZ #12701 for entries that are
are currently incorrectly accepted as valid data, such as '0e', '0e+',
'0x', '0x8p', '0x0p-', etc.
Reviewed-by: Joseph Myers <josmyers@redhat.com>
Add Makefile infrastructure and 64-bit `long double' real input data for
targets switching between the IEEE 754 binary64 and IEEE 754 binary128
formats with `-mlong-double-64' and `-mlong-double-128'. Use modified
output file names for the IEEE 754 binary64 format so as not to clash
with the names used for IEEE 754 binary128 format tests made with common
rules for the 'long double' data type.
Keep input data disabled and referring to BZ #12701 for entries that are
are currently incorrectly accepted as valid data, such as '0e', '0e+',
'0x', '0x8p', '0x0p-', etc.
Reviewed-by: Joseph Myers <josmyers@redhat.com>
Add Makefile infrastructure and `long double' real input data for
targets using the IEEE 754 binary128 format.
Keep input data disabled and referring to BZ #12701 for entries that are
are currently incorrectly accepted as valid data, such as '0e', '0e+',
'0x', '0x8p', '0x0p-', etc.
Reviewed-by: Joseph Myers <josmyers@redhat.com>
Add Makefile infrastructure and `double' real input data for targets
using the IEEE 754 binary64 format.
Keep input data disabled and referring to BZ #12701 for entries that are
are currently incorrectly accepted as valid data, such as '0e', '0e+',
'0x', '0x8p', '0x0p-', etc.
Reviewed-by: Joseph Myers <josmyers@redhat.com>
Add Makefile infrastructure and `float' real input data for targets
using the IEEE 754 binary32 format.
Keep input data disabled and referring to BZ #12701 for entries that are
are currently incorrectly accepted as valid data, such as '0e', '0e+',
'0x', '0x8p', '0x0p-', etc.
Reviewed-by: Joseph Myers <josmyers@redhat.com>
C23 adds various <math.h> function families originally defined in TS
18661-4. Add the powr functions, which are like pow, but with simpler
handling of special cases (based on exp(y*log(x)), so negative x and
0^0 are domain errors, powers of -0 are always +0 or +Inf never -0 or
-Inf, and 1^+-Inf and Inf^0 are also domain errors, while NaN^0 and
1^NaN are NaN). The test inputs are taken from those for pow, with
appropriate adjustments (including removing all tests that would be
domain errors from those in auto-libm-test-in and adding some more
such tests in libm-test-powr.inc).
The underlying implementation uses __ieee754_pow functions after
dealing with all special cases that need to be handled differently.
It might be a little faster (avoiding a wrapper and redundant checks
for special cases) to have an underlying implementation built
separately for both pow and powr with compile-time conditionals for
special-case handling, but I expect the benefit of that would be
limited given that both functions will end up needing to use the same
logic for computing pow outside of special cases.
My understanding is that powr(negative, qNaN) should raise "invalid":
that the rule on "invalid" for an argument outside the domain of the
function takes precedence over a quiet NaN argument producing a quiet
NaN result with no exceptions raised (for rootn it's explicit that the
0th root of qNaN raises "invalid"). I've raised this on the WG14
reflector to confirm the intent.
Tested for x86_64 and x86, and with build-many-glibcs.py.
On SPR, it improves atanh bench performance by:
Before After Improvement
reciprocal-throughput 15.1715 14.8628 2%
latency 57.1941 56.1883 2%
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
On SPR, it improves sinh bench performance by:
Before After Improvement
reciprocal-throughput 14.2017 11.815 17%
latency 36.4917 35.2114 4%
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
On Skylake, it improves tanh bench performance by:
Before After Improvement
max 110.89 95.826 14%
min 20.966 20.157 4%
mean 30.9601 29.8431 4%
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
The current approach tracks math maximum supported errors by explicitly
setting them per function and architecture. On newer implementations or
new compiler versions, the file is updated with newer values if it
shows higher results. The idea is to track the maximum known error, to
update the manual with the obtained values.
The constant libm-test-ulps shows little value, where it is usually a
mechanical change done by the maintainer, for past releases it is
usually ignored whether the ulp change resulted from a compiler
regression, and the math tests already have a maximum ulp error that
triggers a regression.
It was shown by a recent update after the new acosf [1] implementation
that is correctly rounded, where the libm-test-ulps was indeed from a
compiler issue.
This patch removes all arch-specific libm-test-ulps, adds system generic
libm-test-ulps where applicable, and changes its semantics. The generic
files now track specific implementation constraints, like if it is
expected to be correctly rounded, or if the system-specific has
different error expectations.
Now multiple libm-test-ulps can be defined, and system-specific
overrides generic implementation. This is for the case where
arch-specific implementation might show worse precision than generic
implementation, for instance, the cbrtf on i686.
Regressions are only reported if the implementation shows larger errors
than 9 ulps (13 for IBM long double) unless it is overridden by
libm-test-ulps and the maximum error is not printed at the end of tests.
The regen-ulps rule is also removed since it does not make sense to
update the libm-test-ulps automatically.
The manual error table is also removed, Paul Zimmermann and others have
been tracking libm precision with a more comprehensive analysis for some
releases; so link to his work instead.
[1] https://sourceware.org/git/?p=glibc.git;a=commit;h=9cc9f8e11e8fb8f54f1e84d9f024917634a78201
C23 adds various <math.h> function families originally defined in TS
18661-4. Add the rsqrt functions (1/sqrt(x)). The test inputs are
taken from those for sqrt.
Tested for x86_64 and x86, and with build-many-glibcs.py.
Single-precision remainderf() and quad-precision remainderl()
implementation derived from Sun is affected by an issue when the result
is +-0. IEEE754 requires that if remainder(x, y) = 0, its sign shall be
that of x regardless of the rounding direction.
The implementation seems to have assumed that x - x = +0 in all
rounding modes, which is not the case. When rounding direction is
roundTowardNegative the sign of an exact zero sum (or difference) is −0.
Regression tests that triggered this erroneous behavior are added to
math/libm-test-remainder.inc.
Tested for cross riscv64 and powerpc.
Original fix by: Bruce Evans <bde@FreeBSD.org> in FreeBSD's
a2ddfa5ea726c56dbf825763ad371c261b89b7c7.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
A number of fma tests started to fail on hppa when gcc was changed to
use Ranger rather than EVRP. Eventually I found that the value of
a1 + u.d in this is block of code was being computed in FE_TOWARDZERO
mode and not the original rounding mode:
if (TININESS_AFTER_ROUNDING)
{
w.d = a1 + u.d;
if (w.ieee.exponent == 109)
return w.d * 0x1p-108;
}
This caused the exponent value to be wrong and the wrong return path
to be used.
Here we add an optimization barrier after the rounding mode is reset
to ensure that the previous value of a1 + u.d is not reused.
Signed-off-by: John David Anglin <dave.anglin@bell.net>
GCC aligns global data to 16 bytes if their size is >= 16 bytes. This patch
changes the exp_data struct slightly so that the fields are better aligned
and without gaps. As a result on targets that support them, more load-pair
instructions are used in exp. Exp10 is improved by moving invlog10_2N later
so that neglog10_2hiN and neglog10_2loN can be loaded using load-pair.
The exp benchmark improves 2.5%, "144bits" by 7.2%, "768bits" by 12.7% on
Neoverse V2. Exp10 improves by 1.5%.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
The libm size improvement built with "--enable-stack-protector=strong
--enable-bind-now=yes --enable-fortify-source=2":
Before:
text data bss dec hex filename
585192 860 12 586064 8f150 aarch64-linux-gnu/math/libm.so
960775 1068 12 961855 ead3f x86_64-linux-gnu/math/libm.so
1189174 5544 368 1195086 123c4e powerpc64le-linux-gnu/math/libm.so
After:
text data bss dec hex filename
584952 860 12 585824 8f060 aarch64-linux-gnu/math/libm.so
960615 1068 12 961695 eac9f x86_64-linux-gnu/math/libm.so
1189078 5544 368 1194990 123bee powerpc64le-linux-gnu/math/libm.so
The are small code changes for x86_64 and powerpc64le, which do not
affect performance; but on aarch64 with gcc-14 I see a slight better
code generation due the usage of ldq for floating point constant loading.
Reviewed-by: Andreas K. Huettel <dilfridge@gentoo.org>
The libm size improvement built with "--enable-stack-protector=strong
--enable-bind-now=yes --enable-fortify-source=2":
Before:
text data bss dec hex filename
587304 860 12 588176 8f990 aarch64-linux-gnu-master/math/libm.so
962855 1068 12 963935 eb55f x86_64-linux-gnu-master/math/libm.so
1191222 5544 368 1197134 12444e powerpc64le-linux-gnu-master/math/libm.so
After:
text data bss dec hex filename
585192 860 12 586064 8f150 aarch64-linux-gnu/math/libm.so
960775 1068 12 961855 ead3f x86_64-linux-gnu/math/libm.so
1189174 5544 368 1195086 123c4e powerpc64le-linux-gnu/math/libm.so
The are small code changes for x86_64 and powerpc64le, which do not
affect performance; but on aarch64 with gcc-14 I see a slight better
code generation due the usage of ldq for floating point constant loading.
Reviewed-by: Andreas K. Huettel <dilfridge@gentoo.org>