The current implementation relies on setting the rounding mode for
different calculations (FE_TOWARDZERO) to obtain correctly rounded
results. For most CPUs, this adds significant performance overhead
because it requires executing a typically slow instruction (to
get/set the floating-point status), necessitates flushing the
pipeline, and breaks some compiler assumptions/optimizations.
The original implementation adds tests to handle underflow in corner
cases, whereas this implementation uses a different strategy that
checks both the mantissa and the result to determine whether the
result is not subject to double rounding.
I tested this implementation on various targets (x86_64, i686, arm,
aarch64, powerpc), including some by manually disabling the compiler
instructions.
Performance-wise, it shows large improvements:
reciprocal-throughput master patched improvement
x86_64 [1] 58.09 7.96 7.33x
i686 [1] 279.41 16.97 16.46x
aarch64 [2] 26.09 4.10 6.35x
armhf [2] 30.25 4.20 7.18x
powerpc [3] 9.46 1.46 6.45x
latency master patched improvement
x86_64 64.50 14.25 4.53x
i686 304.39 61.04 4.99x
aarch64 27.71 5.74 4.82x
armhf 33.46 7.34 4.55x
powerpc 10.96 2.65 4.13x
Checked on x86_64-linux-gnu and i686-linux-gnu with —disable-multi-arch,
and on arm-linux-gnueabihf.
[1] gcc 15.2.1, Zen3
[2] gcc 15.2.1, Neoverse N1
[3] gcc 15.2.1, POWER10
Signed-off-by: Szabolcs Nagy <nsz@gcc.gnu.org>
Co-authored-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Co-authored-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>