1
0
mirror of https://sourceware.org/git/glibc.git synced 2025-09-15 12:01:15 +03:00
Commit Graph

505 Commits

Author SHA1 Message Date
Noah Goldstein
4af6844aa5 x86: Optimize memrchr-evex.S
Optimizations are:
1. Use the fact that lzcnt(0) -> VEC_SIZE for memchr to save a branch
   in short string case.
2. Save several instructions in len = [VEC_SIZE, 4 * VEC_SIZE] case.
3. Use more code-size efficient instructions.
	- tzcnt ...     -> bsf ...
	- vpcmpb $0 ... -> vpcmpeq ...

Code Size Changes:
memrchr-evex.S      :  -29 bytes

Net perf changes:

Reported as geometric mean of all improvements / regressions from N=10
runs of the benchtests. Value as New Time / Old Time so < 1.0 is
improvement and 1.0 is regression.

memrchr-evex.S      : 0.949 (Mostly from improvements in small strings)

Full results attached in email.

Full check passes on x86-64.
2022-10-19 17:31:03 -07:00
Noah Goldstein
b79f8ff26a x86: Optimize strnlen-evex.S and implement with VMM headers
Optimizations are:
1. Use the fact that bsf(0) leaves the destination unchanged to save a
   branch in short string case.
2. Restructure code so that small strings are given the hot path.
        - This is a net-zero on the benchmark suite but in general makes
      sense as smaller sizes are far more common.
3. Use more code-size efficient instructions.
	- tzcnt ...     -> bsf ...
	- vpcmpb $0 ... -> vpcmpeq ...
4. Align labels less aggressively, especially if it doesn't save fetch
   blocks / causes the basic-block to span extra cache-lines.

The optimizations (especially for point 2) make the strnlen and
strlen code essentially incompatible so split strnlen-evex
to a new file.

Code Size Changes:
strlen-evex.S       :  -23 bytes
strnlen-evex.S      : -167 bytes

Net perf changes:

Reported as geometric mean of all improvements / regressions from N=10
runs of the benchtests. Value as New Time / Old Time so < 1.0 is
improvement and 1.0 is regression.

strlen-evex.S       : 0.992 (No real change)
strnlen-evex.S      : 0.947

Full results attached in email.

Full check passes on x86-64.
2022-10-19 17:31:03 -07:00
Noah Goldstein
69717709ec x86: Shrink / minorly optimize strchr-evex and implement with VMM headers
Size Optimizations:
1. Condence hot path for better cache-locality.
    - This is most impact for strchrnul where the logic strings with
      len <= VEC_SIZE or with a match in the first VEC no fits entirely
      in the first cache line.
2. Reuse common targets in first 4x VEC and after the loop.
3. Don't align targets so aggressively if it doesn't change the number
   of fetch blocks it will require and put more care in avoiding the
   case where targets unnecessarily split cache lines.
4. Align the loop better for DSB/LSD
5. Use more code-size efficient instructions.
	- tzcnt ...     -> bsf ...
	- vpcmpb $0 ... -> vpcmpeq ...
6. Align labels less aggressively, especially if it doesn't save fetch
   blocks / causes the basic-block to span extra cache-lines.

Code Size Changes:
strchr-evex.S	: -63 bytes
strchrnul-evex.S: -48 bytes

Net perf changes:
Reported as geometric mean of all improvements / regressions from N=10
runs of the benchtests. Value as New Time / Old Time so < 1.0 is
improvement and 1.0 is regression.

strchr-evex.S (Fixed)   : 0.971
strchr-evex.S (Rand)    : 0.932
strchrnul-evex.S        : 0.965

Full results attached in email.

Full check passes on x86-64.
2022-10-19 17:31:03 -07:00
Noah Goldstein
330881763e x86: Optimize memchr-evex.S and implement with VMM headers
Optimizations are:

1. Use the fact that tzcnt(0) -> VEC_SIZE for memchr to save a branch
   in short string case.
2. Restructure code so that small strings are given the hot path.
	- This is a net-zero on the benchmark suite but in general makes
      sense as smaller sizes are far more common.
3. Use more code-size efficient instructions.
	- tzcnt ...     -> bsf ...
	- vpcmpb $0 ... -> vpcmpeq ...
4. Align labels less aggressively, especially if it doesn't save fetch
   blocks / causes the basic-block to span extra cache-lines.

The optimizations (especially for point 2) make the memchr and
rawmemchr code essentially incompatible so split rawmemchr-evex
to a new file.

Code Size Changes:
memchr-evex.S       : -107 bytes
rawmemchr-evex.S    :  -53 bytes

Net perf changes:

Reported as geometric mean of all improvements / regressions from N=10
runs of the benchtests. Value as New Time / Old Time so < 1.0 is
improvement and 1.0 is regression.

memchr-evex.S       : 0.928
rawmemchr-evex.S    : 0.986 (Less targets cross cache lines)

Full results attached in email.

Full check passes on x86-64.
2022-10-19 17:31:03 -07:00
Sunil K Pandey
451c6e5854 x86_64: Implement evex512 version of memchr, rawmemchr and wmemchr
This patch implements following evex512 version of string functions.
evex512 version takes up to 30% less cycle as compared to evex,
depending on length and alignment.

- memchr function using 512 bit vectors.
- rawmemchr function using 512 bit vectors.
- wmemchr function using 512 bit vectors.

Code size data:

memchr-evex.o		762 byte
memchr-evex512.o	576 byte (-24%)

rawmemchr-evex.o	461 byte
rawmemchr-evex512.o	412 byte (-11%)

wmemchr-evex.o		794 byte
wmemchr-evex512.o	552 byte (-30%)

Placeholder function, not used by any processor at the moment.

Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
2022-10-18 13:26:33 -07:00
Noah Goldstein
be066536bd x86: Update strlen-evex-base to use new reg/vec macros.
To avoid duplicate the VMM / GPR / mask insn macros in all incoming
evex512 files use the macros defined in 'reg-macros.h' and
'{vec}-macros.h'

This commit does not change libc.so

Tested build on x86-64
2022-10-14 21:21:58 -07:00
Noah Goldstein
47f5d51461 x86: Remove now unused vec header macros.
This commit does not change libc.so

Tested build on x86-64
2022-10-14 21:21:58 -07:00
Noah Goldstein
a6784653f7 x86: Update memset to use new VEC macros
Replace %VEC(n) -> %VMM(n)

This commit does not change libc.so

Tested build on x86-64
2022-10-14 21:21:58 -07:00
Noah Goldstein
4fb7d8a938 x86: Update memmove to use new VEC macros
Replace %VEC(n) -> %VMM(n)

This commit does not change libc.so

Tested build on x86-64
2022-10-14 21:21:58 -07:00
Noah Goldstein
3088a66ff8 x86: Update memrchr to use new VEC macros
Replace %VEC(n) -> %VMM(n)

This commit does not change libc.so

Tested build on x86-64
2022-10-14 21:21:58 -07:00
Noah Goldstein
52ab7604db x86: Update VEC macros to complete API for evex/evex512 impls
1) Copy so that backport will be easier.
2) Make section only define if there is not a previous definition
3) Add `VEC_lo` definition for proper reg-width but in the
   ymm/zmm0-15 range.
4) Add macros for accessing GPRs based on VEC_SIZE
        This is to make it easier to do think like:
        ```
            vpcmpb %VEC(0), %VEC(1), %k0
            kmov{d|q} %k0, %{eax|rax}
            test %{eax|rax}
        ```
        It adds macro s.t any GPR can get the proper width with:
            `V{upcase_GPR_name}`

        and any mask insn can get the proper width with:
            `{upcase_mask_insn_without_postfix}`

This commit does not change libc.so

Tested build on x86-64
2022-10-14 21:21:58 -07:00
Adhemerval Zanella
5355f9ca7b elf: Remove -fno-tree-loop-distribute-patterns usage on dl-support
Besides the option being gcc specific, this approach is still fragile
and not future proof since we do not know if this will be the only
optimization option gcc will add that transforms loops to memset
(or any libcall).

This patch adds a new header, dl-symbol-redir-ifunc.h, that can b
used to redirect the compiler generated libcalls to port the generic
memset implementation if required.

Checked on x86_64-linux-gnu and aarch64-linux-gnu.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
2022-10-10 10:32:28 -03:00
Adhemerval Zanella Netto
9dc4e29f63 x86: Fix -Os build (BZ #29576)
The compiler might transform __stpcpy calls (which are routed to
__builtin_stpcpy as an optimization) to strcpy and x86_64 strcpy
multiarch implementation does not build any working symbol due
ISA_SHOULD_BUILD not being evaluated for IS_IN(rtld).

Checked on x86_64-linux-gnu.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
Tested-by: Carlos O'Donell <carlos@redhat.com>
2022-10-05 18:04:13 -03:00
Aurelien Jarno
7e8283170c x86-64: Require BMI1/BMI2 for AVX2 strrchr and wcsrchr implementations
The AVX2 strrchr and wcsrchr implementation uses the 'blsmsk'
instruction which belongs to the BMI1 CPU feature and the 'shrx'
instruction, which belongs to the BMI2 CPU feature.

Fixes: df7e295d18 ("x86: Optimize {str|wcs}rchr-avx2")
Partially resolves: BZ #29611

Reviewed-by: Noah Goldstein  <goldstein.w.n@gmail.com>
2022-10-03 23:46:11 +02:00
Aurelien Jarno
3c0c78afab x86-64: Require BMI2 and LZCNT for AVX2 memrchr implementation
The AVX2 memrchr implementation uses the 'shlxl' instruction, which
belongs to the BMI2 CPU feature and uses the 'lzcnt' instruction, which
belongs to the LZCNT CPU feature.

Fixes: af5306a735 ("x86: Optimize memrchr-avx2.S")
Partially resolves: BZ #29611

Reviewed-by: Noah Goldstein  <goldstein.w.n@gmail.com>
2022-10-03 23:46:11 +02:00
Aurelien Jarno
e3e7fab7fe x86-64: Require BMI2 for AVX2 (raw|w)memchr implementations
The AVX2 memchr, rawmemchr and wmemchr implementations use the 'bzhi'
and 'sarx' instructions, which belongs to the BMI2 CPU feature.

Fixes: acfd088a19 ("x86: Optimize memchr-avx2.S")
Partially resolves: BZ #29611

Reviewed-by: Noah Goldstein  <goldstein.w.n@gmail.com>
2022-10-03 23:46:11 +02:00
Aurelien Jarno
f31a5a884e x86-64: Require BMI2 for AVX2 wcs(n)cmp implementations
The AVX2 wcs(n)cmp implementations use the 'bzhi' instruction, which
belongs to the BMI2 CPU feature.

NB: It also uses the 'tzcnt' BMI1 instruction, but it is executed as BSF
as BSF if the CPU doesn't support TZCNT, and produces the same result
for non-zero input.

Partially fixes: b77b06e0e2 ("x86: Optimize strcmp-avx2.S")
Partially resolves: BZ #29611

Reviewed-by: Noah Goldstein  <goldstein.w.n@gmail.com>
2022-10-03 23:46:11 +02:00
Aurelien Jarno
fc7de1d9b9 x86-64: Require BMI2 for AVX2 strncmp implementation
The AVX2 strncmp implementations uses the 'bzhi' instruction, which
belongs to the BMI2 CPU feature.

NB: It also uses the 'tzcnt' BMI1 instruction, but it is executed as BSF
as BSF if the CPU doesn't support TZCNT, and produces the same result
for non-zero input.

Partially fixes: b77b06e0e2 ("x86: Optimize strcmp-avx2.S")
Partially resolves: BZ #29611

Reviewed-by: Noah Goldstein  <goldstein.w.n@gmail.com>
2022-10-03 23:46:11 +02:00
Aurelien Jarno
4d64c64457 x86-64: Require BMI2 for AVX2 strcmp implementation
The AVX2 strcmp implementation uses the 'bzhi' instruction, which
belongs to the BMI2 CPU feature.

NB: It also uses the 'tzcnt' BMI1 instruction, but it is executed as BSF
as BSF if the CPU doesn't support TZCNT, and produces the same result
for non-zero input.

Partially fixes: b77b06e0e2 ("x86: Optimize strcmp-avx2.S")
Partially resolves: BZ #29611

Reviewed-by: Noah Goldstein  <goldstein.w.n@gmail.com>
2022-10-03 23:46:11 +02:00
Aurelien Jarno
10f79d3670 x86-64: Require BMI2 for AVX2 str(n)casecmp implementations
The AVX2 str(n)casecmp implementations use the 'bzhi' instruction, which
belongs to the BMI2 CPU feature.

NB: It also uses the 'tzcnt' BMI1 instruction, but it is executed as BSF
as BSF if the CPU doesn't support TZCNT, and produces the same result
for non-zero input.

Partially fixes: b77b06e0e2 ("x86: Optimize strcmp-avx2.S")
Partially resolves: BZ #29611

Reviewed-by: Noah Goldstein  <goldstein.w.n@gmail.com>
2022-10-03 23:46:11 +02:00
Noah Goldstein
b0969fa53a x86: Fix wcsnlen-avx2 page cross length comparison [BZ #29591]
Previous implementation was adjusting length (rsi) to match
bytes (eax), but since there is no bound to length this can cause
overflow.

Fix is to just convert the byte-count (eax) to length by dividing by
sizeof (wchar_t) before the comparison.

Full check passes on x86-64 and build succeeds w/ and w/o multiarch.
2022-09-28 20:15:16 -07:00
Noah Goldstein
312ded0d63 x86: Fix #define STRCPY guard in strcpy-sse2.S
`#ifndef STPCPY` is incorrect for checking if `STRCPY` is already
defined.  It doesn't end up mattering as the whole check is
guarded by `#if IS_IN (libc)` but is incorrect none the less.
2022-08-09 17:00:03 +08:00
Noah Goldstein
49889fb256 x86: Add support to build st{p|r}{n}{cpy|cat} with explicit ISA level
1. Add default ISA level selection in non-multiarch/rtld
   implementations.

2. Add ISA level build guards to different implementations.
    - I.e strcpy-avx2.S which is ISA level 3 will only build if
      compiled ISA level <= 3. Otherwise there is no reason to
      include it as we will always use one of the ISA level 4
      implementations (strcpy-evex.S).

3. Refactor the ifunc selector and ifunc implementation list to use
   the ISA level aware wrapper macros that allow functions below the
   compiled ISA level (with a guranteed replacement) to be skipped.

Tested with and without multiarch on x86_64 for ISA levels:
{generic, x86-64-v2, x86-64-v3, x86-64-v4}

And m32 with and without multiarch.
2022-07-16 03:07:59 -07:00
Noah Goldstein
192979ee35 x86: Add support to build wcscpy with explicit ISA level
1. Add ISA level build guards to different implementations.
    - wcscpy-ssse3.S is used as ISA level 2/3/4.
    - wcscpy-generic.c is only used at ISA level 1 and will
      only build if compiled with ISA level == 1. Otherwise
      there is no reason to include it as we will always use
      wcscpy-ssse3.S

2. Refactor the ifunc selector and ifunc implementation list to use
   the ISA level aware wrapper macros that allow functions below the
   compiled ISA level (with a guranteed replacement) to be skipped.

Tested with and without multiarch on x86_64 for ISA levels:
{generic, x86-64-v2, x86-64-v3, x86-64-v4}

And m32 with and without multiarch.
2022-07-16 03:07:59 -07:00
Noah Goldstein
ceabdcd130 x86: Add support to build strcmp/strlen/strchr with explicit ISA level
1. Add default ISA level selection in non-multiarch/rtld
   implementations.

2. Add ISA level build guards to different implementations.
    - I.e strcmp-avx2.S which is ISA level 3 will only build if
      compiled ISA level <= 3. Otherwise there is no reason to
      include it as we will always use one of the ISA level 4
      implementations (strcmp-evex.S).

3. Refactor the ifunc selector and ifunc implementation list to use
   the ISA level aware wrapper macros that allow functions below the
   compiled ISA level (with a guranteed replacement) to be skipped.

Tested with and without multiarch on x86_64 for ISA levels:
{generic, x86-64-v2, x86-64-v3, x86-64-v4}

And m32 with and without multiarch.
2022-07-16 03:07:59 -07:00
Noah Goldstein
42b014dd1b x86: Remove unneeded rtld-wmemcmp
wmemcmp isn't used by the dynamic loader so their no need to add an
RTLD stub for it.

Tested with and without multiarch on x86_64 for ISA levels:
{generic, x86-64-v2, x86-64-v3, x86-64-v4}

And m32 with and without multiarch.
2022-07-13 14:55:31 -07:00
Noah Goldstein
e19bb87c97 x86: Move wcslen SSE2 implementation to multiarch/wcslen-sse2.S
This commit doesn't affect libc.so.6, its just housekeeping to prepare
for adding explicit ISA level support.

Tested build on x86_64 and x86_32 with/without multiarch.
2022-07-13 14:55:31 -07:00
Noah Goldstein
64479f11b7 x86: Move wcschr SSE2 implementation to multiarch/wcschr-sse2.S
This commit doesn't affect libc.so.6, its just housekeeping to prepare
for adding explicit ISA level support.

Tested build on x86_64 and x86_32 with/without multiarch.
2022-07-13 14:55:31 -07:00
Noah Goldstein
72a48ec0f7 x86: Move strcat SSE2 implementation to multiarch/strcat-sse2.S
This commit doesn't affect libc.so.6, its just housekeeping to prepare
for adding explicit ISA level support.

Tested build on x86_64 and x86_32 with/without multiarch.
2022-07-13 14:55:31 -07:00
Noah Goldstein
cd080d0741 x86: Move strchr SSE2 implementation to multiarch/strchr-sse2.S
This commit doesn't affect libc.so.6, its just housekeeping to prepare
for adding explicit ISA level support.

Tested build on x86_64 and x86_32 with/without multiarch.
2022-07-13 14:55:31 -07:00
Noah Goldstein
425647458b x86: Move strrchr SSE2 implementation to multiarch/strrchr-sse2.S
This commit doesn't affect libc.so.6, its just housekeeping to prepare
for adding explicit ISA level support.

Tested build on x86_64 and x86_32 with/without multiarch.
2022-07-13 14:55:31 -07:00
Noah Goldstein
08af081ffd x86: Move memrchr SSE2 implementation to multiarch/memrchr-sse2.S
This commit doesn't affect libc.so.6, its just housekeeping to prepare
for adding explicit ISA level support.

Tested build on x86_64 and x86_32 with/without multiarch.
2022-07-13 14:55:31 -07:00
Noah Goldstein
6b9006bfb0 x86: Move strcpy SSE2 implementation to multiarch/strcpy-sse2.S
This commit doesn't affect libc.so.6, its just housekeeping to prepare
for adding explicit ISA level support.

Tested build on x86_64 and x86_32 with/without multiarch.
2022-07-13 14:55:31 -07:00
Noah Goldstein
58e6cd4bcb x86: Move strlen SSE2 implementation to multiarch/strlen-sse2.S
This commit doesn't affect libc.so.6, its just housekeeping to prepare
for adding explicit ISA level support.

Tested build on x86_64 and x86_32 with/without multiarch.
2022-07-13 14:55:31 -07:00
Noah Goldstein
60a583ec60 x86: Move strcmp SSE42 implementation to multiarch/strcmp-sse4_2.S
This commit doesn't affect libc.so.6, its just housekeeping to prepare
for adding explicit ISA level support.

Tested build on x86_64 and x86_32 with/without multiarch.
2022-07-13 14:55:31 -07:00
Noah Goldstein
427eaa2c85 x86: Move wcscmp SSE2 implementation to multiarch/wcscmp-sse2.S
This commit doesn't affect libc.so.6, its just housekeeping to prepare
for adding explicit ISA level support.

Tested build on x86_64 and x86_32 with/without multiarch.
2022-07-13 14:55:31 -07:00
Noah Goldstein
d561fbb041 x86: Move strcmp SSE2 implementation to multiarch/strcmp-sse2.S
This commit doesn't affect libc.so.6, its just housekeeping to prepare
for adding explicit ISA level support.

Because strcmp-sse2.S implements so many functions (more from
avx2/evex/sse42) add a new file 'strcmp-naming.h' to assist in
getting the correct symbol name for all the function across
multiarch/non-multiarch builds.

Tested build on x86_64 and x86_32 with/without multiarch.
2022-07-13 14:55:31 -07:00
Noah Goldstein
30e57e0a21 x86: Rename STRCASECMP_NONASCII macro to STRCASECMP_L_NONASCII
The previous macro name can be confusing given that both
`__strcasecmp_l_nonascii` and `__strcasecmp_nonascii` are
functions and we use the `_l` version.
2022-07-13 14:55:31 -07:00
Noah Goldstein
f2698954ff x86: Remove __mmask intrinsics in strstr-avx512.c
The intrinsics are not available before GCC7 and using standard
operators generates code of equivalent or better quality.

Removed:
    _cvtmask64_u64
    _kshiftri_mask64
    _kand_mask64

Geometric Mean of 5 Runs of Full Benchmark Suite New / Old: 0.958
2022-07-12 15:41:14 -07:00
Noah Goldstein
9c38deec96 x86: Remove generic strncat, strncpy, and stpncpy implementations
These functions all have optimized versions:
__strncat_sse2_unaligned, __strncpy_sse2_unaligned, and
stpncpy_sse2_unaligned which are faster than their respective generic
implementations.  Since the sse2 versions can run on baseline x86_64,
we should use these as the baseline implementation and can remove the
generic implementations.

Geometric mean of N=20 runs of the entire benchmark suite on:
11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz (Tigerlake)

__strncat_sse2_unaligned / __strncat_generic: .944
__strncpy_sse2_unaligned / __strncpy_generic: .726
__stpncpy_sse2_unaligned / __stpncpy_generic: .650

Tested build with and without multiarch and full check with multiarch.
2022-07-12 11:44:12 -07:00
H.J. Lu
ec9013727d x86-64: Remove redundant strcspn-generic/strpbrk-generic/strspn-generic
Remove redundant strcspn-generic, strpbrk-generic and strspn-generic
from sysdep_routines in sysdeps/x86_64/multiarch/Makefile added by

commit c69f960b01
Author: Noah Goldstein <goldstein.w.n@gmail.com>
Date:   Sun Jul 3 21:28:07 2022 -0700

    x86: Add support for building str{c|p}{brk|spn} with explicit ISA level

since they have been added to sysdep_routines in sysdeps/x86_64/Makefile.
2022-07-08 16:06:04 -07:00
H.J. Lu
eedf7886ed x86-64: Don't mark symbols as hidden in strcmp-XXX.S
Don't mark symbols as hidden in strcmp-avx2.S, strcmp-evex.S and
strcmp-sse42.S since they are marked as hidden in the IFUNC selectors.
2022-07-07 16:38:11 -07:00
Noah Goldstein
ae308947ff x86: Add support for building {w}memcmp{eq} with explicit ISA level
1. Refactor files so that all implementations are in the multiarch
   directory
    - Moved the implementation portion of memcmp sse2 from memcmp.S to
      multiarch/memcmp-sse2.S

    - The non-multiarch file now only includes one of the
      implementations in the multiarch directory based on the compiled
      ISA level (only used for non-multiarch builds.  Otherwise we go
      through the ifunc selector).

2. Add ISA level build guards to different implementations.
    - I.e memcmp-avx2-movsb.S which is ISA level 3 will only build if
      compiled ISA level <= 3. Otherwise there is no reason to include
      it as we will always use one of the ISA level 4
      implementations (memcmp-evex-movbe.S).

3. Add new multiarch/rtld-{w}memcmp{eq}.S that just include the
   non-multiarch {w}memcmp{eq}.S which will in turn select the best
   implementation based on the compiled ISA level.

4. Refactor the ifunc selector and ifunc implementation list to use
   the ISA level aware wrapper macros that allow functions below the
   compiled ISA level (with a guranteed replacement) to be skipped.

Tested with and without multiarch on x86_64 for ISA levels:
{generic, x86-64-v2, x86-64-v3, x86-64-v4}

And m32 with and without multiarch.
2022-07-05 16:42:42 -07:00
Noah Goldstein
37ecc657b2 x86: Add support for building {w}memset{_chk} with explicit ISA level
1. Refactor files so that all implementations are in the multiarch
   directory
    - Moved the implementation portion of memset sse2 from memset.S to
      multiarch/memset-sse2.S

    - The non-multiarch file now only includes one of the
      implementations in the multiarch directory based on the compiled
      ISA level (only used for non-multiarch builds.  Otherwise we go
      through the ifunc selector).

2. Add ISA level build guards to different implementations.
    - I.e memset-avx2-unaligned-erms.S which is ISA level 3 will only
      build if compiled ISA level <= 3. Otherwise there is no reason
      to include it as we will always use one of the ISA level 4
      implementations (memset-evex-unaligned-erms.S).

3. Add new multiarch/rtld-memset.S that just include the
   non-multiarch memset.S which will in turn select the best
   implementation based on the compiled ISA level.

4. Refactor the ifunc selector and ifunc implementation list to use
   the ISA level aware wrapper macros that allow functions below the
   compiled ISA level (with a guranteed replacement) to be skipped.

Tested with and without multiarch on x86_64 for ISA levels:
{generic, x86-64-v2, x86-64-v3, x86-64-v4}

And m32 with and without multiarch.
2022-07-05 16:42:42 -07:00
Noah Goldstein
b6a02c3606 x86: Add support for building {w}memmove{_chk} with explicit ISA level
1. Refactor files so that all implementations are in the multiarch
   directory
    - Moved the implementation portion of memmove sse2 from memmove.S
      to multiarch/memmove-sse2.S

    - The non-multiarch file now only includes one of the
      implementations in the multiarch directory based on the compiled
      ISA level (only used for non-multiarch builds.  Otherwise we go
      through the ifunc selector).

2. Add ISA level build guards to different implementations.
    - I.e memmove-avx2-unaligned-erms.S which is ISA level 3 will only
      build if compiled ISA level <= 3. Otherwise there is no reason
      to include it as we will always use one of the ISA level 4
      implementations (memmove-evex-unaligned-erms.S).

3. Add new multiarch/rtld-memmove.S that just include the
   non-multiarch memmove.S which will in turn select the best
   implementation based on the compiled ISA level.

4. Refactor the ifunc selector and ifunc implementation list to use
   the ISA level aware wrapper macros that allow functions below the
   compiled ISA level (with a guranteed replacement) to be skipped.

Tested with and without multiarch on x86_64 for ISA levels:
{generic, x86-64-v2, x86-64-v3, x86-64-v4}

And m32 with and without multiarch.
isa raising memmove
2022-07-05 16:42:42 -07:00
Noah Goldstein
c69f960b01 x86: Add support for building str{c|p}{brk|spn} with explicit ISA level
The changes for these functions are different than the others because
the best implementation (sse4_2) requires the generic
implementation as a fallback to be built as well.

Changes are:

1. Add non-multiarch functions for str{c|p}{brk|spn}.c to statically
   select the best implementation based on the configured ISA build
   level.

2. Add stubs for str{c|p}{brk|spn}-generic and varshift.c to in the
   sysdeps/x86_64 directory so that the the sse4 implementation will
   have all of its dependencies for the non-multiarch / rtld build
   when ISA level >= 2.

3. Add new multiarch/rtld-strcspn.c that just include the
   non-multiarch strcspn.c which will in turn select the best
   implementation based on the compiled ISA level.

4. Refactor the ifunc selector and ifunc implementation list to use
   the ISA level aware wrapper macros that allow functions below the
   compiled ISA level (with a guranteed replacement) to be skipped.

Tested with and without multiarch on x86_64 for ISA levels:
{generic, x86-64-v2, x86-64-v3, x86-64-v4}

And m32 with and without multiarch.
2022-07-05 16:42:42 -07:00
Noah Goldstein
baeae86fb8 x86: Add comment explaining no Slow_SSE4_2 check in ifunc-sse4_2
Just for clarities sake and so that if a future implementation is
added we remember to add the check.
2022-07-05 16:42:42 -07:00
Noah Goldstein
96ac447d91 x86: Add missing IS_IN (libc) check to strncmp-sse4_2.S
Was missing to for the multiarch build rtld-strncmp-sse4_2.os was
being built and exporting symbols:

build/glibc/string/rtld-strncmp-sse4_2.os:
0000000000000000 T __strncmp_sse42

Introduced in:

commit 11ffcacb64
Author: H.J. Lu <hjl.tools@gmail.com>
Date:   Wed Jun 21 12:10:50 2017 -0700

    x86-64: Implement strcmp family IFUNC selectors in C
2022-06-29 19:47:52 -07:00
Noah Goldstein
0aa294fb88 x86: Add missing IS_IN (libc) check to strcspn-sse4.c
Was missing to for the multiarch build rtld-strcspn-sse4.os was
being built and exporting symbols:

build/glibc/string/rtld-strcspn-sse4.os:
                 U ___m128i_shift_right
                 U __strcspn_generic
0000000000000000 T __strcspn_sse42
                 U strlen

build/glibc/string/rtld-varshift.os:
0000000000000000 R ___m128i_shift_right

Introduced in:

commit 06e51c8f3d
Author: H.J. Lu <hongjiu.lu@intel.com>
Date:   Fri Jul 3 02:48:56 2009 -0700

    Add SSE4.2 support for strcspn, strpbrk, and strspn on x86-64.
2022-06-29 19:47:52 -07:00
Noah Goldstein
8cfbbbcdf9 x86: Add missing IS_IN (libc) check to memmove-ssse3.S
Was missing to for the multiarch build rtld-memmove-ssse3.os was
being built and exporting symbols:

>$ nm string/rtld-memmove-ssse3.os
                 U __GI___chk_fail
0000000000000020 T __memcpy_chk_ssse3
0000000000000040 T __memcpy_ssse3
0000000000000020 T __memmove_chk_ssse3
0000000000000040 T __memmove_ssse3
0000000000000000 T __mempcpy_chk_ssse3
0000000000000010 T __mempcpy_ssse3
                 U __x86_shared_cache_size_half

Introduced after 2.35 in:

commit 26b2478322
Author: Noah Goldstein <goldstein.w.n@gmail.com>
Date:   Thu Apr 14 11:47:40 2022 -0500

    x86: Reduce code size of mem{move|pcpy|cpy}-ssse3
2022-06-29 19:47:52 -07:00