[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#1017777: bullseye-pu: package glibc/2.31-13+deb11u4



Package: release.debian.org
Severity: normal
Tags: bullseye
User: release.debian.org@packages.debian.org
Usertags: pu
X-Debbugs-Cc: debian-boot@lists.debian.org, debian-glibc@lists.debian.org

[ Reason ]
There are multiple fixes in this upload, mostly coming from the upstream
stable branch:
- One security issue with CVE entry
- Multiple overflow fixes to wide string functions
- Failure to enforce libio vtable protection
- A performance issue with string functions affecting Skylake-X CPUs (up
  to 40% slower)
- A new NEW.Debian.gz entry for libc6-dev explaining users how to switch
  from to the TI-RPC implementation following the Sun RPC implementation
  removal in glibc 2.31
- Make grantpt usable after multi-threaded fork to prevent Ansible
  deadlocks.

[ Impact ]
In case the update isn't approved, systems will be left with
vulnerabilities, and performance issues for systems with Skylake-X.

[ Tests ]
The upstream fixes come with additional tests, which represent a
significant part of the diff.

[ Risks ]
The most risky parts are probably the string functions changes given the
amount of changes. That said those changes are in testing/sid for more
than 6 months, and also upstream or in other distribution. So overall
the risks can be considered low.

[ Checklist ]
  [x] *all* changes are documented in the d/changelog
  [x] I reviewed all changes and I approve them
  [x] attach debdiff against the package in (old)stable
  [x] the issue is verified as fixed in unstable

[ Changes ]
Let me comment the changelog:

  * debian/debhelper.in/libc-dev.NEWS: New file to explain how to update
    programs to use the TI-RPC library instead of the Sun RPC one.  Closes:
    #1014735.

This is just a documentation change to explain the change from Sun RPC
to TI-RPC, as the switch is not transparent to users and require
some changes in CFLAGS/LDFLAGS. This is fixed in testing/sid for ~2
weeks.


  * debian/patches/git-updates.diff: update from upstream stable branch:
    - Fix an off-by-one buffer overflow/underflow in getcwd() (CVE-2021-3999).

This is a security fix which is considered low impact by the security
team, so which didn't warrant a DSA. This fix is in testing/sid for ~7
months.


    - Fix an overflow bug in the SSE2 and AVX2 implementations of wmemchr.
    - Fix an overflow bug in the SSE4.1 and AVX2 implementations of wcslen and
      wcsncat.
    - Fix an overflow bug in the AVX2 and EVEX implementation of wcsncmp.

Those are corner cases fixes to wide string functions, that are in
testing/sid for ~6 months.


    - Add a few EVEX optimized string functions to fix a performance issue (up
      to 40%) with Skylake-X processors.

On CPUs with AVX512 and transactional memory like Skylake-X, the use of
AVX2 for string functions cause a transactional memory abort, which
causes up to 40% performance penalty. The fix is to add EVEX (AVX512)
string functions to get back the same performance. This changes are in
testing/sid for ~6 months.

 
    - Make grantpt usable after multi-threaded fork.  Closes: #1015740.
    - debian/patches/hurd-i386/git-posix_openpt.diff: rebase.

This change basically removes code that is not needed anymore to change
pseudo-terminal ownership through pt_chown, and can cause some deadlock
on fork, affecting Ansible. Debian doesn't ship pt_chown on Linux for ~7
years, so this code is definitely not used, not even in debian
installer. The update to the git-posix_openpt.diff is basically dropping
hunks that was touching removed code. This change is in testing/sid for
~8 months.


  * debian/rules.d/build.mk: pass --with-default-link=no to configure to
    ensure that libio vtable protection is enabled.

It has been found recently that with newer binutils versions, the libio
vtables are not placed in the RELRO segment anymore. This is not
considered issue, but a lack of hardening that can be used in
conjunction with another vulnerability. The simple workaround is to
bypass the configure script autodetection and force the value to no.
This change is in testing/sid for ~6 months, replaced by a change of the
default value since upstream glibc 2.34.
diff --git a/debian/changelog b/debian/changelog
index b1753afe..448f3d93 100644
--- a/debian/changelog
+++ b/debian/changelog
@@ -1,3 +1,24 @@
+glibc (2.31-13+deb11u4) bullseye; urgency=medium
+
+  [ Aurelien Jarno ]
+  * debian/debhelper.in/libc-dev.NEWS: New file to explain how to update
+    programs to use the TI-RPC library instead of the Sun RPC one.  Closes:
+    #1014735.
+  * debian/patches/git-updates.diff: update from upstream stable branch:
+    - Fix an off-by-one buffer overflow/underflow in getcwd() (CVE-2021-3999).
+    - Fix an overflow bug in the SSE2 and AVX2 implementations of wmemchr.
+    - Fix an overflow bug in the SSE4.1 and AVX2 implementations of wcslen and
+      wcsncat.
+    - Fix an overflow bug in the AVX2 and EVEX implementation of wcsncmp.
+    - Add a few EVEX optimized string functions to fix a performance issue (up
+      to 40%) with Skylake-X processors.
+    - Make grantpt usable after multi-threaded fork.  Closes: #1015740.
+    - debian/patches/hurd-i386/git-posix_openpt.diff: rebase.
+  * debian/rules.d/build.mk: pass --with-default-link=no to configure to
+    ensure that libio vtable protection is enabled.
+
+ -- Aurelien Jarno <aurel32@debian.org>  Fri, 19 Aug 2022 15:57:19 +0200
+
 glibc (2.31-13+deb11u3) bullseye; urgency=medium
 
   [ Aurelien Jarno ]
diff --git a/debian/debhelper.in/libc-dev.NEWS b/debian/debhelper.in/libc-dev.NEWS
new file mode 100644
index 00000000..2911668e
--- /dev/null
+++ b/debian/debhelper.in/libc-dev.NEWS
@@ -0,0 +1,21 @@
+glibc (2.31-13+deb11u4) bullseye; urgency=medium
+
+  Starting with glibc 2.31, Sun RPC is removed from glibc. This includes the
+  rpcgen program, librpcsvc, and the Sun RPC header files. However backward
+  runtime compatibility is provided, that is to say existing binaries will
+  continue to work.
+
+  In order to link new binaries, the rpcsvc-proto package (a dependency of
+  libc6-dev) provides rpcgen and several rpcsvc header files and RPC protocol
+  definitions from Sun RPC sources that were previously shipped by glibc, and
+  an an alternative RPC library shall be used. The most used alternative
+  library is TI-RPC, the corresponding development package is libtirpc-dev.
+
+  Here are the necessary steps to switch an existing program to use the TI-RPC
+  library:
+  - Make sure the rpcsvc-proto, libtirpc-dev and pkg-config packages are
+    installed.
+  - Add the output of 'pkg-config --cflags libtirpc' to CFLAGS or equivalent.
+  - Add the output of 'pkg-config --libs libtirpc' to LDFLAGS or equivalent.
+
+ -- Aurelien Jarno <aurel32@debian.org>  Wed, 03 Aug 2022 12:07:29 +0200
diff --git a/debian/patches/git-updates.diff b/debian/patches/git-updates.diff
index a6898540..e4bcb9ee 100644
--- a/debian/patches/git-updates.diff
+++ b/debian/patches/git-updates.diff
@@ -1,10 +1,32 @@
 GIT update of https://sourceware.org/git/glibc.git/release/2.31/master from glibc-2.31
 
+diff --git a/INSTALL b/INSTALL
+index 242cb06f91..b487e18634 100644
+--- a/INSTALL
++++ b/INSTALL
+@@ -184,14 +184,9 @@ if 'CFLAGS' is specified it must enable optimization.  For example:
+ '--enable-pt_chown'
+      The file 'pt_chown' is a helper binary for 'grantpt' (*note
+      Pseudo-Terminals: Allocation.) that is installed setuid root to fix
+-     up pseudo-terminal ownership.  It is not built by default because
+-     systems using the Linux kernel are commonly built with the 'devpts'
+-     filesystem enabled and mounted at '/dev/pts', which manages
+-     pseudo-terminal ownership automatically.  By using
+-     '--enable-pt_chown', you may build 'pt_chown' and install it setuid
+-     and owned by 'root'.  The use of 'pt_chown' introduces additional
+-     security risks to the system and you should enable it only if you
+-     understand and accept those risks.
++     up pseudo-terminal ownership on GNU/Hurd.  It is not required on
++     GNU/Linux, and the GNU C Library will not use the installed
++     'pt_chown' program when configured with '--enable-pt_chown'.
+ 
+ '--disable-werror'
+      By default, the GNU C Library is built with '-Werror'.  If you wish
 diff --git a/NEWS b/NEWS
-index 292fbc595a..4402867631 100644
+index 292fbc595a..a3278be684 100644
 --- a/NEWS
 +++ b/NEWS
-@@ -5,6 +5,78 @@ See the end for copying conditions.
+@@ -5,6 +5,90 @@ See the end for copying conditions.
  Please send GNU C library bug reports via <https://sourceware.org/bugzilla/>
  using `glibc' in the "product" field.
  
@@ -42,8 +64,14 @@ index 292fbc595a..4402867631 100644
 +  [26932] libc: sh: Multiple floating point functions defined as stubs only
 +  [27130] "rep movsb" performance issue
 +  [27177] GLIBC_TUNABLES=glibc.cpu.x86_ibt=on:glibc.cpu.x86_shstk=on doesn't work
++  [27457] vzeroupper use in AVX2 multiarch string functions cause HTM aborts
++  [27974] Overflow bug in some implementation of wcsnlen, wmemchr, and wcsncat
 +  [28524] Conversion from ISO-2022-JP-3 with iconv may emit spurious NULs
++  [28755] overflow bug in wcsncmp_avx2 and wcsncmp_evex
 +  [28768] CVE-2022-23218: Buffer overflow in sunrpc svcunix_create
++  [28769] CVE-2021-3999: Off-by-one buffer overflow/underflow in getcwd()
++  [28896] strncmp-avx2-rtm and wcsncmp-avx2-rtm fallback on non-rtm
++    variants when avoiding overflow
 +
 +Security related changes:
 +
@@ -73,6 +101,12 @@ index 292fbc595a..4402867631 100644
 +  CVE-2020-29562: An assertion failure has been fixed in the iconv function
 +  when invoked with UCS4 input containing an invalid character.
 +
++  CVE-2021-3999: Passing a buffer of size exactly 1 byte to the getcwd
++  function may result in an off-by-one buffer underflow and overflow
++  when the current working directory is longer than PATH_MAX and also
++  corresponds to the / directory through an unprivileged mount
++  namespace.  Reported by Qualys.
++
 +  CVE-2022-23219: Passing an overlong file name to the clnt_create
 +  legacy function could result in a stack-based buffer overflow when
 +  using the "unix" protocol.  Reported by Martin Sebor.
@@ -83,6 +117,25 @@ index 292fbc595a..4402867631 100644
  Version 2.31
  
  Major new features:
+@@ -141,6 +225,18 @@ Changes to build and runtime requirements:
+   source tree.  ChangeLog files are located in the ChangeLog.old directory as
+   ChangeLog.N where the highest N has the latest entries.
+ 
++* On Linux, the system administrator needs to configure /dev/pts with
++  the intended access modes for pseudo-terminals.  glibc no longer
++  attemps to adjust permissions of terminal devices.  The previous glibc
++  defaults ("tty" group, user read/write and group write) already
++  corresponded to what most systems used, so that grantpt did not
++  perform any adjustments.
++
++* On Linux, the posix_openpt and getpt functions no longer attempt to
++  use legacy (BSD) pseudo-terminals and assume that if /dev/ptmx exists
++  (and pseudo-terminals are supported), a devpts file system is mounted
++  on /dev/pts.  Current systems already meet these requirements.
++
+ Security related changes:
+ 
+   CVE-2019-19126: ld.so failed to ignore the LD_PREFER_MAP_32BIT_EXEC
 diff --git a/Rules b/Rules
 index 8b771f6095..beab969fde 100644
 --- a/Rules
@@ -4359,6 +4412,300 @@ index b6c5aea08f..eddea33f4c 100644
 -#define TEST_FUNCTION do_test ()
 -#include "../test-skeleton.c"
 +#include <support/test-driver.c>
+diff --git a/string/test-memchr.c b/string/test-memchr.c
+index 5dd0aa5470..de70e794d9 100644
+--- a/string/test-memchr.c
++++ b/string/test-memchr.c
+@@ -65,8 +65,8 @@ do_one_test (impl_t *impl, const CHAR *s, int c, size_t n, CHAR *exp_res)
+   CHAR *res = CALL (impl, s, c, n);
+   if (res != exp_res)
+     {
+-      error (0, 0, "Wrong result in function %s %p %p", impl->name,
+-	     res, exp_res);
++      error (0, 0, "Wrong result in function %s (%p, %d, %zu) -> %p != %p",
++             impl->name, s, c, n, res, exp_res);
+       ret = 1;
+       return;
+     }
+@@ -91,7 +91,7 @@ do_test (size_t align, size_t pos, size_t len, size_t n, int seek_char)
+     }
+   buf[align + len] = 0;
+ 
+-  if (pos < len)
++  if (pos < MIN(n, len))
+     {
+       buf[align + pos] = seek_char;
+       buf[align + len] = -seek_char;
+@@ -107,6 +107,38 @@ do_test (size_t align, size_t pos, size_t len, size_t n, int seek_char)
+     do_one_test (impl, (CHAR *) (buf + align), seek_char, n, result);
+ }
+ 
++static void
++do_overflow_tests (void)
++{
++  size_t i, j, len;
++  const size_t one = 1;
++  uintptr_t buf_addr = (uintptr_t) buf1;
++
++  for (i = 0; i < 750; ++i)
++    {
++        do_test (0, i, 751, SIZE_MAX - i, BIG_CHAR);
++        do_test (0, i, 751, i - buf_addr, BIG_CHAR);
++        do_test (0, i, 751, -buf_addr - i, BIG_CHAR);
++        do_test (0, i, 751, SIZE_MAX - buf_addr - i, BIG_CHAR);
++        do_test (0, i, 751, SIZE_MAX - buf_addr + i, BIG_CHAR);
++
++      len = 0;
++      for (j = 8 * sizeof(size_t) - 1; j ; --j)
++        {
++          len |= one << j;
++          do_test (0, i, 751, len - i, BIG_CHAR);
++          do_test (0, i, 751, len + i, BIG_CHAR);
++          do_test (0, i, 751, len - buf_addr - i, BIG_CHAR);
++          do_test (0, i, 751, len - buf_addr + i, BIG_CHAR);
++
++          do_test (0, i, 751, ~len - i, BIG_CHAR);
++          do_test (0, i, 751, ~len + i, BIG_CHAR);
++          do_test (0, i, 751, ~len - buf_addr - i, BIG_CHAR);
++          do_test (0, i, 751, ~len - buf_addr + i, BIG_CHAR);
++        }
++    }
++}
++
+ static void
+ do_random_tests (void)
+ {
+@@ -221,6 +253,7 @@ test_main (void)
+     do_test (page_size / 2 - i, i, i, 1, 0x9B);
+ 
+   do_random_tests ();
++  do_overflow_tests ();
+   return ret;
+ }
+ 
+diff --git a/string/test-strncat.c b/string/test-strncat.c
+index abbacb95c6..0c7f68d086 100644
+--- a/string/test-strncat.c
++++ b/string/test-strncat.c
+@@ -134,6 +134,66 @@ do_test (size_t align1, size_t align2, size_t len1, size_t len2,
+     }
+ }
+ 
++static void
++do_overflow_tests (void)
++{
++  size_t i, j, len;
++  const size_t one = 1;
++  CHAR *s1, *s2;
++  uintptr_t s1_addr;
++  s1 = (CHAR *) buf1;
++  s2 = (CHAR *) buf2;
++  s1_addr = (uintptr_t)s1;
++ for (j = 0; j < 200; ++j)
++      s2[j] = 32 + 23 * j % (BIG_CHAR - 32);
++ s2[200] = 0;
++  for (i = 0; i < 750; ++i) {
++    for (j = 0; j < i; ++j)
++      s1[j] = 32 + 23 * j % (BIG_CHAR - 32);
++    s1[i] = '\0';
++
++       FOR_EACH_IMPL (impl, 0)
++    {
++      s2[200] = '\0';
++      do_one_test (impl, s2, s1, SIZE_MAX - i);
++      s2[200] = '\0';
++      do_one_test (impl, s2, s1, i - s1_addr);
++      s2[200] = '\0';
++      do_one_test (impl, s2, s1, -s1_addr - i);
++      s2[200] = '\0';
++      do_one_test (impl, s2, s1, SIZE_MAX - s1_addr - i);
++      s2[200] = '\0';
++      do_one_test (impl, s2, s1, SIZE_MAX - s1_addr + i);
++    }
++
++    len = 0;
++    for (j = 8 * sizeof(size_t) - 1; j ; --j)
++      {
++        len |= one << j;
++        FOR_EACH_IMPL (impl, 0)
++          {
++            s2[200] = '\0';
++            do_one_test (impl, s2, s1, len - i);
++            s2[200] = '\0';
++            do_one_test (impl, s2, s1, len + i);
++            s2[200] = '\0';
++            do_one_test (impl, s2, s1, len - s1_addr - i);
++            s2[200] = '\0';
++            do_one_test (impl, s2, s1, len - s1_addr + i);
++
++            s2[200] = '\0';
++            do_one_test (impl, s2, s1, ~len - i);
++            s2[200] = '\0';
++            do_one_test (impl, s2, s1, ~len + i);
++            s2[200] = '\0';
++            do_one_test (impl, s2, s1, ~len - s1_addr - i);
++            s2[200] = '\0';
++            do_one_test (impl, s2, s1, ~len - s1_addr + i);
++          }
++      }
++  }
++}
++
+ static void
+ do_random_tests (void)
+ {
+@@ -316,6 +376,7 @@ test_main (void)
+     }
+ 
+   do_random_tests ();
++  do_overflow_tests ();
+   return ret;
+ }
+ 
+diff --git a/string/test-strncmp.c b/string/test-strncmp.c
+index d961ac4493..02806f4ebd 100644
+--- a/string/test-strncmp.c
++++ b/string/test-strncmp.c
+@@ -403,6 +403,18 @@ check2 (void)
+   free (s2);
+ }
+ 
++static void
++check3 (void)
++{
++  const CHAR *s1 = L ("abc");
++  CHAR *s2 = STRDUP (s1);
++
++  FOR_EACH_IMPL (impl, 0)
++    check_result (impl, s1, s2, SIZE_MAX, 0);
++
++  free (s2);
++}
++
+ int
+ test_main (void)
+ {
+@@ -412,6 +424,7 @@ test_main (void)
+ 
+   check1 ();
+   check2 ();
++  check3 ();
+ 
+   printf ("%23s", "");
+   FOR_EACH_IMPL (impl, 0)
+diff --git a/string/test-strnlen.c b/string/test-strnlen.c
+index 80ac9e8602..a1a6746cc9 100644
+--- a/string/test-strnlen.c
++++ b/string/test-strnlen.c
+@@ -27,6 +27,7 @@
+ 
+ #ifndef WIDE
+ # define STRNLEN strnlen
++# define MEMSET memset
+ # define CHAR char
+ # define BIG_CHAR CHAR_MAX
+ # define MIDDLE_CHAR 127
+@@ -34,6 +35,7 @@
+ #else
+ # include <wchar.h>
+ # define STRNLEN wcsnlen
++# define MEMSET wmemset
+ # define CHAR wchar_t
+ # define BIG_CHAR WCHAR_MAX
+ # define MIDDLE_CHAR 1121
+@@ -87,6 +89,38 @@ do_test (size_t align, size_t len, size_t maxlen, int max_char)
+     do_one_test (impl, (CHAR *) (buf + align), maxlen, MIN (len, maxlen));
+ }
+ 
++static void
++do_overflow_tests (void)
++{
++  size_t i, j, len;
++  const size_t one = 1;
++  uintptr_t buf_addr = (uintptr_t) buf1;
++
++  for (i = 0; i < 750; ++i)
++    {
++      do_test (0, i, SIZE_MAX - i, BIG_CHAR);
++      do_test (0, i, i - buf_addr, BIG_CHAR);
++      do_test (0, i, -buf_addr - i, BIG_CHAR);
++      do_test (0, i, SIZE_MAX - buf_addr - i, BIG_CHAR);
++      do_test (0, i, SIZE_MAX - buf_addr + i, BIG_CHAR);
++
++      len = 0;
++      for (j = 8 * sizeof(size_t) - 1; j ; --j)
++        {
++          len |= one << j;
++          do_test (0, i, len - i, BIG_CHAR);
++          do_test (0, i, len + i, BIG_CHAR);
++          do_test (0, i, len - buf_addr - i, BIG_CHAR);
++          do_test (0, i, len - buf_addr + i, BIG_CHAR);
++
++          do_test (0, i, ~len - i, BIG_CHAR);
++          do_test (0, i, ~len + i, BIG_CHAR);
++          do_test (0, i, ~len - buf_addr - i, BIG_CHAR);
++          do_test (0, i, ~len - buf_addr + i, BIG_CHAR);
++        }
++    }
++}
++
+ static void
+ do_random_tests (void)
+ {
+@@ -153,7 +187,7 @@ do_page_tests (void)
+   size_t last_offset = (page_size / sizeof (CHAR)) - 1;
+ 
+   CHAR *s = (CHAR *) buf2;
+-  memset (s, 65, (last_offset - 1));
++  MEMSET (s, 65, (last_offset - 1));
+   s[last_offset] = 0;
+ 
+   /* Place short strings ending at page boundary.  */
+@@ -196,6 +230,35 @@ do_page_tests (void)
+     }
+ }
+ 
++/* Tests meant to unveil fail on implementations that access bytes
++   beyond the maxium length.  */
++
++static void
++do_page_2_tests (void)
++{
++  size_t i, exp_len, offset;
++  size_t last_offset = page_size / sizeof (CHAR);
++
++  CHAR *s = (CHAR *) buf2;
++  MEMSET (s, 65, last_offset);
++
++  /* Place short strings ending at page boundary without the null
++     byte.  */
++  offset = last_offset;
++  for (i = 0; i < 128; i++)
++    {
++      /* Decrease offset to stress several sizes and alignments.  */
++      offset--;
++      exp_len = last_offset - offset;
++      FOR_EACH_IMPL (impl, 0)
++	{
++	  /* If an implementation goes beyond EXP_LEN, it will trigger
++	     the segfault.  */
++	  do_one_test (impl, (CHAR *) (s + offset), exp_len, exp_len);
++	}
++    }
++}
++
+ int
+ test_main (void)
+ {
+@@ -242,6 +305,8 @@ test_main (void)
+ 
+   do_random_tests ();
+   do_page_tests ();
++  do_page_2_tests ();
++  do_overflow_tests ();
+   return ret;
+ }
+ 
 diff --git a/sunrpc/Makefile b/sunrpc/Makefile
 index d5840d0770..162a5cef53 100644
 --- a/sunrpc/Makefile
@@ -4527,6 +4874,21 @@ index 0000000000..35a4b7b0b3
 +}
 +
 +#include <support/test-driver.c>
+diff --git a/support/Makefile b/support/Makefile
+index 3325feb790..05e8c292b7 100644
+--- a/support/Makefile
++++ b/support/Makefile
+@@ -83,8 +83,10 @@ libsupport-routines = \
+   xasprintf \
+   xbind \
+   xcalloc \
++  xchdir \
+   xchroot \
+   xclock_gettime \
++  xclone \
+   xclose \
+   xconnect \
+   xcopy_file_range \
 diff --git a/support/capture_subprocess.h b/support/capture_subprocess.h
 index 9808750f80..421f657678 100644
 --- a/support/capture_subprocess.h
@@ -4632,6 +4994,18 @@ index 8b442fd5c0..34ffd02e8e 100644
  /* Wait for the subprocess indicated by PROC::PID.  Return the status
     indicate by waitpid call.  */
  int support_process_wait (struct support_subprocess *proc);
+diff --git a/support/support.h b/support/support.h
+index 77d68c2aba..0536474c41 100644
+--- a/support/support.h
++++ b/support/support.h
+@@ -23,6 +23,7 @@
+ #ifndef SUPPORT_H
+ #define SUPPORT_H
+ 
++#include <stdbool.h>
+ #include <stddef.h>
+ #include <sys/cdefs.h>
+ /* For mode_t.  */
 diff --git a/support/support_capture_subprocess.c b/support/support_capture_subprocess.c
 index eeed676e3d..28a37df67f 100644
 --- a/support/support_capture_subprocess.c
@@ -4850,120 +5224,552 @@ index 36e3a77af2..4a25828111 100644
  int
  support_process_wait (struct support_subprocess *proc)
  {
-diff --git a/sysdeps/aarch64/dl-machine.h b/sysdeps/aarch64/dl-machine.h
-index db3335e5ad..8ffa0d1c51 100644
---- a/sysdeps/aarch64/dl-machine.h
-+++ b/sysdeps/aarch64/dl-machine.h
-@@ -392,13 +392,6 @@ elf_machine_lazy_rel (struct link_map *map,
-   /* Check for unexpected PLT reloc type.  */
-   if (__builtin_expect (r_type == AARCH64_R(JUMP_SLOT), 1))
-     {
--      if (map->l_mach.plt == 0)
--	{
--	  /* Prelinking.  */
--	  *reloc_addr += l_addr;
--	  return;
--	}
--
-       if (__glibc_unlikely (map->l_info[DT_AARCH64 (VARIANT_PCS)] != NULL))
- 	{
- 	  /* Check the symbol table for variant PCS symbols.  */
-@@ -422,7 +415,10 @@ elf_machine_lazy_rel (struct link_map *map,
- 	    }
- 	}
+diff --git a/support/temp_file.c b/support/temp_file.c
+index 277c5e0cf1..e41128c2d4 100644
+--- a/support/temp_file.c
++++ b/support/temp_file.c
+@@ -1,5 +1,6 @@
+ /* Temporary file handling for tests.
+-   Copyright (C) 1998-2020 Free Software Foundation, Inc.
++   Copyright (C) 1998-2022 Free Software Foundation, Inc.
++   Copyright The GNU Tools Authors.
+    This file is part of the GNU C Library.
  
--      *reloc_addr = map->l_mach.plt;
-+      if (map->l_mach.plt == 0)
-+	*reloc_addr += l_addr;
-+      else
-+	*reloc_addr = map->l_mach.plt;
-     }
-   else if (__builtin_expect (r_type == AARCH64_R(TLSDESC), 1))
-     {
-diff --git a/sysdeps/aarch64/memcpy.S b/sysdeps/aarch64/memcpy.S
-index d0d47e90b8..e0b4c4502f 100644
---- a/sysdeps/aarch64/memcpy.S
-+++ b/sysdeps/aarch64/memcpy.S
-@@ -33,11 +33,11 @@
- #define A_l	x6
- #define A_lw	w6
- #define A_h	x7
--#define A_hw	w7
- #define B_l	x8
- #define B_lw	w8
- #define B_h	x9
- #define C_l	x10
-+#define C_lw	w10
- #define C_h	x11
- #define D_l	x12
- #define D_h	x13
-@@ -51,16 +51,6 @@
- #define H_h	srcend
- #define tmp1	x14
+    The GNU C Library is free software; you can redistribute it and/or
+@@ -20,15 +21,17 @@
+    some 32-bit platforms. */
+ #define _FILE_OFFSET_BITS 64
  
--/* Copies are split into 3 main cases: small copies of up to 32 bytes,
--   medium copies of 33..128 bytes which are fully unrolled. Large copies
--   of more than 128 bytes align the destination and use an unrolled loop
--   processing 64 bytes per iteration.
--   In order to share code with memmove, small and medium copies read all
--   data before writing, allowing any kind of overlap. So small, medium
--   and large backwards memmoves are handled by falling through into memcpy.
--   Overlapping large forward memmoves use a loop that copies backwards.
--*/
--
- #ifndef MEMMOVE
- # define MEMMOVE memmove
- #endif
-@@ -68,118 +58,115 @@
- # define MEMCPY memcpy
- #endif
++#include <support/check.h>
+ #include <support/temp_file.h>
+ #include <support/temp_file-internal.h>
+ #include <support/support.h>
  
--ENTRY_ALIGN (MEMMOVE, 6)
-+/* This implementation supports both memcpy and memmove and shares most code.
-+   It uses unaligned accesses and branchless sequences to keep the code small,
-+   simple and improve performance.
++#include <errno.h>
+ #include <paths.h>
+ #include <stdio.h>
+ #include <stdlib.h>
+ #include <string.h>
+-#include <unistd.h>
++#include <xunistd.h>
+ 
+ /* List of temporary files.  */
+ static struct temp_name_list
+@@ -36,14 +39,20 @@ static struct temp_name_list
+   struct temp_name_list *next;
+   char *name;
+   pid_t owner;
++  bool toolong;
+ } *temp_name_list;
+ 
+ /* Location of the temporary files.  Set by the test skeleton via
+    support_set_test_dir.  The string is not be freed.  */
+ static const char *test_dir = _PATH_TMP;
  
--	DELOUSE (0)
--	DELOUSE (1)
--	DELOUSE (2)
-+   Copies are split into 3 main cases: small copies of up to 32 bytes, medium
-+   copies of up to 128 bytes, and large copies.  The overhead of the overlap
-+   check in memmove is negligible since it is only required for large copies.
+-void
+-add_temp_file (const char *name)
++/* Name of subdirectories in a too long temporary directory tree.  */
++static char toolong_subdir[NAME_MAX + 1];
++static bool toolong_initialized;
++static size_t toolong_path_max;
++
++static void
++add_temp_file_internal (const char *name, bool toolong)
+ {
+   struct temp_name_list *newp
+     = (struct temp_name_list *) xcalloc (sizeof (*newp), 1);
+@@ -53,21 +62,26 @@ add_temp_file (const char *name)
+       newp->name = newname;
+       newp->next = temp_name_list;
+       newp->owner = getpid ();
++      newp->toolong = toolong;
+       temp_name_list = newp;
+     }
+   else
+     free (newp);
+ }
  
--	sub	tmp1, dstin, src
--	cmp	count, 128
--	ccmp	tmp1, count, 2, hi
--	b.lo	L(move_long)
--
--	/* Common case falls through into memcpy.  */
--END (MEMMOVE)
--libc_hidden_builtin_def (MEMMOVE)
--ENTRY (MEMCPY)
-+   Large copies use a software pipelined loop processing 64 bytes per
-+   iteration.  The destination pointer is 16-byte aligned to minimize
-+   unaligned accesses.  The loop tail is handled by always copying 64 bytes
-+   from the end.
-+*/
++void
++add_temp_file (const char *name)
++{
++  add_temp_file_internal (name, false);
++}
++
+ int
+-create_temp_file (const char *base, char **filename)
++create_temp_file_in_dir (const char *base, const char *dir, char **filename)
+ {
+   char *fname;
+   int fd;
+ 
+-  fname = (char *) xmalloc (strlen (test_dir) + 1 + strlen (base)
+-			    + sizeof ("XXXXXX"));
+-  strcpy (stpcpy (stpcpy (stpcpy (fname, test_dir), "/"), base), "XXXXXX");
++  fname = xasprintf ("%s/%sXXXXXX", dir, base);
+ 
+   fd = mkstemp (fname);
+   if (fd == -1)
+@@ -86,8 +100,14 @@ create_temp_file (const char *base, char **filename)
+   return fd;
+ }
  
-+ENTRY_ALIGN (MEMCPY, 6)
- 	DELOUSE (0)
- 	DELOUSE (1)
- 	DELOUSE (2)
+-char *
+-support_create_temp_directory (const char *base)
++int
++create_temp_file (const char *base, char **filename)
++{
++  return create_temp_file_in_dir (base, test_dir, filename);
++}
++
++static char *
++create_temp_directory_internal (const char *base, bool toolong)
+ {
+   char *path = xasprintf ("%s/%sXXXXXX", test_dir, base);
+   if (mkdtemp (path) == NULL)
+@@ -95,16 +115,132 @@ support_create_temp_directory (const char *base)
+       printf ("error: mkdtemp (\"%s\"): %m", path);
+       exit (1);
+     }
+-  add_temp_file (path);
++  add_temp_file_internal (path, toolong);
+   return path;
+ }
  
--	prfm	PLDL1KEEP, [src]
- 	add	srcend, src, count
- 	add	dstend, dstin, count
--	cmp	count, 32
--	b.ls	L(copy32)
- 	cmp	count, 128
- 	b.hi	L(copy_long)
-+	cmp	count, 32
-+	b.hi	L(copy32_128)
+-/* Helper functions called by the test skeleton follow.  */
++char *
++support_create_temp_directory (const char *base)
++{
++  return create_temp_directory_internal (base, false);
++}
++
++static void
++ensure_toolong_initialized (void)
++{
++  if (!toolong_initialized)
++    FAIL_EXIT1 ("uninitialized toolong directory tree\n");
++}
++
++static void
++initialize_toolong (const char *base)
++{
++  long name_max = pathconf (base, _PC_NAME_MAX);
++  name_max = (name_max < 0 ? 64
++	      : (name_max < sizeof (toolong_subdir) ? name_max
++		 : sizeof (toolong_subdir) - 1));
++
++  long path_max = pathconf (base, _PC_PATH_MAX);
++  path_max = (path_max < 0 ? 1024
++	      : path_max <= PTRDIFF_MAX ? path_max : PTRDIFF_MAX);
++
++  /* Sanity check to ensure that the test does not create temporary directories
++     in different filesystems because this API doesn't support it.  */
++  if (toolong_initialized)
++    {
++      if (name_max != strlen (toolong_subdir))
++	FAIL_UNSUPPORTED ("name_max: Temporary directories in different"
++			  " filesystems not supported yet\n");
++      if (path_max != toolong_path_max)
++	FAIL_UNSUPPORTED ("path_max: Temporary directories in different"
++			  " filesystems not supported yet\n");
++      return;
++    }
++
++  toolong_path_max = path_max;
++
++  size_t len = name_max;
++  memset (toolong_subdir, 'X', len);
++  toolong_initialized = true;
++}
++
++char *
++support_create_and_chdir_toolong_temp_directory (const char *basename)
++{
++  char *base = create_temp_directory_internal (basename, true);
++  xchdir (base);
++
++  initialize_toolong (base);
++
++  size_t sz = strlen (toolong_subdir);
++
++  /* Create directories and descend into them so that the final path is larger
++     than PATH_MAX.  */
++  for (size_t i = 0; i <= toolong_path_max / sz; i++)
++    {
++      int ret = mkdir (toolong_subdir, S_IRWXU);
++      if (ret != 0 && errno == ENAMETOOLONG)
++	FAIL_UNSUPPORTED ("Filesystem does not support creating too long "
++			  "directory trees\n");
++      else if (ret != 0)
++	FAIL_EXIT1 ("Failed to create directory tree: %m\n");
++      xchdir (toolong_subdir);
++    }
++  return base;
++}
  
--	/* Medium copies: 33..128 bytes.  */
-+	/* Small copies: 0..32 bytes.  */
-+	cmp	count, 16
-+	b.lo	L(copy16)
+ void
+-support_set_test_dir (const char *path)
++support_chdir_toolong_temp_directory (const char *base)
+ {
+-  test_dir = path;
++  ensure_toolong_initialized ();
++
++  xchdir (base);
++
++  size_t sz = strlen (toolong_subdir);
++  for (size_t i = 0; i <= toolong_path_max / sz; i++)
++    xchdir (toolong_subdir);
++}
++
++/* Helper functions called by the test skeleton follow.  */
++
++static void
++remove_toolong_subdirs (const char *base)
++{
++  ensure_toolong_initialized ();
++
++  if (chdir (base) != 0)
++    {
++      printf ("warning: toolong cleanup base failed: chdir (\"%s\"): %m\n",
++	      base);
++      return;
++    }
++
++  /* Descend.  */
++  int levels = 0;
++  size_t sz = strlen (toolong_subdir);
++  for (levels = 0; levels <= toolong_path_max / sz; levels++)
++    if (chdir (toolong_subdir) != 0)
++      {
++	printf ("warning: toolong cleanup failed: chdir (\"%s\"): %m\n",
++		toolong_subdir);
++	break;
++      }
++
++  /* Ascend and remove.  */
++  while (--levels >= 0)
++    {
++      if (chdir ("..") != 0)
++	{
++	  printf ("warning: toolong cleanup failed: chdir (\"..\"): %m\n");
++	  return;
++	}
++      if (remove (toolong_subdir) != 0)
++	{
++	  printf ("warning: could not remove subdirectory: %s: %m\n",
++		  toolong_subdir);
++	  return;
++	}
++    }
+ }
+ 
+ void
+@@ -119,6 +255,9 @@ support_delete_temp_files (void)
+ 	 around, to prevent PID reuse.)  */
+       if (temp_name_list->owner == pid)
+ 	{
++	  if (temp_name_list->toolong)
++	    remove_toolong_subdirs (temp_name_list->name);
++
+ 	  if (remove (temp_name_list->name) != 0)
+ 	    printf ("warning: could not remove temporary file: %s: %m\n",
+ 		    temp_name_list->name);
+@@ -143,3 +282,9 @@ support_print_temp_files (FILE *f)
+       fprintf (f, ")\n");
+     }
+ }
++
++void
++support_set_test_dir (const char *path)
++{
++  test_dir = path;
++}
+diff --git a/support/temp_file.h b/support/temp_file.h
+index 8b6303a6e4..2598f82136 100644
+--- a/support/temp_file.h
++++ b/support/temp_file.h
+@@ -32,11 +32,27 @@ void add_temp_file (const char *name);
+    *FILENAME.  */
+ int create_temp_file (const char *base, char **filename);
+ 
++/* Create a temporary file in directory DIR.  Return the opened file
++   descriptor on success, or -1 on failure.  Write the file name to
++   *FILENAME if FILENAME is not NULL.  In this case, the caller is
++   expected to free *FILENAME.  */
++int create_temp_file_in_dir (const char *base, const char *dir,
++			     char **filename);
++
+ /* Create a temporary directory and schedule it for deletion.  BASE is
+    used as a prefix for the unique directory name, which the function
+    returns.  The caller should free this string.  */
+ char *support_create_temp_directory (const char *base);
+ 
++/* Create a temporary directory tree that is longer than PATH_MAX and schedule
++   it for deletion.  BASENAME is used as a prefix for the unique directory
++   name, which the function returns.  The caller should free this string.  */
++char *support_create_and_chdir_toolong_temp_directory (const char *basename);
++
++/* Change into the innermost directory of the directory tree BASE, which was
++   created using support_create_and_chdir_toolong_temp_directory.  */
++void support_chdir_toolong_temp_directory (const char *base);
++
+ __END_DECLS
+ 
+ #endif /* SUPPORT_TEMP_FILE_H */
+diff --git a/support/xchdir.c b/support/xchdir.c
+new file mode 100644
+index 0000000000..beb4feff72
+--- /dev/null
++++ b/support/xchdir.c
+@@ -0,0 +1,28 @@
++/* chdir with error checking.
++   Copyright (C) 2020 Free Software Foundation, Inc.
++   This file is part of the GNU C Library.
++
++   The GNU C Library is free software; you can redistribute it and/or
++   modify it under the terms of the GNU Lesser General Public
++   License as published by the Free Software Foundation; either
++   version 2.1 of the License, or (at your option) any later version.
++
++   The GNU C Library is distributed in the hope that it will be useful,
++   but WITHOUT ANY WARRANTY; without even the implied warranty of
++   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
++   Lesser General Public License for more details.
++
++   You should have received a copy of the GNU Lesser General Public
++   License along with the GNU C Library; if not, see
++   <https://www.gnu.org/licenses/>.  */
++
++#include <support/check.h>
++#include <support/xunistd.h>
++#include <unistd.h>
++
++void
++xchdir (const char *path)
++{
++  if (chdir (path) != 0)
++    FAIL_EXIT1 ("chdir (\"%s\"): %m", path);
++}
+diff --git a/support/xclone.c b/support/xclone.c
+new file mode 100644
+index 0000000000..243eee8b23
+--- /dev/null
++++ b/support/xclone.c
+@@ -0,0 +1,49 @@
++/* Auxiliary functions to issue the clone syscall.
++   Copyright (C) 2021 Free Software Foundation, Inc.
++   This file is part of the GNU C Library.
++
++   The GNU C Library is free software; you can redistribute it and/or
++   modify it under the terms of the GNU Lesser General Public
++   License as published by the Free Software Foundation; either
++   version 2.1 of the License, or (at your option) any later version.
++
++   The GNU C Library is distributed in the hope that it will be useful,
++   but WITHOUT ANY WARRANTY; without even the implied warranty of
++   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
++   Lesser General Public License for more details.
++
++   You should have received a copy of the GNU Lesser General Public
++   License along with the GNU C Library; if not, see
++   <https://www.gnu.org/licenses/>.  */
++
++#ifdef __linux__
++# include <support/check.h>
++# include <stackinfo.h>  /* For _STACK_GROWS_{UP,DOWN}.  */
++# include <xsched.h>
++
++pid_t
++xclone (int (*fn) (void *arg), void *arg, void *stack, size_t stack_size,
++	int flags)
++{
++  pid_t r = -1;
++
++# ifdef __ia64__
++  extern int __clone2 (int (*fn) (void *arg), void *stack, size_t stack_size,
++		       int flags, void *arg, ...);
++  r = __clone2 (fn, stack, stack_size, flags, arg, /* ptid */ NULL,
++		/* tls */ NULL, /* ctid  */ NULL);
++# else
++#  if _STACK_GROWS_DOWN
++  r = clone (fn, stack + stack_size, flags, arg, /* ptid */ NULL,
++	     /* tls */ NULL, /* ctid */  NULL);
++#  elif _STACK_GROWS_UP
++  r = clone (fn, stack, flags, arg, /* ptid */ NULL, /* tls */ NULL, NULL);
++#  endif
++# endif
++
++  if (r < 0)
++    FAIL_EXIT1 ("clone: %m");
++
++  return r;
++}
++#endif
+diff --git a/support/xsched.h b/support/xsched.h
+new file mode 100644
+index 0000000000..eefd731940
+--- /dev/null
++++ b/support/xsched.h
+@@ -0,0 +1,34 @@
++/* Wrapper for sched.h functions.
++   Copyright (C) 2021 Free Software Foundation, Inc.
++   This file is part of the GNU C Library.
++
++   The GNU C Library is free software; you can redistribute it and/or
++   modify it under the terms of the GNU Lesser General Public
++   License as published by the Free Software Foundation; either
++   version 2.1 of the License, or (at your option) any later version.
++
++   The GNU C Library is distributed in the hope that it will be useful,
++   but WITHOUT ANY WARRANTY; without even the implied warranty of
++   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
++   Lesser General Public License for more details.
++
++   You should have received a copy of the GNU Lesser General Public
++   License along with the GNU C Library; if not, see
++   <https://www.gnu.org/licenses/>.  */
++
++#ifndef SUPPORT_XSCHED_H
++#define SUPPORT_XSCHED_H
++
++__BEGIN_DECLS
++
++#include <sched.h>
++#include <sys/types.h>
++
++#ifdef __linux__
++pid_t xclone (int (*fn) (void *arg), void *arg, void *stack,
++	      size_t stack_size, int flags);
++#endif
++
++__END_DECLS
++
++#endif
+diff --git a/support/xunistd.h b/support/xunistd.h
+index 96f498f2e5..43799d92c5 100644
+--- a/support/xunistd.h
++++ b/support/xunistd.h
+@@ -44,6 +44,7 @@ long xsysconf (int name);
+ long long xlseek (int fd, long long offset, int whence);
+ void xftruncate (int fd, long long length);
+ void xsymlink (const char *target, const char *linkpath);
++void xchdir (const char *path);
+ 
+ /* Equivalent of "mkdir -p".  */
+ void xmkdirp (const char *, mode_t);
+diff --git a/sysdeps/aarch64/dl-machine.h b/sysdeps/aarch64/dl-machine.h
+index db3335e5ad..8ffa0d1c51 100644
+--- a/sysdeps/aarch64/dl-machine.h
++++ b/sysdeps/aarch64/dl-machine.h
+@@ -392,13 +392,6 @@ elf_machine_lazy_rel (struct link_map *map,
+   /* Check for unexpected PLT reloc type.  */
+   if (__builtin_expect (r_type == AARCH64_R(JUMP_SLOT), 1))
+     {
+-      if (map->l_mach.plt == 0)
+-	{
+-	  /* Prelinking.  */
+-	  *reloc_addr += l_addr;
+-	  return;
+-	}
+-
+       if (__glibc_unlikely (map->l_info[DT_AARCH64 (VARIANT_PCS)] != NULL))
+ 	{
+ 	  /* Check the symbol table for variant PCS symbols.  */
+@@ -422,7 +415,10 @@ elf_machine_lazy_rel (struct link_map *map,
+ 	    }
+ 	}
+ 
+-      *reloc_addr = map->l_mach.plt;
++      if (map->l_mach.plt == 0)
++	*reloc_addr += l_addr;
++      else
++	*reloc_addr = map->l_mach.plt;
+     }
+   else if (__builtin_expect (r_type == AARCH64_R(TLSDESC), 1))
+     {
+diff --git a/sysdeps/aarch64/memcpy.S b/sysdeps/aarch64/memcpy.S
+index d0d47e90b8..e0b4c4502f 100644
+--- a/sysdeps/aarch64/memcpy.S
++++ b/sysdeps/aarch64/memcpy.S
+@@ -33,11 +33,11 @@
+ #define A_l	x6
+ #define A_lw	w6
+ #define A_h	x7
+-#define A_hw	w7
+ #define B_l	x8
+ #define B_lw	w8
+ #define B_h	x9
+ #define C_l	x10
++#define C_lw	w10
+ #define C_h	x11
+ #define D_l	x12
+ #define D_h	x13
+@@ -51,16 +51,6 @@
+ #define H_h	srcend
+ #define tmp1	x14
+ 
+-/* Copies are split into 3 main cases: small copies of up to 32 bytes,
+-   medium copies of 33..128 bytes which are fully unrolled. Large copies
+-   of more than 128 bytes align the destination and use an unrolled loop
+-   processing 64 bytes per iteration.
+-   In order to share code with memmove, small and medium copies read all
+-   data before writing, allowing any kind of overlap. So small, medium
+-   and large backwards memmoves are handled by falling through into memcpy.
+-   Overlapping large forward memmoves use a loop that copies backwards.
+-*/
+-
+ #ifndef MEMMOVE
+ # define MEMMOVE memmove
+ #endif
+@@ -68,118 +58,115 @@
+ # define MEMCPY memcpy
+ #endif
+ 
+-ENTRY_ALIGN (MEMMOVE, 6)
++/* This implementation supports both memcpy and memmove and shares most code.
++   It uses unaligned accesses and branchless sequences to keep the code small,
++   simple and improve performance.
+ 
+-	DELOUSE (0)
+-	DELOUSE (1)
+-	DELOUSE (2)
++   Copies are split into 3 main cases: small copies of up to 32 bytes, medium
++   copies of up to 128 bytes, and large copies.  The overhead of the overlap
++   check in memmove is negligible since it is only required for large copies.
+ 
+-	sub	tmp1, dstin, src
+-	cmp	count, 128
+-	ccmp	tmp1, count, 2, hi
+-	b.lo	L(move_long)
+-
+-	/* Common case falls through into memcpy.  */
+-END (MEMMOVE)
+-libc_hidden_builtin_def (MEMMOVE)
+-ENTRY (MEMCPY)
++   Large copies use a software pipelined loop processing 64 bytes per
++   iteration.  The destination pointer is 16-byte aligned to minimize
++   unaligned accesses.  The loop tail is handled by always copying 64 bytes
++   from the end.
++*/
+ 
++ENTRY_ALIGN (MEMCPY, 6)
+ 	DELOUSE (0)
+ 	DELOUSE (1)
+ 	DELOUSE (2)
+ 
+-	prfm	PLDL1KEEP, [src]
+ 	add	srcend, src, count
+ 	add	dstend, dstin, count
+-	cmp	count, 32
+-	b.ls	L(copy32)
+ 	cmp	count, 128
+ 	b.hi	L(copy_long)
++	cmp	count, 32
++	b.hi	L(copy32_128)
+ 
+-	/* Medium copies: 33..128 bytes.  */
++	/* Small copies: 0..32 bytes.  */
++	cmp	count, 16
++	b.lo	L(copy16)
  	ldp	A_l, A_h, [src]
 -	ldp	B_l, B_h, [src, 16]
 -	ldp	C_l, C_h, [srcend, -32]
@@ -6391,6 +7197,25 @@ index 0000000000..f59b97769d
 +}
 +
 +#include <support/test-driver.c>
+diff --git a/sysdeps/posix/getcwd.c b/sysdeps/posix/getcwd.c
+index f00b337a13..839d78d7b7 100644
+--- a/sysdeps/posix/getcwd.c
++++ b/sysdeps/posix/getcwd.c
+@@ -241,6 +241,14 @@ __getcwd (char *buf, size_t size)
+   char *path;
+ #ifndef NO_ALLOCATION
+   size_t allocated = size;
++
++  /* A size of 1 byte is never useful.  */
++  if (allocated == 1)
++    {
++      __set_errno (ERANGE);
++      return NULL;
++    }
++
+   if (size == 0)
+     {
+       if (buf != NULL)
 diff --git a/sysdeps/posix/system.c b/sysdeps/posix/system.c
 index e613e6a344..a03f478fc7 100644
 --- a/sysdeps/posix/system.c
@@ -6762,7 +7587,7 @@ index e28e801c7a..6b22b2cb45 100644
 +write		-	write		Ci:ibU	__libc_write	__write write
  writev		-	writev		Ci:ipi	__writev	writev
 diff --git a/sysdeps/unix/sysv/linux/Makefile b/sysdeps/unix/sysv/linux/Makefile
-index f12b7b1a2d..5fbde369c3 100644
+index f12b7b1a2d..0a0da00151 100644
 --- a/sysdeps/unix/sysv/linux/Makefile
 +++ b/sysdeps/unix/sysv/linux/Makefile
 @@ -60,7 +60,9 @@ sysdep_routines += adjtimex clone umount umount2 readahead \
@@ -6776,6 +7601,15 @@ index f12b7b1a2d..5fbde369c3 100644
  
  CFLAGS-gethostid.c = -fexceptions
  CFLAGS-tee.c = -fexceptions -fasynchronous-unwind-tables
+@@ -273,7 +275,7 @@ sysdep_routines += xstatconv internal_statvfs internal_statvfs64 \
+ 
+ sysdep_headers += bits/fcntl-linux.h
+ 
+-tests += tst-fallocate tst-fallocate64
++tests += tst-fallocate tst-fallocate64 tst-getcwd-smallbuff
+ endif
+ 
+ ifeq ($(subdir),elf)
 diff --git a/sysdeps/unix/sysv/linux/aarch64/arch-syscall.h b/sysdeps/unix/sysv/linux/aarch64/arch-syscall.h
 index 9378387747..c8471947b9 100644
 --- a/sysdeps/unix/sysv/linux/aarch64/arch-syscall.h
@@ -6821,22 +7655,200 @@ index a60053b914..08af68b5e8 100644
  # The dynamic loader needs __tls_get_addr for TLS.
  ld.so: __tls_get_addr
  # The main malloc is interposed into the dynamic linker, for
-diff --git a/sysdeps/unix/sysv/linux/hppa/atomic-machine.h b/sysdeps/unix/sysv/linux/hppa/atomic-machine.h
-index 9d8ffbe860..bf61b66b70 100644
---- a/sysdeps/unix/sysv/linux/hppa/atomic-machine.h
-+++ b/sysdeps/unix/sysv/linux/hppa/atomic-machine.h
-@@ -36,9 +36,37 @@ typedef uintptr_t uatomicptr_t;
- typedef intmax_t atomic_max_t;
- typedef uintmax_t uatomic_max_t;
+diff --git a/sysdeps/unix/sysv/linux/getpt.c b/sysdeps/unix/sysv/linux/getpt.c
+index 1803b232c9..3cc745e11a 100644
+--- a/sysdeps/unix/sysv/linux/getpt.c
++++ b/sysdeps/unix/sysv/linux/getpt.c
+@@ -16,69 +16,18 @@
+    License along with the GNU C Library; if not, see
+    <https://www.gnu.org/licenses/>.  */
  
-+#define atomic_full_barrier() __sync_synchronize ()
-+
- #define __HAVE_64B_ATOMICS 0
- #define USE_ATOMIC_COMPILER_BUILTINS 0
+-#include <errno.h>
+ #include <fcntl.h>
+-#include <stdlib.h>
+ #include <unistd.h>
+ #include <paths.h>
+-#include <sys/statfs.h>
+-
+-#include "linux_fsinfo.h"
  
-+/* We use the compiler atomic load and store builtins as the generic
-+   defines are not atomic.  In particular, we need to use compare and
-+   exchange for stores as the implementation is synthesized.  */
+ /* Path to the master pseudo terminal cloning device.  */
+ #define _PATH_DEVPTMX _PATH_DEV "ptmx"
+-/* Directory containing the UNIX98 pseudo terminals.  */
+-#define _PATH_DEVPTS _PATH_DEV "pts"
+-
+-/* Prototype for function that opens BSD-style master pseudo-terminals.  */
+-extern int __bsd_getpt (void) attribute_hidden;
+ 
+ /* Open a master pseudo terminal and return its file descriptor.  */
+ int
+ __posix_openpt (int oflag)
+ {
+-  static int have_no_dev_ptmx;
+-  int fd;
+-
+-  if (!have_no_dev_ptmx)
+-    {
+-      fd = __open (_PATH_DEVPTMX, oflag);
+-      if (fd != -1)
+-	{
+-	  struct statfs fsbuf;
+-	  static int devpts_mounted;
+-
+-	  /* Check that the /dev/pts filesystem is mounted
+-	     or if /dev is a devfs filesystem (this implies /dev/pts).  */
+-	  if (devpts_mounted
+-	      || (__statfs (_PATH_DEVPTS, &fsbuf) == 0
+-		  && fsbuf.f_type == DEVPTS_SUPER_MAGIC)
+-	      || (__statfs (_PATH_DEV, &fsbuf) == 0
+-		  && fsbuf.f_type == DEVFS_SUPER_MAGIC))
+-	    {
+-	      /* Everything is ok.  */
+-	      devpts_mounted = 1;
+-	      return fd;
+-	    }
+-
+-	  /* If /dev/pts is not mounted then the UNIX98 pseudo terminals
+-	     are not usable.  */
+-	  __close (fd);
+-	  have_no_dev_ptmx = 1;
+-	  __set_errno (ENOENT);
+-	}
+-      else
+-	{
+-	  if (errno == ENOENT || errno == ENODEV)
+-	    have_no_dev_ptmx = 1;
+-	  else
+-	    return -1;
+-	}
+-    }
+-  else
+-    __set_errno (ENOENT);
+-
+-  return -1;
++  return __open (_PATH_DEVPTMX, oflag);
+ }
+ weak_alias (__posix_openpt, posix_openpt)
+ 
+@@ -86,16 +35,6 @@ weak_alias (__posix_openpt, posix_openpt)
+ int
+ __getpt (void)
+ {
+-  int fd = __posix_openpt (O_RDWR);
+-  if (fd == -1)
+-    fd = __bsd_getpt ();
+-  return fd;
++  return __posix_openpt (O_RDWR);
+ }
+-
+-
+-#define PTYNAME1 "pqrstuvwxyzabcde";
+-#define PTYNAME2 "0123456789abcdef";
+-
+-#define __getpt __bsd_getpt
+-#define HAVE_POSIX_OPENPT
+-#include <sysdeps/unix/bsd/getpt.c>
++weak_alias (__getpt, getpt)
+diff --git a/sysdeps/unix/sysv/linux/grantpt.c b/sysdeps/unix/sysv/linux/grantpt.c
+index 2030e07fa6..43122f9a76 100644
+--- a/sysdeps/unix/sysv/linux/grantpt.c
++++ b/sysdeps/unix/sysv/linux/grantpt.c
+@@ -1,44 +1,41 @@
+-#include <assert.h>
+-#include <ctype.h>
+-#include <dirent.h>
+-#include <errno.h>
+-#include <fcntl.h>
+-#include <paths.h>
+-#include <stdlib.h>
+-#include <unistd.h>
++/* grantpt implementation for Linux.
++   Copyright (C) 1998-2020 Free Software Foundation, Inc.
++   This file is part of the GNU C Library.
++   Contributed by Zack Weinberg <zack@rabi.phys.columbia.edu>, 1998.
+ 
+-#include <not-cancel.h>
++   The GNU C Library is free software; you can redistribute it and/or
++   modify it under the terms of the GNU Lesser General Public
++   License as published by the Free Software Foundation; either
++   version 2.1 of the License, or (at your option) any later version.
+ 
+-#include "pty-private.h"
++   The GNU C Library is distributed in the hope that it will be useful,
++   but WITHOUT ANY WARRANTY; without even the implied warranty of
++   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
++   Lesser General Public License for more details.
+ 
+-#if HAVE_PT_CHOWN
+-/* Close all file descriptors except the one specified.  */
+-static void
+-close_all_fds (void)
+-{
+-  DIR *dir = __opendir ("/proc/self/fd");
+-  if (dir != NULL)
+-    {
+-      struct dirent64 *d;
+-      while ((d = __readdir64 (dir)) != NULL)
+-	if (isdigit (d->d_name[0]))
+-	  {
+-	    char *endp;
+-	    long int fd = strtol (d->d_name, &endp, 10);
+-	    if (*endp == '\0' && fd != PTY_FILENO && fd != dirfd (dir))
+-	      __close_nocancel_nostatus (fd);
+-	  }
++   You should have received a copy of the GNU Lesser General Public
++   License along with the GNU C Library; if not, see
++   <https://www.gnu.org/licenses/>.  */
++
++#include <errno.h>
++#include <stdlib.h>
++#include <sys/ioctl.h>
++#include <termios.h>
+ 
+-      __closedir (dir);
++int
++grantpt (int fd)
++{
++  /* Without pt_chown on Linux, we have delegated the creation of the
++     pty node with the right group and permission mode to the kernel, and
++     non-root users are unlikely to be able to change it. Therefore let's
++     consider that POSIX enforcement is the responsibility of the whole
++     system and not only the GNU libc.   */
+ 
+-      int nullfd = __open_nocancel (_PATH_DEVNULL, O_RDONLY);
+-      assert (nullfd == STDIN_FILENO);
+-      nullfd = __open_nocancel (_PATH_DEVNULL, O_WRONLY);
+-      assert (nullfd == STDOUT_FILENO);
+-      __dup2 (STDOUT_FILENO, STDERR_FILENO);
+-    }
++  /* Verify that fd refers to a ptmx descriptor.  */
++  unsigned int ptyno;
++  int ret = __ioctl (fd, TIOCGPTN, &ptyno);
++  if (ret != 0 && errno == ENOTTY)
++    /* POSIX requires EINVAL instead of ENOTTY provided by the kernel.  */
++    __set_errno (EINVAL);
++  return ret;
+ }
+-# define CLOSE_ALL_FDS() close_all_fds()
+-#endif
+-
+-#include <sysdeps/unix/grantpt.c>
+diff --git a/sysdeps/unix/sysv/linux/hppa/atomic-machine.h b/sysdeps/unix/sysv/linux/hppa/atomic-machine.h
+index 9d8ffbe860..bf61b66b70 100644
+--- a/sysdeps/unix/sysv/linux/hppa/atomic-machine.h
++++ b/sysdeps/unix/sysv/linux/hppa/atomic-machine.h
+@@ -36,9 +36,37 @@ typedef uintptr_t uatomicptr_t;
+ typedef intmax_t atomic_max_t;
+ typedef uintmax_t uatomic_max_t;
+ 
++#define atomic_full_barrier() __sync_synchronize ()
++
+ #define __HAVE_64B_ATOMICS 0
+ #define USE_ATOMIC_COMPILER_BUILTINS 0
+ 
++/* We use the compiler atomic load and store builtins as the generic
++   defines are not atomic.  In particular, we need to use compare and
++   exchange for stores as the implementation is synthesized.  */
 +void __atomic_link_error (void);
 +#define __atomic_check_size_ls(mem) \
 + if ((sizeof (*mem) != 1) && (sizeof (*mem) != 2) && sizeof (*mem) != 4)    \
@@ -8335,6 +9347,143 @@ index 0000000000..944ab9b7f1
 +  return INLINE_SYSCALL_CALL (process_vm_writev, pid, local_iov,
 +			      liovcnt, remote_iov, riovcnt, flags);
 +}
+diff --git a/sysdeps/unix/sysv/linux/ptsname.c b/sysdeps/unix/sysv/linux/ptsname.c
+index 81d9d26f1e..3e9be3f0d4 100644
+--- a/sysdeps/unix/sysv/linux/ptsname.c
++++ b/sysdeps/unix/sysv/linux/ptsname.c
+@@ -21,39 +21,14 @@
+ #include <stdlib.h>
+ #include <string.h>
+ #include <sys/ioctl.h>
+-#include <sys/stat.h>
+-#include <sys/sysmacros.h>
+ #include <termios.h>
+ #include <unistd.h>
+ 
+ #include <_itoa.h>
+ 
+-/* Check if DEV corresponds to a master pseudo terminal device.  */
+-#define MASTER_P(Dev)							\
+-  (__gnu_dev_major ((Dev)) == 2						\
+-   || (__gnu_dev_major ((Dev)) == 4					\
+-       && __gnu_dev_minor ((Dev)) >= 128 && __gnu_dev_minor ((Dev)) < 192) \
+-   || (__gnu_dev_major ((Dev)) >= 128 && __gnu_dev_major ((Dev)) < 136))
+-
+-/* Check if DEV corresponds to a slave pseudo terminal device.  */
+-#define SLAVE_P(Dev)							\
+-  (__gnu_dev_major ((Dev)) == 3						\
+-   || (__gnu_dev_major ((Dev)) == 4					\
+-       && __gnu_dev_minor ((Dev)) >= 192 && __gnu_dev_minor ((Dev)) < 256) \
+-   || (__gnu_dev_major ((Dev)) >= 136 && __gnu_dev_major ((Dev)) < 144))
+-
+-/* Note that major number 4 corresponds to the old BSD style pseudo
+-   terminal devices.  As of Linux 2.1.115 these are no longer
+-   supported.  They have been replaced by major numbers 2 (masters)
+-   and 3 (slaves).  */
+-
+ /* Directory where we can find the slave pty nodes.  */
+ #define _PATH_DEVPTS "/dev/pts/"
+ 
+-/* The are declared in getpt.c.  */
+-extern const char __libc_ptyname1[] attribute_hidden;
+-extern const char __libc_ptyname2[] attribute_hidden;
+-
+ /* Static buffer for `ptsname'.  */
+ static char buffer[sizeof (_PATH_DEVPTS) + 20];
+ 
+@@ -68,19 +43,15 @@ ptsname (int fd)
+ }
+ 
+ 
++/* Store at most BUFLEN characters of the pathname of the slave pseudo
++   terminal associated with the master FD is open on in BUF.
++   Return 0 on success, otherwise an error number.  */
+ int
+-__ptsname_internal (int fd, char *buf, size_t buflen, struct stat64 *stp)
++__ptsname_r (int fd, char *buf, size_t buflen)
+ {
+   int save_errno = errno;
+   unsigned int ptyno;
+ 
+-  if (!__isatty (fd))
+-    {
+-      __set_errno (ENOTTY);
+-      return ENOTTY;
+-    }
+-
+-#ifdef TIOCGPTN
+   if (__ioctl (fd, TIOCGPTN, &ptyno) == 0)
+     {
+       /* Buffer we use to print the number in.  For a maximum size for
+@@ -101,67 +72,11 @@ __ptsname_internal (int fd, char *buf, size_t buflen, struct stat64 *stp)
+ 
+       memcpy (__stpcpy (buf, devpts), p, &numbuf[sizeof (numbuf)] - p);
+     }
+-  else if (errno != EINVAL)
+-    return errno;
+   else
+-#endif
+-    {
+-      char *p;
+-
+-      if (buflen < strlen (_PATH_TTY) + 3)
+-	{
+-	  __set_errno (ERANGE);
+-	  return ERANGE;
+-	}
+-
+-      if (__fxstat64 (_STAT_VER, fd, stp) < 0)
+-	return errno;
+-
+-      /* Check if FD really is a master pseudo terminal.  */
+-      if (! MASTER_P (stp->st_rdev))
+-	{
+-	  __set_errno (ENOTTY);
+-	  return ENOTTY;
+-	}
+-
+-      ptyno = __gnu_dev_minor (stp->st_rdev);
+-
+-      if (ptyno / 16 >= strlen (__libc_ptyname1))
+-	{
+-	  __set_errno (ENOTTY);
+-	  return ENOTTY;
+-	}
+-
+-      p = __stpcpy (buf, _PATH_TTY);
+-      p[0] = __libc_ptyname1[ptyno / 16];
+-      p[1] = __libc_ptyname2[ptyno % 16];
+-      p[2] = '\0';
+-    }
+-
+-  if (__xstat64 (_STAT_VER, buf, stp) < 0)
++    /* Bad file descriptor, or not a ptmx descriptor.  */
+     return errno;
+ 
+-  /* Check if the name we're about to return really corresponds to a
+-     slave pseudo terminal.  */
+-  if (! S_ISCHR (stp->st_mode) || ! SLAVE_P (stp->st_rdev))
+-    {
+-      /* This really is a configuration problem.  */
+-      __set_errno (ENOTTY);
+-      return ENOTTY;
+-    }
+-
+   __set_errno (save_errno);
+   return 0;
+ }
+-
+-
+-/* Store at most BUFLEN characters of the pathname of the slave pseudo
+-   terminal associated with the master FD is open on in BUF.
+-   Return 0 on success, otherwise an error number.  */
+-int
+-__ptsname_r (int fd, char *buf, size_t buflen)
+-{
+-  struct stat64 st;
+-  return __ptsname_internal (fd, buf, buflen, &st);
+-}
+ weak_alias (__ptsname_r, ptsname_r)
 diff --git a/sysdeps/unix/sysv/linux/riscv/sysdep.h b/sysdeps/unix/sysv/linux/riscv/sysdep.h
 index 201bf9a91b..2bd9b16f32 100644
 --- a/sysdeps/unix/sysv/linux/riscv/sysdep.h
@@ -8848,6 +9997,271 @@ index 5f1352ad43..52e6dafc86 100644
  memfd_create    EXTRA	memfd_create	i:si    memfd_create
  pkey_alloc	EXTRA	pkey_alloc	i:ii	pkey_alloc
  pkey_free	EXTRA	pkey_free	i:i	pkey_free
+diff --git a/sysdeps/unix/sysv/linux/tst-getcwd-smallbuff.c b/sysdeps/unix/sysv/linux/tst-getcwd-smallbuff.c
+new file mode 100644
+index 0000000000..55362f6060
+--- /dev/null
++++ b/sysdeps/unix/sysv/linux/tst-getcwd-smallbuff.c
+@@ -0,0 +1,259 @@
++/* Verify that getcwd returns ERANGE for size 1 byte and does not underflow
++   buffer when the CWD is too long and is also a mount target of /.  See bug
++   #28769 or CVE-2021-3999 for more context.
++   Copyright The GNU Toolchain Authors.
++   This file is part of the GNU C Library.
++
++   The GNU C Library is free software; you can redistribute it and/or
++   modify it under the terms of the GNU Lesser General Public
++   License as published by the Free Software Foundation; either
++   version 2.1 of the License, or (at your option) any later version.
++
++   The GNU C Library is distributed in the hope that it will be useful,
++   but WITHOUT ANY WARRANTY; without even the implied warranty of
++   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
++   Lesser General Public License for more details.
++
++   You should have received a copy of the GNU Lesser General Public
++   License along with the GNU C Library; if not, see
++   <https://www.gnu.org/licenses/>.  */
++
++#include <errno.h>
++#include <fcntl.h>
++#include <intprops.h>
++#include <limits.h>
++#include <stdio.h>
++#include <stdlib.h>
++#include <string.h>
++#include <sys/mount.h>
++#include <sys/stat.h>
++#include <sys/types.h>
++#include <sys/wait.h>
++
++#include <sys/socket.h>
++#include <sys/un.h>
++#include <support/check.h>
++#include <support/temp_file.h>
++#include <support/test-driver.h>
++#include <support/xsched.h>
++#include <support/xunistd.h>
++
++static char *base;
++#define BASENAME "tst-getcwd-smallbuff"
++#define MOUNT_NAME "mpoint"
++static int sockfd[2];
++
++static void
++do_cleanup (void)
++{
++  support_chdir_toolong_temp_directory (base);
++  TEST_VERIFY_EXIT (rmdir (MOUNT_NAME) == 0);
++  free (base);
++}
++
++static void
++send_fd (const int sock, const int fd)
++{
++  struct msghdr msg = {0};
++  union
++    {
++      struct cmsghdr hdr;
++      char buf[CMSG_SPACE (sizeof (int))];
++    } cmsgbuf = {0};
++  struct cmsghdr *cmsg;
++  struct iovec vec;
++  char ch = 'A';
++  ssize_t n;
++
++  msg.msg_control = &cmsgbuf.buf;
++  msg.msg_controllen = sizeof (cmsgbuf.buf);
++
++  cmsg = CMSG_FIRSTHDR (&msg);
++  cmsg->cmsg_len = CMSG_LEN (sizeof (int));
++  cmsg->cmsg_level = SOL_SOCKET;
++  cmsg->cmsg_type = SCM_RIGHTS;
++  memcpy (CMSG_DATA (cmsg), &fd, sizeof (fd));
++
++  vec.iov_base = &ch;
++  vec.iov_len = 1;
++  msg.msg_iov = &vec;
++  msg.msg_iovlen = 1;
++
++  while ((n = sendmsg (sock, &msg, 0)) == -1 && errno == EINTR);
++
++  TEST_VERIFY_EXIT (n == 1);
++}
++
++static int
++recv_fd (const int sock)
++{
++  struct msghdr msg = {0};
++  union
++    {
++      struct cmsghdr hdr;
++      char buf[CMSG_SPACE(sizeof(int))];
++    } cmsgbuf = {0};
++  struct cmsghdr *cmsg;
++  struct iovec vec;
++  ssize_t n;
++  char ch = '\0';
++  int fd = -1;
++
++  vec.iov_base = &ch;
++  vec.iov_len = 1;
++  msg.msg_iov = &vec;
++  msg.msg_iovlen = 1;
++
++  msg.msg_control = &cmsgbuf.buf;
++  msg.msg_controllen = sizeof (cmsgbuf.buf);
++
++  while ((n = recvmsg (sock, &msg, 0)) == -1 && errno == EINTR);
++  if (n != 1 || ch != 'A')
++    return -1;
++
++  cmsg = CMSG_FIRSTHDR (&msg);
++  if (cmsg == NULL)
++    return -1;
++  if (cmsg->cmsg_type != SCM_RIGHTS)
++    return -1;
++  memcpy (&fd, CMSG_DATA (cmsg), sizeof (fd));
++  if (fd < 0)
++    return -1;
++  return fd;
++}
++
++static int
++child_func (void * const arg)
++{
++  xclose (sockfd[0]);
++  const int sock = sockfd[1];
++  char ch;
++
++  TEST_VERIFY_EXIT (read (sock, &ch, 1) == 1);
++  TEST_VERIFY_EXIT (ch == '1');
++
++  if (mount ("/", MOUNT_NAME, NULL, MS_BIND | MS_REC, NULL))
++    FAIL_EXIT1 ("mount failed: %m\n");
++  const int fd = xopen ("mpoint",
++			O_RDONLY | O_PATH | O_DIRECTORY | O_NOFOLLOW, 0);
++
++  send_fd (sock, fd);
++  xclose (fd);
++
++  TEST_VERIFY_EXIT (read (sock, &ch, 1) == 1);
++  TEST_VERIFY_EXIT (ch == 'a');
++
++  xclose (sock);
++  return 0;
++}
++
++static void
++update_map (char * const mapping, const char * const map_file)
++{
++  const size_t map_len = strlen (mapping);
++
++  const int fd = xopen (map_file, O_WRONLY, 0);
++  xwrite (fd, mapping, map_len);
++  xclose (fd);
++}
++
++static void
++proc_setgroups_write (const long child_pid, const char * const str)
++{
++  const size_t str_len = strlen(str);
++
++  char setgroups_path[sizeof ("/proc//setgroups") + INT_STRLEN_BOUND (long)];
++
++  snprintf (setgroups_path, sizeof (setgroups_path),
++	    "/proc/%ld/setgroups", child_pid);
++
++  const int fd = open (setgroups_path, O_WRONLY);
++
++  if (fd < 0)
++    {
++      TEST_VERIFY_EXIT (errno == ENOENT);
++      FAIL_UNSUPPORTED ("/proc/%ld/setgroups not found\n", child_pid);
++    }
++
++  xwrite (fd, str, str_len);
++  xclose(fd);
++}
++
++static char child_stack[1024 * 1024];
++
++int
++do_test (void)
++{
++  base = support_create_and_chdir_toolong_temp_directory (BASENAME);
++
++  xmkdir (MOUNT_NAME, S_IRWXU);
++  atexit (do_cleanup);
++
++  /* Check whether user namespaces are supported.  */
++  {
++    pid_t pid = xfork ();
++    if (pid == 0)
++      {
++	if (unshare (CLONE_NEWUSER | CLONE_NEWNS) != 0)
++	  _exit (EXIT_UNSUPPORTED);
++	else
++	  _exit (0);
++      }
++    int status;
++    xwaitpid (pid, &status, 0);
++    TEST_VERIFY_EXIT (WIFEXITED (status));
++    if (WEXITSTATUS (status) != 0)
++      return WEXITSTATUS (status);
++  }
++
++  TEST_VERIFY_EXIT (socketpair (AF_UNIX, SOCK_STREAM, 0, sockfd) == 0);
++  pid_t child_pid = xclone (child_func, NULL, child_stack,
++			    sizeof (child_stack),
++			    CLONE_NEWUSER | CLONE_NEWNS | SIGCHLD);
++
++  xclose (sockfd[1]);
++  const int sock = sockfd[0];
++
++  char map_path[sizeof ("/proc//uid_map") + INT_STRLEN_BOUND (long)];
++  char map_buf[sizeof ("0  1") + INT_STRLEN_BOUND (long)];
++
++  snprintf (map_path, sizeof (map_path), "/proc/%ld/uid_map",
++	    (long) child_pid);
++  snprintf (map_buf, sizeof (map_buf), "0 %ld 1", (long) getuid());
++  update_map (map_buf, map_path);
++
++  proc_setgroups_write ((long) child_pid, "deny");
++  snprintf (map_path, sizeof (map_path), "/proc/%ld/gid_map",
++	    (long) child_pid);
++  snprintf (map_buf, sizeof (map_buf), "0 %ld 1", (long) getgid());
++  update_map (map_buf, map_path);
++
++  TEST_VERIFY_EXIT (send (sock, "1", 1, MSG_NOSIGNAL) == 1);
++  const int fd = recv_fd (sock);
++  TEST_VERIFY_EXIT (fd >= 0);
++  TEST_VERIFY_EXIT (fchdir (fd) == 0);
++
++  static char buf[2 * 10 + 1];
++  memset (buf, 'A', sizeof (buf));
++
++  /* Finally, call getcwd and check if it resulted in a buffer underflow.  */
++  char * cwd = getcwd (buf + sizeof (buf) / 2, 1);
++  TEST_VERIFY (cwd == NULL);
++  TEST_VERIFY (errno == ERANGE);
++
++  for (int i = 0; i < sizeof (buf); i++)
++    if (buf[i] != 'A')
++      {
++	printf ("buf[%d] = %02x\n", i, (unsigned int) buf[i]);
++	support_record_failure ();
++      }
++
++  TEST_VERIFY_EXIT (send (sock, "a", 1, MSG_NOSIGNAL) == 1);
++  xclose (sock);
++  TEST_VERIFY_EXIT (xwaitpid (child_pid, NULL, 0) == child_pid);
++
++  return 0;
++}
++
++#define CLEANUP_HANDLER do_cleanup
++#include <support/test-driver.c>
 diff --git a/sysdeps/unix/sysv/linux/x86_64/sysdep.h b/sysdeps/unix/sysv/linux/x86_64/sysdep.h
 index c2eb37e575..c7f740a1df 100644
 --- a/sysdeps/unix/sysv/linux/x86_64/sysdep.h
@@ -9029,10 +10443,10 @@ index 5bf9eed80b..62e6f8fe11 100644
 +
  #endif /* linux/x86_64/x32/sysdep.h */
 diff --git a/sysdeps/x86/Makefile b/sysdeps/x86/Makefile
-index 95182a508c..a5112ef367 100644
+index 95182a508c..b7aec5df2b 100644
 --- a/sysdeps/x86/Makefile
 +++ b/sysdeps/x86/Makefile
-@@ -12,6 +12,12 @@ endif
+@@ -12,6 +12,42 @@ endif
  ifeq ($(subdir),setjmp)
  gen-as-const-headers += jmp_buf-ssp.sym
  sysdep_routines += __longjmp_cancel
@@ -9042,6 +10456,36 @@ index 95182a508c..a5112ef367 100644
 +tst-setjmp-cet-ENV = GLIBC_TUNABLES=glibc.cpu.x86_ibt=on:glibc.cpu.x86_shstk=on
 +endif
 +endif
++endif
++
++ifeq ($(subdir),string)
++sysdep_routines += cacheinfo
++
++tests += \
++  tst-memchr-rtm \
++  tst-memcmp-rtm \
++  tst-memmove-rtm \
++  tst-memrchr-rtm \
++  tst-memset-rtm \
++  tst-strchr-rtm \
++  tst-strcpy-rtm \
++  tst-strlen-rtm \
++  tst-strncmp-rtm \
++  tst-strrchr-rtm \
++  tst-wcsncmp-rtm \
++# tests
++
++CFLAGS-tst-memchr-rtm.c += -mrtm
++CFLAGS-tst-memcmp-rtm.c += -mrtm
++CFLAGS-tst-memmove-rtm.c += -mrtm
++CFLAGS-tst-memrchr-rtm.c += -mrtm
++CFLAGS-tst-memset-rtm.c += -mrtm
++CFLAGS-tst-strchr-rtm.c += -mrtm
++CFLAGS-tst-strcpy-rtm.c += -mrtm
++CFLAGS-tst-strlen-rtm.c += -mrtm
++CFLAGS-tst-strncmp-rtm.c += -mrtm -Wno-error
++CFLAGS-tst-strrchr-rtm.c += -mrtm
++CFLAGS-tst-wcsncmp-rtm.c += -mrtm -Wno-error
  endif
  
  ifeq ($(enable-cet),yes)
@@ -9109,6 +10553,151 @@ index e3e8ef27bb..39c13b7195 100644
  }
  
  #endif
+diff --git a/sysdeps/x86/cpu-features.c b/sysdeps/x86/cpu-features.c
+index 81a170a819..e1c22e3e58 100644
+--- a/sysdeps/x86/cpu-features.c
++++ b/sysdeps/x86/cpu-features.c
+@@ -333,6 +333,9 @@ init_cpu_features (struct cpu_features *cpu_features)
+ 
+       get_extended_indices (cpu_features);
+ 
++      if (CPU_FEATURES_CPU_P (cpu_features, RTM_ALWAYS_ABORT))
++	cpu_features->cpuid[index_cpu_RTM].reg_RTM &= ~bit_cpu_RTM;
++
+       if (family == 0x06)
+ 	{
+ 	  model += extended_model;
+@@ -394,11 +397,42 @@ init_cpu_features (struct cpu_features *cpu_features)
+ 	      break;
+ 	    }
+ 
+-	 /* Disable TSX on some Haswell processors to avoid TSX on kernels that
+-	    weren't updated with the latest microcode package (which disables
+-	    broken feature by default).  */
++	 /* Disable TSX on some processors to avoid TSX on kernels that
++	    weren't updated with the latest microcode package (which
++	    disables broken feature by default).  */
+ 	 switch (model)
+ 	    {
++	    case 0x55:
++	      if (stepping <= 5)
++		goto disable_tsx;
++	      break;
++	    case 0x8e:
++	      /* NB: Although the errata documents that for model == 0x8e,
++		 only 0xb stepping or lower are impacted, the intention of
++		 the errata was to disable TSX on all client processors on
++		 all steppings.  Include 0xc stepping which is an Intel
++		 Core i7-8665U, a client mobile processor.  */
++	    case 0x9e:
++	      if (stepping > 0xc)
++		break;
++	      /* Fall through.  */
++	    case 0x4e:
++	    case 0x5e:
++	      {
++		/* Disable Intel TSX and enable RTM_ALWAYS_ABORT for
++		   processors listed in:
++
++https://www.intel.com/content/www/us/en/support/articles/000059422/processors.html
++		 */
++disable_tsx:
++		cpu_features->cpuid[index_cpu_HLE].reg_HLE
++		  &= ~bit_cpu_HLE;
++		cpu_features->cpuid[index_cpu_RTM].reg_RTM
++		  &= ~bit_cpu_RTM;
++		cpu_features->cpuid[index_cpu_RTM_ALWAYS_ABORT].reg_RTM_ALWAYS_ABORT
++		  |= bit_cpu_RTM_ALWAYS_ABORT;
++	      }
++	      break;
+ 	    case 0x3f:
+ 	      /* Xeon E7 v3 with stepping >= 4 has working TSX.  */
+ 	      if (stepping >= 4)
+@@ -424,8 +458,24 @@ init_cpu_features (struct cpu_features *cpu_features)
+ 	cpu_features->feature[index_arch_Prefer_No_VZEROUPPER]
+ 	  |= bit_arch_Prefer_No_VZEROUPPER;
+       else
+-	cpu_features->feature[index_arch_Prefer_No_AVX512]
+-	  |= bit_arch_Prefer_No_AVX512;
++	{
++	  cpu_features->feature[index_arch_Prefer_No_AVX512]
++	    |= bit_arch_Prefer_No_AVX512;
++
++	  /* Avoid RTM abort triggered by VZEROUPPER inside a
++	     transactionally executing RTM region.  */
++	  if (CPU_FEATURES_CPU_P (cpu_features, RTM))
++	    cpu_features->feature[index_arch_Prefer_No_VZEROUPPER]
++	      |= bit_arch_Prefer_No_VZEROUPPER;
++
++	  /* Since to compare 2 32-byte strings, 256-bit EVEX strcmp
++	     requires 2 loads, 3 VPCMPs and 2 KORDs while AVX2 strcmp
++	     requires 1 load, 2 VPCMPEQs, 1 VPMINU and 1 VPMOVMSKB,
++	     AVX2 strcmp is faster than EVEX strcmp.  */
++	  if (CPU_FEATURES_ARCH_P (cpu_features, AVX2_Usable))
++	    cpu_features->feature[index_arch_Prefer_AVX2_STRCMP]
++	      |= bit_arch_Prefer_AVX2_STRCMP;
++	}
+     }
+   /* This spells out "AuthenticAMD" or "HygonGenuine".  */
+   else if ((ebx == 0x68747541 && ecx == 0x444d4163 && edx == 0x69746e65)
+diff --git a/sysdeps/x86/cpu-features.h b/sysdeps/x86/cpu-features.h
+index aea83e6e31..9fb97907b5 100644
+--- a/sysdeps/x86/cpu-features.h
++++ b/sysdeps/x86/cpu-features.h
+@@ -499,6 +499,7 @@ extern const struct cpu_features *__get_cpu_features (void)
+ #define bit_cpu_AVX512_4VNNIW	(1u << 2)
+ #define bit_cpu_AVX512_4FMAPS	(1u << 3)
+ #define bit_cpu_FSRM		(1u << 4)
++#define bit_cpu_RTM_ALWAYS_ABORT (1u << 11)
+ #define bit_cpu_PCONFIG		(1u << 18)
+ #define bit_cpu_IBT		(1u << 20)
+ #define bit_cpu_IBRS_IBPB	(1u << 26)
+@@ -667,6 +668,7 @@ extern const struct cpu_features *__get_cpu_features (void)
+ #define index_cpu_AVX512_4VNNIW COMMON_CPUID_INDEX_7
+ #define index_cpu_AVX512_4FMAPS	COMMON_CPUID_INDEX_7
+ #define index_cpu_FSRM		COMMON_CPUID_INDEX_7
++#define index_cpu_RTM_ALWAYS_ABORT COMMON_CPUID_INDEX_7
+ #define index_cpu_PCONFIG	COMMON_CPUID_INDEX_7
+ #define index_cpu_IBT		COMMON_CPUID_INDEX_7
+ #define index_cpu_IBRS_IBPB	COMMON_CPUID_INDEX_7
+@@ -835,6 +837,7 @@ extern const struct cpu_features *__get_cpu_features (void)
+ #define reg_AVX512_4VNNIW	edx
+ #define reg_AVX512_4FMAPS	edx
+ #define reg_FSRM		edx
++#define reg_RTM_ALWAYS_ABORT	edx
+ #define reg_PCONFIG		edx
+ #define reg_IBT			edx
+ #define reg_IBRS_IBPB		edx
+@@ -897,6 +900,7 @@ extern const struct cpu_features *__get_cpu_features (void)
+ #define bit_arch_Prefer_FSRM			(1u << 13)
+ #define bit_arch_Prefer_No_AVX512		(1u << 14)
+ #define bit_arch_MathVec_Prefer_No_AVX512	(1u << 15)
++#define bit_arch_Prefer_AVX2_STRCMP		(1u << 16)
+ 
+ #define index_arch_Fast_Rep_String		FEATURE_INDEX_2
+ #define index_arch_Fast_Copy_Backward		FEATURE_INDEX_2
+@@ -914,6 +918,7 @@ extern const struct cpu_features *__get_cpu_features (void)
+ #define index_arch_Prefer_No_AVX512		FEATURE_INDEX_2
+ #define index_arch_MathVec_Prefer_No_AVX512	FEATURE_INDEX_2
+ #define index_arch_Prefer_FSRM			FEATURE_INDEX_2
++#define index_arch_Prefer_AVX2_STRCMP		FEATURE_INDEX_2
+ 
+ /* XCR0 Feature flags.  */
+ #define bit_XMM_state		(1u << 1)
+diff --git a/sysdeps/x86/cpu-tunables.c b/sysdeps/x86/cpu-tunables.c
+index 861bd7bcaa..cb83ecc3b2 100644
+--- a/sysdeps/x86/cpu-tunables.c
++++ b/sysdeps/x86/cpu-tunables.c
+@@ -282,6 +282,9 @@ TUNABLE_CALLBACK (set_hwcaps) (tunable_val_t *valp)
+ 	      CHECK_GLIBC_IFUNC_ARCH_BOTH (n, cpu_features,
+ 					   Fast_Copy_Backward, disable,
+ 					   18);
++	      CHECK_GLIBC_IFUNC_ARCH_NEED_ARCH_BOTH
++		(n, cpu_features, Prefer_AVX2_STRCMP, AVX2_Usable,
++		 disable, 18);
+ 	    }
+ 	  break;
+ 	case 19:
 diff --git a/sysdeps/x86/dl-cet.c b/sysdeps/x86/dl-cet.c
 index ca3b5849bc..8ffaf94a00 100644
 --- a/sysdeps/x86/dl-cet.c
@@ -9126,25 +10715,779 @@ index ca3b5849bc..8ffaf94a00 100644
  
    /* Check if IBT is enabled by kernel.  */
    bool ibt_enabled
-diff --git a/sysdeps/x86/tst-setjmp-cet.c b/sysdeps/x86/tst-setjmp-cet.c
+diff --git a/sysdeps/x86/tst-get-cpu-features.c b/sysdeps/x86/tst-get-cpu-features.c
+index 0f55987ae5..bbb5cd356d 100644
+--- a/sysdeps/x86/tst-get-cpu-features.c
++++ b/sysdeps/x86/tst-get-cpu-features.c
+@@ -176,6 +176,7 @@ do_test (void)
+   CHECK_CPU_FEATURE (AVX512_4VNNIW);
+   CHECK_CPU_FEATURE (AVX512_4FMAPS);
+   CHECK_CPU_FEATURE (FSRM);
++  CHECK_CPU_FEATURE (RTM_ALWAYS_ABORT);
+   CHECK_CPU_FEATURE (PCONFIG);
+   CHECK_CPU_FEATURE (IBT);
+   CHECK_CPU_FEATURE (IBRS_IBPB);
+diff --git a/sysdeps/x86/tst-memchr-rtm.c b/sysdeps/x86/tst-memchr-rtm.c
 new file mode 100644
-index 0000000000..42c795d2a8
+index 0000000000..e47494011e
 --- /dev/null
-+++ b/sysdeps/x86/tst-setjmp-cet.c
-@@ -0,0 +1 @@
-+#include <setjmp/tst-setjmp.c>
-diff --git a/sysdeps/x86_64/configure b/sysdeps/x86_64/configure
-old mode 100644
-new mode 100755
-index 84f82c2406..fc1840e23f
---- a/sysdeps/x86_64/configure
-+++ b/sysdeps/x86_64/configure
-@@ -107,39 +107,6 @@ if test x"$build_mathvec" = xnotset; then
-   build_mathvec=yes
- fi
- 
--if test "$static_pie" = yes; then
--  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for linker static PIE support" >&5
++++ b/sysdeps/x86/tst-memchr-rtm.c
+@@ -0,0 +1,54 @@
++/* Test case for memchr inside a transactionally executing RTM region.
++   Copyright (C) 2021 Free Software Foundation, Inc.
++   This file is part of the GNU C Library.
++
++   The GNU C Library is free software; you can redistribute it and/or
++   modify it under the terms of the GNU Lesser General Public
++   License as published by the Free Software Foundation; either
++   version 2.1 of the License, or (at your option) any later version.
++
++   The GNU C Library is distributed in the hope that it will be useful,
++   but WITHOUT ANY WARRANTY; without even the implied warranty of
++   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
++   Lesser General Public License for more details.
++
++   You should have received a copy of the GNU Lesser General Public
++   License along with the GNU C Library; if not, see
++   <https://www.gnu.org/licenses/>.  */
++
++#include <tst-string-rtm.h>
++
++#define LOOP 3000
++#define STRING_SIZE 1024
++char string1[STRING_SIZE];
++
++__attribute__ ((noinline, noclone))
++static int
++prepare (void)
++{
++  memset (string1, 'a', STRING_SIZE);
++  string1[100] = 'c';
++  string1[STRING_SIZE - 100] = 'c';
++  char *p = memchr (string1, 'c', STRING_SIZE);
++  if (p == &string1[100])
++    return EXIT_SUCCESS;
++  else
++    return EXIT_FAILURE;
++}
++
++__attribute__ ((noinline, noclone))
++static int
++function (void)
++{
++  char *p = memchr (string1, 'c', STRING_SIZE);
++  if (p == &string1[100])
++    return 0;
++  else
++    return 1;
++}
++
++static int
++do_test (void)
++{
++  return do_test_1 ("memchr", LOOP, prepare, function);
++}
+diff --git a/sysdeps/x86/tst-memcmp-rtm.c b/sysdeps/x86/tst-memcmp-rtm.c
+new file mode 100644
+index 0000000000..e4c8a623bb
+--- /dev/null
++++ b/sysdeps/x86/tst-memcmp-rtm.c
+@@ -0,0 +1,52 @@
++/* Test case for memcmp inside a transactionally executing RTM region.
++   Copyright (C) 2021 Free Software Foundation, Inc.
++   This file is part of the GNU C Library.
++
++   The GNU C Library is free software; you can redistribute it and/or
++   modify it under the terms of the GNU Lesser General Public
++   License as published by the Free Software Foundation; either
++   version 2.1 of the License, or (at your option) any later version.
++
++   The GNU C Library is distributed in the hope that it will be useful,
++   but WITHOUT ANY WARRANTY; without even the implied warranty of
++   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
++   Lesser General Public License for more details.
++
++   You should have received a copy of the GNU Lesser General Public
++   License along with the GNU C Library; if not, see
++   <https://www.gnu.org/licenses/>.  */
++
++#include <tst-string-rtm.h>
++
++#define LOOP 3000
++#define STRING_SIZE 1024
++char string1[STRING_SIZE];
++char string2[STRING_SIZE];
++
++__attribute__ ((noinline, noclone))
++static int
++prepare (void)
++{
++  memset (string1, 'a', STRING_SIZE);
++  memset (string2, 'a', STRING_SIZE);
++  if (memcmp (string1, string2, STRING_SIZE) == 0)
++    return EXIT_SUCCESS;
++  else
++    return EXIT_FAILURE;
++}
++
++__attribute__ ((noinline, noclone))
++static int
++function (void)
++{
++  if (memcmp (string1, string2, STRING_SIZE) == 0)
++    return 0;
++  else
++    return 1;
++}
++
++static int
++do_test (void)
++{
++  return do_test_1 ("memcmp", LOOP, prepare, function);
++}
+diff --git a/sysdeps/x86/tst-memmove-rtm.c b/sysdeps/x86/tst-memmove-rtm.c
+new file mode 100644
+index 0000000000..4bf97ef1e3
+--- /dev/null
++++ b/sysdeps/x86/tst-memmove-rtm.c
+@@ -0,0 +1,53 @@
++/* Test case for memmove inside a transactionally executing RTM region.
++   Copyright (C) 2021 Free Software Foundation, Inc.
++   This file is part of the GNU C Library.
++
++   The GNU C Library is free software; you can redistribute it and/or
++   modify it under the terms of the GNU Lesser General Public
++   License as published by the Free Software Foundation; either
++   version 2.1 of the License, or (at your option) any later version.
++
++   The GNU C Library is distributed in the hope that it will be useful,
++   but WITHOUT ANY WARRANTY; without even the implied warranty of
++   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
++   Lesser General Public License for more details.
++
++   You should have received a copy of the GNU Lesser General Public
++   License along with the GNU C Library; if not, see
++   <https://www.gnu.org/licenses/>.  */
++
++#include <tst-string-rtm.h>
++
++#define LOOP 3000
++#define STRING_SIZE 1024
++char string1[STRING_SIZE];
++char string2[STRING_SIZE];
++
++__attribute__ ((noinline, noclone))
++static int
++prepare (void)
++{
++  memset (string1, 'a', STRING_SIZE);
++  if (memmove (string2, string1, STRING_SIZE) == string2
++      && memcmp (string2, string1, STRING_SIZE) == 0)
++    return EXIT_SUCCESS;
++  else
++    return EXIT_FAILURE;
++}
++
++__attribute__ ((noinline, noclone))
++static int
++function (void)
++{
++  if (memmove (string2, string1, STRING_SIZE) == string2
++      && memcmp (string2, string1, STRING_SIZE) == 0)
++    return 0;
++  else
++    return 1;
++}
++
++static int
++do_test (void)
++{
++  return do_test_1 ("memmove", LOOP, prepare, function);
++}
+diff --git a/sysdeps/x86/tst-memrchr-rtm.c b/sysdeps/x86/tst-memrchr-rtm.c
+new file mode 100644
+index 0000000000..a57a5a8eb9
+--- /dev/null
++++ b/sysdeps/x86/tst-memrchr-rtm.c
+@@ -0,0 +1,54 @@
++/* Test case for memrchr inside a transactionally executing RTM region.
++   Copyright (C) 2021 Free Software Foundation, Inc.
++   This file is part of the GNU C Library.
++
++   The GNU C Library is free software; you can redistribute it and/or
++   modify it under the terms of the GNU Lesser General Public
++   License as published by the Free Software Foundation; either
++   version 2.1 of the License, or (at your option) any later version.
++
++   The GNU C Library is distributed in the hope that it will be useful,
++   but WITHOUT ANY WARRANTY; without even the implied warranty of
++   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
++   Lesser General Public License for more details.
++
++   You should have received a copy of the GNU Lesser General Public
++   License along with the GNU C Library; if not, see
++   <https://www.gnu.org/licenses/>.  */
++
++#include <tst-string-rtm.h>
++
++#define LOOP 3000
++#define STRING_SIZE 1024
++char string1[STRING_SIZE];
++
++__attribute__ ((noinline, noclone))
++static int
++prepare (void)
++{
++  memset (string1, 'a', STRING_SIZE);
++  string1[100] = 'c';
++  string1[STRING_SIZE - 100] = 'c';
++  char *p = memrchr (string1, 'c', STRING_SIZE);
++  if (p == &string1[STRING_SIZE - 100])
++    return EXIT_SUCCESS;
++  else
++    return EXIT_FAILURE;
++}
++
++__attribute__ ((noinline, noclone))
++static int
++function (void)
++{
++  char *p = memrchr (string1, 'c', STRING_SIZE);
++  if (p == &string1[STRING_SIZE - 100])
++    return 0;
++  else
++    return 1;
++}
++
++static int
++do_test (void)
++{
++  return do_test_1 ("memrchr", LOOP, prepare, function);
++}
+diff --git a/sysdeps/x86/tst-memset-rtm.c b/sysdeps/x86/tst-memset-rtm.c
+new file mode 100644
+index 0000000000..bf343a4dad
+--- /dev/null
++++ b/sysdeps/x86/tst-memset-rtm.c
+@@ -0,0 +1,45 @@
++/* Test case for memset inside a transactionally executing RTM region.
++   Copyright (C) 2021 Free Software Foundation, Inc.
++   This file is part of the GNU C Library.
++
++   The GNU C Library is free software; you can redistribute it and/or
++   modify it under the terms of the GNU Lesser General Public
++   License as published by the Free Software Foundation; either
++   version 2.1 of the License, or (at your option) any later version.
++
++   The GNU C Library is distributed in the hope that it will be useful,
++   but WITHOUT ANY WARRANTY; without even the implied warranty of
++   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
++   Lesser General Public License for more details.
++
++   You should have received a copy of the GNU Lesser General Public
++   License along with the GNU C Library; if not, see
++   <https://www.gnu.org/licenses/>.  */
++
++#include <tst-string-rtm.h>
++
++#define LOOP 3000
++#define STRING_SIZE 1024
++char string1[STRING_SIZE];
++
++__attribute__ ((noinline, noclone))
++static int
++prepare (void)
++{
++  memset (string1, 'a', STRING_SIZE);
++  return EXIT_SUCCESS;
++}
++
++__attribute__ ((noinline, noclone))
++static int
++function (void)
++{
++  memset (string1, 'a', STRING_SIZE);
++  return 0;
++}
++
++static int
++do_test (void)
++{
++  return do_test_1 ("memset", LOOP, prepare, function);
++}
+diff --git a/sysdeps/x86/tst-setjmp-cet.c b/sysdeps/x86/tst-setjmp-cet.c
+new file mode 100644
+index 0000000000..42c795d2a8
+--- /dev/null
++++ b/sysdeps/x86/tst-setjmp-cet.c
+@@ -0,0 +1 @@
++#include <setjmp/tst-setjmp.c>
+diff --git a/sysdeps/x86/tst-strchr-rtm.c b/sysdeps/x86/tst-strchr-rtm.c
+new file mode 100644
+index 0000000000..a82e29c072
+--- /dev/null
++++ b/sysdeps/x86/tst-strchr-rtm.c
+@@ -0,0 +1,54 @@
++/* Test case for strchr inside a transactionally executing RTM region.
++   Copyright (C) 2021 Free Software Foundation, Inc.
++   This file is part of the GNU C Library.
++
++   The GNU C Library is free software; you can redistribute it and/or
++   modify it under the terms of the GNU Lesser General Public
++   License as published by the Free Software Foundation; either
++   version 2.1 of the License, or (at your option) any later version.
++
++   The GNU C Library is distributed in the hope that it will be useful,
++   but WITHOUT ANY WARRANTY; without even the implied warranty of
++   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
++   Lesser General Public License for more details.
++
++   You should have received a copy of the GNU Lesser General Public
++   License along with the GNU C Library; if not, see
++   <https://www.gnu.org/licenses/>.  */
++
++#include <tst-string-rtm.h>
++
++#define LOOP 3000
++#define STRING_SIZE 1024
++char string1[STRING_SIZE];
++
++__attribute__ ((noinline, noclone))
++static int
++prepare (void)
++{
++  memset (string1, 'a', STRING_SIZE - 1);
++  string1[100] = 'c';
++  string1[STRING_SIZE - 100] = 'c';
++  char *p = strchr (string1, 'c');
++  if (p == &string1[100])
++    return EXIT_SUCCESS;
++  else
++    return EXIT_FAILURE;
++}
++
++__attribute__ ((noinline, noclone))
++static int
++function (void)
++{
++  char *p = strchr (string1, 'c');
++  if (p == &string1[100])
++    return 0;
++  else
++    return 1;
++}
++
++static int
++do_test (void)
++{
++  return do_test_1 ("strchr", LOOP, prepare, function);
++}
+diff --git a/sysdeps/x86/tst-strcpy-rtm.c b/sysdeps/x86/tst-strcpy-rtm.c
+new file mode 100644
+index 0000000000..2b2a583fb4
+--- /dev/null
++++ b/sysdeps/x86/tst-strcpy-rtm.c
+@@ -0,0 +1,53 @@
++/* Test case for strcpy inside a transactionally executing RTM region.
++   Copyright (C) 2021 Free Software Foundation, Inc.
++   This file is part of the GNU C Library.
++
++   The GNU C Library is free software; you can redistribute it and/or
++   modify it under the terms of the GNU Lesser General Public
++   License as published by the Free Software Foundation; either
++   version 2.1 of the License, or (at your option) any later version.
++
++   The GNU C Library is distributed in the hope that it will be useful,
++   but WITHOUT ANY WARRANTY; without even the implied warranty of
++   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
++   Lesser General Public License for more details.
++
++   You should have received a copy of the GNU Lesser General Public
++   License along with the GNU C Library; if not, see
++   <https://www.gnu.org/licenses/>.  */
++
++#include <tst-string-rtm.h>
++
++#define LOOP 3000
++#define STRING_SIZE 1024
++char string1[STRING_SIZE];
++char string2[STRING_SIZE];
++
++__attribute__ ((noinline, noclone))
++static int
++prepare (void)
++{
++  memset (string1, 'a', STRING_SIZE - 1);
++  if (strcpy (string2, string1) == string2
++      && strcmp (string2, string1) == 0)
++    return EXIT_SUCCESS;
++  else
++    return EXIT_FAILURE;
++}
++
++__attribute__ ((noinline, noclone))
++static int
++function (void)
++{
++  if (strcpy (string2, string1) == string2
++      && strcmp (string2, string1) == 0)
++    return 0;
++  else
++    return 1;
++}
++
++static int
++do_test (void)
++{
++  return do_test_1 ("strcpy", LOOP, prepare, function);
++}
+diff --git a/sysdeps/x86/tst-string-rtm.h b/sysdeps/x86/tst-string-rtm.h
+new file mode 100644
+index 0000000000..6ed9eca017
+--- /dev/null
++++ b/sysdeps/x86/tst-string-rtm.h
+@@ -0,0 +1,72 @@
++/* Test string function in a transactionally executing RTM region.
++   Copyright (C) 2021 Free Software Foundation, Inc.
++   This file is part of the GNU C Library.
++
++   The GNU C Library is free software; you can redistribute it and/or
++   modify it under the terms of the GNU Lesser General Public
++   License as published by the Free Software Foundation; either
++   version 2.1 of the License, or (at your option) any later version.
++
++   The GNU C Library is distributed in the hope that it will be useful,
++   but WITHOUT ANY WARRANTY; without even the implied warranty of
++   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
++   Lesser General Public License for more details.
++
++   You should have received a copy of the GNU Lesser General Public
++   License along with the GNU C Library; if not, see
++   <https://www.gnu.org/licenses/>.  */
++
++#include <string.h>
++#include <x86intrin.h>
++#include <cpu-features.h>
++#include <support/check.h>
++#include <support/test-driver.h>
++
++static int
++do_test_1 (const char *name, unsigned int loop, int (*prepare) (void),
++	   int (*function) (void))
++{
++  if (!CPU_FEATURE_USABLE (RTM))
++    return EXIT_UNSUPPORTED;
++
++  int status = prepare ();
++  if (status != EXIT_SUCCESS)
++    return status;
++
++  unsigned int i;
++  unsigned int naborts = 0;
++  unsigned int failed = 0;
++  for (i = 0; i < loop; i++)
++    {
++      failed |= function ();
++      if (_xbegin() == _XBEGIN_STARTED)
++	{
++	  failed |= function ();
++	  _xend();
++	}
++      else
++	{
++	  failed |= function ();
++	  ++naborts;
++	}
++    }
++
++  if (failed)
++    FAIL_EXIT1 ("%s() failed", name);
++
++  if (naborts)
++    {
++      /* NB: Low single digit (<= 5%) noise-level aborts are normal for
++	 TSX.  */
++      double rate = 100 * ((double) naborts) / ((double) loop);
++      if (rate > 5)
++	FAIL_EXIT1 ("TSX abort rate: %.2f%% (%d out of %d)",
++		    rate, naborts, loop);
++    }
++
++  return EXIT_SUCCESS;
++}
++
++static int do_test (void);
++
++#include <support/test-driver.c>
+diff --git a/sysdeps/x86/tst-strlen-rtm.c b/sysdeps/x86/tst-strlen-rtm.c
+new file mode 100644
+index 0000000000..0dcf14db87
+--- /dev/null
++++ b/sysdeps/x86/tst-strlen-rtm.c
+@@ -0,0 +1,53 @@
++/* Test case for strlen inside a transactionally executing RTM region.
++   Copyright (C) 2021 Free Software Foundation, Inc.
++   This file is part of the GNU C Library.
++
++   The GNU C Library is free software; you can redistribute it and/or
++   modify it under the terms of the GNU Lesser General Public
++   License as published by the Free Software Foundation; either
++   version 2.1 of the License, or (at your option) any later version.
++
++   The GNU C Library is distributed in the hope that it will be useful,
++   but WITHOUT ANY WARRANTY; without even the implied warranty of
++   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
++   Lesser General Public License for more details.
++
++   You should have received a copy of the GNU Lesser General Public
++   License along with the GNU C Library; if not, see
++   <https://www.gnu.org/licenses/>.  */
++
++#include <tst-string-rtm.h>
++
++#define LOOP 3000
++#define STRING_SIZE 1024
++char string1[STRING_SIZE];
++
++__attribute__ ((noinline, noclone))
++static int
++prepare (void)
++{
++  memset (string1, 'a', STRING_SIZE - 1);
++  string1[STRING_SIZE - 100] = '\0';
++  size_t len = strlen (string1);
++  if (len == STRING_SIZE - 100)
++    return EXIT_SUCCESS;
++  else
++    return EXIT_FAILURE;
++}
++
++__attribute__ ((noinline, noclone))
++static int
++function (void)
++{
++  size_t len = strlen (string1);
++  if (len == STRING_SIZE - 100)
++    return 0;
++  else
++    return 1;
++}
++
++static int
++do_test (void)
++{
++  return do_test_1 ("strlen", LOOP, prepare, function);
++}
+diff --git a/sysdeps/x86/tst-strncmp-rtm.c b/sysdeps/x86/tst-strncmp-rtm.c
+new file mode 100644
+index 0000000000..aef9866cf2
+--- /dev/null
++++ b/sysdeps/x86/tst-strncmp-rtm.c
+@@ -0,0 +1,81 @@
++/* Test case for strncmp inside a transactionally executing RTM region.
++   Copyright (C) 2021 Free Software Foundation, Inc.
++   This file is part of the GNU C Library.
++
++   The GNU C Library is free software; you can redistribute it and/or
++   modify it under the terms of the GNU Lesser General Public
++   License as published by the Free Software Foundation; either
++   version 2.1 of the License, or (at your option) any later version.
++
++   The GNU C Library is distributed in the hope that it will be useful,
++   but WITHOUT ANY WARRANTY; without even the implied warranty of
++   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
++   Lesser General Public License for more details.
++
++   You should have received a copy of the GNU Lesser General Public
++   License along with the GNU C Library; if not, see
++   <https://www.gnu.org/licenses/>.  */
++
++#include <stdint.h>
++#include <tst-string-rtm.h>
++
++#ifdef WIDE
++# define CHAR wchar_t
++# define MEMSET wmemset
++# define STRNCMP wcsncmp
++# define TEST_NAME "wcsncmp"
++#else /* !WIDE */
++# define CHAR char
++# define MEMSET memset
++# define STRNCMP strncmp
++# define TEST_NAME "strncmp"
++#endif /* !WIDE */
++
++
++
++#define LOOP 3000
++#define STRING_SIZE 1024
++CHAR string1[STRING_SIZE];
++CHAR string2[STRING_SIZE];
++
++__attribute__ ((noinline, noclone))
++static int
++prepare (void)
++{
++  MEMSET (string1, 'a', STRING_SIZE - 1);
++  MEMSET (string2, 'a', STRING_SIZE - 1);
++  if (STRNCMP (string1, string2, STRING_SIZE) == 0)
++    return EXIT_SUCCESS;
++  else
++    return EXIT_FAILURE;
++}
++
++__attribute__ ((noinline, noclone))
++static int
++function (void)
++{
++  if (STRNCMP (string1, string2, STRING_SIZE) == 0)
++    return 0;
++  else
++    return 1;
++}
++
++__attribute__ ((noinline, noclone))
++static int
++function_overflow (void)
++{
++  if (STRNCMP (string1, string2, SIZE_MAX) == 0)
++    return 0;
++  else
++    return 1;
++}
++
++static int
++do_test (void)
++{
++  int status = do_test_1 (TEST_NAME, LOOP, prepare, function);
++  if (status != EXIT_SUCCESS)
++    return status;
++  status = do_test_1 (TEST_NAME, LOOP, prepare, function_overflow);
++  return status;
++}
+diff --git a/sysdeps/x86/tst-strrchr-rtm.c b/sysdeps/x86/tst-strrchr-rtm.c
+new file mode 100644
+index 0000000000..e32bfaf5f5
+--- /dev/null
++++ b/sysdeps/x86/tst-strrchr-rtm.c
+@@ -0,0 +1,53 @@
++/* Test case for strrchr inside a transactionally executing RTM region.
++   Copyright (C) 2021 Free Software Foundation, Inc.
++   This file is part of the GNU C Library.
++
++   The GNU C Library is free software; you can redistribute it and/or
++   modify it under the terms of the GNU Lesser General Public
++   License as published by the Free Software Foundation; either
++   version 2.1 of the License, or (at your option) any later version.
++
++   The GNU C Library is distributed in the hope that it will be useful,
++   but WITHOUT ANY WARRANTY; without even the implied warranty of
++   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
++   Lesser General Public License for more details.
++
++   You should have received a copy of the GNU Lesser General Public
++   License along with the GNU C Library; if not, see
++   <https://www.gnu.org/licenses/>.  */
++
++#include <tst-string-rtm.h>
++
++#define LOOP 3000
++#define STRING_SIZE 1024
++char string1[STRING_SIZE];
++
++__attribute__ ((noinline, noclone))
++static int
++prepare (void)
++{
++  memset (string1, 'a', STRING_SIZE - 1);
++  string1[STRING_SIZE - 100] = 'c';
++  char *p = strrchr (string1, 'c');
++  if (p == &string1[STRING_SIZE - 100])
++    return EXIT_SUCCESS;
++  else
++    return EXIT_FAILURE;
++}
++
++__attribute__ ((noinline, noclone))
++static int
++function (void)
++{
++  char *p = strrchr (string1, 'c');
++  if (p == &string1[STRING_SIZE - 100])
++    return 0;
++  else
++    return 1;
++}
++
++static int
++do_test (void)
++{
++  return do_test_1 ("strrchr", LOOP, prepare, function);
++}
+diff --git a/sysdeps/x86/tst-wcsncmp-rtm.c b/sysdeps/x86/tst-wcsncmp-rtm.c
+new file mode 100644
+index 0000000000..bad3b86378
+--- /dev/null
++++ b/sysdeps/x86/tst-wcsncmp-rtm.c
+@@ -0,0 +1,21 @@
++/* Test case for wcsncmp inside a transactionally executing RTM region.
++   Copyright (C) 2022 Free Software Foundation, Inc.
++   This file is part of the GNU C Library.
++
++   The GNU C Library is free software; you can redistribute it and/or
++   modify it under the terms of the GNU Lesser General Public
++   License as published by the Free Software Foundation; either
++   version 2.1 of the License, or (at your option) any later version.
++
++   The GNU C Library is distributed in the hope that it will be useful,
++   but WITHOUT ANY WARRANTY; without even the implied warranty of
++   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
++   Lesser General Public License for more details.
++
++   You should have received a copy of the GNU Lesser General Public
++   License along with the GNU C Library; if not, see
++   <https://www.gnu.org/licenses/>.  */
++
++#define WIDE 1
++#include <wchar.h>
++#include "tst-strncmp-rtm.c"
+diff --git a/sysdeps/x86_64/Makefile b/sysdeps/x86_64/Makefile
+index d51cf03ac9..b1951adce9 100644
+--- a/sysdeps/x86_64/Makefile
++++ b/sysdeps/x86_64/Makefile
+@@ -20,6 +20,8 @@ endif
+ ifeq ($(subdir),string)
+ sysdep_routines += cacheinfo strcasecmp_l-nonascii strncase_l-nonascii
+ gen-as-const-headers += locale-defines.sym
++tests += \
++  tst-rsi-strlen
+ endif
+ 
+ ifeq ($(subdir),elf)
+@@ -150,6 +152,11 @@ ifeq ($(subdir),csu)
+ gen-as-const-headers += tlsdesc.sym rtld-offsets.sym
+ endif
+ 
++ifeq ($(subdir),wcsmbs)
++tests += \
++  tst-rsi-wcslen
++endif
++
+ $(objpfx)x86_64/tst-x86_64mod-1.os: $(objpfx)tst-x86_64mod-1.os
+ 	$(make-target-directory)
+ 	rm -f $@
+diff --git a/sysdeps/x86_64/configure b/sysdeps/x86_64/configure
+old mode 100644
+new mode 100755
+index 84f82c2406..fc1840e23f
+--- a/sysdeps/x86_64/configure
++++ b/sysdeps/x86_64/configure
+@@ -107,39 +107,6 @@ if test x"$build_mathvec" = xnotset; then
+   build_mathvec=yes
+ fi
+ 
+-if test "$static_pie" = yes; then
+-  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for linker static PIE support" >&5
 -$as_echo_n "checking for linker static PIE support... " >&6; }
 -if ${libc_cv_ld_static_pie+:} false; then :
 -  $as_echo_n "(cached) " >&6
@@ -9176,175 +11519,10311 @@ index 84f82c2406..fc1840e23f
 -  fi
 -fi
 -
- $as_echo "#define PI_STATIC_AND_HIDDEN 1" >>confdefs.h
+ $as_echo "#define PI_STATIC_AND_HIDDEN 1" >>confdefs.h
+ 
+ 
+diff --git a/sysdeps/x86_64/configure.ac b/sysdeps/x86_64/configure.ac
+index cdaba0c075..611a7d9ba3 100644
+--- a/sysdeps/x86_64/configure.ac
++++ b/sysdeps/x86_64/configure.ac
+@@ -53,31 +53,6 @@ if test x"$build_mathvec" = xnotset; then
+   build_mathvec=yes
+ fi
+ 
+-dnl Check if linker supports static PIE with the fix for
+-dnl
+-dnl https://sourceware.org/bugzilla/show_bug.cgi?id=21782
+-dnl
+-if test "$static_pie" = yes; then
+-  AC_CACHE_CHECK(for linker static PIE support, libc_cv_ld_static_pie, [dnl
+-cat > conftest.s <<\EOF
+-	.text
+-	.global _start
+-	.weak foo
+-_start:
+-	leaq	foo(%rip), %rax
+-EOF
+-  libc_cv_pie_option="-Wl,-pie"
+-  if AC_TRY_COMMAND(${CC-cc} $CFLAGS $CPPFLAGS $LDFLAGS -nostartfiles -nostdlib $no_ssp $libc_cv_pie_option -o conftest conftest.s 1>&AS_MESSAGE_LOG_FD); then
+-    libc_cv_ld_static_pie=yes
+-  else
+-    libc_cv_ld_static_pie=no
+-  fi
+-rm -f conftest*])
+-  if test "$libc_cv_ld_static_pie" != yes; then
+-    AC_MSG_ERROR([linker support for static PIE needed])
+-  fi
+-fi
+-
+ dnl It is always possible to access static and hidden symbols in an
+ dnl position independent way.
+ AC_DEFINE(PI_STATIC_AND_HIDDEN)
+diff --git a/sysdeps/x86_64/dl-machine.h b/sysdeps/x86_64/dl-machine.h
+index 8e9baffeb4..74029871d8 100644
+--- a/sysdeps/x86_64/dl-machine.h
++++ b/sysdeps/x86_64/dl-machine.h
+@@ -315,16 +315,22 @@ elf_machine_rela (struct link_map *map, const ElfW(Rela) *reloc,
+ 	{
+ # ifndef RTLD_BOOTSTRAP
+ 	  if (sym_map != map
+-	      && sym_map->l_type != lt_executable
+ 	      && !sym_map->l_relocated)
+ 	    {
+ 	      const char *strtab
+ 		= (const char *) D_PTR (map, l_info[DT_STRTAB]);
+-	      _dl_error_printf ("\
++	      if (sym_map->l_type == lt_executable)
++		_dl_fatal_printf ("\
++%s: IFUNC symbol '%s' referenced in '%s' is defined in the executable \
++and creates an unsatisfiable circular dependency.\n",
++				  RTLD_PROGNAME, strtab + refsym->st_name,
++				  map->l_name);
++	      else
++		_dl_error_printf ("\
+ %s: Relink `%s' with `%s' for IFUNC symbol `%s'\n",
+-				RTLD_PROGNAME, map->l_name,
+-				sym_map->l_name,
+-				strtab + refsym->st_name);
++				  RTLD_PROGNAME, map->l_name,
++				  sym_map->l_name,
++				  strtab + refsym->st_name);
+ 	    }
+ # endif
+ 	  value = ((ElfW(Addr) (*) (void)) value) ();
+diff --git a/sysdeps/x86_64/memchr.S b/sysdeps/x86_64/memchr.S
+index a5c879d2af..070e5ef90b 100644
+--- a/sysdeps/x86_64/memchr.S
++++ b/sysdeps/x86_64/memchr.S
+@@ -21,9 +21,11 @@
+ #ifdef USE_AS_WMEMCHR
+ # define MEMCHR		wmemchr
+ # define PCMPEQ		pcmpeqd
++# define CHAR_PER_VEC	4
+ #else
+ # define MEMCHR		memchr
+ # define PCMPEQ		pcmpeqb
++# define CHAR_PER_VEC	16
+ #endif
+ 
+ /* fast SSE2 version with using pmaxub and 64 byte loop */
+@@ -33,15 +35,14 @@ ENTRY(MEMCHR)
+ 	movd	%esi, %xmm1
+ 	mov	%edi, %ecx
+ 
++#ifdef __ILP32__
++	/* Clear the upper 32 bits.  */
++	movl	%edx, %edx
++#endif
+ #ifdef USE_AS_WMEMCHR
+ 	test	%RDX_LP, %RDX_LP
+ 	jz	L(return_null)
+-	shl	$2, %RDX_LP
+ #else
+-# ifdef __ILP32__
+-	/* Clear the upper 32 bits.  */
+-	movl	%edx, %edx
+-# endif
+ 	punpcklbw %xmm1, %xmm1
+ 	test	%RDX_LP, %RDX_LP
+ 	jz	L(return_null)
+@@ -60,13 +61,16 @@ ENTRY(MEMCHR)
+ 	test	%eax, %eax
+ 
+ 	jnz	L(matches_1)
+-	sub	$16, %rdx
++	sub	$CHAR_PER_VEC, %rdx
+ 	jbe	L(return_null)
+ 	add	$16, %rdi
+ 	and	$15, %ecx
+ 	and	$-16, %rdi
++#ifdef USE_AS_WMEMCHR
++	shr	$2, %ecx
++#endif
+ 	add	%rcx, %rdx
+-	sub	$64, %rdx
++	sub	$(CHAR_PER_VEC * 4), %rdx
+ 	jbe	L(exit_loop)
+ 	jmp	L(loop_prolog)
+ 
+@@ -77,16 +81,21 @@ L(crosscache):
+ 	movdqa	(%rdi), %xmm0
+ 
+ 	PCMPEQ	%xmm1, %xmm0
+-/* Check if there is a match.  */
++	/* Check if there is a match.  */
+ 	pmovmskb %xmm0, %eax
+-/* Remove the leading bytes.  */
++	/* Remove the leading bytes.  */
+ 	sar	%cl, %eax
+ 	test	%eax, %eax
+ 	je	L(unaligned_no_match)
+-/* Check which byte is a match.  */
++	/* Check which byte is a match.  */
+ 	bsf	%eax, %eax
+-
++#ifdef USE_AS_WMEMCHR
++	mov	%eax, %esi
++	shr	$2, %esi
++	sub	%rsi, %rdx
++#else
+ 	sub	%rax, %rdx
++#endif
+ 	jbe	L(return_null)
+ 	add	%rdi, %rax
+ 	add	%rcx, %rax
+@@ -94,15 +103,18 @@ L(crosscache):
+ 
+ 	.p2align 4
+ L(unaligned_no_match):
+-        /* "rcx" is less than 16.  Calculate "rdx + rcx - 16" by using
++	/* "rcx" is less than 16.  Calculate "rdx + rcx - 16" by using
+ 	   "rdx - (16 - rcx)" instead of "(rdx + rcx) - 16" to void
+ 	   possible addition overflow.  */
+ 	neg	%rcx
+ 	add	$16, %rcx
++#ifdef USE_AS_WMEMCHR
++	shr	$2, %ecx
++#endif
+ 	sub	%rcx, %rdx
+ 	jbe	L(return_null)
+ 	add	$16, %rdi
+-	sub	$64, %rdx
++	sub	$(CHAR_PER_VEC * 4), %rdx
+ 	jbe	L(exit_loop)
+ 
+ 	.p2align 4
+@@ -135,7 +147,7 @@ L(loop_prolog):
+ 	test	$0x3f, %rdi
+ 	jz	L(align64_loop)
+ 
+-	sub	$64, %rdx
++	sub	$(CHAR_PER_VEC * 4), %rdx
+ 	jbe	L(exit_loop)
+ 
+ 	movdqa	(%rdi), %xmm0
+@@ -167,11 +179,14 @@ L(loop_prolog):
+ 	mov	%rdi, %rcx
+ 	and	$-64, %rdi
+ 	and	$63, %ecx
++#ifdef USE_AS_WMEMCHR
++	shr	$2, %ecx
++#endif
+ 	add	%rcx, %rdx
+ 
+ 	.p2align 4
+ L(align64_loop):
+-	sub	$64, %rdx
++	sub	$(CHAR_PER_VEC * 4), %rdx
+ 	jbe	L(exit_loop)
+ 	movdqa	(%rdi), %xmm0
+ 	movdqa	16(%rdi), %xmm2
+@@ -218,7 +233,7 @@ L(align64_loop):
+ 
+ 	.p2align 4
+ L(exit_loop):
+-	add	$32, %edx
++	add	$(CHAR_PER_VEC * 2), %edx
+ 	jle	L(exit_loop_32)
+ 
+ 	movdqa	(%rdi), %xmm0
+@@ -238,7 +253,7 @@ L(exit_loop):
+ 	pmovmskb %xmm3, %eax
+ 	test	%eax, %eax
+ 	jnz	L(matches32_1)
+-	sub	$16, %edx
++	sub	$CHAR_PER_VEC, %edx
+ 	jle	L(return_null)
+ 
+ 	PCMPEQ	48(%rdi), %xmm1
+@@ -250,13 +265,13 @@ L(exit_loop):
+ 
+ 	.p2align 4
+ L(exit_loop_32):
+-	add	$32, %edx
++	add	$(CHAR_PER_VEC * 2), %edx
+ 	movdqa	(%rdi), %xmm0
+ 	PCMPEQ	%xmm1, %xmm0
+ 	pmovmskb %xmm0, %eax
+ 	test	%eax, %eax
+ 	jnz	L(matches_1)
+-	sub	$16, %edx
++	sub	$CHAR_PER_VEC, %edx
+ 	jbe	L(return_null)
+ 
+ 	PCMPEQ	16(%rdi), %xmm1
+@@ -293,7 +308,13 @@ L(matches32):
+ 	.p2align 4
+ L(matches_1):
+ 	bsf	%eax, %eax
++#ifdef USE_AS_WMEMCHR
++	mov	%eax, %esi
++	shr	$2, %esi
++	sub	%rsi, %rdx
++#else
+ 	sub	%rax, %rdx
++#endif
+ 	jbe	L(return_null)
+ 	add	%rdi, %rax
+ 	ret
+@@ -301,7 +322,13 @@ L(matches_1):
+ 	.p2align 4
+ L(matches16_1):
+ 	bsf	%eax, %eax
++#ifdef USE_AS_WMEMCHR
++	mov	%eax, %esi
++	shr	$2, %esi
++	sub	%rsi, %rdx
++#else
+ 	sub	%rax, %rdx
++#endif
+ 	jbe	L(return_null)
+ 	lea	16(%rdi, %rax), %rax
+ 	ret
+@@ -309,7 +336,13 @@ L(matches16_1):
+ 	.p2align 4
+ L(matches32_1):
+ 	bsf	%eax, %eax
++#ifdef USE_AS_WMEMCHR
++	mov	%eax, %esi
++	shr	$2, %esi
++	sub	%rsi, %rdx
++#else
+ 	sub	%rax, %rdx
++#endif
+ 	jbe	L(return_null)
+ 	lea	32(%rdi, %rax), %rax
+ 	ret
+@@ -317,7 +350,13 @@ L(matches32_1):
+ 	.p2align 4
+ L(matches48_1):
+ 	bsf	%eax, %eax
++#ifdef USE_AS_WMEMCHR
++	mov	%eax, %esi
++	shr	$2, %esi
++	sub	%rsi, %rdx
++#else
+ 	sub	%rax, %rdx
++#endif
+ 	jbe	L(return_null)
+ 	lea	48(%rdi, %rax), %rax
+ 	ret
+diff --git a/sysdeps/x86_64/multiarch/Makefile b/sysdeps/x86_64/multiarch/Makefile
+index 395e432c09..da1446d731 100644
+--- a/sysdeps/x86_64/multiarch/Makefile
++++ b/sysdeps/x86_64/multiarch/Makefile
+@@ -43,7 +43,45 @@ sysdep_routines += strncat-c stpncpy-c strncpy-c \
+ 		   memmove-avx512-unaligned-erms \
+ 		   memset-sse2-unaligned-erms \
+ 		   memset-avx2-unaligned-erms \
+-		   memset-avx512-unaligned-erms
++		   memset-avx512-unaligned-erms \
++		   memchr-avx2-rtm \
++		   memcmp-avx2-movbe-rtm \
++		   memmove-avx-unaligned-erms-rtm \
++		   memrchr-avx2-rtm \
++		   memset-avx2-unaligned-erms-rtm \
++		   rawmemchr-avx2-rtm \
++		   strchr-avx2-rtm \
++		   strcmp-avx2-rtm \
++		   strchrnul-avx2-rtm \
++		   stpcpy-avx2-rtm \
++		   stpncpy-avx2-rtm \
++		   strcat-avx2-rtm \
++		   strcpy-avx2-rtm \
++		   strlen-avx2-rtm \
++		   strncat-avx2-rtm \
++		   strncmp-avx2-rtm \
++		   strncpy-avx2-rtm \
++		   strnlen-avx2-rtm \
++		   strrchr-avx2-rtm \
++		   memchr-evex \
++		   memcmp-evex-movbe \
++		   memmove-evex-unaligned-erms \
++		   memrchr-evex \
++		   memset-evex-unaligned-erms \
++		   rawmemchr-evex \
++		   stpcpy-evex \
++		   stpncpy-evex \
++		   strcat-evex \
++		   strchr-evex \
++		   strchrnul-evex \
++		   strcmp-evex \
++		   strcpy-evex \
++		   strlen-evex \
++		   strncat-evex \
++		   strncmp-evex \
++		   strncpy-evex \
++		   strnlen-evex \
++		   strrchr-evex
+ CFLAGS-varshift.c += -msse4
+ CFLAGS-strcspn-c.c += -msse4
+ CFLAGS-strpbrk-c.c += -msse4
+@@ -59,8 +97,24 @@ sysdep_routines += wmemcmp-sse4 wmemcmp-ssse3 wmemcmp-c \
+ 		   wcscpy-ssse3 wcscpy-c \
+ 		   wcschr-sse2 wcschr-avx2 \
+ 		   wcsrchr-sse2 wcsrchr-avx2 \
+-		   wcsnlen-sse4_1 wcsnlen-c \
+-		   wcslen-sse2 wcslen-avx2 wcsnlen-avx2
++		   wcslen-sse2 wcslen-sse4_1 wcslen-avx2 \
++		   wcsnlen-c wcsnlen-sse4_1 wcsnlen-avx2 \
++		   wcschr-avx2-rtm \
++		   wcscmp-avx2-rtm \
++		   wcslen-avx2-rtm \
++		   wcsncmp-avx2-rtm \
++		   wcsnlen-avx2-rtm \
++		   wcsrchr-avx2-rtm \
++		   wmemchr-avx2-rtm \
++		   wmemcmp-avx2-movbe-rtm \
++		   wcschr-evex \
++		   wcscmp-evex \
++		   wcslen-evex \
++		   wcsncmp-evex \
++		   wcsnlen-evex \
++		   wcsrchr-evex \
++		   wmemchr-evex \
++		   wmemcmp-evex-movbe
+ endif
+ 
+ ifeq ($(subdir),debug)
+diff --git a/sysdeps/x86_64/multiarch/ifunc-avx2.h b/sysdeps/x86_64/multiarch/ifunc-avx2.h
+index 69f30398ae..74189b6aa5 100644
+--- a/sysdeps/x86_64/multiarch/ifunc-avx2.h
++++ b/sysdeps/x86_64/multiarch/ifunc-avx2.h
+@@ -21,16 +21,28 @@
+ 
+ extern __typeof (REDIRECT_NAME) OPTIMIZE (sse2) attribute_hidden;
+ extern __typeof (REDIRECT_NAME) OPTIMIZE (avx2) attribute_hidden;
++extern __typeof (REDIRECT_NAME) OPTIMIZE (avx2_rtm) attribute_hidden;
++extern __typeof (REDIRECT_NAME) OPTIMIZE (evex) attribute_hidden;
+ 
+ static inline void *
+ IFUNC_SELECTOR (void)
+ {
+   const struct cpu_features* cpu_features = __get_cpu_features ();
+ 
+-  if (!CPU_FEATURES_ARCH_P (cpu_features, Prefer_No_VZEROUPPER)
+-      && CPU_FEATURES_ARCH_P (cpu_features, AVX2_Usable)
++  if (CPU_FEATURES_ARCH_P (cpu_features, AVX2_Usable)
+       && CPU_FEATURES_ARCH_P (cpu_features, AVX_Fast_Unaligned_Load))
+-    return OPTIMIZE (avx2);
++    {
++      if (CPU_FEATURES_ARCH_P (cpu_features, AVX512VL_Usable)
++	  && CPU_FEATURES_ARCH_P (cpu_features, AVX512BW_Usable)
++	  && CPU_FEATURES_CPU_P (cpu_features, BMI2))
++	return OPTIMIZE (evex);
++
++      if (CPU_FEATURES_CPU_P (cpu_features, RTM))
++	return OPTIMIZE (avx2_rtm);
++
++      if (!CPU_FEATURES_ARCH_P (cpu_features, Prefer_No_VZEROUPPER))
++	return OPTIMIZE (avx2);
++    }
+ 
+   return OPTIMIZE (sse2);
+ }
+diff --git a/sysdeps/x86_64/multiarch/ifunc-impl-list.c b/sysdeps/x86_64/multiarch/ifunc-impl-list.c
+index ce7eb1eecf..56b05ee741 100644
+--- a/sysdeps/x86_64/multiarch/ifunc-impl-list.c
++++ b/sysdeps/x86_64/multiarch/ifunc-impl-list.c
+@@ -43,6 +43,15 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
+ 	      IFUNC_IMPL_ADD (array, i, memchr,
+ 			      HAS_ARCH_FEATURE (AVX2_Usable),
+ 			      __memchr_avx2)
++	      IFUNC_IMPL_ADD (array, i, memchr,
++			      (HAS_ARCH_FEATURE (AVX2_Usable)
++			       && HAS_CPU_FEATURE (RTM)),
++			      __memchr_avx2_rtm)
++	      IFUNC_IMPL_ADD (array, i, memchr,
++			      (HAS_ARCH_FEATURE (AVX512VL_Usable)
++			       && HAS_ARCH_FEATURE (AVX512BW_Usable)
++			       && HAS_CPU_FEATURE (BMI2)),
++			      __memchr_evex)
+ 	      IFUNC_IMPL_ADD (array, i, memchr, 1, __memchr_sse2))
+ 
+   /* Support sysdeps/x86_64/multiarch/memcmp.c.  */
+@@ -51,6 +60,16 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
+ 			      (HAS_ARCH_FEATURE (AVX2_Usable)
+ 			       && HAS_CPU_FEATURE (MOVBE)),
+ 			      __memcmp_avx2_movbe)
++	      IFUNC_IMPL_ADD (array, i, memcmp,
++			      (HAS_ARCH_FEATURE (AVX2_Usable)
++			       && HAS_CPU_FEATURE (MOVBE)
++			       && HAS_CPU_FEATURE (RTM)),
++			      __memcmp_avx2_movbe_rtm)
++	      IFUNC_IMPL_ADD (array, i, memcmp,
++			      (HAS_ARCH_FEATURE (AVX512VL_Usable)
++			       && HAS_ARCH_FEATURE (AVX512BW_Usable)
++			       && HAS_CPU_FEATURE (MOVBE)),
++			      __memcmp_evex_movbe)
+ 	      IFUNC_IMPL_ADD (array, i, memcmp, HAS_CPU_FEATURE (SSE4_1),
+ 			      __memcmp_sse4_1)
+ 	      IFUNC_IMPL_ADD (array, i, memcmp, HAS_CPU_FEATURE (SSSE3),
+@@ -64,10 +83,10 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
+ 			      HAS_ARCH_FEATURE (AVX512F_Usable),
+ 			      __memmove_chk_avx512_no_vzeroupper)
+ 	      IFUNC_IMPL_ADD (array, i, __memmove_chk,
+-			      HAS_ARCH_FEATURE (AVX512F_Usable),
++			      HAS_ARCH_FEATURE (AVX512VL_Usable),
+ 			      __memmove_chk_avx512_unaligned)
+ 	      IFUNC_IMPL_ADD (array, i, __memmove_chk,
+-			      HAS_ARCH_FEATURE (AVX512F_Usable),
++			      HAS_ARCH_FEATURE (AVX512VL_Usable),
+ 			      __memmove_chk_avx512_unaligned_erms)
+ 	      IFUNC_IMPL_ADD (array, i, __memmove_chk,
+ 			      HAS_ARCH_FEATURE (AVX_Usable),
+@@ -75,6 +94,20 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
+ 	      IFUNC_IMPL_ADD (array, i, __memmove_chk,
+ 			      HAS_ARCH_FEATURE (AVX_Usable),
+ 			      __memmove_chk_avx_unaligned_erms)
++	      IFUNC_IMPL_ADD (array, i, __memmove_chk,
++			      (HAS_ARCH_FEATURE (AVX_Usable)
++			       && HAS_CPU_FEATURE (RTM)),
++			      __memmove_chk_avx_unaligned_rtm)
++	      IFUNC_IMPL_ADD (array, i, __memmove_chk,
++			      (HAS_ARCH_FEATURE (AVX_Usable)
++			       && HAS_CPU_FEATURE (RTM)),
++			      __memmove_chk_avx_unaligned_erms_rtm)
++	      IFUNC_IMPL_ADD (array, i, __memmove_chk,
++			      HAS_ARCH_FEATURE (AVX512VL_Usable),
++			      __memmove_chk_evex_unaligned)
++	      IFUNC_IMPL_ADD (array, i, __memmove_chk,
++			      HAS_ARCH_FEATURE (AVX512VL_Usable),
++			      __memmove_chk_evex_unaligned_erms)
+ 	      IFUNC_IMPL_ADD (array, i, __memmove_chk,
+ 			      HAS_CPU_FEATURE (SSSE3),
+ 			      __memmove_chk_ssse3_back)
+@@ -97,14 +130,28 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
+ 	      IFUNC_IMPL_ADD (array, i, memmove,
+ 			      HAS_ARCH_FEATURE (AVX_Usable),
+ 			      __memmove_avx_unaligned_erms)
++	      IFUNC_IMPL_ADD (array, i, memmove,
++			      (HAS_ARCH_FEATURE (AVX_Usable)
++			       && HAS_CPU_FEATURE (RTM)),
++			      __memmove_avx_unaligned_rtm)
++	      IFUNC_IMPL_ADD (array, i, memmove,
++			      (HAS_ARCH_FEATURE (AVX_Usable)
++			       && HAS_CPU_FEATURE (RTM)),
++			      __memmove_avx_unaligned_erms_rtm)
++	      IFUNC_IMPL_ADD (array, i, memmove,
++			      HAS_ARCH_FEATURE (AVX512VL_Usable),
++			      __memmove_evex_unaligned)
++	      IFUNC_IMPL_ADD (array, i, memmove,
++			      HAS_ARCH_FEATURE (AVX512VL_Usable),
++			      __memmove_evex_unaligned_erms)
+ 	      IFUNC_IMPL_ADD (array, i, memmove,
+ 			      HAS_ARCH_FEATURE (AVX512F_Usable),
+ 			      __memmove_avx512_no_vzeroupper)
+ 	      IFUNC_IMPL_ADD (array, i, memmove,
+-			      HAS_ARCH_FEATURE (AVX512F_Usable),
++			      HAS_ARCH_FEATURE (AVX512VL_Usable),
+ 			      __memmove_avx512_unaligned)
+ 	      IFUNC_IMPL_ADD (array, i, memmove,
+-			      HAS_ARCH_FEATURE (AVX512F_Usable),
++			      HAS_ARCH_FEATURE (AVX512VL_Usable),
+ 			      __memmove_avx512_unaligned_erms)
+ 	      IFUNC_IMPL_ADD (array, i, memmove, HAS_CPU_FEATURE (SSSE3),
+ 			      __memmove_ssse3_back)
+@@ -121,6 +168,15 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
+ 	      IFUNC_IMPL_ADD (array, i, memrchr,
+ 			      HAS_ARCH_FEATURE (AVX2_Usable),
+ 			      __memrchr_avx2)
++	      IFUNC_IMPL_ADD (array, i, memrchr,
++			      (HAS_ARCH_FEATURE (AVX2_Usable)
++			       && HAS_CPU_FEATURE (RTM)),
++			      __memrchr_avx2_rtm)
++	      IFUNC_IMPL_ADD (array, i, memrchr,
++			      (HAS_ARCH_FEATURE (AVX512VL_Usable)
++			       && HAS_ARCH_FEATURE (AVX512BW_Usable)),
++			      __memrchr_evex)
++
+ 	      IFUNC_IMPL_ADD (array, i, memrchr, 1, __memrchr_sse2))
+ 
+ #ifdef SHARED
+@@ -139,10 +195,28 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
+ 			      HAS_ARCH_FEATURE (AVX2_Usable),
+ 			      __memset_chk_avx2_unaligned_erms)
+ 	      IFUNC_IMPL_ADD (array, i, __memset_chk,
+-			      HAS_ARCH_FEATURE (AVX512F_Usable),
++			      (HAS_ARCH_FEATURE (AVX2_Usable)
++			       && HAS_CPU_FEATURE (RTM)),
++			      __memset_chk_avx2_unaligned_rtm)
++	      IFUNC_IMPL_ADD (array, i, __memset_chk,
++			      (HAS_ARCH_FEATURE (AVX2_Usable)
++			       && HAS_CPU_FEATURE (RTM)),
++			      __memset_chk_avx2_unaligned_erms_rtm)
++	      IFUNC_IMPL_ADD (array, i, __memset_chk,
++			      (HAS_ARCH_FEATURE (AVX512VL_Usable)
++			       && HAS_ARCH_FEATURE (AVX512BW_Usable)),
++			      __memset_chk_evex_unaligned)
++	      IFUNC_IMPL_ADD (array, i, __memset_chk,
++			      (HAS_ARCH_FEATURE (AVX512VL_Usable)
++			       && HAS_ARCH_FEATURE (AVX512BW_Usable)),
++			      __memset_chk_evex_unaligned_erms)
++	      IFUNC_IMPL_ADD (array, i, __memset_chk,
++			      (HAS_ARCH_FEATURE (AVX512VL_Usable)
++			       && HAS_ARCH_FEATURE (AVX512BW_Usable)),
+ 			      __memset_chk_avx512_unaligned_erms)
+ 	      IFUNC_IMPL_ADD (array, i, __memset_chk,
+-			      HAS_ARCH_FEATURE (AVX512F_Usable),
++			      (HAS_ARCH_FEATURE (AVX512VL_Usable)
++			       && HAS_ARCH_FEATURE (AVX512BW_Usable)),
+ 			      __memset_chk_avx512_unaligned)
+ 	      IFUNC_IMPL_ADD (array, i, __memset_chk,
+ 			      HAS_ARCH_FEATURE (AVX512F_Usable),
+@@ -164,10 +238,28 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
+ 			      HAS_ARCH_FEATURE (AVX2_Usable),
+ 			      __memset_avx2_unaligned_erms)
+ 	      IFUNC_IMPL_ADD (array, i, memset,
+-			      HAS_ARCH_FEATURE (AVX512F_Usable),
++			      (HAS_ARCH_FEATURE (AVX2_Usable)
++			       && HAS_CPU_FEATURE (RTM)),
++			      __memset_avx2_unaligned_rtm)
++	      IFUNC_IMPL_ADD (array, i, memset,
++			      (HAS_ARCH_FEATURE (AVX2_Usable)
++			       && HAS_CPU_FEATURE (RTM)),
++			      __memset_avx2_unaligned_erms_rtm)
++	      IFUNC_IMPL_ADD (array, i, memset,
++			      (HAS_ARCH_FEATURE (AVX512VL_Usable)
++			       && HAS_ARCH_FEATURE (AVX512BW_Usable)),
++			      __memset_evex_unaligned)
++	      IFUNC_IMPL_ADD (array, i, memset,
++			      (HAS_ARCH_FEATURE (AVX512VL_Usable)
++			       && HAS_ARCH_FEATURE (AVX512BW_Usable)),
++			      __memset_evex_unaligned_erms)
++	      IFUNC_IMPL_ADD (array, i, memset,
++			      (HAS_ARCH_FEATURE (AVX512VL_Usable)
++			       && HAS_ARCH_FEATURE (AVX512BW_Usable)),
+ 			      __memset_avx512_unaligned_erms)
+ 	      IFUNC_IMPL_ADD (array, i, memset,
+-			      HAS_ARCH_FEATURE (AVX512F_Usable),
++			      (HAS_ARCH_FEATURE (AVX512VL_Usable)
++			       && HAS_ARCH_FEATURE (AVX512BW_Usable)),
+ 			      __memset_avx512_unaligned)
+ 	      IFUNC_IMPL_ADD (array, i, memset,
+ 			      HAS_ARCH_FEATURE (AVX512F_Usable),
+@@ -179,20 +271,51 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
+ 	      IFUNC_IMPL_ADD (array, i, rawmemchr,
+ 			      HAS_ARCH_FEATURE (AVX2_Usable),
+ 			      __rawmemchr_avx2)
++	      IFUNC_IMPL_ADD (array, i, rawmemchr,
++			      (HAS_ARCH_FEATURE (AVX2_Usable)
++			       && HAS_CPU_FEATURE (RTM)),
++			      __rawmemchr_avx2_rtm)
++	      IFUNC_IMPL_ADD (array, i, rawmemchr,
++			      (HAS_ARCH_FEATURE (AVX512VL_Usable)
++			       && HAS_ARCH_FEATURE (AVX512BW_Usable)
++			       && HAS_CPU_FEATURE (BMI2)),
++			      __rawmemchr_evex)
+ 	      IFUNC_IMPL_ADD (array, i, rawmemchr, 1, __rawmemchr_sse2))
+ 
+   /* Support sysdeps/x86_64/multiarch/strlen.c.  */
+   IFUNC_IMPL (i, name, strlen,
+ 	      IFUNC_IMPL_ADD (array, i, strlen,
+-			      HAS_ARCH_FEATURE (AVX2_Usable),
++			      (HAS_ARCH_FEATURE (AVX2_Usable)
++			       && HAS_CPU_FEATURE (BMI2)),
+ 			      __strlen_avx2)
++	      IFUNC_IMPL_ADD (array, i, strlen,
++			      (HAS_ARCH_FEATURE (AVX2_Usable)
++			       && HAS_CPU_FEATURE (BMI2)
++			       && HAS_CPU_FEATURE (RTM)),
++			      __strlen_avx2_rtm)
++	      IFUNC_IMPL_ADD (array, i, strlen,
++			      (HAS_ARCH_FEATURE (AVX512VL_Usable)
++			       && HAS_ARCH_FEATURE (AVX512BW_Usable)
++			       && HAS_CPU_FEATURE (BMI2)),
++			      __strlen_evex)
+ 	      IFUNC_IMPL_ADD (array, i, strlen, 1, __strlen_sse2))
+ 
+   /* Support sysdeps/x86_64/multiarch/strnlen.c.  */
+   IFUNC_IMPL (i, name, strnlen,
+ 	      IFUNC_IMPL_ADD (array, i, strnlen,
+-			      HAS_ARCH_FEATURE (AVX2_Usable),
++			      (HAS_ARCH_FEATURE (AVX2_Usable)
++			       && HAS_CPU_FEATURE (BMI2)),
+ 			      __strnlen_avx2)
++	      IFUNC_IMPL_ADD (array, i, strnlen,
++			      (HAS_ARCH_FEATURE (AVX2_Usable)
++			       && HAS_CPU_FEATURE (BMI2)
++			       && HAS_CPU_FEATURE (RTM)),
++			      __strnlen_avx2_rtm)
++	      IFUNC_IMPL_ADD (array, i, strnlen,
++			      (HAS_ARCH_FEATURE (AVX512VL_Usable)
++			       && HAS_ARCH_FEATURE (AVX512BW_Usable)
++			       && HAS_CPU_FEATURE (BMI2)),
++			      __strnlen_evex)
+ 	      IFUNC_IMPL_ADD (array, i, strnlen, 1, __strnlen_sse2))
+ 
+   /* Support sysdeps/x86_64/multiarch/stpncpy.c.  */
+@@ -201,6 +324,14 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
+ 			      __stpncpy_ssse3)
+ 	      IFUNC_IMPL_ADD (array, i, stpncpy, HAS_ARCH_FEATURE (AVX2_Usable),
+ 			      __stpncpy_avx2)
++	      IFUNC_IMPL_ADD (array, i, stpncpy,
++			      (HAS_ARCH_FEATURE (AVX2_Usable)
++			       && HAS_CPU_FEATURE (RTM)),
++			      __stpncpy_avx2_rtm)
++	      IFUNC_IMPL_ADD (array, i, stpncpy,
++			      (HAS_ARCH_FEATURE (AVX512VL_Usable)
++			       && HAS_ARCH_FEATURE (AVX512BW_Usable)),
++			      __stpncpy_evex)
+ 	      IFUNC_IMPL_ADD (array, i, stpncpy, 1,
+ 			      __stpncpy_sse2_unaligned)
+ 	      IFUNC_IMPL_ADD (array, i, stpncpy, 1, __stpncpy_sse2))
+@@ -211,6 +342,14 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
+ 			      __stpcpy_ssse3)
+ 	      IFUNC_IMPL_ADD (array, i, stpcpy, HAS_ARCH_FEATURE (AVX2_Usable),
+ 			      __stpcpy_avx2)
++	      IFUNC_IMPL_ADD (array, i, stpcpy,
++			      (HAS_ARCH_FEATURE (AVX2_Usable)
++			       && HAS_CPU_FEATURE (RTM)),
++			      __stpcpy_avx2_rtm)
++	      IFUNC_IMPL_ADD (array, i, stpcpy,
++			      (HAS_ARCH_FEATURE (AVX512VL_Usable)
++			       && HAS_ARCH_FEATURE (AVX512BW_Usable)),
++			      __stpcpy_evex)
+ 	      IFUNC_IMPL_ADD (array, i, stpcpy, 1, __stpcpy_sse2_unaligned)
+ 	      IFUNC_IMPL_ADD (array, i, stpcpy, 1, __stpcpy_sse2))
+ 
+@@ -245,6 +384,14 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
+   IFUNC_IMPL (i, name, strcat,
+ 	      IFUNC_IMPL_ADD (array, i, strcat, HAS_ARCH_FEATURE (AVX2_Usable),
+ 			      __strcat_avx2)
++	      IFUNC_IMPL_ADD (array, i, strcat,
++			      (HAS_ARCH_FEATURE (AVX2_Usable)
++			       && HAS_CPU_FEATURE (RTM)),
++			      __strcat_avx2_rtm)
++	      IFUNC_IMPL_ADD (array, i, strcat,
++			      (HAS_ARCH_FEATURE (AVX512VL_Usable)
++			       && HAS_ARCH_FEATURE (AVX512BW_Usable)),
++			      __strcat_evex)
+ 	      IFUNC_IMPL_ADD (array, i, strcat, HAS_CPU_FEATURE (SSSE3),
+ 			      __strcat_ssse3)
+ 	      IFUNC_IMPL_ADD (array, i, strcat, 1, __strcat_sse2_unaligned)
+@@ -255,6 +402,15 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
+ 	      IFUNC_IMPL_ADD (array, i, strchr,
+ 			      HAS_ARCH_FEATURE (AVX2_Usable),
+ 			      __strchr_avx2)
++	      IFUNC_IMPL_ADD (array, i, strchr,
++			      (HAS_ARCH_FEATURE (AVX2_Usable)
++			       && HAS_CPU_FEATURE (RTM)),
++			      __strchr_avx2_rtm)
++	      IFUNC_IMPL_ADD (array, i, strchr,
++			      (HAS_ARCH_FEATURE (AVX512VL_Usable)
++			       && HAS_ARCH_FEATURE (AVX512BW_Usable)
++			       && HAS_CPU_FEATURE (BMI2)),
++			      __strchr_evex)
+ 	      IFUNC_IMPL_ADD (array, i, strchr, 1, __strchr_sse2_no_bsf)
+ 	      IFUNC_IMPL_ADD (array, i, strchr, 1, __strchr_sse2))
+ 
+@@ -263,6 +419,15 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
+ 	      IFUNC_IMPL_ADD (array, i, strchrnul,
+ 			      HAS_ARCH_FEATURE (AVX2_Usable),
+ 			      __strchrnul_avx2)
++	      IFUNC_IMPL_ADD (array, i, strchrnul,
++			      (HAS_ARCH_FEATURE (AVX2_Usable)
++			       && HAS_CPU_FEATURE (RTM)),
++			      __strchrnul_avx2_rtm)
++	      IFUNC_IMPL_ADD (array, i, strchrnul,
++			      (HAS_ARCH_FEATURE (AVX512VL_Usable)
++			       && HAS_ARCH_FEATURE (AVX512BW_Usable)
++			       && HAS_CPU_FEATURE (BMI2)),
++			      __strchrnul_evex)
+ 	      IFUNC_IMPL_ADD (array, i, strchrnul, 1, __strchrnul_sse2))
+ 
+   /* Support sysdeps/x86_64/multiarch/strrchr.c.  */
+@@ -270,6 +435,14 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
+ 	      IFUNC_IMPL_ADD (array, i, strrchr,
+ 			      HAS_ARCH_FEATURE (AVX2_Usable),
+ 			      __strrchr_avx2)
++	      IFUNC_IMPL_ADD (array, i, strrchr,
++			      (HAS_ARCH_FEATURE (AVX2_Usable)
++			       && HAS_CPU_FEATURE (RTM)),
++			      __strrchr_avx2_rtm)
++	      IFUNC_IMPL_ADD (array, i, strrchr,
++			      (HAS_ARCH_FEATURE (AVX512VL_Usable)
++			       && HAS_ARCH_FEATURE (AVX512BW_Usable)),
++			      __strrchr_evex)
+ 	      IFUNC_IMPL_ADD (array, i, strrchr, 1, __strrchr_sse2))
+ 
+   /* Support sysdeps/x86_64/multiarch/strcmp.c.  */
+@@ -277,6 +450,15 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
+ 	      IFUNC_IMPL_ADD (array, i, strcmp,
+ 			      HAS_ARCH_FEATURE (AVX2_Usable),
+ 			      __strcmp_avx2)
++	      IFUNC_IMPL_ADD (array, i, strcmp,
++			      (HAS_ARCH_FEATURE (AVX2_Usable)
++			       && HAS_CPU_FEATURE (RTM)),
++			      __strcmp_avx2_rtm)
++	      IFUNC_IMPL_ADD (array, i, strcmp,
++			      (HAS_ARCH_FEATURE (AVX512VL_Usable)
++			       && HAS_ARCH_FEATURE (AVX512BW_Usable)
++			       && HAS_CPU_FEATURE (BMI2)),
++			      __strcmp_evex)
+ 	      IFUNC_IMPL_ADD (array, i, strcmp, HAS_CPU_FEATURE (SSE4_2),
+ 			      __strcmp_sse42)
+ 	      IFUNC_IMPL_ADD (array, i, strcmp, HAS_CPU_FEATURE (SSSE3),
+@@ -288,6 +470,14 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
+   IFUNC_IMPL (i, name, strcpy,
+ 	      IFUNC_IMPL_ADD (array, i, strcpy, HAS_ARCH_FEATURE (AVX2_Usable),
+ 			      __strcpy_avx2)
++	      IFUNC_IMPL_ADD (array, i, strcpy,
++			      (HAS_ARCH_FEATURE (AVX2_Usable)
++			       && HAS_CPU_FEATURE (RTM)),
++			      __strcpy_avx2_rtm)
++	      IFUNC_IMPL_ADD (array, i, strcpy,
++			      (HAS_ARCH_FEATURE (AVX512VL_Usable)
++			       && HAS_ARCH_FEATURE (AVX512BW_Usable)),
++			      __strcpy_evex)
+ 	      IFUNC_IMPL_ADD (array, i, strcpy, HAS_CPU_FEATURE (SSSE3),
+ 			      __strcpy_ssse3)
+ 	      IFUNC_IMPL_ADD (array, i, strcpy, 1, __strcpy_sse2_unaligned)
+@@ -331,6 +521,14 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
+   IFUNC_IMPL (i, name, strncat,
+ 	      IFUNC_IMPL_ADD (array, i, strncat, HAS_ARCH_FEATURE (AVX2_Usable),
+ 			      __strncat_avx2)
++	      IFUNC_IMPL_ADD (array, i, strncat,
++			      (HAS_ARCH_FEATURE (AVX2_Usable)
++			       && HAS_CPU_FEATURE (RTM)),
++			      __strncat_avx2_rtm)
++	      IFUNC_IMPL_ADD (array, i, strncat,
++			      (HAS_ARCH_FEATURE (AVX512VL_Usable)
++			       && HAS_ARCH_FEATURE (AVX512BW_Usable)),
++			      __strncat_evex)
+ 	      IFUNC_IMPL_ADD (array, i, strncat, HAS_CPU_FEATURE (SSSE3),
+ 			      __strncat_ssse3)
+ 	      IFUNC_IMPL_ADD (array, i, strncat, 1,
+@@ -341,6 +539,14 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
+   IFUNC_IMPL (i, name, strncpy,
+ 	      IFUNC_IMPL_ADD (array, i, strncpy, HAS_ARCH_FEATURE (AVX2_Usable),
+ 			      __strncpy_avx2)
++	      IFUNC_IMPL_ADD (array, i, strncpy,
++			      (HAS_ARCH_FEATURE (AVX2_Usable)
++			       && HAS_CPU_FEATURE (RTM)),
++			      __strncpy_avx2_rtm)
++	      IFUNC_IMPL_ADD (array, i, strncpy,
++			      (HAS_ARCH_FEATURE (AVX512VL_Usable)
++			       && HAS_ARCH_FEATURE (AVX512BW_Usable)),
++			      __strncpy_evex)
+ 	      IFUNC_IMPL_ADD (array, i, strncpy, HAS_CPU_FEATURE (SSSE3),
+ 			      __strncpy_ssse3)
+ 	      IFUNC_IMPL_ADD (array, i, strncpy, 1,
+@@ -370,6 +576,15 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
+ 	      IFUNC_IMPL_ADD (array, i, wcschr,
+ 			      HAS_ARCH_FEATURE (AVX2_Usable),
+ 			      __wcschr_avx2)
++	      IFUNC_IMPL_ADD (array, i, wcschr,
++			      (HAS_ARCH_FEATURE (AVX2_Usable)
++			       && HAS_CPU_FEATURE (RTM)),
++			      __wcschr_avx2_rtm)
++	      IFUNC_IMPL_ADD (array, i, wcschr,
++			      (HAS_ARCH_FEATURE (AVX512VL_Usable)
++			       && HAS_ARCH_FEATURE (AVX512BW_Usable)
++			       && HAS_CPU_FEATURE (BMI2)),
++			      __wcschr_evex)
+ 	      IFUNC_IMPL_ADD (array, i, wcschr, 1, __wcschr_sse2))
+ 
+   /* Support sysdeps/x86_64/multiarch/wcsrchr.c.  */
+@@ -377,6 +592,15 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
+ 	      IFUNC_IMPL_ADD (array, i, wcsrchr,
+ 			      HAS_ARCH_FEATURE (AVX2_Usable),
+ 			      __wcsrchr_avx2)
++	      IFUNC_IMPL_ADD (array, i, wcsrchr,
++			      (HAS_ARCH_FEATURE (AVX2_Usable)
++			       && HAS_CPU_FEATURE (RTM)),
++			      __wcsrchr_avx2_rtm)
++	      IFUNC_IMPL_ADD (array, i, wcsrchr,
++			      (HAS_ARCH_FEATURE (AVX512VL_Usable)
++			       && HAS_ARCH_FEATURE (AVX512BW_Usable)
++			       && HAS_CPU_FEATURE (BMI2)),
++			      __wcsrchr_evex)
+ 	      IFUNC_IMPL_ADD (array, i, wcsrchr, 1, __wcsrchr_sse2))
+ 
+   /* Support sysdeps/x86_64/multiarch/wcscmp.c.  */
+@@ -384,6 +608,15 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
+ 	      IFUNC_IMPL_ADD (array, i, wcscmp,
+ 			      HAS_ARCH_FEATURE (AVX2_Usable),
+ 			      __wcscmp_avx2)
++	      IFUNC_IMPL_ADD (array, i, wcscmp,
++			      (HAS_ARCH_FEATURE (AVX2_Usable)
++			       && HAS_CPU_FEATURE (RTM)),
++			      __wcscmp_avx2_rtm)
++	      IFUNC_IMPL_ADD (array, i, wcscmp,
++			      (HAS_ARCH_FEATURE (AVX512VL_Usable)
++			       && HAS_ARCH_FEATURE (AVX512BW_Usable)
++			       && HAS_CPU_FEATURE (BMI2)),
++			      __wcscmp_evex)
+ 	      IFUNC_IMPL_ADD (array, i, wcscmp, 1, __wcscmp_sse2))
+ 
+   /* Support sysdeps/x86_64/multiarch/wcsncmp.c.  */
+@@ -391,6 +624,15 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
+ 	      IFUNC_IMPL_ADD (array, i, wcsncmp,
+ 			      HAS_ARCH_FEATURE (AVX2_Usable),
+ 			      __wcsncmp_avx2)
++	      IFUNC_IMPL_ADD (array, i, wcsncmp,
++			      (HAS_ARCH_FEATURE (AVX2_Usable)
++			       && HAS_CPU_FEATURE (RTM)),
++			      __wcsncmp_avx2_rtm)
++	      IFUNC_IMPL_ADD (array, i, wcsncmp,
++			      (HAS_ARCH_FEATURE (AVX512VL_Usable)
++			       && HAS_ARCH_FEATURE (AVX512BW_Usable)
++			       && HAS_CPU_FEATURE (BMI2)),
++			      __wcsncmp_evex)
+ 	      IFUNC_IMPL_ADD (array, i, wcsncmp, 1, __wcsncmp_sse2))
+ 
+   /* Support sysdeps/x86_64/multiarch/wcscpy.c.  */
+@@ -402,15 +644,40 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
+   /* Support sysdeps/x86_64/multiarch/wcslen.c.  */
+   IFUNC_IMPL (i, name, wcslen,
+ 	      IFUNC_IMPL_ADD (array, i, wcslen,
+-			      HAS_ARCH_FEATURE (AVX2_Usable),
++			      (HAS_ARCH_FEATURE (AVX2_Usable)
++			       && HAS_CPU_FEATURE (BMI2)),
+ 			      __wcslen_avx2)
++	      IFUNC_IMPL_ADD (array, i, wcslen,
++			      (HAS_ARCH_FEATURE (AVX2_Usable)
++			       && HAS_CPU_FEATURE (BMI2)
++			       && HAS_CPU_FEATURE (RTM)),
++			      __wcslen_avx2_rtm)
++	      IFUNC_IMPL_ADD (array, i, wcslen,
++			      (HAS_ARCH_FEATURE (AVX512VL_Usable)
++			       && HAS_ARCH_FEATURE (AVX512BW_Usable)
++			       && HAS_CPU_FEATURE (BMI2)),
++			      __wcslen_evex)
++	      IFUNC_IMPL_ADD (array, i, wcslen,
++			      CPU_FEATURE_USABLE (SSE4_1),
++			      __wcslen_sse4_1)
+ 	      IFUNC_IMPL_ADD (array, i, wcslen, 1, __wcslen_sse2))
+ 
+   /* Support sysdeps/x86_64/multiarch/wcsnlen.c.  */
+   IFUNC_IMPL (i, name, wcsnlen,
+ 	      IFUNC_IMPL_ADD (array, i, wcsnlen,
+-			      HAS_ARCH_FEATURE (AVX2_Usable),
++			      (HAS_ARCH_FEATURE (AVX2_Usable)
++			       && HAS_CPU_FEATURE (BMI2)),
+ 			      __wcsnlen_avx2)
++	      IFUNC_IMPL_ADD (array, i, wcsnlen,
++			      (HAS_ARCH_FEATURE (AVX2_Usable)
++			       && HAS_CPU_FEATURE (BMI2)
++			       && HAS_CPU_FEATURE (RTM)),
++			      __wcsnlen_avx2_rtm)
++	      IFUNC_IMPL_ADD (array, i, wcsnlen,
++			      (HAS_ARCH_FEATURE (AVX512VL_Usable)
++			       && HAS_ARCH_FEATURE (AVX512BW_Usable)
++			       && HAS_CPU_FEATURE (BMI2)),
++			      __wcsnlen_evex)
+ 	      IFUNC_IMPL_ADD (array, i, wcsnlen,
+ 			      HAS_CPU_FEATURE (SSE4_1),
+ 			      __wcsnlen_sse4_1)
+@@ -421,6 +688,15 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
+ 	      IFUNC_IMPL_ADD (array, i, wmemchr,
+ 			      HAS_ARCH_FEATURE (AVX2_Usable),
+ 			      __wmemchr_avx2)
++	      IFUNC_IMPL_ADD (array, i, wmemchr,
++			      (HAS_ARCH_FEATURE (AVX2_Usable)
++			       && HAS_CPU_FEATURE (RTM)),
++			      __wmemchr_avx2_rtm)
++	      IFUNC_IMPL_ADD (array, i, wmemchr,
++			      (HAS_ARCH_FEATURE (AVX512VL_Usable)
++			       && HAS_ARCH_FEATURE (AVX512BW_Usable)
++			       && HAS_CPU_FEATURE (BMI2)),
++			      __wmemchr_evex)
+ 	      IFUNC_IMPL_ADD (array, i, wmemchr, 1, __wmemchr_sse2))
+ 
+   /* Support sysdeps/x86_64/multiarch/wmemcmp.c.  */
+@@ -429,6 +705,16 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
+ 			      (HAS_ARCH_FEATURE (AVX2_Usable)
+ 			       && HAS_CPU_FEATURE (MOVBE)),
+ 			      __wmemcmp_avx2_movbe)
++	      IFUNC_IMPL_ADD (array, i, wmemcmp,
++			      (HAS_ARCH_FEATURE (AVX2_Usable)
++			       && HAS_CPU_FEATURE (MOVBE)
++			       && HAS_CPU_FEATURE (RTM)),
++			      __wmemcmp_avx2_movbe_rtm)
++	      IFUNC_IMPL_ADD (array, i, wmemcmp,
++			      (HAS_ARCH_FEATURE (AVX512VL_Usable)
++			       && HAS_ARCH_FEATURE (AVX512BW_Usable)
++			       && HAS_CPU_FEATURE (MOVBE)),
++			      __wmemcmp_evex_movbe)
+ 	      IFUNC_IMPL_ADD (array, i, wmemcmp, HAS_CPU_FEATURE (SSE4_1),
+ 			      __wmemcmp_sse4_1)
+ 	      IFUNC_IMPL_ADD (array, i, wmemcmp, HAS_CPU_FEATURE (SSSE3),
+@@ -443,7 +729,14 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
+ 			      HAS_ARCH_FEATURE (AVX2_Usable),
+ 			      __wmemset_avx2_unaligned)
+ 	      IFUNC_IMPL_ADD (array, i, wmemset,
+-			      HAS_ARCH_FEATURE (AVX512F_Usable),
++			      (HAS_ARCH_FEATURE (AVX2_Usable)
++			       && HAS_CPU_FEATURE (RTM)),
++			      __wmemset_avx2_unaligned_rtm)
++	      IFUNC_IMPL_ADD (array, i, wmemset,
++			      HAS_ARCH_FEATURE (AVX512VL_Usable),
++			      __wmemset_evex_unaligned)
++	      IFUNC_IMPL_ADD (array, i, wmemset,
++			      HAS_ARCH_FEATURE (AVX512VL_Usable),
+ 			      __wmemset_avx512_unaligned))
+ 
+ #ifdef SHARED
+@@ -453,10 +746,10 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
+ 			      HAS_ARCH_FEATURE (AVX512F_Usable),
+ 			      __memcpy_chk_avx512_no_vzeroupper)
+ 	      IFUNC_IMPL_ADD (array, i, __memcpy_chk,
+-			      HAS_ARCH_FEATURE (AVX512F_Usable),
++			      HAS_ARCH_FEATURE (AVX512VL_Usable),
+ 			      __memcpy_chk_avx512_unaligned)
+ 	      IFUNC_IMPL_ADD (array, i, __memcpy_chk,
+-			      HAS_ARCH_FEATURE (AVX512F_Usable),
++			      HAS_ARCH_FEATURE (AVX512VL_Usable),
+ 			      __memcpy_chk_avx512_unaligned_erms)
+ 	      IFUNC_IMPL_ADD (array, i, __memcpy_chk,
+ 			      HAS_ARCH_FEATURE (AVX_Usable),
+@@ -464,6 +757,20 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
+ 	      IFUNC_IMPL_ADD (array, i, __memcpy_chk,
+ 			      HAS_ARCH_FEATURE (AVX_Usable),
+ 			      __memcpy_chk_avx_unaligned_erms)
++	      IFUNC_IMPL_ADD (array, i, __memcpy_chk,
++			      (HAS_ARCH_FEATURE (AVX_Usable)
++			       && HAS_CPU_FEATURE (RTM)),
++			      __memcpy_chk_avx_unaligned_rtm)
++	      IFUNC_IMPL_ADD (array, i, __memcpy_chk,
++			      (HAS_ARCH_FEATURE (AVX_Usable)
++			       && HAS_CPU_FEATURE (RTM)),
++			      __memcpy_chk_avx_unaligned_erms_rtm)
++	      IFUNC_IMPL_ADD (array, i, __memcpy_chk,
++			      HAS_ARCH_FEATURE (AVX512VL_Usable),
++			      __memcpy_chk_evex_unaligned)
++	      IFUNC_IMPL_ADD (array, i, __memcpy_chk,
++			      HAS_ARCH_FEATURE (AVX512VL_Usable),
++			      __memcpy_chk_evex_unaligned_erms)
+ 	      IFUNC_IMPL_ADD (array, i, __memcpy_chk,
+ 			      HAS_CPU_FEATURE (SSSE3),
+ 			      __memcpy_chk_ssse3_back)
+@@ -486,6 +793,20 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
+ 	      IFUNC_IMPL_ADD (array, i, memcpy,
+ 			      HAS_ARCH_FEATURE (AVX_Usable),
+ 			      __memcpy_avx_unaligned_erms)
++	      IFUNC_IMPL_ADD (array, i, memcpy,
++			      (HAS_ARCH_FEATURE (AVX_Usable)
++			       && HAS_CPU_FEATURE (RTM)),
++			      __memcpy_avx_unaligned_rtm)
++	      IFUNC_IMPL_ADD (array, i, memcpy,
++			      (HAS_ARCH_FEATURE (AVX_Usable)
++			       && HAS_CPU_FEATURE (RTM)),
++			      __memcpy_avx_unaligned_erms_rtm)
++	      IFUNC_IMPL_ADD (array, i, memcpy,
++			      HAS_ARCH_FEATURE (AVX512VL_Usable),
++			      __memcpy_evex_unaligned)
++	      IFUNC_IMPL_ADD (array, i, memcpy,
++			      HAS_ARCH_FEATURE (AVX512VL_Usable),
++			      __memcpy_evex_unaligned_erms)
+ 	      IFUNC_IMPL_ADD (array, i, memcpy, HAS_CPU_FEATURE (SSSE3),
+ 			      __memcpy_ssse3_back)
+ 	      IFUNC_IMPL_ADD (array, i, memcpy, HAS_CPU_FEATURE (SSSE3),
+@@ -494,10 +815,10 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
+ 			      HAS_ARCH_FEATURE (AVX512F_Usable),
+ 			      __memcpy_avx512_no_vzeroupper)
+ 	      IFUNC_IMPL_ADD (array, i, memcpy,
+-			      HAS_ARCH_FEATURE (AVX512F_Usable),
++			      HAS_ARCH_FEATURE (AVX512VL_Usable),
+ 			      __memcpy_avx512_unaligned)
+ 	      IFUNC_IMPL_ADD (array, i, memcpy,
+-			      HAS_ARCH_FEATURE (AVX512F_Usable),
++			      HAS_ARCH_FEATURE (AVX512VL_Usable),
+ 			      __memcpy_avx512_unaligned_erms)
+ 	      IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_sse2_unaligned)
+ 	      IFUNC_IMPL_ADD (array, i, memcpy, 1,
+@@ -511,10 +832,10 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
+ 			      HAS_ARCH_FEATURE (AVX512F_Usable),
+ 			      __mempcpy_chk_avx512_no_vzeroupper)
+ 	      IFUNC_IMPL_ADD (array, i, __mempcpy_chk,
+-			      HAS_ARCH_FEATURE (AVX512F_Usable),
++			      HAS_ARCH_FEATURE (AVX512VL_Usable),
+ 			      __mempcpy_chk_avx512_unaligned)
+ 	      IFUNC_IMPL_ADD (array, i, __mempcpy_chk,
+-			      HAS_ARCH_FEATURE (AVX512F_Usable),
++			      HAS_ARCH_FEATURE (AVX512VL_Usable),
+ 			      __mempcpy_chk_avx512_unaligned_erms)
+ 	      IFUNC_IMPL_ADD (array, i, __mempcpy_chk,
+ 			      HAS_ARCH_FEATURE (AVX_Usable),
+@@ -522,6 +843,20 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
+ 	      IFUNC_IMPL_ADD (array, i, __mempcpy_chk,
+ 			      HAS_ARCH_FEATURE (AVX_Usable),
+ 			      __mempcpy_chk_avx_unaligned_erms)
++	      IFUNC_IMPL_ADD (array, i, __mempcpy_chk,
++			      (HAS_ARCH_FEATURE (AVX_Usable)
++			       && HAS_CPU_FEATURE (RTM)),
++			      __mempcpy_chk_avx_unaligned_rtm)
++	      IFUNC_IMPL_ADD (array, i, __mempcpy_chk,
++			      (HAS_ARCH_FEATURE (AVX_Usable)
++			       && HAS_CPU_FEATURE (RTM)),
++			      __mempcpy_chk_avx_unaligned_erms_rtm)
++	      IFUNC_IMPL_ADD (array, i, __mempcpy_chk,
++			      HAS_ARCH_FEATURE (AVX512VL_Usable),
++			      __mempcpy_chk_evex_unaligned)
++	      IFUNC_IMPL_ADD (array, i, __mempcpy_chk,
++			      HAS_ARCH_FEATURE (AVX512VL_Usable),
++			      __mempcpy_chk_evex_unaligned_erms)
+ 	      IFUNC_IMPL_ADD (array, i, __mempcpy_chk,
+ 			      HAS_CPU_FEATURE (SSSE3),
+ 			      __mempcpy_chk_ssse3_back)
+@@ -542,10 +877,10 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
+ 			      HAS_ARCH_FEATURE (AVX512F_Usable),
+ 			      __mempcpy_avx512_no_vzeroupper)
+ 	      IFUNC_IMPL_ADD (array, i, mempcpy,
+-			      HAS_ARCH_FEATURE (AVX512F_Usable),
++			      HAS_ARCH_FEATURE (AVX512VL_Usable),
+ 			      __mempcpy_avx512_unaligned)
+ 	      IFUNC_IMPL_ADD (array, i, mempcpy,
+-			      HAS_ARCH_FEATURE (AVX512F_Usable),
++			      HAS_ARCH_FEATURE (AVX512VL_Usable),
+ 			      __mempcpy_avx512_unaligned_erms)
+ 	      IFUNC_IMPL_ADD (array, i, mempcpy,
+ 			      HAS_ARCH_FEATURE (AVX_Usable),
+@@ -553,6 +888,20 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
+ 	      IFUNC_IMPL_ADD (array, i, mempcpy,
+ 			      HAS_ARCH_FEATURE (AVX_Usable),
+ 			      __mempcpy_avx_unaligned_erms)
++	      IFUNC_IMPL_ADD (array, i, mempcpy,
++			      (HAS_ARCH_FEATURE (AVX_Usable)
++			       && HAS_CPU_FEATURE (RTM)),
++			      __mempcpy_avx_unaligned_rtm)
++	      IFUNC_IMPL_ADD (array, i, mempcpy,
++			      (HAS_ARCH_FEATURE (AVX_Usable)
++			       && HAS_CPU_FEATURE (RTM)),
++			      __mempcpy_avx_unaligned_erms_rtm)
++	      IFUNC_IMPL_ADD (array, i, mempcpy,
++			      HAS_ARCH_FEATURE (AVX512VL_Usable),
++			      __mempcpy_evex_unaligned)
++	      IFUNC_IMPL_ADD (array, i, mempcpy,
++			      HAS_ARCH_FEATURE (AVX512VL_Usable),
++			      __mempcpy_evex_unaligned_erms)
+ 	      IFUNC_IMPL_ADD (array, i, mempcpy, HAS_CPU_FEATURE (SSSE3),
+ 			      __mempcpy_ssse3_back)
+ 	      IFUNC_IMPL_ADD (array, i, mempcpy, HAS_CPU_FEATURE (SSSE3),
+@@ -568,6 +917,14 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
+ 	      IFUNC_IMPL_ADD (array, i, strncmp,
+ 			      HAS_ARCH_FEATURE (AVX2_Usable),
+ 			      __strncmp_avx2)
++	      IFUNC_IMPL_ADD (array, i, strncmp,
++			      (HAS_ARCH_FEATURE (AVX2_Usable)
++			       && HAS_CPU_FEATURE (RTM)),
++			      __strncmp_avx2_rtm)
++	      IFUNC_IMPL_ADD (array, i, strncmp,
++			      (HAS_ARCH_FEATURE (AVX512VL_Usable)
++			       && HAS_ARCH_FEATURE (AVX512BW_Usable)),
++			      __strncmp_evex)
+ 	      IFUNC_IMPL_ADD (array, i, strncmp, HAS_CPU_FEATURE (SSE4_2),
+ 			      __strncmp_sse42)
+ 	      IFUNC_IMPL_ADD (array, i, strncmp, HAS_CPU_FEATURE (SSSE3),
+@@ -582,6 +939,9 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
+ 	      IFUNC_IMPL_ADD (array, i, __wmemset_chk,
+ 			      HAS_ARCH_FEATURE (AVX2_Usable),
+ 			      __wmemset_chk_avx2_unaligned)
++	      IFUNC_IMPL_ADD (array, i, __wmemset_chk,
++			      HAS_ARCH_FEATURE (AVX512VL_Usable),
++			      __wmemset_chk_evex_unaligned)
+ 	      IFUNC_IMPL_ADD (array, i, __wmemset_chk,
+ 			      HAS_ARCH_FEATURE (AVX512F_Usable),
+ 			      __wmemset_chk_avx512_unaligned))
+diff --git a/sysdeps/x86_64/multiarch/ifunc-memcmp.h b/sysdeps/x86_64/multiarch/ifunc-memcmp.h
+index c14db39cf4..ebbb0c01cf 100644
+--- a/sysdeps/x86_64/multiarch/ifunc-memcmp.h
++++ b/sysdeps/x86_64/multiarch/ifunc-memcmp.h
+@@ -23,17 +23,28 @@ extern __typeof (REDIRECT_NAME) OPTIMIZE (sse2) attribute_hidden;
+ extern __typeof (REDIRECT_NAME) OPTIMIZE (ssse3) attribute_hidden;
+ extern __typeof (REDIRECT_NAME) OPTIMIZE (sse4_1) attribute_hidden;
+ extern __typeof (REDIRECT_NAME) OPTIMIZE (avx2_movbe) attribute_hidden;
++extern __typeof (REDIRECT_NAME) OPTIMIZE (avx2_movbe_rtm) attribute_hidden;
++extern __typeof (REDIRECT_NAME) OPTIMIZE (evex_movbe) attribute_hidden;
+ 
+ static inline void *
+ IFUNC_SELECTOR (void)
+ {
+   const struct cpu_features* cpu_features = __get_cpu_features ();
+ 
+-  if (!CPU_FEATURES_ARCH_P (cpu_features, Prefer_No_VZEROUPPER)
+-      && CPU_FEATURES_ARCH_P (cpu_features, AVX2_Usable)
++  if (CPU_FEATURES_ARCH_P (cpu_features, AVX2_Usable)
+       && CPU_FEATURES_CPU_P (cpu_features, MOVBE)
+       && CPU_FEATURES_ARCH_P (cpu_features, AVX_Fast_Unaligned_Load))
+-    return OPTIMIZE (avx2_movbe);
++    {
++      if (CPU_FEATURES_ARCH_P (cpu_features, AVX512VL_Usable)
++	  && CPU_FEATURES_ARCH_P (cpu_features, AVX512BW_Usable))
++	return OPTIMIZE (evex_movbe);
++
++      if (CPU_FEATURES_CPU_P (cpu_features, RTM))
++	return OPTIMIZE (avx2_movbe_rtm);
++
++      if (!CPU_FEATURES_ARCH_P (cpu_features, Prefer_No_VZEROUPPER))
++	return OPTIMIZE (avx2_movbe);
++    }
+ 
+   if (CPU_FEATURES_CPU_P (cpu_features, SSE4_1))
+     return OPTIMIZE (sse4_1);
+diff --git a/sysdeps/x86_64/multiarch/ifunc-memmove.h b/sysdeps/x86_64/multiarch/ifunc-memmove.h
+index 81673d2019..dfc5a28487 100644
+--- a/sysdeps/x86_64/multiarch/ifunc-memmove.h
++++ b/sysdeps/x86_64/multiarch/ifunc-memmove.h
+@@ -29,6 +29,14 @@ extern __typeof (REDIRECT_NAME) OPTIMIZE (ssse3_back) attribute_hidden;
+ extern __typeof (REDIRECT_NAME) OPTIMIZE (avx_unaligned) attribute_hidden;
+ extern __typeof (REDIRECT_NAME) OPTIMIZE (avx_unaligned_erms)
+   attribute_hidden;
++extern __typeof (REDIRECT_NAME) OPTIMIZE (avx_unaligned_rtm)
++  attribute_hidden;
++extern __typeof (REDIRECT_NAME) OPTIMIZE (avx_unaligned_erms_rtm)
++  attribute_hidden;
++extern __typeof (REDIRECT_NAME) OPTIMIZE (evex_unaligned)
++  attribute_hidden;
++extern __typeof (REDIRECT_NAME) OPTIMIZE (evex_unaligned_erms)
++  attribute_hidden;
+ extern __typeof (REDIRECT_NAME) OPTIMIZE (avx512_unaligned)
+   attribute_hidden;
+ extern __typeof (REDIRECT_NAME) OPTIMIZE (avx512_unaligned_erms)
+@@ -48,21 +56,42 @@ IFUNC_SELECTOR (void)
+   if (CPU_FEATURES_ARCH_P (cpu_features, AVX512F_Usable)
+       && !CPU_FEATURES_ARCH_P (cpu_features, Prefer_No_AVX512))
+     {
+-      if (CPU_FEATURES_ARCH_P (cpu_features, Prefer_No_VZEROUPPER))
+-	return OPTIMIZE (avx512_no_vzeroupper);
++      if (CPU_FEATURES_ARCH_P (cpu_features, AVX512VL_Usable))
++	{
++	if (CPU_FEATURES_CPU_P (cpu_features, ERMS))
++	    return OPTIMIZE (avx512_unaligned_erms);
+ 
+-      if (CPU_FEATURES_CPU_P (cpu_features, ERMS))
+-	return OPTIMIZE (avx512_unaligned_erms);
++	  return OPTIMIZE (avx512_unaligned);
++	}
+ 
+-      return OPTIMIZE (avx512_unaligned);
++      return OPTIMIZE (avx512_no_vzeroupper);
+     }
+ 
+   if (CPU_FEATURES_ARCH_P (cpu_features, AVX_Fast_Unaligned_Load))
+     {
+-      if (CPU_FEATURES_CPU_P (cpu_features, ERMS))
+-	return OPTIMIZE (avx_unaligned_erms);
++      if (CPU_FEATURES_ARCH_P (cpu_features, AVX512VL_Usable))
++	{
++	  if (CPU_FEATURES_CPU_P (cpu_features, ERMS))
++	    return OPTIMIZE (evex_unaligned_erms);
++
++	  return OPTIMIZE (evex_unaligned);
++	}
++
++      if (CPU_FEATURES_CPU_P (cpu_features, RTM))
++	{
++	  if (CPU_FEATURES_CPU_P (cpu_features, ERMS))
++	    return OPTIMIZE (avx_unaligned_erms_rtm);
++
++	  return OPTIMIZE (avx_unaligned_rtm);
++	}
++
++      if (!CPU_FEATURES_ARCH_P (cpu_features, Prefer_No_VZEROUPPER))
++	{
++	  if (CPU_FEATURES_CPU_P (cpu_features, ERMS))
++	    return OPTIMIZE (avx_unaligned_erms);
+ 
+-      return OPTIMIZE (avx_unaligned);
++	  return OPTIMIZE (avx_unaligned);
++	}
+     }
+ 
+   if (!CPU_FEATURES_CPU_P (cpu_features, SSSE3)
+diff --git a/sysdeps/x86_64/multiarch/ifunc-memset.h b/sysdeps/x86_64/multiarch/ifunc-memset.h
+index d690293385..48fdb24b02 100644
+--- a/sysdeps/x86_64/multiarch/ifunc-memset.h
++++ b/sysdeps/x86_64/multiarch/ifunc-memset.h
+@@ -27,6 +27,14 @@ extern __typeof (REDIRECT_NAME) OPTIMIZE (sse2_unaligned_erms)
+ extern __typeof (REDIRECT_NAME) OPTIMIZE (avx2_unaligned) attribute_hidden;
+ extern __typeof (REDIRECT_NAME) OPTIMIZE (avx2_unaligned_erms)
+   attribute_hidden;
++extern __typeof (REDIRECT_NAME) OPTIMIZE (avx2_unaligned_rtm)
++  attribute_hidden;
++extern __typeof (REDIRECT_NAME) OPTIMIZE (avx2_unaligned_erms_rtm)
++  attribute_hidden;
++extern __typeof (REDIRECT_NAME) OPTIMIZE (evex_unaligned)
++  attribute_hidden;
++extern __typeof (REDIRECT_NAME) OPTIMIZE (evex_unaligned_erms)
++  attribute_hidden;
+ extern __typeof (REDIRECT_NAME) OPTIMIZE (avx512_unaligned)
+   attribute_hidden;
+ extern __typeof (REDIRECT_NAME) OPTIMIZE (avx512_unaligned_erms)
+@@ -45,21 +53,44 @@ IFUNC_SELECTOR (void)
+   if (CPU_FEATURES_ARCH_P (cpu_features, AVX512F_Usable)
+       && !CPU_FEATURES_ARCH_P (cpu_features, Prefer_No_AVX512))
+     {
+-      if (CPU_FEATURES_ARCH_P (cpu_features, Prefer_No_VZEROUPPER))
+-	return OPTIMIZE (avx512_no_vzeroupper);
++      if (CPU_FEATURES_ARCH_P (cpu_features, AVX512VL_Usable)
++	  && CPU_FEATURES_ARCH_P (cpu_features, AVX512BW_Usable))
++	{
++	  if (CPU_FEATURES_CPU_P (cpu_features, ERMS))
++	    return OPTIMIZE (avx512_unaligned_erms);
+ 
+-      if (CPU_FEATURES_CPU_P (cpu_features, ERMS))
+-	return OPTIMIZE (avx512_unaligned_erms);
++	  return OPTIMIZE (avx512_unaligned);
++	}
+ 
+-      return OPTIMIZE (avx512_unaligned);
++      return OPTIMIZE (avx512_no_vzeroupper);
+     }
+ 
+   if (CPU_FEATURES_ARCH_P (cpu_features, AVX2_Usable))
+     {
+-      if (CPU_FEATURES_CPU_P (cpu_features, ERMS))
+-	return OPTIMIZE (avx2_unaligned_erms);
+-      else
+-	return OPTIMIZE (avx2_unaligned);
++      if (CPU_FEATURES_ARCH_P (cpu_features, AVX512VL_Usable)
++	  && CPU_FEATURES_ARCH_P (cpu_features, AVX512BW_Usable))
++	{
++	  if (CPU_FEATURES_CPU_P (cpu_features, ERMS))
++	    return OPTIMIZE (evex_unaligned_erms);
++
++	  return OPTIMIZE (evex_unaligned);
++	}
++
++      if (CPU_FEATURES_CPU_P (cpu_features, RTM))
++	{
++	  if (CPU_FEATURES_CPU_P (cpu_features, ERMS))
++	    return OPTIMIZE (avx2_unaligned_erms_rtm);
++
++	  return OPTIMIZE (avx2_unaligned_rtm);
++	}
++
++      if (!CPU_FEATURES_ARCH_P (cpu_features, Prefer_No_VZEROUPPER))
++	{
++	  if (CPU_FEATURES_CPU_P (cpu_features, ERMS))
++	    return OPTIMIZE (avx2_unaligned_erms);
++
++	  return OPTIMIZE (avx2_unaligned);
++	}
+     }
+ 
+   if (CPU_FEATURES_CPU_P (cpu_features, ERMS))
+diff --git a/sysdeps/x86_64/multiarch/ifunc-strcpy.h b/sysdeps/x86_64/multiarch/ifunc-strcpy.h
+index ae4f451803..f38a3b7501 100644
+--- a/sysdeps/x86_64/multiarch/ifunc-strcpy.h
++++ b/sysdeps/x86_64/multiarch/ifunc-strcpy.h
+@@ -25,16 +25,27 @@ extern __typeof (REDIRECT_NAME) OPTIMIZE (sse2_unaligned)
+   attribute_hidden;
+ extern __typeof (REDIRECT_NAME) OPTIMIZE (ssse3) attribute_hidden;
+ extern __typeof (REDIRECT_NAME) OPTIMIZE (avx2) attribute_hidden;
++extern __typeof (REDIRECT_NAME) OPTIMIZE (avx2_rtm) attribute_hidden;
++extern __typeof (REDIRECT_NAME) OPTIMIZE (evex) attribute_hidden;
+ 
+ static inline void *
+ IFUNC_SELECTOR (void)
+ {
+   const struct cpu_features* cpu_features = __get_cpu_features ();
+ 
+-  if (!CPU_FEATURES_ARCH_P (cpu_features, Prefer_No_VZEROUPPER)
+-      && CPU_FEATURES_ARCH_P (cpu_features, AVX2_Usable)
++  if (CPU_FEATURES_ARCH_P (cpu_features, AVX2_Usable)
+       && CPU_FEATURES_ARCH_P (cpu_features, AVX_Fast_Unaligned_Load))
+-    return OPTIMIZE (avx2);
++    {
++      if (CPU_FEATURES_ARCH_P (cpu_features, AVX512VL_Usable)
++	  && CPU_FEATURES_ARCH_P (cpu_features, AVX512BW_Usable))
++	return OPTIMIZE (evex);
++
++      if (CPU_FEATURES_CPU_P (cpu_features, RTM))
++	return OPTIMIZE (avx2_rtm);
++
++      if (!CPU_FEATURES_ARCH_P (cpu_features, Prefer_No_VZEROUPPER))
++	return OPTIMIZE (avx2);
++    }
+ 
+   if (CPU_FEATURES_ARCH_P (cpu_features, Fast_Unaligned_Load))
+     return OPTIMIZE (sse2_unaligned);
+diff --git a/sysdeps/x86_64/multiarch/ifunc-wcslen.h b/sysdeps/x86_64/multiarch/ifunc-wcslen.h
+new file mode 100644
+index 0000000000..564cc8cbec
+--- /dev/null
++++ b/sysdeps/x86_64/multiarch/ifunc-wcslen.h
+@@ -0,0 +1,52 @@
++/* Common definition for ifunc selections for wcslen and wcsnlen
++   All versions must be listed in ifunc-impl-list.c.
++   Copyright (C) 2017-2021 Free Software Foundation, Inc.
++   This file is part of the GNU C Library.
++
++   The GNU C Library is free software; you can redistribute it and/or
++   modify it under the terms of the GNU Lesser General Public
++   License as published by the Free Software Foundation; either
++   version 2.1 of the License, or (at your option) any later version.
++
++   The GNU C Library is distributed in the hope that it will be useful,
++   but WITHOUT ANY WARRANTY; without even the implied warranty of
++   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
++   Lesser General Public License for more details.
++
++   You should have received a copy of the GNU Lesser General Public
++   License along with the GNU C Library; if not, see
++   <https://www.gnu.org/licenses/>.  */
++
++#include <init-arch.h>
++
++extern __typeof (REDIRECT_NAME) OPTIMIZE (sse2) attribute_hidden;
++extern __typeof (REDIRECT_NAME) OPTIMIZE (sse4_1) attribute_hidden;
++extern __typeof (REDIRECT_NAME) OPTIMIZE (avx2) attribute_hidden;
++extern __typeof (REDIRECT_NAME) OPTIMIZE (avx2_rtm) attribute_hidden;
++extern __typeof (REDIRECT_NAME) OPTIMIZE (evex) attribute_hidden;
++
++static inline void *
++IFUNC_SELECTOR (void)
++{
++  const struct cpu_features* cpu_features = __get_cpu_features ();
++
++  if (CPU_FEATURES_ARCH_P (cpu_features, AVX2_Usable)
++      && CPU_FEATURES_CPU_P (cpu_features, BMI2)
++      && CPU_FEATURES_ARCH_P (cpu_features, AVX_Fast_Unaligned_Load))
++    {
++      if (CPU_FEATURES_ARCH_P (cpu_features, AVX512VL_Usable)
++	  && CPU_FEATURES_ARCH_P (cpu_features, AVX512BW_Usable))
++	return OPTIMIZE (evex);
++
++      if (CPU_FEATURES_CPU_P (cpu_features, RTM))
++	return OPTIMIZE (avx2_rtm);
++
++      if (!CPU_FEATURES_ARCH_P (cpu_features, Prefer_No_VZEROUPPER))
++	return OPTIMIZE (avx2);
++    }
++
++  if (CPU_FEATURES_CPU_P (cpu_features, SSE4_1))
++    return OPTIMIZE (sse4_1);
++
++  return OPTIMIZE (sse2);
++}
+diff --git a/sysdeps/x86_64/multiarch/ifunc-wmemset.h b/sysdeps/x86_64/multiarch/ifunc-wmemset.h
+index 583f6310a1..0ce29a229d 100644
+--- a/sysdeps/x86_64/multiarch/ifunc-wmemset.h
++++ b/sysdeps/x86_64/multiarch/ifunc-wmemset.h
+@@ -20,6 +20,9 @@
+ 
+ extern __typeof (REDIRECT_NAME) OPTIMIZE (sse2_unaligned) attribute_hidden;
+ extern __typeof (REDIRECT_NAME) OPTIMIZE (avx2_unaligned) attribute_hidden;
++extern __typeof (REDIRECT_NAME) OPTIMIZE (avx2_unaligned_rtm)
++  attribute_hidden;
++extern __typeof (REDIRECT_NAME) OPTIMIZE (evex_unaligned) attribute_hidden;
+ extern __typeof (REDIRECT_NAME) OPTIMIZE (avx512_unaligned) attribute_hidden;
+ 
+ static inline void *
+@@ -27,14 +30,21 @@ IFUNC_SELECTOR (void)
+ {
+   const struct cpu_features* cpu_features = __get_cpu_features ();
+ 
+-  if (!CPU_FEATURES_ARCH_P (cpu_features, Prefer_No_VZEROUPPER)
+-      && CPU_FEATURES_ARCH_P (cpu_features, AVX2_Usable)
++  if (CPU_FEATURES_ARCH_P (cpu_features, AVX2_Usable)
+       && CPU_FEATURES_ARCH_P (cpu_features, AVX_Fast_Unaligned_Load))
+     {
+-      if (CPU_FEATURES_ARCH_P (cpu_features, AVX512F_Usable)
+-	  && !CPU_FEATURES_ARCH_P (cpu_features, Prefer_No_AVX512))
+-	return OPTIMIZE (avx512_unaligned);
+-      else
++      if (CPU_FEATURES_ARCH_P (cpu_features, AVX512VL_Usable))
++	{
++	  if (!CPU_FEATURES_ARCH_P (cpu_features, Prefer_No_AVX512))
++	    return OPTIMIZE (avx512_unaligned);
++
++	  return OPTIMIZE (evex_unaligned);
++	}
++
++      if (CPU_FEATURES_CPU_P (cpu_features, RTM))
++	return OPTIMIZE (avx2_unaligned_rtm);
++
++      if (!CPU_FEATURES_ARCH_P (cpu_features, Prefer_No_VZEROUPPER))
+ 	return OPTIMIZE (avx2_unaligned);
+     }
+ 
+diff --git a/sysdeps/x86_64/multiarch/memchr-avx2-rtm.S b/sysdeps/x86_64/multiarch/memchr-avx2-rtm.S
+new file mode 100644
+index 0000000000..87b076c7c4
+--- /dev/null
++++ b/sysdeps/x86_64/multiarch/memchr-avx2-rtm.S
+@@ -0,0 +1,12 @@
++#ifndef MEMCHR
++# define MEMCHR __memchr_avx2_rtm
++#endif
++
++#define ZERO_UPPER_VEC_REGISTERS_RETURN \
++  ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST
++
++#define VZEROUPPER_RETURN jmp	 L(return_vzeroupper)
++
++#define SECTION(p) p##.avx.rtm
++
++#include "memchr-avx2.S"
+diff --git a/sysdeps/x86_64/multiarch/memchr-avx2.S b/sysdeps/x86_64/multiarch/memchr-avx2.S
+index e5a9abd211..0987616a1b 100644
+--- a/sysdeps/x86_64/multiarch/memchr-avx2.S
++++ b/sysdeps/x86_64/multiarch/memchr-avx2.S
+@@ -26,319 +26,407 @@
+ 
+ # ifdef USE_AS_WMEMCHR
+ #  define VPCMPEQ	vpcmpeqd
++#  define VPBROADCAST	vpbroadcastd
++#  define CHAR_SIZE	4
+ # else
+ #  define VPCMPEQ	vpcmpeqb
++#  define VPBROADCAST	vpbroadcastb
++#  define CHAR_SIZE	1
++# endif
++
++# ifdef USE_AS_RAWMEMCHR
++#  define ERAW_PTR_REG	ecx
++#  define RRAW_PTR_REG	rcx
++#  define ALGN_PTR_REG	rdi
++# else
++#  define ERAW_PTR_REG	edi
++#  define RRAW_PTR_REG	rdi
++#  define ALGN_PTR_REG	rcx
+ # endif
+ 
+ # ifndef VZEROUPPER
+ #  define VZEROUPPER	vzeroupper
+ # endif
+ 
++# ifndef SECTION
++#  define SECTION(p)	p##.avx
++# endif
++
+ # define VEC_SIZE 32
++# define PAGE_SIZE 4096
++# define CHAR_PER_VEC	(VEC_SIZE / CHAR_SIZE)
+ 
+-	.section .text.avx,"ax",@progbits
++	.section SECTION(.text),"ax",@progbits
+ ENTRY (MEMCHR)
+ # ifndef USE_AS_RAWMEMCHR
+ 	/* Check for zero length.  */
++#  ifdef __ILP32__
++	/* Clear upper bits.  */
++	and	%RDX_LP, %RDX_LP
++#  else
+ 	test	%RDX_LP, %RDX_LP
++#  endif
+ 	jz	L(null)
+ # endif
+-	movl	%edi, %ecx
+-	/* Broadcast CHAR to YMM0.  */
++	/* Broadcast CHAR to YMMMATCH.  */
+ 	vmovd	%esi, %xmm0
+-# ifdef USE_AS_WMEMCHR
+-	shl	$2, %RDX_LP
+-	vpbroadcastd %xmm0, %ymm0
+-# else
+-#  ifdef __ILP32__
+-	/* Clear the upper 32 bits.  */
+-	movl	%edx, %edx
+-#  endif
+-	vpbroadcastb %xmm0, %ymm0
+-# endif
++	VPBROADCAST %xmm0, %ymm0
+ 	/* Check if we may cross page boundary with one vector load.  */
+-	andl	$(2 * VEC_SIZE - 1), %ecx
+-	cmpl	$VEC_SIZE, %ecx
+-	ja	L(cros_page_boundary)
++	movl	%edi, %eax
++	andl	$(PAGE_SIZE - 1), %eax
++	cmpl	$(PAGE_SIZE - VEC_SIZE), %eax
++	ja	L(cross_page_boundary)
+ 
+ 	/* Check the first VEC_SIZE bytes.  */
+-	VPCMPEQ (%rdi), %ymm0, %ymm1
++	VPCMPEQ	(%rdi), %ymm0, %ymm1
+ 	vpmovmskb %ymm1, %eax
+-	testl	%eax, %eax
+-
+ # ifndef USE_AS_RAWMEMCHR
+-	jnz	L(first_vec_x0_check)
+-	/* Adjust length and check the end of data.  */
+-	subq	$VEC_SIZE, %rdx
+-	jbe	L(zero)
+-# else
+-	jnz	L(first_vec_x0)
++	/* If length < CHAR_PER_VEC handle special.  */
++	cmpq	$CHAR_PER_VEC, %rdx
++	jbe	L(first_vec_x0)
+ # endif
+-
+-	/* Align data for aligned loads in the loop.  */
+-	addq	$VEC_SIZE, %rdi
+-	andl	$(VEC_SIZE - 1), %ecx
+-	andq	$-VEC_SIZE, %rdi
++	testl	%eax, %eax
++	jz	L(aligned_more)
++	tzcntl	%eax, %eax
++	addq	%rdi, %rax
++	VZEROUPPER_RETURN
+ 
+ # ifndef USE_AS_RAWMEMCHR
+-	/* Adjust length.  */
+-	addq	%rcx, %rdx
++	.p2align 5
++L(first_vec_x0):
++	/* Check if first match was before length.  */
++	tzcntl	%eax, %eax
++#  ifdef USE_AS_WMEMCHR
++	/* NB: Multiply length by 4 to get byte count.  */
++	sall	$2, %edx
++#  endif
++	xorl	%ecx, %ecx
++	cmpl	%eax, %edx
++	leaq	(%rdi, %rax), %rax
++	cmovle	%rcx, %rax
++	VZEROUPPER_RETURN
+ 
+-	subq	$(VEC_SIZE * 4), %rdx
+-	jbe	L(last_4x_vec_or_less)
++L(null):
++	xorl	%eax, %eax
++	ret
+ # endif
+-	jmp	L(more_4x_vec)
+-
+ 	.p2align 4
+-L(cros_page_boundary):
+-	andl	$(VEC_SIZE - 1), %ecx
+-	andq	$-VEC_SIZE, %rdi
+-	VPCMPEQ (%rdi), %ymm0, %ymm1
++L(cross_page_boundary):
++	/* Save pointer before aligning as its original value is
++	   necessary for computer return address if byte is found or
++	   adjusting length if it is not and this is memchr.  */
++	movq	%rdi, %rcx
++	/* Align data to VEC_SIZE - 1. ALGN_PTR_REG is rcx for memchr
++	   and rdi for rawmemchr.  */
++	orq	$(VEC_SIZE - 1), %ALGN_PTR_REG
++	VPCMPEQ	-(VEC_SIZE - 1)(%ALGN_PTR_REG), %ymm0, %ymm1
+ 	vpmovmskb %ymm1, %eax
++# ifndef USE_AS_RAWMEMCHR
++	/* Calculate length until end of page (length checked for a
++	   match).  */
++	leaq	1(%ALGN_PTR_REG), %rsi
++	subq	%RRAW_PTR_REG, %rsi
++#  ifdef USE_AS_WMEMCHR
++	/* NB: Divide bytes by 4 to get wchar_t count.  */
++	shrl	$2, %esi
++#  endif
++# endif
+ 	/* Remove the leading bytes.  */
+-	sarl	%cl, %eax
+-	testl	%eax, %eax
+-	jz	L(aligned_more)
+-	tzcntl	%eax, %eax
++	sarxl	%ERAW_PTR_REG, %eax, %eax
+ # ifndef USE_AS_RAWMEMCHR
+ 	/* Check the end of data.  */
+-	cmpq	%rax, %rdx
+-	jbe	L(zero)
++	cmpq	%rsi, %rdx
++	jbe	L(first_vec_x0)
+ # endif
++	testl	%eax, %eax
++	jz	L(cross_page_continue)
++	tzcntl	%eax, %eax
++	addq	%RRAW_PTR_REG, %rax
++L(return_vzeroupper):
++	ZERO_UPPER_VEC_REGISTERS_RETURN
++
++	.p2align 4
++L(first_vec_x1):
++	tzcntl	%eax, %eax
++	incq	%rdi
+ 	addq	%rdi, %rax
+-	addq	%rcx, %rax
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ 
+ 	.p2align 4
+-L(aligned_more):
+-# ifndef USE_AS_RAWMEMCHR
+-        /* Calculate "rdx + rcx - VEC_SIZE" with "rdx - (VEC_SIZE - rcx)"
+-	   instead of "(rdx + rcx) - VEC_SIZE" to void possible addition
+-	   overflow.  */
+-	negq	%rcx
+-	addq	$VEC_SIZE, %rcx
++L(first_vec_x2):
++	tzcntl	%eax, %eax
++	addq	$(VEC_SIZE + 1), %rdi
++	addq	%rdi, %rax
++	VZEROUPPER_RETURN
+ 
+-	/* Check the end of data.  */
+-	subq	%rcx, %rdx
+-	jbe	L(zero)
+-# endif
++	.p2align 4
++L(first_vec_x3):
++	tzcntl	%eax, %eax
++	addq	$(VEC_SIZE * 2 + 1), %rdi
++	addq	%rdi, %rax
++	VZEROUPPER_RETURN
+ 
+-	addq	$VEC_SIZE, %rdi
+ 
+-# ifndef USE_AS_RAWMEMCHR
+-	subq	$(VEC_SIZE * 4), %rdx
+-	jbe	L(last_4x_vec_or_less)
+-# endif
++	.p2align 4
++L(first_vec_x4):
++	tzcntl	%eax, %eax
++	addq	$(VEC_SIZE * 3 + 1), %rdi
++	addq	%rdi, %rax
++	VZEROUPPER_RETURN
+ 
+-L(more_4x_vec):
++	.p2align 4
++L(aligned_more):
+ 	/* Check the first 4 * VEC_SIZE.  Only one VEC_SIZE at a time
+ 	   since data is only aligned to VEC_SIZE.  */
+-	VPCMPEQ (%rdi), %ymm0, %ymm1
+-	vpmovmskb %ymm1, %eax
+-	testl	%eax, %eax
+-	jnz	L(first_vec_x0)
+ 
+-	VPCMPEQ VEC_SIZE(%rdi), %ymm0, %ymm1
++# ifndef USE_AS_RAWMEMCHR
++L(cross_page_continue):
++	/* Align data to VEC_SIZE - 1.  */
++	xorl	%ecx, %ecx
++	subl	%edi, %ecx
++	orq	$(VEC_SIZE - 1), %rdi
++	/* esi is for adjusting length to see if near the end.  */
++	leal	(VEC_SIZE * 4 + 1)(%rdi, %rcx), %esi
++#  ifdef USE_AS_WMEMCHR
++	/* NB: Divide bytes by 4 to get the wchar_t count.  */
++	sarl	$2, %esi
++#  endif
++# else
++	orq	$(VEC_SIZE - 1), %rdi
++L(cross_page_continue):
++# endif
++	/* Load first VEC regardless.  */
++	VPCMPEQ	1(%rdi), %ymm0, %ymm1
+ 	vpmovmskb %ymm1, %eax
++# ifndef USE_AS_RAWMEMCHR
++	/* Adjust length. If near end handle specially.  */
++	subq	%rsi, %rdx
++	jbe	L(last_4x_vec_or_less)
++# endif
+ 	testl	%eax, %eax
+ 	jnz	L(first_vec_x1)
+ 
+-	VPCMPEQ (VEC_SIZE * 2)(%rdi), %ymm0, %ymm1
++	VPCMPEQ	(VEC_SIZE + 1)(%rdi), %ymm0, %ymm1
+ 	vpmovmskb %ymm1, %eax
+ 	testl	%eax, %eax
+ 	jnz	L(first_vec_x2)
+ 
+-	VPCMPEQ (VEC_SIZE * 3)(%rdi), %ymm0, %ymm1
++	VPCMPEQ	(VEC_SIZE * 2 + 1)(%rdi), %ymm0, %ymm1
+ 	vpmovmskb %ymm1, %eax
+ 	testl	%eax, %eax
+ 	jnz	L(first_vec_x3)
+ 
+-	addq	$(VEC_SIZE * 4), %rdi
+-
+-# ifndef USE_AS_RAWMEMCHR
+-	subq	$(VEC_SIZE * 4), %rdx
+-	jbe	L(last_4x_vec_or_less)
+-# endif
+-
+-	/* Align data to 4 * VEC_SIZE.  */
+-	movq	%rdi, %rcx
+-	andl	$(4 * VEC_SIZE - 1), %ecx
+-	andq	$-(4 * VEC_SIZE), %rdi
++	VPCMPEQ	(VEC_SIZE * 3 + 1)(%rdi), %ymm0, %ymm1
++	vpmovmskb %ymm1, %eax
++	testl	%eax, %eax
++	jnz	L(first_vec_x4)
+ 
+ # ifndef USE_AS_RAWMEMCHR
+-	/* Adjust length.  */
++	/* Check if at last VEC_SIZE * 4 length.  */
++	subq	$(CHAR_PER_VEC * 4), %rdx
++	jbe	L(last_4x_vec_or_less_cmpeq)
++	/* Align data to VEC_SIZE * 4 - 1 for the loop and readjust
++	   length.  */
++	incq	%rdi
++	movl	%edi, %ecx
++	orq	$(VEC_SIZE * 4 - 1), %rdi
++	andl	$(VEC_SIZE * 4 - 1), %ecx
++#  ifdef USE_AS_WMEMCHR
++	/* NB: Divide bytes by 4 to get the wchar_t count.  */
++	sarl	$2, %ecx
++#  endif
+ 	addq	%rcx, %rdx
++# else
++	/* Align data to VEC_SIZE * 4 - 1 for loop.  */
++	incq	%rdi
++	orq	$(VEC_SIZE * 4 - 1), %rdi
+ # endif
+ 
++	/* Compare 4 * VEC at a time forward.  */
+ 	.p2align 4
+ L(loop_4x_vec):
+-	/* Compare 4 * VEC at a time forward.  */
+-	VPCMPEQ (%rdi), %ymm0, %ymm1
+-	VPCMPEQ VEC_SIZE(%rdi), %ymm0, %ymm2
+-	VPCMPEQ (VEC_SIZE * 2)(%rdi), %ymm0, %ymm3
+-	VPCMPEQ (VEC_SIZE * 3)(%rdi), %ymm0, %ymm4
+-
++	VPCMPEQ	1(%rdi), %ymm0, %ymm1
++	VPCMPEQ	(VEC_SIZE + 1)(%rdi), %ymm0, %ymm2
++	VPCMPEQ	(VEC_SIZE * 2 + 1)(%rdi), %ymm0, %ymm3
++	VPCMPEQ	(VEC_SIZE * 3 + 1)(%rdi), %ymm0, %ymm4
+ 	vpor	%ymm1, %ymm2, %ymm5
+ 	vpor	%ymm3, %ymm4, %ymm6
+ 	vpor	%ymm5, %ymm6, %ymm5
+ 
+-	vpmovmskb %ymm5, %eax
+-	testl	%eax, %eax
+-	jnz	L(4x_vec_end)
+-
+-	addq	$(VEC_SIZE * 4), %rdi
+-
++	vpmovmskb %ymm5, %ecx
+ # ifdef USE_AS_RAWMEMCHR
+-	jmp	L(loop_4x_vec)
++	subq	$-(VEC_SIZE * 4), %rdi
++	testl	%ecx, %ecx
++	jz	L(loop_4x_vec)
+ # else
+-	subq	$(VEC_SIZE * 4), %rdx
+-	ja	L(loop_4x_vec)
++	testl	%ecx, %ecx
++	jnz	L(loop_4x_vec_end)
+ 
+-L(last_4x_vec_or_less):
+-	/* Less than 4 * VEC and aligned to VEC_SIZE.  */
+-	addl	$(VEC_SIZE * 2), %edx
+-	jle	L(last_2x_vec)
++	subq	$-(VEC_SIZE * 4), %rdi
+ 
+-	VPCMPEQ (%rdi), %ymm0, %ymm1
+-	vpmovmskb %ymm1, %eax
+-	testl	%eax, %eax
+-	jnz	L(first_vec_x0)
++	subq	$(CHAR_PER_VEC * 4), %rdx
++	ja	L(loop_4x_vec)
+ 
+-	VPCMPEQ VEC_SIZE(%rdi), %ymm0, %ymm1
++	/* Fall through into less than 4 remaining vectors of length
++	   case.  */
++	VPCMPEQ	(VEC_SIZE * 0 + 1)(%rdi), %ymm0, %ymm1
+ 	vpmovmskb %ymm1, %eax
++	.p2align 4
++L(last_4x_vec_or_less):
++#  ifdef USE_AS_WMEMCHR
++	/* NB: Multiply length by 4 to get byte count.  */
++	sall	$2, %edx
++#  endif
++	/* Check if first VEC contained match.  */
+ 	testl	%eax, %eax
+-	jnz	L(first_vec_x1)
++	jnz	L(first_vec_x1_check)
+ 
+-	VPCMPEQ (VEC_SIZE * 2)(%rdi), %ymm0, %ymm1
+-	vpmovmskb %ymm1, %eax
+-	testl	%eax, %eax
++	/* If remaining length > VEC_SIZE * 2.  */
++	addl	$(VEC_SIZE * 2), %edx
++	jg	L(last_4x_vec)
+ 
+-	jnz	L(first_vec_x2_check)
+-	subl	$VEC_SIZE, %edx
+-	jle	L(zero)
++L(last_2x_vec):
++	/* If remaining length < VEC_SIZE.  */
++	addl	$VEC_SIZE, %edx
++	jle	L(zero_end)
+ 
+-	VPCMPEQ (VEC_SIZE * 3)(%rdi), %ymm0, %ymm1
++	/* Check VEC2 and compare any match with remaining length.  */
++	VPCMPEQ	(VEC_SIZE + 1)(%rdi), %ymm0, %ymm1
+ 	vpmovmskb %ymm1, %eax
+-	testl	%eax, %eax
+-
+-	jnz	L(first_vec_x3_check)
+-	xorl	%eax, %eax
+-	VZEROUPPER
+-	ret
++	tzcntl	%eax, %eax
++	cmpl	%eax, %edx
++	jbe	L(set_zero_end)
++	addq	$(VEC_SIZE + 1), %rdi
++	addq	%rdi, %rax
++L(zero_end):
++	VZEROUPPER_RETURN
+ 
+ 	.p2align 4
+-L(last_2x_vec):
+-	addl	$(VEC_SIZE * 2), %edx
+-	VPCMPEQ (%rdi), %ymm0, %ymm1
++L(loop_4x_vec_end):
++# endif
++	/* rawmemchr will fall through into this if match was found in
++	   loop.  */
++
+ 	vpmovmskb %ymm1, %eax
+ 	testl	%eax, %eax
++	jnz	L(last_vec_x1_return)
+ 
+-	jnz	L(first_vec_x0_check)
+-	subl	$VEC_SIZE, %edx
+-	jle	L(zero)
+-
+-	VPCMPEQ VEC_SIZE(%rdi), %ymm0, %ymm1
+-	vpmovmskb %ymm1, %eax
++	vpmovmskb %ymm2, %eax
+ 	testl	%eax, %eax
+-	jnz	L(first_vec_x1_check)
+-	xorl	%eax, %eax
+-	VZEROUPPER
+-	ret
++	jnz	L(last_vec_x2_return)
+ 
+-	.p2align 4
+-L(first_vec_x0_check):
+-	tzcntl	%eax, %eax
+-	/* Check the end of data.  */
+-	cmpq	%rax, %rdx
+-	jbe	L(zero)
++	vpmovmskb %ymm3, %eax
++	/* Combine VEC3 matches (eax) with VEC4 matches (ecx).  */
++	salq	$32, %rcx
++	orq	%rcx, %rax
++	tzcntq	%rax, %rax
++# ifdef USE_AS_RAWMEMCHR
++	subq	$(VEC_SIZE * 2 - 1), %rdi
++# else
++	subq	$-(VEC_SIZE * 2 + 1), %rdi
++# endif
+ 	addq	%rdi, %rax
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
++# ifndef USE_AS_RAWMEMCHR
+ 
+ 	.p2align 4
+ L(first_vec_x1_check):
+ 	tzcntl	%eax, %eax
+-	/* Check the end of data.  */
+-	cmpq	%rax, %rdx
+-	jbe	L(zero)
+-	addq	$VEC_SIZE, %rax
++	/* Adjust length.  */
++	subl	$-(VEC_SIZE * 4), %edx
++	/* Check if match within remaining length.  */
++	cmpl	%eax, %edx
++	jbe	L(set_zero_end)
++	incq	%rdi
+ 	addq	%rdi, %rax
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
++	.p2align 4
++L(set_zero_end):
++	xorl	%eax, %eax
++	VZEROUPPER_RETURN
++# endif
+ 
+ 	.p2align 4
+-L(first_vec_x2_check):
++L(last_vec_x1_return):
+ 	tzcntl	%eax, %eax
+-	/* Check the end of data.  */
+-	cmpq	%rax, %rdx
+-	jbe	L(zero)
+-	addq	$(VEC_SIZE * 2), %rax
++# ifdef USE_AS_RAWMEMCHR
++	subq	$(VEC_SIZE * 4 - 1), %rdi
++# else
++	incq	%rdi
++# endif
+ 	addq	%rdi, %rax
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ 
+ 	.p2align 4
+-L(first_vec_x3_check):
++L(last_vec_x2_return):
+ 	tzcntl	%eax, %eax
+-	/* Check the end of data.  */
+-	cmpq	%rax, %rdx
+-	jbe	L(zero)
+-	addq	$(VEC_SIZE * 3), %rax
++# ifdef USE_AS_RAWMEMCHR
++	subq	$(VEC_SIZE * 3 - 1), %rdi
++# else
++	subq	$-(VEC_SIZE + 1), %rdi
++# endif
+ 	addq	%rdi, %rax
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ 
++# ifndef USE_AS_RAWMEMCHR
+ 	.p2align 4
+-L(zero):
+-	VZEROUPPER
+-L(null):
+-	xorl	%eax, %eax
+-	ret
+-# endif
++L(last_4x_vec_or_less_cmpeq):
++	VPCMPEQ	(VEC_SIZE * 4 + 1)(%rdi), %ymm0, %ymm1
++	vpmovmskb %ymm1, %eax
++#  ifdef USE_AS_WMEMCHR
++	/* NB: Multiply length by 4 to get byte count.  */
++	sall	$2, %edx
++#  endif
++	subq	$-(VEC_SIZE * 4), %rdi
++	/* Check first VEC regardless.  */
++	testl	%eax, %eax
++	jnz	L(first_vec_x1_check)
+ 
++	/* If remaining length <= CHAR_PER_VEC * 2.  */
++	addl	$(VEC_SIZE * 2), %edx
++	jle	L(last_2x_vec)
+ 	.p2align 4
+-L(first_vec_x0):
+-	tzcntl	%eax, %eax
+-	addq	%rdi, %rax
+-	VZEROUPPER
+-	ret
++L(last_4x_vec):
++	VPCMPEQ	(VEC_SIZE + 1)(%rdi), %ymm0, %ymm1
++	vpmovmskb %ymm1, %eax
++	testl	%eax, %eax
++	jnz	L(last_vec_x2_return)
+ 
+-	.p2align 4
+-L(first_vec_x1):
+-	tzcntl	%eax, %eax
+-	addq	$VEC_SIZE, %rax
+-	addq	%rdi, %rax
+-	VZEROUPPER
+-	ret
++	VPCMPEQ	(VEC_SIZE * 2 + 1)(%rdi), %ymm0, %ymm1
++	vpmovmskb %ymm1, %eax
+ 
+-	.p2align 4
+-L(first_vec_x2):
++	/* Create mask for possible matches within remaining length.  */
++	movq	$-1, %rcx
++	bzhiq	%rdx, %rcx, %rcx
++
++	/* Test matches in data against length match.  */
++	andl	%ecx, %eax
++	jnz	L(last_vec_x3)
++
++	/* if remaining length <= VEC_SIZE * 3 (Note this is after
++	   remaining length was found to be > VEC_SIZE * 2.  */
++	subl	$VEC_SIZE, %edx
++	jbe	L(zero_end2)
++
++	VPCMPEQ	(VEC_SIZE * 3 + 1)(%rdi), %ymm0, %ymm1
++	vpmovmskb %ymm1, %eax
++	/* Shift remaining length mask for last VEC.  */
++	shrq	$32, %rcx
++	andl	%ecx, %eax
++	jz	L(zero_end2)
+ 	tzcntl	%eax, %eax
+-	addq	$(VEC_SIZE * 2), %rax
++	addq	$(VEC_SIZE * 3 + 1), %rdi
+ 	addq	%rdi, %rax
+-	VZEROUPPER
+-	ret
++L(zero_end2):
++	VZEROUPPER_RETURN
+ 
+ 	.p2align 4
+-L(4x_vec_end):
+-	vpmovmskb %ymm1, %eax
+-	testl	%eax, %eax
+-	jnz	L(first_vec_x0)
+-	vpmovmskb %ymm2, %eax
+-	testl	%eax, %eax
+-	jnz	L(first_vec_x1)
+-	vpmovmskb %ymm3, %eax
+-	testl	%eax, %eax
+-	jnz	L(first_vec_x2)
+-	vpmovmskb %ymm4, %eax
+-	testl	%eax, %eax
+-L(first_vec_x3):
++L(last_vec_x3):
+ 	tzcntl	%eax, %eax
+-	addq	$(VEC_SIZE * 3), %rax
++	subq	$-(VEC_SIZE * 2 + 1), %rdi
+ 	addq	%rdi, %rax
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
++# endif
+ 
+ END (MEMCHR)
+ #endif
+diff --git a/sysdeps/x86_64/multiarch/memchr-evex.S b/sysdeps/x86_64/multiarch/memchr-evex.S
+new file mode 100644
+index 0000000000..f3fdad4fda
+--- /dev/null
++++ b/sysdeps/x86_64/multiarch/memchr-evex.S
+@@ -0,0 +1,478 @@
++/* memchr/wmemchr optimized with 256-bit EVEX instructions.
++   Copyright (C) 2021 Free Software Foundation, Inc.
++   This file is part of the GNU C Library.
++
++   The GNU C Library is free software; you can redistribute it and/or
++   modify it under the terms of the GNU Lesser General Public
++   License as published by the Free Software Foundation; either
++   version 2.1 of the License, or (at your option) any later version.
++
++   The GNU C Library is distributed in the hope that it will be useful,
++   but WITHOUT ANY WARRANTY; without even the implied warranty of
++   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
++   Lesser General Public License for more details.
++
++   You should have received a copy of the GNU Lesser General Public
++   License along with the GNU C Library; if not, see
++   <https://www.gnu.org/licenses/>.  */
++
++#if IS_IN (libc)
++
++# include <sysdep.h>
++
++# ifndef MEMCHR
++#  define MEMCHR	__memchr_evex
++# endif
++
++# ifdef USE_AS_WMEMCHR
++#  define VPBROADCAST	vpbroadcastd
++#  define VPMINU	vpminud
++#  define VPCMP	vpcmpd
++#  define VPCMPEQ	vpcmpeqd
++#  define CHAR_SIZE	4
++# else
++#  define VPBROADCAST	vpbroadcastb
++#  define VPMINU	vpminub
++#  define VPCMP	vpcmpb
++#  define VPCMPEQ	vpcmpeqb
++#  define CHAR_SIZE	1
++# endif
++
++# ifdef USE_AS_RAWMEMCHR
++#  define RAW_PTR_REG	rcx
++#  define ALGN_PTR_REG	rdi
++# else
++#  define RAW_PTR_REG	rdi
++#  define ALGN_PTR_REG	rcx
++# endif
++
++# define XMMZERO	xmm23
++# define YMMZERO	ymm23
++# define XMMMATCH	xmm16
++# define YMMMATCH	ymm16
++# define YMM1		ymm17
++# define YMM2		ymm18
++# define YMM3		ymm19
++# define YMM4		ymm20
++# define YMM5		ymm21
++# define YMM6		ymm22
++
++# define VEC_SIZE 32
++# define CHAR_PER_VEC (VEC_SIZE / CHAR_SIZE)
++# define PAGE_SIZE 4096
++
++	.section .text.evex,"ax",@progbits
++ENTRY (MEMCHR)
++# ifndef USE_AS_RAWMEMCHR
++	/* Check for zero length.  */
++	test	%RDX_LP, %RDX_LP
++	jz	L(zero)
++
++#  ifdef __ILP32__
++	/* Clear the upper 32 bits.  */
++	movl	%edx, %edx
++#  endif
++# endif
++	/* Broadcast CHAR to YMMMATCH.  */
++	VPBROADCAST %esi, %YMMMATCH
++	/* Check if we may cross page boundary with one vector load.  */
++	movl	%edi, %eax
++	andl	$(PAGE_SIZE - 1), %eax
++	cmpl	$(PAGE_SIZE - VEC_SIZE), %eax
++	ja	L(cross_page_boundary)
++
++	/* Check the first VEC_SIZE bytes.  */
++	VPCMP	$0, (%rdi), %YMMMATCH, %k0
++	kmovd	%k0, %eax
++# ifndef USE_AS_RAWMEMCHR
++	/* If length < CHAR_PER_VEC handle special.  */
++	cmpq	$CHAR_PER_VEC, %rdx
++	jbe	L(first_vec_x0)
++# endif
++	testl	%eax, %eax
++	jz	L(aligned_more)
++	tzcntl	%eax, %eax
++# ifdef USE_AS_WMEMCHR
++	/* NB: Multiply bytes by CHAR_SIZE to get the wchar_t count.  */
++	leaq	(%rdi, %rax, CHAR_SIZE), %rax
++# else
++	addq	%rdi, %rax
++# endif
++	ret
++
++# ifndef USE_AS_RAWMEMCHR
++L(zero):
++	xorl	%eax, %eax
++	ret
++
++	.p2align 5
++L(first_vec_x0):
++	/* Check if first match was before length.  */
++	tzcntl	%eax, %eax
++	xorl	%ecx, %ecx
++	cmpl	%eax, %edx
++	leaq	(%rdi, %rax, CHAR_SIZE), %rax
++	cmovle	%rcx, %rax
++	ret
++# else
++	/* NB: first_vec_x0 is 17 bytes which will leave
++	   cross_page_boundary (which is relatively cold) close enough
++	   to ideal alignment. So only realign L(cross_page_boundary) if
++	   rawmemchr.  */
++	.p2align 4
++# endif
++L(cross_page_boundary):
++	/* Save pointer before aligning as its original value is
++	   necessary for computer return address if byte is found or
++	   adjusting length if it is not and this is memchr.  */
++	movq	%rdi, %rcx
++	/* Align data to VEC_SIZE. ALGN_PTR_REG is rcx for memchr and rdi
++	   for rawmemchr.  */
++	andq	$-VEC_SIZE, %ALGN_PTR_REG
++	VPCMP	$0, (%ALGN_PTR_REG), %YMMMATCH, %k0
++	kmovd	%k0, %r8d
++# ifdef USE_AS_WMEMCHR
++	/* NB: Divide shift count by 4 since each bit in K0 represent 4
++	   bytes.  */
++	sarl	$2, %eax
++# endif
++# ifndef USE_AS_RAWMEMCHR
++	movl	$(PAGE_SIZE / CHAR_SIZE), %esi
++	subl	%eax, %esi
++# endif
++# ifdef USE_AS_WMEMCHR
++	andl	$(CHAR_PER_VEC - 1), %eax
++# endif
++	/* Remove the leading bytes.  */
++	sarxl	%eax, %r8d, %eax
++# ifndef USE_AS_RAWMEMCHR
++	/* Check the end of data.  */
++	cmpq	%rsi, %rdx
++	jbe	L(first_vec_x0)
++# endif
++	testl	%eax, %eax
++	jz	L(cross_page_continue)
++	tzcntl	%eax, %eax
++# ifdef USE_AS_WMEMCHR
++	/* NB: Multiply bytes by CHAR_SIZE to get the wchar_t count.  */
++	leaq	(%RAW_PTR_REG, %rax, CHAR_SIZE), %rax
++# else
++	addq	%RAW_PTR_REG, %rax
++# endif
++	ret
++
++	.p2align 4
++L(first_vec_x1):
++	tzcntl	%eax, %eax
++	leaq	VEC_SIZE(%rdi, %rax, CHAR_SIZE), %rax
++	ret
++
++	.p2align 4
++L(first_vec_x2):
++	tzcntl	%eax, %eax
++	leaq	(VEC_SIZE * 2)(%rdi, %rax, CHAR_SIZE), %rax
++	ret
++
++	.p2align 4
++L(first_vec_x3):
++	tzcntl	%eax, %eax
++	leaq	(VEC_SIZE * 3)(%rdi, %rax, CHAR_SIZE), %rax
++	ret
++
++	.p2align 4
++L(first_vec_x4):
++	tzcntl	%eax, %eax
++	leaq	(VEC_SIZE * 4)(%rdi, %rax, CHAR_SIZE), %rax
++	ret
++
++	.p2align 5
++L(aligned_more):
++	/* Check the first 4 * VEC_SIZE.  Only one VEC_SIZE at a time
++	   since data is only aligned to VEC_SIZE.  */
++
++# ifndef USE_AS_RAWMEMCHR
++	/* Align data to VEC_SIZE.  */
++L(cross_page_continue):
++	xorl	%ecx, %ecx
++	subl	%edi, %ecx
++	andq	$-VEC_SIZE, %rdi
++	/* esi is for adjusting length to see if near the end.  */
++	leal	(VEC_SIZE * 5)(%rdi, %rcx), %esi
++#  ifdef USE_AS_WMEMCHR
++	/* NB: Divide bytes by 4 to get the wchar_t count.  */
++	sarl	$2, %esi
++#  endif
++# else
++	andq	$-VEC_SIZE, %rdi
++L(cross_page_continue):
++# endif
++	/* Load first VEC regardless.  */
++	VPCMP	$0, (VEC_SIZE)(%rdi), %YMMMATCH, %k0
++	kmovd	%k0, %eax
++# ifndef USE_AS_RAWMEMCHR
++	/* Adjust length. If near end handle specially.  */
++	subq	%rsi, %rdx
++	jbe	L(last_4x_vec_or_less)
++# endif
++	testl	%eax, %eax
++	jnz	L(first_vec_x1)
++
++	VPCMP	$0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k0
++	kmovd	%k0, %eax
++	testl	%eax, %eax
++	jnz	L(first_vec_x2)
++
++	VPCMP	$0, (VEC_SIZE * 3)(%rdi), %YMMMATCH, %k0
++	kmovd	%k0, %eax
++	testl	%eax, %eax
++	jnz	L(first_vec_x3)
++
++	VPCMP	$0, (VEC_SIZE * 4)(%rdi), %YMMMATCH, %k0
++	kmovd	%k0, %eax
++	testl	%eax, %eax
++	jnz	L(first_vec_x4)
++
++
++# ifndef USE_AS_RAWMEMCHR
++	/* Check if at last CHAR_PER_VEC * 4 length.  */
++	subq	$(CHAR_PER_VEC * 4), %rdx
++	jbe	L(last_4x_vec_or_less_cmpeq)
++	addq	$VEC_SIZE, %rdi
++
++	/* Align data to VEC_SIZE * 4 for the loop and readjust length.
++	 */
++#  ifdef USE_AS_WMEMCHR
++	movl	%edi, %ecx
++	andq	$-(4 * VEC_SIZE), %rdi
++	andl	$(VEC_SIZE * 4 - 1), %ecx
++	/* NB: Divide bytes by 4 to get the wchar_t count.  */
++	sarl	$2, %ecx
++	addq	%rcx, %rdx
++#  else
++	addq	%rdi, %rdx
++	andq	$-(4 * VEC_SIZE), %rdi
++	subq	%rdi, %rdx
++#  endif
++# else
++	addq	$VEC_SIZE, %rdi
++	andq	$-(4 * VEC_SIZE), %rdi
++# endif
++
++	vpxorq	%XMMZERO, %XMMZERO, %XMMZERO
++
++	/* Compare 4 * VEC at a time forward.  */
++	.p2align 4
++L(loop_4x_vec):
++	/* It would be possible to save some instructions using 4x VPCMP
++	   but bottleneck on port 5 makes it not woth it.  */
++	VPCMP	$4, (VEC_SIZE * 4)(%rdi), %YMMMATCH, %k1
++	/* xor will set bytes match esi to zero.  */
++	vpxorq	(VEC_SIZE * 5)(%rdi), %YMMMATCH, %YMM2
++	vpxorq	(VEC_SIZE * 6)(%rdi), %YMMMATCH, %YMM3
++	VPCMP	$0, (VEC_SIZE * 7)(%rdi), %YMMMATCH, %k3
++	/* Reduce VEC2 / VEC3 with min and VEC1 with zero mask.  */
++	VPMINU	%YMM2, %YMM3, %YMM3{%k1}{z}
++	VPCMP	$0, %YMM3, %YMMZERO, %k2
++# ifdef USE_AS_RAWMEMCHR
++	subq	$-(VEC_SIZE * 4), %rdi
++	kortestd %k2, %k3
++	jz	L(loop_4x_vec)
++# else
++	kortestd %k2, %k3
++	jnz	L(loop_4x_vec_end)
++
++	subq	$-(VEC_SIZE * 4), %rdi
++
++	subq	$(CHAR_PER_VEC * 4), %rdx
++	ja	L(loop_4x_vec)
++
++	/* Fall through into less than 4 remaining vectors of length case.
++	 */
++	VPCMP	$0, (VEC_SIZE * 4)(%rdi), %YMMMATCH, %k0
++	kmovd	%k0, %eax
++	addq	$(VEC_SIZE * 3), %rdi
++	.p2align 4
++L(last_4x_vec_or_less):
++	/* Check if first VEC contained match.  */
++	testl	%eax, %eax
++	jnz	L(first_vec_x1_check)
++
++	/* If remaining length > CHAR_PER_VEC * 2.  */
++	addl	$(CHAR_PER_VEC * 2), %edx
++	jg	L(last_4x_vec)
++
++L(last_2x_vec):
++	/* If remaining length < CHAR_PER_VEC.  */
++	addl	$CHAR_PER_VEC, %edx
++	jle	L(zero_end)
++
++	/* Check VEC2 and compare any match with remaining length.  */
++	VPCMP	$0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k0
++	kmovd	%k0, %eax
++	tzcntl	%eax, %eax
++	cmpl	%eax, %edx
++	jbe	L(set_zero_end)
++	leaq	(VEC_SIZE * 2)(%rdi, %rax, CHAR_SIZE), %rax
++L(zero_end):
++	ret
++
++
++	.p2align 4
++L(first_vec_x1_check):
++	tzcntl	%eax, %eax
++	/* Adjust length.  */
++	subl	$-(CHAR_PER_VEC * 4), %edx
++	/* Check if match within remaining length.  */
++	cmpl	%eax, %edx
++	jbe	L(set_zero_end)
++	/* NB: Multiply bytes by CHAR_SIZE to get the wchar_t count.  */
++	leaq	VEC_SIZE(%rdi, %rax, CHAR_SIZE), %rax
++	ret
++L(set_zero_end):
++	xorl	%eax, %eax
++	ret
++
++	.p2align 4
++L(loop_4x_vec_end):
++# endif
++	/* rawmemchr will fall through into this if match was found in
++	   loop.  */
++
++	/* k1 has not of matches with VEC1.  */
++	kmovd	%k1, %eax
++# ifdef USE_AS_WMEMCHR
++	subl	$((1 << CHAR_PER_VEC) - 1), %eax
++# else
++	incl	%eax
++# endif
++	jnz	L(last_vec_x1_return)
++
++	VPCMP	$0, %YMM2, %YMMZERO, %k0
++	kmovd	%k0, %eax
++	testl	%eax, %eax
++	jnz	L(last_vec_x2_return)
++
++	kmovd	%k2, %eax
++	testl	%eax, %eax
++	jnz	L(last_vec_x3_return)
++
++	kmovd	%k3, %eax
++	tzcntl	%eax, %eax
++# ifdef USE_AS_RAWMEMCHR
++	leaq	(VEC_SIZE * 3)(%rdi, %rax, CHAR_SIZE), %rax
++# else
++	leaq	(VEC_SIZE * 7)(%rdi, %rax, CHAR_SIZE), %rax
++# endif
++	ret
++
++	.p2align 4
++L(last_vec_x1_return):
++	tzcntl	%eax, %eax
++# ifdef USE_AS_RAWMEMCHR
++#  ifdef USE_AS_WMEMCHR
++	/* NB: Multiply bytes by CHAR_SIZE to get the wchar_t count.  */
++	leaq	(%rdi, %rax, CHAR_SIZE), %rax
++#  else
++	addq	%rdi, %rax
++#  endif
++# else
++	/* NB: Multiply bytes by CHAR_SIZE to get the wchar_t count.  */
++	leaq	(VEC_SIZE * 4)(%rdi, %rax, CHAR_SIZE), %rax
++# endif
++	ret
++
++	.p2align 4
++L(last_vec_x2_return):
++	tzcntl	%eax, %eax
++# ifdef USE_AS_RAWMEMCHR
++	/* NB: Multiply bytes by CHAR_SIZE to get the wchar_t count.  */
++	leaq	VEC_SIZE(%rdi, %rax, CHAR_SIZE), %rax
++# else
++	/* NB: Multiply bytes by CHAR_SIZE to get the wchar_t count.  */
++	leaq	(VEC_SIZE * 5)(%rdi, %rax, CHAR_SIZE), %rax
++# endif
++	ret
++
++	.p2align 4
++L(last_vec_x3_return):
++	tzcntl	%eax, %eax
++# ifdef USE_AS_RAWMEMCHR
++	/* NB: Multiply bytes by CHAR_SIZE to get the wchar_t count.  */
++	leaq	(VEC_SIZE * 2)(%rdi, %rax, CHAR_SIZE), %rax
++# else
++	/* NB: Multiply bytes by CHAR_SIZE to get the wchar_t count.  */
++	leaq	(VEC_SIZE * 6)(%rdi, %rax, CHAR_SIZE), %rax
++# endif
++	ret
++
++
++# ifndef USE_AS_RAWMEMCHR
++L(last_4x_vec_or_less_cmpeq):
++	VPCMP	$0, (VEC_SIZE * 5)(%rdi), %YMMMATCH, %k0
++	kmovd	%k0, %eax
++	subq	$-(VEC_SIZE * 4), %rdi
++	/* Check first VEC regardless.  */
++	testl	%eax, %eax
++	jnz	L(first_vec_x1_check)
++
++	/* If remaining length <= CHAR_PER_VEC * 2.  */
++	addl	$(CHAR_PER_VEC * 2), %edx
++	jle	L(last_2x_vec)
++
++	.p2align 4
++L(last_4x_vec):
++	VPCMP	$0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k0
++	kmovd	%k0, %eax
++	testl	%eax, %eax
++	jnz	L(last_vec_x2)
++
++
++	VPCMP	$0, (VEC_SIZE * 3)(%rdi), %YMMMATCH, %k0
++	kmovd	%k0, %eax
++	/* Create mask for possible matches within remaining length.  */
++#  ifdef USE_AS_WMEMCHR
++	movl	$((1 << (CHAR_PER_VEC * 2)) - 1), %ecx
++	bzhil	%edx, %ecx, %ecx
++#  else
++	movq	$-1, %rcx
++	bzhiq	%rdx, %rcx, %rcx
++#  endif
++	/* Test matches in data against length match.  */
++	andl	%ecx, %eax
++	jnz	L(last_vec_x3)
++
++	/* if remaining length <= CHAR_PER_VEC * 3 (Note this is after
++	   remaining length was found to be > CHAR_PER_VEC * 2.  */
++	subl	$CHAR_PER_VEC, %edx
++	jbe	L(zero_end2)
++
++
++	VPCMP	$0, (VEC_SIZE * 4)(%rdi), %YMMMATCH, %k0
++	kmovd	%k0, %eax
++	/* Shift remaining length mask for last VEC.  */
++#  ifdef USE_AS_WMEMCHR
++	shrl	$CHAR_PER_VEC, %ecx
++#  else
++	shrq	$CHAR_PER_VEC, %rcx
++#  endif
++	andl	%ecx, %eax
++	jz	L(zero_end2)
++	tzcntl	%eax, %eax
++	leaq	(VEC_SIZE * 4)(%rdi, %rax, CHAR_SIZE), %rax
++L(zero_end2):
++	ret
++
++L(last_vec_x2):
++	tzcntl	%eax, %eax
++	leaq	(VEC_SIZE * 2)(%rdi, %rax, CHAR_SIZE), %rax
++	ret
++
++	.p2align 4
++L(last_vec_x3):
++	tzcntl	%eax, %eax
++	leaq	(VEC_SIZE * 3)(%rdi, %rax, CHAR_SIZE), %rax
++	ret
++# endif
++
++END (MEMCHR)
++#endif
+diff --git a/sysdeps/x86_64/multiarch/memcmp-avx2-movbe-rtm.S b/sysdeps/x86_64/multiarch/memcmp-avx2-movbe-rtm.S
+new file mode 100644
+index 0000000000..cf4eff5d4a
+--- /dev/null
++++ b/sysdeps/x86_64/multiarch/memcmp-avx2-movbe-rtm.S
+@@ -0,0 +1,12 @@
++#ifndef MEMCMP
++# define MEMCMP __memcmp_avx2_movbe_rtm
++#endif
++
++#define ZERO_UPPER_VEC_REGISTERS_RETURN \
++  ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST
++
++#define VZEROUPPER_RETURN jmp	 L(return_vzeroupper)
++
++#define SECTION(p) p##.avx.rtm
++
++#include "memcmp-avx2-movbe.S"
+diff --git a/sysdeps/x86_64/multiarch/memcmp-avx2-movbe.S b/sysdeps/x86_64/multiarch/memcmp-avx2-movbe.S
+index 67fc575b59..87f9478eaf 100644
+--- a/sysdeps/x86_64/multiarch/memcmp-avx2-movbe.S
++++ b/sysdeps/x86_64/multiarch/memcmp-avx2-movbe.S
+@@ -47,6 +47,10 @@
+ #  define VZEROUPPER	vzeroupper
+ # endif
+ 
++# ifndef SECTION
++#  define SECTION(p)	p##.avx
++# endif
++
+ # define VEC_SIZE 32
+ # define VEC_MASK ((1 << VEC_SIZE) - 1)
+ 
+@@ -55,7 +59,7 @@
+            memcmp has to use UNSIGNED comparison for elemnts.
+ */
+ 
+-	.section .text.avx,"ax",@progbits
++	.section SECTION(.text),"ax",@progbits
+ ENTRY (MEMCMP)
+ # ifdef USE_AS_WMEMCMP
+ 	shl	$2, %RDX_LP
+@@ -123,8 +127,8 @@ ENTRY (MEMCMP)
+ 	vptest	%ymm0, %ymm5
+ 	jnc	L(4x_vec_end)
+ 	xorl	%eax, %eax
+-	VZEROUPPER
+-	ret
++L(return_vzeroupper):
++	ZERO_UPPER_VEC_REGISTERS_RETURN
+ 
+ 	.p2align 4
+ L(last_2x_vec):
+@@ -144,8 +148,7 @@ L(last_vec):
+ 	vpmovmskb %ymm2, %eax
+ 	subl    $VEC_MASK, %eax
+ 	jnz	L(first_vec)
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ 
+ 	.p2align 4
+ L(first_vec):
+@@ -164,8 +167,7 @@ L(wmemcmp_return):
+ 	movzbl	(%rsi, %rcx), %edx
+ 	sub	%edx, %eax
+ # endif
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ 
+ # ifdef USE_AS_WMEMCMP
+ 	.p2align 4
+@@ -367,8 +369,7 @@ L(last_4x_vec):
+ 	vpmovmskb %ymm2, %eax
+ 	subl    $VEC_MASK, %eax
+ 	jnz	L(first_vec)
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ 
+ 	.p2align 4
+ L(4x_vec_end):
+@@ -394,8 +395,7 @@ L(4x_vec_end):
+ 	movzbl	(VEC_SIZE * 3)(%rsi, %rcx), %edx
+ 	sub	%edx, %eax
+ # endif
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ 
+ 	.p2align 4
+ L(first_vec_x1):
+@@ -410,8 +410,7 @@ L(first_vec_x1):
+ 	movzbl	VEC_SIZE(%rsi, %rcx), %edx
+ 	sub	%edx, %eax
+ # endif
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ 
+ 	.p2align 4
+ L(first_vec_x2):
+@@ -426,7 +425,6 @@ L(first_vec_x2):
+ 	movzbl	(VEC_SIZE * 2)(%rsi, %rcx), %edx
+ 	sub	%edx, %eax
+ # endif
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ END (MEMCMP)
+ #endif
+diff --git a/sysdeps/x86_64/multiarch/memcmp-evex-movbe.S b/sysdeps/x86_64/multiarch/memcmp-evex-movbe.S
+new file mode 100644
+index 0000000000..9c093972e1
+--- /dev/null
++++ b/sysdeps/x86_64/multiarch/memcmp-evex-movbe.S
+@@ -0,0 +1,440 @@
++/* memcmp/wmemcmp optimized with 256-bit EVEX instructions.
++   Copyright (C) 2021 Free Software Foundation, Inc.
++   This file is part of the GNU C Library.
++
++   The GNU C Library is free software; you can redistribute it and/or
++   modify it under the terms of the GNU Lesser General Public
++   License as published by the Free Software Foundation; either
++   version 2.1 of the License, or (at your option) any later version.
++
++   The GNU C Library is distributed in the hope that it will be useful,
++   but WITHOUT ANY WARRANTY; without even the implied warranty of
++   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
++   Lesser General Public License for more details.
++
++   You should have received a copy of the GNU Lesser General Public
++   License along with the GNU C Library; if not, see
++   <https://www.gnu.org/licenses/>.  */
++
++#if IS_IN (libc)
++
++/* memcmp/wmemcmp is implemented as:
++   1. For size from 2 to 7 bytes, load as big endian with movbe and bswap
++      to avoid branches.
++   2. Use overlapping compare to avoid branch.
++   3. Use vector compare when size >= 4 bytes for memcmp or size >= 8
++      bytes for wmemcmp.
++   4. If size is 8 * VEC_SIZE or less, unroll the loop.
++   5. Compare 4 * VEC_SIZE at a time with the aligned first memory
++      area.
++   6. Use 2 vector compares when size is 2 * VEC_SIZE or less.
++   7. Use 4 vector compares when size is 4 * VEC_SIZE or less.
++   8. Use 8 vector compares when size is 8 * VEC_SIZE or less.  */
++
++# include <sysdep.h>
++
++# ifndef MEMCMP
++#  define MEMCMP	__memcmp_evex_movbe
++# endif
++
++# define VMOVU		vmovdqu64
++
++# ifdef USE_AS_WMEMCMP
++#  define VPCMPEQ	vpcmpeqd
++# else
++#  define VPCMPEQ	vpcmpeqb
++# endif
++
++# define XMM1		xmm17
++# define XMM2		xmm18
++# define YMM1		ymm17
++# define YMM2		ymm18
++# define YMM3		ymm19
++# define YMM4		ymm20
++# define YMM5		ymm21
++# define YMM6		ymm22
++
++# define VEC_SIZE 32
++# ifdef USE_AS_WMEMCMP
++#  define VEC_MASK 0xff
++#  define XMM_MASK 0xf
++# else
++#  define VEC_MASK 0xffffffff
++#  define XMM_MASK 0xffff
++# endif
++
++/* Warning!
++           wmemcmp has to use SIGNED comparison for elements.
++           memcmp has to use UNSIGNED comparison for elemnts.
++*/
++
++	.section .text.evex,"ax",@progbits
++ENTRY (MEMCMP)
++# ifdef USE_AS_WMEMCMP
++	shl	$2, %RDX_LP
++# elif defined __ILP32__
++	/* Clear the upper 32 bits.  */
++	movl	%edx, %edx
++# endif
++	cmp	$VEC_SIZE, %RDX_LP
++	jb	L(less_vec)
++
++	/* From VEC to 2 * VEC.  No branch when size == VEC_SIZE.  */
++	VMOVU	(%rsi), %YMM2
++	VPCMPEQ (%rdi), %YMM2, %k1
++	kmovd	%k1, %eax
++	subl    $VEC_MASK, %eax
++	jnz	L(first_vec)
++
++	cmpq	$(VEC_SIZE * 2), %rdx
++	jbe	L(last_vec)
++
++	/* More than 2 * VEC.  */
++	cmpq	$(VEC_SIZE * 8), %rdx
++	ja	L(more_8x_vec)
++	cmpq	$(VEC_SIZE * 4), %rdx
++	jb	L(last_4x_vec)
++
++	/* From 4 * VEC to 8 * VEC, inclusively. */
++	VMOVU	(%rsi), %YMM1
++	VPCMPEQ (%rdi), %YMM1, %k1
++
++	VMOVU	VEC_SIZE(%rsi), %YMM2
++	VPCMPEQ VEC_SIZE(%rdi), %YMM2, %k2
++
++	VMOVU	(VEC_SIZE * 2)(%rsi), %YMM3
++	VPCMPEQ (VEC_SIZE * 2)(%rdi), %YMM3, %k3
++
++	VMOVU	(VEC_SIZE * 3)(%rsi), %YMM4
++	VPCMPEQ (VEC_SIZE * 3)(%rdi), %YMM4, %k4
++
++	kandd	%k1, %k2, %k5
++	kandd	%k3, %k4, %k6
++	kandd	%k5, %k6, %k6
++
++	kmovd	%k6, %eax
++	cmpl	$VEC_MASK, %eax
++	jne	L(4x_vec_end)
++
++	leaq	-(4 * VEC_SIZE)(%rdi, %rdx), %rdi
++	leaq	-(4 * VEC_SIZE)(%rsi, %rdx), %rsi
++	VMOVU	(%rsi), %YMM1
++	VPCMPEQ (%rdi), %YMM1, %k1
++
++	VMOVU	VEC_SIZE(%rsi), %YMM2
++	VPCMPEQ VEC_SIZE(%rdi), %YMM2, %k2
++	kandd	%k1, %k2, %k5
++
++	VMOVU	(VEC_SIZE * 2)(%rsi), %YMM3
++	VPCMPEQ (VEC_SIZE * 2)(%rdi), %YMM3, %k3
++	kandd	%k3, %k5, %k5
++
++	VMOVU	(VEC_SIZE * 3)(%rsi), %YMM4
++	VPCMPEQ (VEC_SIZE * 3)(%rdi), %YMM4, %k4
++	kandd	%k4, %k5, %k5
++
++	kmovd	%k5, %eax
++	cmpl	$VEC_MASK, %eax
++	jne	L(4x_vec_end)
++	xorl	%eax, %eax
++	ret
++
++	.p2align 4
++L(last_2x_vec):
++	/* From VEC to 2 * VEC.  No branch when size == VEC_SIZE.  */
++	VMOVU	(%rsi), %YMM2
++	VPCMPEQ (%rdi), %YMM2, %k2
++	kmovd	%k2, %eax
++	subl    $VEC_MASK, %eax
++	jnz	L(first_vec)
++
++L(last_vec):
++	/* Use overlapping loads to avoid branches.  */
++	leaq	-VEC_SIZE(%rdi, %rdx), %rdi
++	leaq	-VEC_SIZE(%rsi, %rdx), %rsi
++	VMOVU	(%rsi), %YMM2
++	VPCMPEQ (%rdi), %YMM2, %k2
++	kmovd	%k2, %eax
++	subl    $VEC_MASK, %eax
++	jnz	L(first_vec)
++	ret
++
++	.p2align 4
++L(first_vec):
++	/* A byte or int32 is different within 16 or 32 bytes.  */
++	tzcntl	%eax, %ecx
++# ifdef USE_AS_WMEMCMP
++	xorl	%eax, %eax
++	movl	(%rdi, %rcx, 4), %edx
++	cmpl	(%rsi, %rcx, 4), %edx
++L(wmemcmp_return):
++	setl	%al
++	negl	%eax
++	orl	$1, %eax
++# else
++	movzbl	(%rdi, %rcx), %eax
++	movzbl	(%rsi, %rcx), %edx
++	sub	%edx, %eax
++# endif
++	ret
++
++# ifdef USE_AS_WMEMCMP
++	.p2align 4
++L(4):
++	xorl	%eax, %eax
++	movl	(%rdi), %edx
++	cmpl	(%rsi), %edx
++	jne	L(wmemcmp_return)
++	ret
++# else
++	.p2align 4
++L(between_4_7):
++	/* Load as big endian with overlapping movbe to avoid branches.  */
++	movbe	(%rdi), %eax
++	movbe	(%rsi), %ecx
++	shlq	$32, %rax
++	shlq	$32, %rcx
++	movbe	-4(%rdi, %rdx), %edi
++	movbe	-4(%rsi, %rdx), %esi
++	orq	%rdi, %rax
++	orq	%rsi, %rcx
++	subq	%rcx, %rax
++	je	L(exit)
++	sbbl	%eax, %eax
++	orl	$1, %eax
++	ret
++
++	.p2align 4
++L(exit):
++	ret
++
++	.p2align 4
++L(between_2_3):
++	/* Load as big endian to avoid branches.  */
++	movzwl	(%rdi), %eax
++	movzwl	(%rsi), %ecx
++	shll	$8, %eax
++	shll	$8, %ecx
++	bswap	%eax
++	bswap	%ecx
++	movb	-1(%rdi, %rdx), %al
++	movb	-1(%rsi, %rdx), %cl
++	/* Subtraction is okay because the upper 8 bits are zero.  */
++	subl	%ecx, %eax
++	ret
++
++	.p2align 4
++L(1):
++	movzbl	(%rdi), %eax
++	movzbl	(%rsi), %ecx
++	subl	%ecx, %eax
++	ret
++# endif
++
++	.p2align 4
++L(zero):
++	xorl	%eax, %eax
++	ret
++
++	.p2align 4
++L(less_vec):
++# ifdef USE_AS_WMEMCMP
++	/* It can only be 0, 4, 8, 12, 16, 20, 24, 28 bytes.  */
++	cmpb	$4, %dl
++	je	L(4)
++	jb	L(zero)
++# else
++	cmpb	$1, %dl
++	je	L(1)
++	jb	L(zero)
++	cmpb	$4, %dl
++	jb	L(between_2_3)
++	cmpb	$8, %dl
++	jb	L(between_4_7)
++# endif
++	cmpb	$16, %dl
++	jae	L(between_16_31)
++	/* It is between 8 and 15 bytes.  */
++	vmovq	(%rdi), %XMM1
++	vmovq	(%rsi), %XMM2
++	VPCMPEQ %XMM1, %XMM2, %k2
++	kmovw	%k2, %eax
++	subl    $XMM_MASK, %eax
++	jnz	L(first_vec)
++	/* Use overlapping loads to avoid branches.  */
++	leaq	-8(%rdi, %rdx), %rdi
++	leaq	-8(%rsi, %rdx), %rsi
++	vmovq	(%rdi), %XMM1
++	vmovq	(%rsi), %XMM2
++	VPCMPEQ %XMM1, %XMM2, %k2
++	kmovw	%k2, %eax
++	subl    $XMM_MASK, %eax
++	jnz	L(first_vec)
++	ret
++
++	.p2align 4
++L(between_16_31):
++	/* From 16 to 31 bytes.  No branch when size == 16.  */
++	VMOVU	(%rsi), %XMM2
++	VPCMPEQ (%rdi), %XMM2, %k2
++	kmovw	%k2, %eax
++	subl    $XMM_MASK, %eax
++	jnz	L(first_vec)
++
++	/* Use overlapping loads to avoid branches.  */
++	leaq	-16(%rdi, %rdx), %rdi
++	leaq	-16(%rsi, %rdx), %rsi
++	VMOVU	(%rsi), %XMM2
++	VPCMPEQ (%rdi), %XMM2, %k2
++	kmovw	%k2, %eax
++	subl    $XMM_MASK, %eax
++	jnz	L(first_vec)
++	ret
++
++	.p2align 4
++L(more_8x_vec):
++	/* More than 8 * VEC.  Check the first VEC.  */
++	VMOVU	(%rsi), %YMM2
++	VPCMPEQ (%rdi), %YMM2, %k2
++	kmovd	%k2, %eax
++	subl    $VEC_MASK, %eax
++	jnz	L(first_vec)
++
++	/* Align the first memory area for aligned loads in the loop.
++	   Compute how much the first memory area is misaligned.  */
++	movq	%rdi, %rcx
++	andl	$(VEC_SIZE - 1), %ecx
++	/* Get the negative of offset for alignment.  */
++	subq	$VEC_SIZE, %rcx
++	/* Adjust the second memory area.  */
++	subq	%rcx, %rsi
++	/* Adjust the first memory area which should be aligned now.  */
++	subq	%rcx, %rdi
++	/* Adjust length.  */
++	addq	%rcx, %rdx
++
++L(loop_4x_vec):
++	/* Compare 4 * VEC at a time forward.  */
++	VMOVU	(%rsi), %YMM1
++	VPCMPEQ (%rdi), %YMM1, %k1
++
++	VMOVU	VEC_SIZE(%rsi), %YMM2
++	VPCMPEQ VEC_SIZE(%rdi), %YMM2, %k2
++	kandd	%k2, %k1, %k5
++
++	VMOVU	(VEC_SIZE * 2)(%rsi), %YMM3
++	VPCMPEQ (VEC_SIZE * 2)(%rdi), %YMM3, %k3
++	kandd	%k3, %k5, %k5
++
++	VMOVU	(VEC_SIZE * 3)(%rsi), %YMM4
++	VPCMPEQ (VEC_SIZE * 3)(%rdi), %YMM4, %k4
++	kandd	%k4, %k5, %k5
++
++	kmovd	%k5, %eax
++	cmpl	$VEC_MASK, %eax
++	jne	L(4x_vec_end)
++
++	addq	$(VEC_SIZE * 4), %rdi
++	addq	$(VEC_SIZE * 4), %rsi
++
++	subq	$(VEC_SIZE * 4), %rdx
++	cmpq	$(VEC_SIZE * 4), %rdx
++	jae	L(loop_4x_vec)
++
++	/* Less than 4 * VEC.  */
++	cmpq	$VEC_SIZE, %rdx
++	jbe	L(last_vec)
++	cmpq	$(VEC_SIZE * 2), %rdx
++	jbe	L(last_2x_vec)
++
++L(last_4x_vec):
++	/* From 2 * VEC to 4 * VEC. */
++	VMOVU	(%rsi), %YMM2
++	VPCMPEQ (%rdi), %YMM2, %k2
++	kmovd	%k2, %eax
++	subl    $VEC_MASK, %eax
++	jnz	L(first_vec)
++
++	addq	$VEC_SIZE, %rdi
++	addq	$VEC_SIZE, %rsi
++	VMOVU	(%rsi), %YMM2
++	VPCMPEQ (%rdi), %YMM2, %k2
++	kmovd	%k2, %eax
++	subl    $VEC_MASK, %eax
++	jnz	L(first_vec)
++
++	/* Use overlapping loads to avoid branches.  */
++	leaq	-(3 * VEC_SIZE)(%rdi, %rdx), %rdi
++	leaq	-(3 * VEC_SIZE)(%rsi, %rdx), %rsi
++	VMOVU	(%rsi), %YMM2
++	VPCMPEQ (%rdi), %YMM2, %k2
++	kmovd	%k2, %eax
++	subl    $VEC_MASK, %eax
++	jnz	L(first_vec)
++
++	addq	$VEC_SIZE, %rdi
++	addq	$VEC_SIZE, %rsi
++	VMOVU	(%rsi), %YMM2
++	VPCMPEQ (%rdi), %YMM2, %k2
++	kmovd	%k2, %eax
++	subl    $VEC_MASK, %eax
++	jnz	L(first_vec)
++	ret
++
++	.p2align 4
++L(4x_vec_end):
++	kmovd	%k1, %eax
++	subl	$VEC_MASK, %eax
++	jnz	L(first_vec)
++	kmovd	%k2, %eax
++	subl	$VEC_MASK, %eax
++	jnz	L(first_vec_x1)
++	kmovd	%k3, %eax
++	subl	$VEC_MASK, %eax
++	jnz	L(first_vec_x2)
++	kmovd	%k4, %eax
++	subl	$VEC_MASK, %eax
++	tzcntl	%eax, %ecx
++# ifdef USE_AS_WMEMCMP
++	xorl	%eax, %eax
++	movl	(VEC_SIZE * 3)(%rdi, %rcx, 4), %edx
++	cmpl	(VEC_SIZE * 3)(%rsi, %rcx, 4), %edx
++	jmp	L(wmemcmp_return)
++# else
++	movzbl	(VEC_SIZE * 3)(%rdi, %rcx), %eax
++	movzbl	(VEC_SIZE * 3)(%rsi, %rcx), %edx
++	sub	%edx, %eax
++# endif
++	ret
++
++	.p2align 4
++L(first_vec_x1):
++	tzcntl	%eax, %ecx
++# ifdef USE_AS_WMEMCMP
++	xorl	%eax, %eax
++	movl	VEC_SIZE(%rdi, %rcx, 4), %edx
++	cmpl	VEC_SIZE(%rsi, %rcx, 4), %edx
++	jmp	L(wmemcmp_return)
++# else
++	movzbl	VEC_SIZE(%rdi, %rcx), %eax
++	movzbl	VEC_SIZE(%rsi, %rcx), %edx
++	sub	%edx, %eax
++# endif
++	ret
++
++	.p2align 4
++L(first_vec_x2):
++	tzcntl	%eax, %ecx
++# ifdef USE_AS_WMEMCMP
++	xorl	%eax, %eax
++	movl	(VEC_SIZE * 2)(%rdi, %rcx, 4), %edx
++	cmpl	(VEC_SIZE * 2)(%rsi, %rcx, 4), %edx
++	jmp	L(wmemcmp_return)
++# else
++	movzbl	(VEC_SIZE * 2)(%rdi, %rcx), %eax
++	movzbl	(VEC_SIZE * 2)(%rsi, %rcx), %edx
++	sub	%edx, %eax
++# endif
++	ret
++END (MEMCMP)
++#endif
+diff --git a/sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms-rtm.S b/sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms-rtm.S
+new file mode 100644
+index 0000000000..1ec1962e86
+--- /dev/null
++++ b/sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms-rtm.S
+@@ -0,0 +1,17 @@
++#if IS_IN (libc)
++# define VEC_SIZE	32
++# define VEC(i)		ymm##i
++# define VMOVNT		vmovntdq
++# define VMOVU		vmovdqu
++# define VMOVA		vmovdqa
++
++# define ZERO_UPPER_VEC_REGISTERS_RETURN \
++  ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST
++
++# define VZEROUPPER_RETURN jmp	 L(return)
++
++# define SECTION(p)		p##.avx.rtm
++# define MEMMOVE_SYMBOL(p,s)	p##_avx_##s##_rtm
++
++# include "memmove-vec-unaligned-erms.S"
++#endif
+diff --git a/sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S b/sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S
+index aac1515cf6..7dad1ad74c 100644
+--- a/sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S
++++ b/sysdeps/x86_64/multiarch/memmove-avx512-unaligned-erms.S
+@@ -1,11 +1,25 @@
+ #if IS_IN (libc)
+ # define VEC_SIZE	64
+-# define VEC(i)		zmm##i
++# define XMM0		xmm16
++# define XMM1		xmm17
++# define YMM0		ymm16
++# define YMM1		ymm17
++# define VEC0		zmm16
++# define VEC1		zmm17
++# define VEC2		zmm18
++# define VEC3		zmm19
++# define VEC4		zmm20
++# define VEC5		zmm21
++# define VEC6		zmm22
++# define VEC7		zmm23
++# define VEC8		zmm24
++# define VEC(i)		VEC##i
+ # define VMOVNT		vmovntdq
+ # define VMOVU		vmovdqu64
+ # define VMOVA		vmovdqa64
++# define VZEROUPPER
+ 
+-# define SECTION(p)		p##.avx512
++# define SECTION(p)		p##.evex512
+ # define MEMMOVE_SYMBOL(p,s)	p##_avx512_##s
+ 
+ # include "memmove-vec-unaligned-erms.S"
+diff --git a/sysdeps/x86_64/multiarch/memmove-evex-unaligned-erms.S b/sysdeps/x86_64/multiarch/memmove-evex-unaligned-erms.S
+new file mode 100644
+index 0000000000..b879007e89
+--- /dev/null
++++ b/sysdeps/x86_64/multiarch/memmove-evex-unaligned-erms.S
+@@ -0,0 +1,26 @@
++#if IS_IN (libc)
++# define VEC_SIZE	32
++# define XMM0		xmm16
++# define XMM1		xmm17
++# define YMM0		ymm16
++# define YMM1		ymm17
++# define VEC0		ymm16
++# define VEC1		ymm17
++# define VEC2		ymm18
++# define VEC3		ymm19
++# define VEC4		ymm20
++# define VEC5		ymm21
++# define VEC6		ymm22
++# define VEC7		ymm23
++# define VEC8		ymm24
++# define VEC(i)		VEC##i
++# define VMOVNT		vmovntdq
++# define VMOVU		vmovdqu64
++# define VMOVA		vmovdqa64
++# define VZEROUPPER
++
++# define SECTION(p)		p##.evex
++# define MEMMOVE_SYMBOL(p,s)	p##_evex_##s
++
++# include "memmove-vec-unaligned-erms.S"
++#endif
+diff --git a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
+index c763b7d871..d13d23d6ce 100644
+--- a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
++++ b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
+@@ -48,6 +48,14 @@
+ # define MEMMOVE_CHK_SYMBOL(p,s)	MEMMOVE_SYMBOL(p, s)
+ #endif
+ 
++#ifndef XMM0
++# define XMM0				xmm0
++#endif
++
++#ifndef YMM0
++# define YMM0				ymm0
++#endif
++
+ #ifndef VZEROUPPER
+ # if VEC_SIZE > 16
+ #  define VZEROUPPER vzeroupper
+@@ -67,6 +75,13 @@
+ # define REP_MOVSB_THRESHOLD	(2048 * (VEC_SIZE / 16))
+ #endif
+ 
++/* Avoid short distance rep movsb only with non-SSE vector.  */
++#ifndef AVOID_SHORT_DISTANCE_REP_MOVSB
++# define AVOID_SHORT_DISTANCE_REP_MOVSB (VEC_SIZE > 16)
++#else
++# define AVOID_SHORT_DISTANCE_REP_MOVSB 0
++#endif
++
+ #ifndef PREFETCH
+ # define PREFETCH(addr) prefetcht0 addr
+ #endif
+@@ -143,11 +158,12 @@ L(last_2x_vec):
+ 	VMOVU	-VEC_SIZE(%rsi,%rdx), %VEC(1)
+ 	VMOVU	%VEC(0), (%rdi)
+ 	VMOVU	%VEC(1), -VEC_SIZE(%rdi,%rdx)
+-	VZEROUPPER
+ #if !defined USE_MULTIARCH || !IS_IN (libc)
+ L(nop):
+-#endif
+ 	ret
++#else
++	VZEROUPPER_RETURN
++#endif
+ #if defined USE_MULTIARCH && IS_IN (libc)
+ END (MEMMOVE_SYMBOL (__memmove, unaligned))
+ 
+@@ -240,11 +256,14 @@ L(last_2x_vec):
+ 	VMOVU	%VEC(0), (%rdi)
+ 	VMOVU	%VEC(1), -VEC_SIZE(%rdi,%rdx)
+ L(return):
+-	VZEROUPPER
++#if VEC_SIZE > 16
++	ZERO_UPPER_VEC_REGISTERS_RETURN
++#else
+ 	ret
++#endif
+ 
+ L(movsb):
+-	cmpq	__x86_shared_non_temporal_threshold(%rip), %rdx
++	cmp	__x86_shared_non_temporal_threshold(%rip), %RDX_LP
+ 	jae	L(more_8x_vec)
+ 	cmpq	%rsi, %rdi
+ 	jb	1f
+@@ -257,7 +276,21 @@ L(movsb):
+ #  error Unsupported REP_MOVSB_THRESHOLD and VEC_SIZE!
+ # endif
+ 	jb	L(more_8x_vec_backward)
++# if AVOID_SHORT_DISTANCE_REP_MOVSB
++	movq	%rdi, %rcx
++	subq	%rsi, %rcx
++	jmp	2f
++# endif
+ 1:
++# if AVOID_SHORT_DISTANCE_REP_MOVSB
++	movq	%rsi, %rcx
++	subq	%rdi, %rcx
++2:
++/* Avoid "rep movsb" if RCX, the distance between source and destination,
++   is N*4GB + [1..63] with N >= 0.  */
++	cmpl	$63, %ecx
++	jbe	L(more_2x_vec)	/* Avoid "rep movsb" if ECX <= 63.  */
++# endif
+ 	mov	%RDX_LP, %RCX_LP
+ 	rep movsb
+ L(nop):
+@@ -291,21 +324,20 @@ L(less_vec):
+ #if VEC_SIZE > 32
+ L(between_32_63):
+ 	/* From 32 to 63.  No branch when size == 32.  */
+-	vmovdqu	(%rsi), %ymm0
+-	vmovdqu	-32(%rsi,%rdx), %ymm1
+-	vmovdqu	%ymm0, (%rdi)
+-	vmovdqu	%ymm1, -32(%rdi,%rdx)
+-	VZEROUPPER
+-	ret
++	VMOVU	(%rsi), %YMM0
++	VMOVU	-32(%rsi,%rdx), %YMM1
++	VMOVU	%YMM0, (%rdi)
++	VMOVU	%YMM1, -32(%rdi,%rdx)
++	VZEROUPPER_RETURN
+ #endif
+ #if VEC_SIZE > 16
+ 	/* From 16 to 31.  No branch when size == 16.  */
+ L(between_16_31):
+-	vmovdqu	(%rsi), %xmm0
+-	vmovdqu	-16(%rsi,%rdx), %xmm1
+-	vmovdqu	%xmm0, (%rdi)
+-	vmovdqu	%xmm1, -16(%rdi,%rdx)
+-	ret
++	VMOVU	(%rsi), %XMM0
++	VMOVU	-16(%rsi,%rdx), %XMM1
++	VMOVU	%XMM0, (%rdi)
++	VMOVU	%XMM1, -16(%rdi,%rdx)
++	VZEROUPPER_RETURN
+ #endif
+ L(between_8_15):
+ 	/* From 8 to 15.  No branch when size == 8.  */
+@@ -358,8 +390,7 @@ L(more_2x_vec):
+ 	VMOVU	%VEC(5), -(VEC_SIZE * 2)(%rdi,%rdx)
+ 	VMOVU	%VEC(6), -(VEC_SIZE * 3)(%rdi,%rdx)
+ 	VMOVU	%VEC(7), -(VEC_SIZE * 4)(%rdi,%rdx)
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ L(last_4x_vec):
+ 	/* Copy from 2 * VEC to 4 * VEC. */
+ 	VMOVU	(%rsi), %VEC(0)
+@@ -370,8 +401,7 @@ L(last_4x_vec):
+ 	VMOVU	%VEC(1), VEC_SIZE(%rdi)
+ 	VMOVU	%VEC(2), -VEC_SIZE(%rdi,%rdx)
+ 	VMOVU	%VEC(3), -(VEC_SIZE * 2)(%rdi,%rdx)
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ 
+ L(more_8x_vec):
+ 	cmpq	%rsi, %rdi
+@@ -402,7 +432,7 @@ L(more_8x_vec):
+ 	addq	%r8, %rdx
+ #if (defined USE_MULTIARCH || VEC_SIZE == 16) && IS_IN (libc)
+ 	/* Check non-temporal store threshold.  */
+-	cmpq	__x86_shared_non_temporal_threshold(%rip), %rdx
++	cmp	__x86_shared_non_temporal_threshold(%rip), %RDX_LP
+ 	ja	L(large_forward)
+ #endif
+ L(loop_4x_vec_forward):
+@@ -427,8 +457,7 @@ L(loop_4x_vec_forward):
+ 	VMOVU	%VEC(8), -(VEC_SIZE * 3)(%rcx)
+ 	/* Store the first VEC.  */
+ 	VMOVU	%VEC(4), (%r11)
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ 
+ L(more_8x_vec_backward):
+ 	/* Load the first 4 * VEC and last VEC to support overlapping
+@@ -454,7 +483,7 @@ L(more_8x_vec_backward):
+ 	subq	%r8, %rdx
+ #if (defined USE_MULTIARCH || VEC_SIZE == 16) && IS_IN (libc)
+ 	/* Check non-temporal store threshold.  */
+-	cmpq	__x86_shared_non_temporal_threshold(%rip), %rdx
++	cmp	__x86_shared_non_temporal_threshold(%rip), %RDX_LP
+ 	ja	L(large_backward)
+ #endif
+ L(loop_4x_vec_backward):
+@@ -479,8 +508,7 @@ L(loop_4x_vec_backward):
+ 	VMOVU	%VEC(7), (VEC_SIZE * 3)(%rdi)
+ 	/* Store the last VEC.  */
+ 	VMOVU	%VEC(8), (%r11)
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ 
+ #if (defined USE_MULTIARCH || VEC_SIZE == 16) && IS_IN (libc)
+ L(large_forward):
+@@ -515,8 +543,7 @@ L(loop_large_forward):
+ 	VMOVU	%VEC(8), -(VEC_SIZE * 3)(%rcx)
+ 	/* Store the first VEC.  */
+ 	VMOVU	%VEC(4), (%r11)
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ 
+ L(large_backward):
+ 	/* Don't use non-temporal store if there is overlap between
+@@ -550,8 +577,7 @@ L(loop_large_backward):
+ 	VMOVU	%VEC(7), (VEC_SIZE * 3)(%rdi)
+ 	/* Store the last VEC.  */
+ 	VMOVU	%VEC(8), (%r11)
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ #endif
+ END (MEMMOVE_SYMBOL (__memmove, unaligned_erms))
+ 
+diff --git a/sysdeps/x86_64/multiarch/memrchr-avx2-rtm.S b/sysdeps/x86_64/multiarch/memrchr-avx2-rtm.S
+new file mode 100644
+index 0000000000..cea2d2a72d
+--- /dev/null
++++ b/sysdeps/x86_64/multiarch/memrchr-avx2-rtm.S
+@@ -0,0 +1,12 @@
++#ifndef MEMRCHR
++# define MEMRCHR __memrchr_avx2_rtm
++#endif
++
++#define ZERO_UPPER_VEC_REGISTERS_RETURN \
++  ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST
++
++#define VZEROUPPER_RETURN jmp	 L(return_vzeroupper)
++
++#define SECTION(p) p##.avx.rtm
++
++#include "memrchr-avx2.S"
+diff --git a/sysdeps/x86_64/multiarch/memrchr-avx2.S b/sysdeps/x86_64/multiarch/memrchr-avx2.S
+index f5437b54de..c8d54c08d6 100644
+--- a/sysdeps/x86_64/multiarch/memrchr-avx2.S
++++ b/sysdeps/x86_64/multiarch/memrchr-avx2.S
+@@ -20,14 +20,22 @@
+ 
+ # include <sysdep.h>
+ 
++# ifndef MEMRCHR
++#  define MEMRCHR	__memrchr_avx2
++# endif
++
+ # ifndef VZEROUPPER
+ #  define VZEROUPPER	vzeroupper
+ # endif
+ 
++# ifndef SECTION
++#  define SECTION(p)	p##.avx
++# endif
++
+ # define VEC_SIZE 32
+ 
+-	.section .text.avx,"ax",@progbits
+-ENTRY (__memrchr_avx2)
++	.section SECTION(.text),"ax",@progbits
++ENTRY (MEMRCHR)
+ 	/* Broadcast CHAR to YMM0.  */
+ 	vmovd	%esi, %xmm0
+ 	vpbroadcastb %xmm0, %ymm0
+@@ -134,8 +142,8 @@ L(loop_4x_vec):
+ 	vpmovmskb %ymm1, %eax
+ 	bsrl	%eax, %eax
+ 	addq	%rdi, %rax
+-	VZEROUPPER
+-	ret
++L(return_vzeroupper):
++	ZERO_UPPER_VEC_REGISTERS_RETURN
+ 
+ 	.p2align 4
+ L(last_4x_vec_or_less):
+@@ -169,8 +177,7 @@ L(last_4x_vec_or_less):
+ 	addq	%rax, %rdx
+ 	jl	L(zero)
+ 	addq	%rdi, %rax
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ 
+ 	.p2align 4
+ L(last_2x_vec):
+@@ -191,31 +198,27 @@ L(last_2x_vec):
+ 	jl	L(zero)
+ 	addl	$(VEC_SIZE * 2), %eax
+ 	addq	%rdi, %rax
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ 
+ 	.p2align 4
+ L(last_vec_x0):
+ 	bsrl	%eax, %eax
+ 	addq	%rdi, %rax
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ 
+ 	.p2align 4
+ L(last_vec_x1):
+ 	bsrl	%eax, %eax
+ 	addl	$VEC_SIZE, %eax
+ 	addq	%rdi, %rax
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ 
+ 	.p2align 4
+ L(last_vec_x2):
+ 	bsrl	%eax, %eax
+ 	addl	$(VEC_SIZE * 2), %eax
+ 	addq	%rdi, %rax
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ 
+ 	.p2align 4
+ L(last_vec_x3):
+@@ -232,8 +235,7 @@ L(last_vec_x1_check):
+ 	jl	L(zero)
+ 	addl	$VEC_SIZE, %eax
+ 	addq	%rdi, %rax
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ 
+ 	.p2align 4
+ L(last_vec_x3_check):
+@@ -243,12 +245,14 @@ L(last_vec_x3_check):
+ 	jl	L(zero)
+ 	addl	$(VEC_SIZE * 3), %eax
+ 	addq	%rdi, %rax
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ 
+ 	.p2align 4
+ L(zero):
+-	VZEROUPPER
++	xorl	%eax, %eax
++	VZEROUPPER_RETURN
++
++	.p2align 4
+ L(null):
+ 	xorl	%eax, %eax
+ 	ret
+@@ -273,8 +277,7 @@ L(last_vec_or_less_aligned):
+ 
+ 	bsrl	%eax, %eax
+ 	addq	%rdi, %rax
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ 
+ 	.p2align 4
+ L(last_vec_or_less):
+@@ -315,8 +318,7 @@ L(last_vec_or_less):
+ 	bsrl	%eax, %eax
+ 	addq	%rdi, %rax
+ 	addq	%r8, %rax
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ 
+ 	.p2align 4
+ L(last_vec_2x_aligned):
+@@ -353,7 +355,6 @@ L(last_vec_2x_aligned):
+ 	bsrl	%eax, %eax
+ 	addq	%rdi, %rax
+ 	addq	%r8, %rax
+-	VZEROUPPER
+-	ret
+-END (__memrchr_avx2)
++	VZEROUPPER_RETURN
++END (MEMRCHR)
+ #endif
+diff --git a/sysdeps/x86_64/multiarch/memrchr-evex.S b/sysdeps/x86_64/multiarch/memrchr-evex.S
+new file mode 100644
+index 0000000000..16bf8e02b1
+--- /dev/null
++++ b/sysdeps/x86_64/multiarch/memrchr-evex.S
+@@ -0,0 +1,337 @@
++/* memrchr optimized with 256-bit EVEX instructions.
++   Copyright (C) 2021 Free Software Foundation, Inc.
++   This file is part of the GNU C Library.
++
++   The GNU C Library is free software; you can redistribute it and/or
++   modify it under the terms of the GNU Lesser General Public
++   License as published by the Free Software Foundation; either
++   version 2.1 of the License, or (at your option) any later version.
++
++   The GNU C Library is distributed in the hope that it will be useful,
++   but WITHOUT ANY WARRANTY; without even the implied warranty of
++   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
++   Lesser General Public License for more details.
++
++   You should have received a copy of the GNU Lesser General Public
++   License along with the GNU C Library; if not, see
++   <https://www.gnu.org/licenses/>.  */
++
++#if IS_IN (libc)
++
++# include <sysdep.h>
++
++# define VMOVA		vmovdqa64
++
++# define YMMMATCH	ymm16
++
++# define VEC_SIZE 32
++
++	.section .text.evex,"ax",@progbits
++ENTRY (__memrchr_evex)
++	/* Broadcast CHAR to YMMMATCH.  */
++	vpbroadcastb %esi, %YMMMATCH
++
++	sub	$VEC_SIZE, %RDX_LP
++	jbe	L(last_vec_or_less)
++
++	add	%RDX_LP, %RDI_LP
++
++	/* Check the last VEC_SIZE bytes.  */
++	vpcmpb	$0, (%rdi), %YMMMATCH, %k1
++	kmovd	%k1, %eax
++	testl	%eax, %eax
++	jnz	L(last_vec_x0)
++
++	subq	$(VEC_SIZE * 4), %rdi
++	movl	%edi, %ecx
++	andl	$(VEC_SIZE - 1), %ecx
++	jz	L(aligned_more)
++
++	/* Align data for aligned loads in the loop.  */
++	addq	$VEC_SIZE, %rdi
++	addq	$VEC_SIZE, %rdx
++	andq	$-VEC_SIZE, %rdi
++	subq	%rcx, %rdx
++
++	.p2align 4
++L(aligned_more):
++	subq	$(VEC_SIZE * 4), %rdx
++	jbe	L(last_4x_vec_or_less)
++
++	/* Check the last 4 * VEC_SIZE.  Only one VEC_SIZE at a time
++	   since data is only aligned to VEC_SIZE.  */
++	vpcmpb	$0, (VEC_SIZE * 3)(%rdi), %YMMMATCH, %k1
++	kmovd	%k1, %eax
++	testl	%eax, %eax
++	jnz	L(last_vec_x3)
++
++	vpcmpb	$0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k2
++	kmovd	%k2, %eax
++	testl	%eax, %eax
++	jnz	L(last_vec_x2)
++
++	vpcmpb	$0, VEC_SIZE(%rdi), %YMMMATCH, %k3
++	kmovd	%k3, %eax
++	testl	%eax, %eax
++	jnz	L(last_vec_x1)
++
++	vpcmpb	$0, (%rdi), %YMMMATCH, %k4
++	kmovd	%k4, %eax
++	testl	%eax, %eax
++	jnz	L(last_vec_x0)
++
++	/* Align data to 4 * VEC_SIZE for loop with fewer branches.
++	   There are some overlaps with above if data isn't aligned
++	   to 4 * VEC_SIZE.  */
++	movl	%edi, %ecx
++	andl	$(VEC_SIZE * 4 - 1), %ecx
++	jz	L(loop_4x_vec)
++
++	addq	$(VEC_SIZE * 4), %rdi
++	addq	$(VEC_SIZE * 4), %rdx
++	andq	$-(VEC_SIZE * 4), %rdi
++	subq	%rcx, %rdx
++
++	.p2align 4
++L(loop_4x_vec):
++	/* Compare 4 * VEC at a time forward.  */
++	subq	$(VEC_SIZE * 4), %rdi
++	subq	$(VEC_SIZE * 4), %rdx
++	jbe	L(last_4x_vec_or_less)
++
++	vpcmpb	$0, (%rdi), %YMMMATCH, %k1
++	vpcmpb	$0, VEC_SIZE(%rdi), %YMMMATCH, %k2
++	kord	%k1, %k2, %k5
++	vpcmpb	$0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k3
++	vpcmpb	$0, (VEC_SIZE * 3)(%rdi), %YMMMATCH, %k4
++
++	kord	%k3, %k4, %k6
++	kortestd %k5, %k6
++	jz	L(loop_4x_vec)
++
++	/* There is a match.  */
++	kmovd	%k4, %eax
++	testl	%eax, %eax
++	jnz	L(last_vec_x3)
++
++	kmovd	%k3, %eax
++	testl	%eax, %eax
++	jnz	L(last_vec_x2)
++
++	kmovd	%k2, %eax
++	testl	%eax, %eax
++	jnz	L(last_vec_x1)
++
++	kmovd	%k1, %eax
++	bsrl	%eax, %eax
++	addq	%rdi, %rax
++	ret
++
++	.p2align 4
++L(last_4x_vec_or_less):
++	addl	$(VEC_SIZE * 4), %edx
++	cmpl	$(VEC_SIZE * 2), %edx
++	jbe	L(last_2x_vec)
++
++	vpcmpb	$0, (VEC_SIZE * 3)(%rdi), %YMMMATCH, %k1
++	kmovd	%k1, %eax
++	testl	%eax, %eax
++	jnz	L(last_vec_x3)
++
++	vpcmpb	$0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k2
++	kmovd	%k2, %eax
++	testl	%eax, %eax
++	jnz	L(last_vec_x2)
++
++	vpcmpb	$0, VEC_SIZE(%rdi), %YMMMATCH, %k3
++	kmovd	%k3, %eax
++	testl	%eax, %eax
++	jnz	L(last_vec_x1_check)
++	cmpl	$(VEC_SIZE * 3), %edx
++	jbe	L(zero)
++
++	vpcmpb	$0, (%rdi), %YMMMATCH, %k4
++	kmovd	%k4, %eax
++	testl	%eax, %eax
++	jz	L(zero)
++	bsrl	%eax, %eax
++	subq	$(VEC_SIZE * 4), %rdx
++	addq	%rax, %rdx
++	jl	L(zero)
++	addq	%rdi, %rax
++	ret
++
++	.p2align 4
++L(last_2x_vec):
++	vpcmpb	$0, (VEC_SIZE * 3)(%rdi), %YMMMATCH, %k1
++	kmovd	%k1, %eax
++	testl	%eax, %eax
++	jnz	L(last_vec_x3_check)
++	cmpl	$VEC_SIZE, %edx
++	jbe	L(zero)
++
++	vpcmpb	$0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k1
++	kmovd	%k1, %eax
++	testl	%eax, %eax
++	jz	L(zero)
++	bsrl	%eax, %eax
++	subq	$(VEC_SIZE * 2), %rdx
++	addq	%rax, %rdx
++	jl	L(zero)
++	addl	$(VEC_SIZE * 2), %eax
++	addq	%rdi, %rax
++	ret
++
++	.p2align 4
++L(last_vec_x0):
++	bsrl	%eax, %eax
++	addq	%rdi, %rax
++	ret
++
++	.p2align 4
++L(last_vec_x1):
++	bsrl	%eax, %eax
++	addl	$VEC_SIZE, %eax
++	addq	%rdi, %rax
++	ret
++
++	.p2align 4
++L(last_vec_x2):
++	bsrl	%eax, %eax
++	addl	$(VEC_SIZE * 2), %eax
++	addq	%rdi, %rax
++	ret
++
++	.p2align 4
++L(last_vec_x3):
++	bsrl	%eax, %eax
++	addl	$(VEC_SIZE * 3), %eax
++	addq	%rdi, %rax
++	ret
++
++	.p2align 4
++L(last_vec_x1_check):
++	bsrl	%eax, %eax
++	subq	$(VEC_SIZE * 3), %rdx
++	addq	%rax, %rdx
++	jl	L(zero)
++	addl	$VEC_SIZE, %eax
++	addq	%rdi, %rax
++	ret
++
++	.p2align 4
++L(last_vec_x3_check):
++	bsrl	%eax, %eax
++	subq	$VEC_SIZE, %rdx
++	addq	%rax, %rdx
++	jl	L(zero)
++	addl	$(VEC_SIZE * 3), %eax
++	addq	%rdi, %rax
++	ret
++
++	.p2align 4
++L(zero):
++	xorl	%eax, %eax
++	ret
++
++	.p2align 4
++L(last_vec_or_less_aligned):
++	movl	%edx, %ecx
++
++	vpcmpb	$0, (%rdi), %YMMMATCH, %k1
++
++	movl	$1, %edx
++	/* Support rdx << 32.  */
++	salq	%cl, %rdx
++	subq	$1, %rdx
++
++	kmovd	%k1, %eax
++
++	/* Remove the trailing bytes.  */
++	andl	%edx, %eax
++	testl	%eax, %eax
++	jz	L(zero)
++
++	bsrl	%eax, %eax
++	addq	%rdi, %rax
++	ret
++
++	.p2align 4
++L(last_vec_or_less):
++	addl	$VEC_SIZE, %edx
++
++	/* Check for zero length.  */
++	testl	%edx, %edx
++	jz	L(zero)
++
++	movl	%edi, %ecx
++	andl	$(VEC_SIZE - 1), %ecx
++	jz	L(last_vec_or_less_aligned)
++
++	movl	%ecx, %esi
++	movl	%ecx, %r8d
++	addl	%edx, %esi
++	andq	$-VEC_SIZE, %rdi
++
++	subl	$VEC_SIZE, %esi
++	ja	L(last_vec_2x_aligned)
++
++	/* Check the last VEC.  */
++	vpcmpb	$0, (%rdi), %YMMMATCH, %k1
++	kmovd	%k1, %eax
++
++	/* Remove the leading and trailing bytes.  */
++	sarl	%cl, %eax
++	movl	%edx, %ecx
++
++	movl	$1, %edx
++	sall	%cl, %edx
++	subl	$1, %edx
++
++	andl	%edx, %eax
++	testl	%eax, %eax
++	jz	L(zero)
++
++	bsrl	%eax, %eax
++	addq	%rdi, %rax
++	addq	%r8, %rax
++	ret
++
++	.p2align 4
++L(last_vec_2x_aligned):
++	movl	%esi, %ecx
++
++	/* Check the last VEC.  */
++	vpcmpb	$0, VEC_SIZE(%rdi), %YMMMATCH, %k1
++
++	movl	$1, %edx
++	sall	%cl, %edx
++	subl	$1, %edx
++
++	kmovd	%k1, %eax
++
++	/* Remove the trailing bytes.  */
++	andl	%edx, %eax
++
++	testl	%eax, %eax
++	jnz	L(last_vec_x1)
++
++	/* Check the second last VEC.  */
++	vpcmpb	$0, (%rdi), %YMMMATCH, %k1
++
++	movl	%r8d, %ecx
++
++	kmovd	%k1, %eax
++
++	/* Remove the leading bytes.  Must use unsigned right shift for
++	   bsrl below.  */
++	shrl	%cl, %eax
++	testl	%eax, %eax
++	jz	L(zero)
++
++	bsrl	%eax, %eax
++	addq	%rdi, %rax
++	addq	%r8, %rax
++	ret
++END (__memrchr_evex)
++#endif
+diff --git a/sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms-rtm.S b/sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms-rtm.S
+new file mode 100644
+index 0000000000..8ac3e479bb
+--- /dev/null
++++ b/sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms-rtm.S
+@@ -0,0 +1,10 @@
++#define ZERO_UPPER_VEC_REGISTERS_RETURN \
++  ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST
++
++#define VZEROUPPER_RETURN jmp	 L(return)
++
++#define SECTION(p) p##.avx.rtm
++#define MEMSET_SYMBOL(p,s)	p##_avx2_##s##_rtm
++#define WMEMSET_SYMBOL(p,s)	p##_avx2_##s##_rtm
++
++#include "memset-avx2-unaligned-erms.S"
+diff --git a/sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S
+index 7ab3d89849..ae0860f36a 100644
+--- a/sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S
++++ b/sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S
+@@ -14,9 +14,15 @@
+   movq r, %rax; \
+   vpbroadcastd %xmm0, %ymm0
+ 
+-# define SECTION(p)		p##.avx
+-# define MEMSET_SYMBOL(p,s)	p##_avx2_##s
+-# define WMEMSET_SYMBOL(p,s)	p##_avx2_##s
++# ifndef SECTION
++#  define SECTION(p)		p##.avx
++# endif
++# ifndef MEMSET_SYMBOL
++#  define MEMSET_SYMBOL(p,s)	p##_avx2_##s
++# endif
++# ifndef WMEMSET_SYMBOL
++#  define WMEMSET_SYMBOL(p,s)	p##_avx2_##s
++# endif
+ 
+ # include "memset-vec-unaligned-erms.S"
+ #endif
+diff --git a/sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S
+index 0783979ca5..22e7b187c8 100644
+--- a/sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S
++++ b/sysdeps/x86_64/multiarch/memset-avx512-unaligned-erms.S
+@@ -1,22 +1,22 @@
+ #if IS_IN (libc)
+ # define VEC_SIZE	64
+-# define VEC(i)		zmm##i
++# define XMM0		xmm16
++# define YMM0		ymm16
++# define VEC0		zmm16
++# define VEC(i)		VEC##i
+ # define VMOVU		vmovdqu64
+ # define VMOVA		vmovdqa64
++# define VZEROUPPER
+ 
+ # define MEMSET_VDUP_TO_VEC0_AND_SET_RETURN(d, r) \
+-  vmovd d, %xmm0; \
+   movq r, %rax; \
+-  vpbroadcastb %xmm0, %xmm0; \
+-  vpbroadcastq %xmm0, %zmm0
++  vpbroadcastb d, %VEC0
+ 
+ # define WMEMSET_VDUP_TO_VEC0_AND_SET_RETURN(d, r) \
+-  vmovd d, %xmm0; \
+   movq r, %rax; \
+-  vpbroadcastd %xmm0, %xmm0; \
+-  vpbroadcastq %xmm0, %zmm0
++  vpbroadcastd d, %VEC0
+ 
+-# define SECTION(p)		p##.avx512
++# define SECTION(p)		p##.evex512
+ # define MEMSET_SYMBOL(p,s)	p##_avx512_##s
+ # define WMEMSET_SYMBOL(p,s)	p##_avx512_##s
+ 
+diff --git a/sysdeps/x86_64/multiarch/memset-evex-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-evex-unaligned-erms.S
+new file mode 100644
+index 0000000000..ae0a4d6e46
+--- /dev/null
++++ b/sysdeps/x86_64/multiarch/memset-evex-unaligned-erms.S
+@@ -0,0 +1,24 @@
++#if IS_IN (libc)
++# define VEC_SIZE	32
++# define XMM0		xmm16
++# define YMM0		ymm16
++# define VEC0		ymm16
++# define VEC(i)		VEC##i
++# define VMOVU		vmovdqu64
++# define VMOVA		vmovdqa64
++# define VZEROUPPER
++
++# define MEMSET_VDUP_TO_VEC0_AND_SET_RETURN(d, r) \
++  movq r, %rax; \
++  vpbroadcastb d, %VEC0
++
++# define WMEMSET_VDUP_TO_VEC0_AND_SET_RETURN(d, r) \
++  movq r, %rax; \
++  vpbroadcastd d, %VEC0
++
++# define SECTION(p)		p##.evex
++# define MEMSET_SYMBOL(p,s)	p##_evex_##s
++# define WMEMSET_SYMBOL(p,s)	p##_evex_##s
++
++# include "memset-vec-unaligned-erms.S"
++#endif
+diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
+index af2299709c..16bed6ec11 100644
+--- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
++++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
+@@ -34,20 +34,25 @@
+ # define WMEMSET_CHK_SYMBOL(p,s)	WMEMSET_SYMBOL(p, s)
+ #endif
+ 
++#ifndef XMM0
++# define XMM0				xmm0
++#endif
++
++#ifndef YMM0
++# define YMM0				ymm0
++#endif
++
+ #ifndef VZEROUPPER
+ # if VEC_SIZE > 16
+ #  define VZEROUPPER			vzeroupper
++#  define VZEROUPPER_SHORT_RETURN	vzeroupper; ret
+ # else
+ #  define VZEROUPPER
+ # endif
+ #endif
+ 
+ #ifndef VZEROUPPER_SHORT_RETURN
+-# if VEC_SIZE > 16
+-#  define VZEROUPPER_SHORT_RETURN	vzeroupper
+-# else
+-#  define VZEROUPPER_SHORT_RETURN	rep
+-# endif
++# define VZEROUPPER_SHORT_RETURN	rep; ret
+ #endif
+ 
+ #ifndef MOVQ
+@@ -77,7 +82,7 @@
+ ENTRY (__bzero)
+ 	mov	%RDI_LP, %RAX_LP /* Set return value.  */
+ 	mov	%RSI_LP, %RDX_LP /* Set n.  */
+-	pxor	%xmm0, %xmm0
++	pxor	%XMM0, %XMM0
+ 	jmp	L(entry_from_bzero)
+ END (__bzero)
+ weak_alias (__bzero, bzero)
+@@ -119,8 +124,7 @@ L(entry_from_bzero):
+ 	/* From VEC and to 2 * VEC.  No branch when size == VEC_SIZE.  */
+ 	VMOVU	%VEC(0), -VEC_SIZE(%rdi,%rdx)
+ 	VMOVU	%VEC(0), (%rdi)
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ #if defined USE_MULTIARCH && IS_IN (libc)
+ END (MEMSET_SYMBOL (__memset, unaligned))
+ 
+@@ -143,14 +147,12 @@ ENTRY (__memset_erms)
+ ENTRY (MEMSET_SYMBOL (__memset, erms))
+ # endif
+ L(stosb):
+-	/* Issue vzeroupper before rep stosb.  */
+-	VZEROUPPER
+ 	mov	%RDX_LP, %RCX_LP
+ 	movzbl	%sil, %eax
+ 	mov	%RDI_LP, %RDX_LP
+ 	rep stosb
+ 	mov	%RDX_LP, %RAX_LP
+-	ret
++	VZEROUPPER_RETURN
+ # if VEC_SIZE == 16
+ END (__memset_erms)
+ # else
+@@ -177,8 +179,7 @@ ENTRY (MEMSET_SYMBOL (__memset, unaligned_erms))
+ 	/* From VEC and to 2 * VEC.  No branch when size == VEC_SIZE.  */
+ 	VMOVU	%VEC(0), -VEC_SIZE(%rdi,%rdx)
+ 	VMOVU	%VEC(0), (%rdi)
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ 
+ L(stosb_more_2x_vec):
+ 	cmpq	$REP_STOSB_THRESHOLD, %rdx
+@@ -192,8 +193,11 @@ L(more_2x_vec):
+ 	VMOVU	%VEC(0), -VEC_SIZE(%rdi,%rdx)
+ 	VMOVU	%VEC(0), -(VEC_SIZE * 2)(%rdi,%rdx)
+ L(return):
+-	VZEROUPPER
++#if VEC_SIZE > 16
++	ZERO_UPPER_VEC_REGISTERS_RETURN
++#else
+ 	ret
++#endif
+ 
+ L(loop_start):
+ 	leaq	(VEC_SIZE * 4)(%rdi), %rcx
+@@ -219,7 +223,6 @@ L(loop):
+ 	cmpq	%rcx, %rdx
+ 	jne	L(loop)
+ 	VZEROUPPER_SHORT_RETURN
+-	ret
+ L(less_vec):
+ 	/* Less than 1 VEC.  */
+ # if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
+@@ -233,7 +236,7 @@ L(less_vec):
+ 	cmpb	$16, %dl
+ 	jae	L(between_16_31)
+ # endif
+-	MOVQ	%xmm0, %rcx
++	MOVQ	%XMM0, %rcx
+ 	cmpb	$8, %dl
+ 	jae	L(between_8_15)
+ 	cmpb	$4, %dl
+@@ -243,40 +246,34 @@ L(less_vec):
+ 	jb	1f
+ 	movb	%cl, (%rdi)
+ 1:
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ # if VEC_SIZE > 32
+ 	/* From 32 to 63.  No branch when size == 32.  */
+ L(between_32_63):
+-	vmovdqu	%ymm0, -32(%rdi,%rdx)
+-	vmovdqu	%ymm0, (%rdi)
+-	VZEROUPPER
+-	ret
++	VMOVU	%YMM0, -32(%rdi,%rdx)
++	VMOVU	%YMM0, (%rdi)
++	VZEROUPPER_RETURN
+ # endif
+ # if VEC_SIZE > 16
+ 	/* From 16 to 31.  No branch when size == 16.  */
+ L(between_16_31):
+-	vmovdqu	%xmm0, -16(%rdi,%rdx)
+-	vmovdqu	%xmm0, (%rdi)
+-	VZEROUPPER
+-	ret
++	VMOVU	%XMM0, -16(%rdi,%rdx)
++	VMOVU	%XMM0, (%rdi)
++	VZEROUPPER_RETURN
+ # endif
+ 	/* From 8 to 15.  No branch when size == 8.  */
+ L(between_8_15):
+ 	movq	%rcx, -8(%rdi,%rdx)
+ 	movq	%rcx, (%rdi)
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ L(between_4_7):
+ 	/* From 4 to 7.  No branch when size == 4.  */
+ 	movl	%ecx, -4(%rdi,%rdx)
+ 	movl	%ecx, (%rdi)
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ L(between_2_3):
+ 	/* From 2 to 3.  No branch when size == 2.  */
+ 	movw	%cx, -2(%rdi,%rdx)
+ 	movw	%cx, (%rdi)
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ END (MEMSET_SYMBOL (__memset, unaligned_erms))
+diff --git a/sysdeps/x86_64/multiarch/rawmemchr-avx2-rtm.S b/sysdeps/x86_64/multiarch/rawmemchr-avx2-rtm.S
+new file mode 100644
+index 0000000000..acc5f6e2fb
+--- /dev/null
++++ b/sysdeps/x86_64/multiarch/rawmemchr-avx2-rtm.S
+@@ -0,0 +1,4 @@
++#define MEMCHR __rawmemchr_avx2_rtm
++#define USE_AS_RAWMEMCHR 1
++
++#include "memchr-avx2-rtm.S"
+diff --git a/sysdeps/x86_64/multiarch/rawmemchr-evex.S b/sysdeps/x86_64/multiarch/rawmemchr-evex.S
+new file mode 100644
+index 0000000000..ec942b77ba
+--- /dev/null
++++ b/sysdeps/x86_64/multiarch/rawmemchr-evex.S
+@@ -0,0 +1,4 @@
++#define MEMCHR __rawmemchr_evex
++#define USE_AS_RAWMEMCHR 1
++
++#include "memchr-evex.S"
+diff --git a/sysdeps/x86_64/multiarch/stpcpy-avx2-rtm.S b/sysdeps/x86_64/multiarch/stpcpy-avx2-rtm.S
+new file mode 100644
+index 0000000000..2b9c07a59f
+--- /dev/null
++++ b/sysdeps/x86_64/multiarch/stpcpy-avx2-rtm.S
+@@ -0,0 +1,3 @@
++#define USE_AS_STPCPY
++#define STRCPY __stpcpy_avx2_rtm
++#include "strcpy-avx2-rtm.S"
+diff --git a/sysdeps/x86_64/multiarch/stpcpy-evex.S b/sysdeps/x86_64/multiarch/stpcpy-evex.S
+new file mode 100644
+index 0000000000..7c6f26cd98
+--- /dev/null
++++ b/sysdeps/x86_64/multiarch/stpcpy-evex.S
+@@ -0,0 +1,3 @@
++#define USE_AS_STPCPY
++#define STRCPY __stpcpy_evex
++#include "strcpy-evex.S"
+diff --git a/sysdeps/x86_64/multiarch/stpncpy-avx2-rtm.S b/sysdeps/x86_64/multiarch/stpncpy-avx2-rtm.S
+new file mode 100644
+index 0000000000..60a2ccfe53
+--- /dev/null
++++ b/sysdeps/x86_64/multiarch/stpncpy-avx2-rtm.S
+@@ -0,0 +1,4 @@
++#define USE_AS_STPCPY
++#define USE_AS_STRNCPY
++#define STRCPY __stpncpy_avx2_rtm
++#include "strcpy-avx2-rtm.S"
+diff --git a/sysdeps/x86_64/multiarch/stpncpy-evex.S b/sysdeps/x86_64/multiarch/stpncpy-evex.S
+new file mode 100644
+index 0000000000..1570014d1c
+--- /dev/null
++++ b/sysdeps/x86_64/multiarch/stpncpy-evex.S
+@@ -0,0 +1,4 @@
++#define USE_AS_STPCPY
++#define USE_AS_STRNCPY
++#define STRCPY __stpncpy_evex
++#include "strcpy-evex.S"
+diff --git a/sysdeps/x86_64/multiarch/strcat-avx2-rtm.S b/sysdeps/x86_64/multiarch/strcat-avx2-rtm.S
+new file mode 100644
+index 0000000000..637fb557c4
+--- /dev/null
++++ b/sysdeps/x86_64/multiarch/strcat-avx2-rtm.S
+@@ -0,0 +1,12 @@
++#ifndef STRCAT
++# define STRCAT __strcat_avx2_rtm
++#endif
++
++#define ZERO_UPPER_VEC_REGISTERS_RETURN \
++  ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST
++
++#define VZEROUPPER_RETURN jmp	 L(return_vzeroupper)
++
++#define SECTION(p) p##.avx.rtm
++
++#include "strcat-avx2.S"
+diff --git a/sysdeps/x86_64/multiarch/strcat-avx2.S b/sysdeps/x86_64/multiarch/strcat-avx2.S
+index a4143bf8f5..1e6d4827ee 100644
+--- a/sysdeps/x86_64/multiarch/strcat-avx2.S
++++ b/sysdeps/x86_64/multiarch/strcat-avx2.S
+@@ -30,7 +30,11 @@
+ /* Number of bytes in a vector register */
+ # define VEC_SIZE	32
+ 
+-	.section .text.avx,"ax",@progbits
++# ifndef SECTION
++#  define SECTION(p)	p##.avx
++# endif
++
++	.section SECTION(.text),"ax",@progbits
+ ENTRY (STRCAT)
+ 	mov	%rdi, %r9
+ # ifdef USE_AS_STRNCAT
+diff --git a/sysdeps/x86_64/multiarch/strcat-evex.S b/sysdeps/x86_64/multiarch/strcat-evex.S
+new file mode 100644
+index 0000000000..97c3d85b6d
+--- /dev/null
++++ b/sysdeps/x86_64/multiarch/strcat-evex.S
+@@ -0,0 +1,283 @@
++/* strcat with 256-bit EVEX instructions.
++   Copyright (C) 2021 Free Software Foundation, Inc.
++   This file is part of the GNU C Library.
++
++   The GNU C Library is free software; you can redistribute it and/or
++   modify it under the terms of the GNU Lesser General Public
++   License as published by the Free Software Foundation; either
++   version 2.1 of the License, or (at your option) any later version.
++
++   The GNU C Library is distributed in the hope that it will be useful,
++   but WITHOUT ANY WARRANTY; without even the implied warranty of
++   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
++   Lesser General Public License for more details.
++
++   You should have received a copy of the GNU Lesser General Public
++   License along with the GNU C Library; if not, see
++   <https://www.gnu.org/licenses/>.  */
++
++#if IS_IN (libc)
++
++# include <sysdep.h>
++
++# ifndef STRCAT
++#  define STRCAT  __strcat_evex
++# endif
++
++# define VMOVU		vmovdqu64
++# define VMOVA		vmovdqa64
++
++/* zero register */
++# define XMMZERO	xmm16
++# define YMMZERO	ymm16
++# define YMM0		ymm17
++# define YMM1		ymm18
++
++# define USE_AS_STRCAT
++
++/* Number of bytes in a vector register */
++# define VEC_SIZE	32
++
++	.section .text.evex,"ax",@progbits
++ENTRY (STRCAT)
++	mov	%rdi, %r9
++# ifdef USE_AS_STRNCAT
++	mov	%rdx, %r8
++# endif
++
++	xor	%eax, %eax
++	mov	%edi, %ecx
++	and	$((VEC_SIZE * 4) - 1), %ecx
++	vpxorq	%XMMZERO, %XMMZERO, %XMMZERO
++	cmp	$(VEC_SIZE * 3), %ecx
++	ja	L(fourth_vector_boundary)
++	vpcmpb	$0, (%rdi), %YMMZERO, %k0
++	kmovd	%k0, %edx
++	test	%edx, %edx
++	jnz	L(exit_null_on_first_vector)
++	mov	%rdi, %rax
++	and	$-VEC_SIZE, %rax
++	jmp	L(align_vec_size_start)
++L(fourth_vector_boundary):
++	mov	%rdi, %rax
++	and	$-VEC_SIZE, %rax
++	vpcmpb	$0, (%rax), %YMMZERO, %k0
++	mov	$-1, %r10d
++	sub	%rax, %rcx
++	shl	%cl, %r10d
++	kmovd	%k0, %edx
++	and	%r10d, %edx
++	jnz	L(exit)
++
++L(align_vec_size_start):
++	vpcmpb	$0, VEC_SIZE(%rax), %YMMZERO, %k0
++	kmovd	%k0, %edx
++	test	%edx, %edx
++	jnz	L(exit_null_on_second_vector)
++
++	vpcmpb	$0, (VEC_SIZE * 2)(%rax), %YMMZERO, %k1
++	kmovd	%k1, %edx
++	test	%edx, %edx
++	jnz	L(exit_null_on_third_vector)
++
++	vpcmpb	$0, (VEC_SIZE * 3)(%rax), %YMMZERO, %k2
++	kmovd	%k2, %edx
++	test	%edx, %edx
++	jnz	L(exit_null_on_fourth_vector)
++
++	vpcmpb	$0, (VEC_SIZE * 4)(%rax), %YMMZERO, %k3
++	kmovd	%k3, %edx
++	test	%edx, %edx
++	jnz	L(exit_null_on_fifth_vector)
++
++	vpcmpb	$0, (VEC_SIZE * 5)(%rax), %YMMZERO, %k4
++	add	$(VEC_SIZE * 4), %rax
++	kmovd	%k4, %edx
++	test	%edx, %edx
++	jnz	L(exit_null_on_second_vector)
++
++	vpcmpb	$0, (VEC_SIZE * 2)(%rax), %YMMZERO, %k1
++	kmovd	%k1, %edx
++	test	%edx, %edx
++	jnz	L(exit_null_on_third_vector)
++
++	vpcmpb	$0, (VEC_SIZE * 3)(%rax), %YMMZERO, %k2
++	kmovd	%k2, %edx
++	test	%edx, %edx
++	jnz	L(exit_null_on_fourth_vector)
++
++	vpcmpb	$0, (VEC_SIZE * 4)(%rax), %YMMZERO, %k3
++	kmovd	%k3, %edx
++	test	%edx, %edx
++	jnz	L(exit_null_on_fifth_vector)
++
++	vpcmpb	$0, (VEC_SIZE * 5)(%rax), %YMMZERO, %k4
++	kmovd	%k4, %edx
++	add	$(VEC_SIZE * 4), %rax
++	test	%edx, %edx
++	jnz	L(exit_null_on_second_vector)
++
++	vpcmpb	$0, (VEC_SIZE * 2)(%rax), %YMMZERO, %k1
++	kmovd	%k1, %edx
++	test	%edx, %edx
++	jnz	L(exit_null_on_third_vector)
++
++	vpcmpb	$0, (VEC_SIZE * 3)(%rax), %YMMZERO, %k2
++	kmovd	%k2, %edx
++	test	%edx, %edx
++	jnz	L(exit_null_on_fourth_vector)
++
++	vpcmpb	$0, (VEC_SIZE * 4)(%rax), %YMMZERO, %k3
++	kmovd	%k3, %edx
++	test	%edx, %edx
++	jnz	L(exit_null_on_fifth_vector)
++
++	vpcmpb	$0, (VEC_SIZE * 5)(%rax), %YMMZERO, %k4
++	add	$(VEC_SIZE * 4), %rax
++	kmovd	%k4, %edx
++	test	%edx, %edx
++	jnz	L(exit_null_on_second_vector)
++
++	vpcmpb	$0, (VEC_SIZE * 2)(%rax), %YMMZERO, %k1
++	kmovd	%k1, %edx
++	test	%edx, %edx
++	jnz	L(exit_null_on_third_vector)
++
++	vpcmpb	$0, (VEC_SIZE * 3)(%rax), %YMMZERO, %k2
++	kmovd	%k2, %edx
++	test	%edx, %edx
++	jnz	L(exit_null_on_fourth_vector)
++
++	vpcmpb	$0, (VEC_SIZE * 4)(%rax), %YMMZERO, %k3
++	kmovd	%k3, %edx
++	test	%edx, %edx
++	jnz	L(exit_null_on_fifth_vector)
++
++	test	$((VEC_SIZE * 4) - 1), %rax
++	jz	L(align_four_vec_loop)
++
++	vpcmpb	$0, (VEC_SIZE * 5)(%rax), %YMMZERO, %k4
++	add	$(VEC_SIZE * 5), %rax
++	kmovd	%k4, %edx
++	test	%edx, %edx
++	jnz	L(exit)
++
++	test	$((VEC_SIZE * 4) - 1), %rax
++	jz	L(align_four_vec_loop)
++
++	vpcmpb	$0, VEC_SIZE(%rax), %YMMZERO, %k0
++	add	$VEC_SIZE, %rax
++	kmovd	%k0, %edx
++	test	%edx, %edx
++	jnz	L(exit)
++
++	test	$((VEC_SIZE * 4) - 1), %rax
++	jz	L(align_four_vec_loop)
++
++	vpcmpb	$0, VEC_SIZE(%rax), %YMMZERO, %k0
++	add	$VEC_SIZE, %rax
++	kmovd	%k0, %edx
++	test	%edx, %edx
++	jnz	L(exit)
++
++	test	$((VEC_SIZE * 4) - 1), %rax
++	jz	L(align_four_vec_loop)
++
++	vpcmpb	$0, VEC_SIZE(%rax), %YMMZERO, %k1
++	add	$VEC_SIZE, %rax
++	kmovd	%k1, %edx
++	test	%edx, %edx
++	jnz	L(exit)
++
++	add	$VEC_SIZE, %rax
++
++	.p2align 4
++L(align_four_vec_loop):
++	VMOVA	(%rax), %YMM0
++	VMOVA	(VEC_SIZE * 2)(%rax), %YMM1
++	vpminub	VEC_SIZE(%rax), %YMM0, %YMM0
++	vpminub	(VEC_SIZE * 3)(%rax), %YMM1, %YMM1
++	vpminub	%YMM0, %YMM1, %YMM0
++	/* If K0 != 0, there is a null byte.  */
++	vpcmpb	$0, %YMM0, %YMMZERO, %k0
++	add	$(VEC_SIZE * 4), %rax
++	ktestd	%k0, %k0
++	jz	L(align_four_vec_loop)
++
++	vpcmpb	$0, -(VEC_SIZE * 4)(%rax), %YMMZERO, %k0
++	sub	$(VEC_SIZE * 5), %rax
++	kmovd	%k0, %edx
++	test	%edx, %edx
++	jnz	L(exit_null_on_second_vector)
++
++	vpcmpb	$0, (VEC_SIZE * 2)(%rax), %YMMZERO, %k1
++	kmovd	%k1, %edx
++	test	%edx, %edx
++	jnz	L(exit_null_on_third_vector)
++
++	vpcmpb	$0, (VEC_SIZE * 3)(%rax), %YMMZERO, %k2
++	kmovd	%k2, %edx
++	test	%edx, %edx
++	jnz	L(exit_null_on_fourth_vector)
++
++	vpcmpb	$0, (VEC_SIZE * 4)(%rax), %YMMZERO, %k3
++	kmovd	%k3, %edx
++	sub	%rdi, %rax
++	bsf	%rdx, %rdx
++	add	%rdx, %rax
++	add	$(VEC_SIZE * 4), %rax
++	jmp	L(StartStrcpyPart)
++
++	.p2align 4
++L(exit):
++	sub	%rdi, %rax
++L(exit_null_on_first_vector):
++	bsf	%rdx, %rdx
++	add	%rdx, %rax
++	jmp	L(StartStrcpyPart)
++
++	.p2align 4
++L(exit_null_on_second_vector):
++	sub	%rdi, %rax
++	bsf	%rdx, %rdx
++	add	%rdx, %rax
++	add	$VEC_SIZE, %rax
++	jmp	L(StartStrcpyPart)
++
++	.p2align 4
++L(exit_null_on_third_vector):
++	sub	%rdi, %rax
++	bsf	%rdx, %rdx
++	add	%rdx, %rax
++	add	$(VEC_SIZE * 2), %rax
++	jmp	L(StartStrcpyPart)
++
++	.p2align 4
++L(exit_null_on_fourth_vector):
++	sub	%rdi, %rax
++	bsf	%rdx, %rdx
++	add	%rdx, %rax
++	add	$(VEC_SIZE * 3), %rax
++	jmp	L(StartStrcpyPart)
++
++	.p2align 4
++L(exit_null_on_fifth_vector):
++	sub	%rdi, %rax
++	bsf	%rdx, %rdx
++	add	%rdx, %rax
++	add	$(VEC_SIZE * 4), %rax
++
++	.p2align 4
++L(StartStrcpyPart):
++	lea	(%r9, %rax), %rdi
++	mov	%rsi, %rcx
++	mov	%r9, %rax      /* save result */
++
++# ifdef USE_AS_STRNCAT
++	test	%r8, %r8
++	jz	L(ExitZero)
++#  define USE_AS_STRNCPY
++# endif
++
++# include "strcpy-evex.S"
++#endif
+diff --git a/sysdeps/x86_64/multiarch/strchr-avx2-rtm.S b/sysdeps/x86_64/multiarch/strchr-avx2-rtm.S
+new file mode 100644
+index 0000000000..81f20d1d8e
+--- /dev/null
++++ b/sysdeps/x86_64/multiarch/strchr-avx2-rtm.S
+@@ -0,0 +1,12 @@
++#ifndef STRCHR
++# define STRCHR __strchr_avx2_rtm
++#endif
++
++#define ZERO_UPPER_VEC_REGISTERS_RETURN \
++  ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST
++
++#define VZEROUPPER_RETURN jmp	 L(return_vzeroupper)
++
++#define SECTION(p) p##.avx.rtm
++
++#include "strchr-avx2.S"
+diff --git a/sysdeps/x86_64/multiarch/strchr-avx2.S b/sysdeps/x86_64/multiarch/strchr-avx2.S
+index 39fc69da7b..0a5217514a 100644
+--- a/sysdeps/x86_64/multiarch/strchr-avx2.S
++++ b/sysdeps/x86_64/multiarch/strchr-avx2.S
+@@ -38,9 +38,13 @@
+ #  define VZEROUPPER	vzeroupper
+ # endif
+ 
++# ifndef SECTION
++#  define SECTION(p)	p##.avx
++# endif
++
+ # define VEC_SIZE 32
+ 
+-	.section .text.avx,"ax",@progbits
++	.section SECTION(.text),"ax",@progbits
+ ENTRY (STRCHR)
+ 	movl	%edi, %ecx
+ 	/* Broadcast CHAR to YMM0.  */
+@@ -93,8 +97,8 @@ L(cros_page_boundary):
+ 	cmp	(%rax), %CHAR_REG
+ 	cmovne	%rdx, %rax
+ # endif
+-	VZEROUPPER
+-	ret
++L(return_vzeroupper):
++	ZERO_UPPER_VEC_REGISTERS_RETURN
+ 
+ 	.p2align 4
+ L(aligned_more):
+@@ -190,8 +194,7 @@ L(first_vec_x0):
+ 	cmp	(%rax), %CHAR_REG
+ 	cmovne	%rdx, %rax
+ # endif
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ 
+ 	.p2align 4
+ L(first_vec_x1):
+@@ -205,8 +208,7 @@ L(first_vec_x1):
+ 	cmp	(%rax), %CHAR_REG
+ 	cmovne	%rdx, %rax
+ # endif
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ 
+ 	.p2align 4
+ L(first_vec_x2):
+@@ -220,8 +222,7 @@ L(first_vec_x2):
+ 	cmp	(%rax), %CHAR_REG
+ 	cmovne	%rdx, %rax
+ # endif
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ 
+ 	.p2align 4
+ L(4x_vec_end):
+@@ -247,8 +248,7 @@ L(first_vec_x3):
+ 	cmp	(%rax), %CHAR_REG
+ 	cmovne	%rdx, %rax
+ # endif
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ 
+ END (STRCHR)
+ #endif
+diff --git a/sysdeps/x86_64/multiarch/strchr-evex.S b/sysdeps/x86_64/multiarch/strchr-evex.S
+new file mode 100644
+index 0000000000..ddc86a7058
+--- /dev/null
++++ b/sysdeps/x86_64/multiarch/strchr-evex.S
+@@ -0,0 +1,335 @@
++/* strchr/strchrnul optimized with 256-bit EVEX instructions.
++   Copyright (C) 2021 Free Software Foundation, Inc.
++   This file is part of the GNU C Library.
++
++   The GNU C Library is free software; you can redistribute it and/or
++   modify it under the terms of the GNU Lesser General Public
++   License as published by the Free Software Foundation; either
++   version 2.1 of the License, or (at your option) any later version.
++
++   The GNU C Library is distributed in the hope that it will be useful,
++   but WITHOUT ANY WARRANTY; without even the implied warranty of
++   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
++   Lesser General Public License for more details.
++
++   You should have received a copy of the GNU Lesser General Public
++   License along with the GNU C Library; if not, see
++   <https://www.gnu.org/licenses/>.  */
++
++#if IS_IN (libc)
++
++# include <sysdep.h>
++
++# ifndef STRCHR
++#  define STRCHR	__strchr_evex
++# endif
++
++# define VMOVU		vmovdqu64
++# define VMOVA		vmovdqa64
++
++# ifdef USE_AS_WCSCHR
++#  define VPBROADCAST	vpbroadcastd
++#  define VPCMP		vpcmpd
++#  define VPMINU	vpminud
++#  define CHAR_REG	esi
++#  define SHIFT_REG	r8d
++# else
++#  define VPBROADCAST	vpbroadcastb
++#  define VPCMP		vpcmpb
++#  define VPMINU	vpminub
++#  define CHAR_REG	sil
++#  define SHIFT_REG	ecx
++# endif
++
++# define XMMZERO	xmm16
++
++# define YMMZERO	ymm16
++# define YMM0		ymm17
++# define YMM1		ymm18
++# define YMM2		ymm19
++# define YMM3		ymm20
++# define YMM4		ymm21
++# define YMM5		ymm22
++# define YMM6		ymm23
++# define YMM7		ymm24
++# define YMM8		ymm25
++
++# define VEC_SIZE 32
++# define PAGE_SIZE 4096
++
++	.section .text.evex,"ax",@progbits
++ENTRY (STRCHR)
++	movl	%edi, %ecx
++# ifndef USE_AS_STRCHRNUL
++	xorl	%edx, %edx
++# endif
++
++	/* Broadcast CHAR to YMM0.	*/
++	VPBROADCAST %esi, %YMM0
++
++	vpxorq	%XMMZERO, %XMMZERO, %XMMZERO
++
++	/* Check if we cross page boundary with one vector load.  */
++	andl	$(PAGE_SIZE - 1), %ecx
++	cmpl	$(PAGE_SIZE - VEC_SIZE), %ecx
++	ja  L(cross_page_boundary)
++
++	/* Check the first VEC_SIZE bytes. Search for both CHAR and the
++	   null bytes.  */
++	VMOVU	(%rdi), %YMM1
++
++	/* Leaves only CHARS matching esi as 0.  */
++	vpxorq	%YMM1, %YMM0, %YMM2
++	VPMINU	%YMM2, %YMM1, %YMM2
++	/* Each bit in K0 represents a CHAR or a null byte in YMM1.  */
++	VPCMP	$0, %YMMZERO, %YMM2, %k0
++	ktestd	%k0, %k0
++	jz	L(more_vecs)
++	kmovd	%k0, %eax
++	tzcntl	%eax, %eax
++	/* Found CHAR or the null byte.	 */
++# ifdef USE_AS_WCSCHR
++	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
++	leaq	(%rdi, %rax, 4), %rax
++# else
++	addq	%rdi, %rax
++# endif
++# ifndef USE_AS_STRCHRNUL
++	cmp (%rax), %CHAR_REG
++	cmovne	%rdx, %rax
++# endif
++	ret
++
++	.p2align 4
++L(more_vecs):
++	/* Align data for aligned loads in the loop.  */
++	andq	$-VEC_SIZE, %rdi
++L(aligned_more):
++
++	/* Check the next 4 * VEC_SIZE.	 Only one VEC_SIZE at a time
++	   since data is only aligned to VEC_SIZE.	*/
++	VMOVA	VEC_SIZE(%rdi), %YMM1
++	addq	$VEC_SIZE, %rdi
++
++	/* Leaves only CHARS matching esi as 0.  */
++	vpxorq	%YMM1, %YMM0, %YMM2
++	VPMINU	%YMM2, %YMM1, %YMM2
++	/* Each bit in K0 represents a CHAR or a null byte in YMM1.  */
++	VPCMP	$0, %YMMZERO, %YMM2, %k0
++	kmovd	%k0, %eax
++	testl	%eax, %eax
++	jnz	L(first_vec_x0)
++
++	VMOVA	VEC_SIZE(%rdi), %YMM1
++	/* Leaves only CHARS matching esi as 0.  */
++	vpxorq	%YMM1, %YMM0, %YMM2
++	VPMINU	%YMM2, %YMM1, %YMM2
++	/* Each bit in K0 represents a CHAR or a null byte in YMM1.  */
++	VPCMP	$0, %YMMZERO, %YMM2, %k0
++	kmovd	%k0, %eax
++	testl	%eax, %eax
++	jnz	L(first_vec_x1)
++
++	VMOVA	(VEC_SIZE * 2)(%rdi), %YMM1
++	/* Leaves only CHARS matching esi as 0.  */
++	vpxorq	%YMM1, %YMM0, %YMM2
++	VPMINU	%YMM2, %YMM1, %YMM2
++	/* Each bit in K0 represents a CHAR or a null byte in YMM1.  */
++	VPCMP	$0, %YMMZERO, %YMM2, %k0
++	kmovd	%k0, %eax
++	testl	%eax, %eax
++	jnz	L(first_vec_x2)
++
++	VMOVA	(VEC_SIZE * 3)(%rdi), %YMM1
++	/* Leaves only CHARS matching esi as 0.  */
++	vpxorq	%YMM1, %YMM0, %YMM2
++	VPMINU	%YMM2, %YMM1, %YMM2
++	/* Each bit in K0 represents a CHAR or a null byte in YMM1.  */
++	VPCMP	$0, %YMMZERO, %YMM2, %k0
++	ktestd	%k0, %k0
++	jz	L(prep_loop_4x)
++
++	kmovd	%k0, %eax
++	tzcntl	%eax, %eax
++	/* Found CHAR or the null byte.	 */
++# ifdef USE_AS_WCSCHR
++	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
++	leaq	(VEC_SIZE * 3)(%rdi, %rax, 4), %rax
++# else
++	leaq	(VEC_SIZE * 3)(%rdi, %rax), %rax
++# endif
++# ifndef USE_AS_STRCHRNUL
++	cmp (%rax), %CHAR_REG
++	cmovne	%rdx, %rax
++# endif
++	ret
++
++	.p2align 4
++L(first_vec_x0):
++	tzcntl	%eax, %eax
++	/* Found CHAR or the null byte.	 */
++# ifdef USE_AS_WCSCHR
++	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
++	leaq	(%rdi, %rax, 4), %rax
++# else
++	addq	%rdi, %rax
++# endif
++# ifndef USE_AS_STRCHRNUL
++	cmp (%rax), %CHAR_REG
++	cmovne	%rdx, %rax
++# endif
++	ret
++
++	.p2align 4
++L(first_vec_x1):
++	tzcntl	%eax, %eax
++	/* Found CHAR or the null byte.	 */
++# ifdef USE_AS_WCSCHR
++	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
++	leaq	VEC_SIZE(%rdi, %rax, 4), %rax
++# else
++	leaq	VEC_SIZE(%rdi, %rax), %rax
++# endif
++# ifndef USE_AS_STRCHRNUL
++	cmp (%rax), %CHAR_REG
++	cmovne	%rdx, %rax
++# endif
++	ret
++
++	.p2align 4
++L(first_vec_x2):
++	tzcntl	%eax, %eax
++	/* Found CHAR or the null byte.	 */
++# ifdef USE_AS_WCSCHR
++	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
++	leaq	(VEC_SIZE * 2)(%rdi, %rax, 4), %rax
++# else
++	leaq	(VEC_SIZE * 2)(%rdi, %rax), %rax
++# endif
++# ifndef USE_AS_STRCHRNUL
++	cmp (%rax), %CHAR_REG
++	cmovne	%rdx, %rax
++# endif
++	ret
++
++L(prep_loop_4x):
++	/* Align data to 4 * VEC_SIZE.	*/
++	andq	$-(VEC_SIZE * 4), %rdi
++
++	.p2align 4
++L(loop_4x_vec):
++	/* Compare 4 * VEC at a time forward.  */
++	VMOVA	(VEC_SIZE * 4)(%rdi), %YMM1
++	VMOVA	(VEC_SIZE * 5)(%rdi), %YMM2
++	VMOVA	(VEC_SIZE * 6)(%rdi), %YMM3
++	VMOVA	(VEC_SIZE * 7)(%rdi), %YMM4
++
++	/* Leaves only CHARS matching esi as 0.  */
++	vpxorq	%YMM1, %YMM0, %YMM5
++	vpxorq	%YMM2, %YMM0, %YMM6
++	vpxorq	%YMM3, %YMM0, %YMM7
++	vpxorq	%YMM4, %YMM0, %YMM8
++
++	VPMINU	%YMM5, %YMM1, %YMM5
++	VPMINU	%YMM6, %YMM2, %YMM6
++	VPMINU	%YMM7, %YMM3, %YMM7
++	VPMINU	%YMM8, %YMM4, %YMM8
++
++	VPMINU	%YMM5, %YMM6, %YMM1
++	VPMINU	%YMM7, %YMM8, %YMM2
++
++	VPMINU	%YMM1, %YMM2, %YMM1
++
++	/* Each bit in K0 represents a CHAR or a null byte.  */
++	VPCMP	$0, %YMMZERO, %YMM1, %k0
++
++	addq	$(VEC_SIZE * 4), %rdi
++
++	ktestd	%k0, %k0
++	jz	L(loop_4x_vec)
++
++	/* Each bit in K0 represents a CHAR or a null byte in YMM1.  */
++	VPCMP	$0, %YMMZERO, %YMM5, %k0
++	kmovd	%k0, %eax
++	testl	%eax, %eax
++	jnz	L(first_vec_x0)
++
++	/* Each bit in K1 represents a CHAR or a null byte in YMM2.  */
++	VPCMP	$0, %YMMZERO, %YMM6, %k1
++	kmovd	%k1, %eax
++	testl	%eax, %eax
++	jnz	L(first_vec_x1)
++
++	/* Each bit in K2 represents a CHAR or a null byte in YMM3.  */
++	VPCMP	$0, %YMMZERO, %YMM7, %k2
++	/* Each bit in K3 represents a CHAR or a null byte in YMM4.  */
++	VPCMP	$0, %YMMZERO, %YMM8, %k3
++
++# ifdef USE_AS_WCSCHR
++	/* NB: Each bit in K2/K3 represents 4-byte element.  */
++	kshiftlw $8, %k3, %k1
++# else
++	kshiftlq $32, %k3, %k1
++# endif
++
++	/* Each bit in K1 represents a NULL or a mismatch.  */
++	korq	%k1, %k2, %k1
++	kmovq	%k1, %rax
++
++	tzcntq  %rax, %rax
++# ifdef USE_AS_WCSCHR
++	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
++	leaq	(VEC_SIZE * 2)(%rdi, %rax, 4), %rax
++# else
++	leaq	(VEC_SIZE * 2)(%rdi, %rax), %rax
++# endif
++# ifndef USE_AS_STRCHRNUL
++	cmp (%rax), %CHAR_REG
++	cmovne	%rdx, %rax
++# endif
++	ret
++
++	/* Cold case for crossing page with first load.	 */
++	.p2align 4
++L(cross_page_boundary):
++	andq	$-VEC_SIZE, %rdi
++	andl	$(VEC_SIZE - 1), %ecx
++
++	VMOVA	(%rdi), %YMM1
++
++	/* Leaves only CHARS matching esi as 0.  */
++	vpxorq	%YMM1, %YMM0, %YMM2
++	VPMINU	%YMM2, %YMM1, %YMM2
++	/* Each bit in K0 represents a CHAR or a null byte in YMM1.  */
++	VPCMP	$0, %YMMZERO, %YMM2, %k0
++	kmovd	%k0, %eax
++	testl	%eax, %eax
++
++# ifdef USE_AS_WCSCHR
++	/* NB: Divide shift count by 4 since each bit in K1 represent 4
++	   bytes.  */
++	movl	%ecx, %SHIFT_REG
++	sarl    $2, %SHIFT_REG
++# endif
++
++	/* Remove the leading bits.	 */
++	sarxl	%SHIFT_REG, %eax, %eax
++	testl	%eax, %eax
++
++	jz	L(aligned_more)
++	tzcntl	%eax, %eax
++	addq	%rcx, %rdi
++# ifdef USE_AS_WCSCHR
++	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
++	leaq	(%rdi, %rax, 4), %rax
++# else
++	addq	%rdi, %rax
++# endif
++# ifndef USE_AS_STRCHRNUL
++	cmp (%rax), %CHAR_REG
++	cmovne	%rdx, %rax
++# endif
++	ret
++
++END (STRCHR)
++# endif
+diff --git a/sysdeps/x86_64/multiarch/strchr.c b/sysdeps/x86_64/multiarch/strchr.c
+index f27980dd36..a04ac8eb1d 100644
+--- a/sysdeps/x86_64/multiarch/strchr.c
++++ b/sysdeps/x86_64/multiarch/strchr.c
+@@ -29,16 +29,28 @@
+ extern __typeof (REDIRECT_NAME) OPTIMIZE (sse2) attribute_hidden;
+ extern __typeof (REDIRECT_NAME) OPTIMIZE (sse2_no_bsf) attribute_hidden;
+ extern __typeof (REDIRECT_NAME) OPTIMIZE (avx2) attribute_hidden;
++extern __typeof (REDIRECT_NAME) OPTIMIZE (avx2_rtm) attribute_hidden;
++extern __typeof (REDIRECT_NAME) OPTIMIZE (evex) attribute_hidden;
+ 
+ static inline void *
+ IFUNC_SELECTOR (void)
+ {
+   const struct cpu_features* cpu_features = __get_cpu_features ();
+ 
+-  if (!CPU_FEATURES_ARCH_P (cpu_features, Prefer_No_VZEROUPPER)
+-      && CPU_FEATURES_ARCH_P (cpu_features, AVX2_Usable)
++  if (CPU_FEATURES_ARCH_P (cpu_features, AVX2_Usable)
+       && CPU_FEATURES_ARCH_P (cpu_features, AVX_Fast_Unaligned_Load))
+-    return OPTIMIZE (avx2);
++    {
++      if (CPU_FEATURES_ARCH_P (cpu_features, AVX512VL_Usable)
++	  && CPU_FEATURES_ARCH_P (cpu_features, AVX512BW_Usable)
++	  && CPU_FEATURES_CPU_P (cpu_features, BMI2))
++	return OPTIMIZE (evex);
++
++      if (CPU_FEATURES_CPU_P (cpu_features, RTM))
++	return OPTIMIZE (avx2_rtm);
++
++      if (!CPU_FEATURES_ARCH_P (cpu_features, Prefer_No_VZEROUPPER))
++	return OPTIMIZE (avx2);
++    }
+ 
+   if (CPU_FEATURES_ARCH_P (cpu_features, Slow_BSF))
+     return OPTIMIZE (sse2_no_bsf);
+diff --git a/sysdeps/x86_64/multiarch/strchrnul-avx2-rtm.S b/sysdeps/x86_64/multiarch/strchrnul-avx2-rtm.S
+new file mode 100644
+index 0000000000..cdcf818b91
+--- /dev/null
++++ b/sysdeps/x86_64/multiarch/strchrnul-avx2-rtm.S
+@@ -0,0 +1,3 @@
++#define STRCHR __strchrnul_avx2_rtm
++#define USE_AS_STRCHRNUL 1
++#include "strchr-avx2-rtm.S"
+diff --git a/sysdeps/x86_64/multiarch/strchrnul-evex.S b/sysdeps/x86_64/multiarch/strchrnul-evex.S
+new file mode 100644
+index 0000000000..064fe7ca9e
+--- /dev/null
++++ b/sysdeps/x86_64/multiarch/strchrnul-evex.S
+@@ -0,0 +1,3 @@
++#define STRCHR __strchrnul_evex
++#define USE_AS_STRCHRNUL 1
++#include "strchr-evex.S"
+diff --git a/sysdeps/x86_64/multiarch/strcmp-avx2-rtm.S b/sysdeps/x86_64/multiarch/strcmp-avx2-rtm.S
+new file mode 100644
+index 0000000000..aecd30d97f
+--- /dev/null
++++ b/sysdeps/x86_64/multiarch/strcmp-avx2-rtm.S
+@@ -0,0 +1,12 @@
++#ifndef STRCMP
++# define STRCMP __strcmp_avx2_rtm
++#endif
++
++#define ZERO_UPPER_VEC_REGISTERS_RETURN \
++  ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST
++
++#define VZEROUPPER_RETURN jmp	 L(return_vzeroupper)
++
++#define SECTION(p) p##.avx.rtm
++
++#include "strcmp-avx2.S"
+diff --git a/sysdeps/x86_64/multiarch/strcmp-avx2.S b/sysdeps/x86_64/multiarch/strcmp-avx2.S
+index 48d03a9f46..4d434fd14e 100644
+--- a/sysdeps/x86_64/multiarch/strcmp-avx2.S
++++ b/sysdeps/x86_64/multiarch/strcmp-avx2.S
+@@ -55,6 +55,10 @@
+ #  define VZEROUPPER	vzeroupper
+ # endif
+ 
++# ifndef SECTION
++#  define SECTION(p)	p##.avx
++# endif
++
+ /* Warning!
+            wcscmp/wcsncmp have to use SIGNED comparison for elements.
+            strcmp/strncmp have to use UNSIGNED comparison for elements.
+@@ -75,7 +79,7 @@
+    the maximum offset is reached before a difference is found, zero is
+    returned.  */
+ 
+-	.section .text.avx,"ax",@progbits
++	.section SECTION(.text),"ax",@progbits
+ ENTRY (STRCMP)
+ # ifdef USE_AS_STRNCMP
+ 	/* Check for simple cases (0 or 1) in offset.  */
+@@ -83,6 +87,16 @@ ENTRY (STRCMP)
+ 	je	L(char0)
+ 	jb	L(zero)
+ #  ifdef USE_AS_WCSCMP
++#  ifndef __ILP32__
++	movq	%rdx, %rcx
++	/* Check if length could overflow when multiplied by
++	   sizeof(wchar_t). Checking top 8 bits will cover all potential
++	   overflow cases as well as redirect cases where its impossible to
++	   length to bound a valid memory region. In these cases just use
++	   'wcscmp'.  */
++	shrq	$56, %rcx
++	jnz	OVERFLOW_STRCMP
++#  endif
+ 	/* Convert units: from wide to byte char.  */
+ 	shl	$2, %RDX_LP
+ #  endif
+@@ -127,8 +141,8 @@ L(return):
+ 	movzbl	(%rsi, %rdx), %edx
+ 	subl	%edx, %eax
+ # endif
+-	VZEROUPPER
+-	ret
++L(return_vzeroupper):
++	ZERO_UPPER_VEC_REGISTERS_RETURN
+ 
+ 	.p2align 4
+ L(return_vec_size):
+@@ -161,8 +175,7 @@ L(return_vec_size):
+ 	subl	%edx, %eax
+ #  endif
+ # endif
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ 
+ 	.p2align 4
+ L(return_2_vec_size):
+@@ -195,8 +208,7 @@ L(return_2_vec_size):
+ 	subl	%edx, %eax
+ #  endif
+ # endif
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ 
+ 	.p2align 4
+ L(return_3_vec_size):
+@@ -229,8 +241,7 @@ L(return_3_vec_size):
+ 	subl	%edx, %eax
+ #  endif
+ # endif
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ 
+ 	.p2align 4
+ L(next_3_vectors):
+@@ -356,8 +367,7 @@ L(back_to_loop):
+ 	subl	%edx, %eax
+ #  endif
+ # endif
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ 
+ 	.p2align 4
+ L(test_vec):
+@@ -400,8 +410,7 @@ L(test_vec):
+ 	subl	%edx, %eax
+ #  endif
+ # endif
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ 
+ 	.p2align 4
+ L(test_2_vec):
+@@ -444,8 +453,7 @@ L(test_2_vec):
+ 	subl	%edx, %eax
+ #  endif
+ # endif
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ 
+ 	.p2align 4
+ L(test_3_vec):
+@@ -486,8 +494,7 @@ L(test_3_vec):
+ 	subl	%edx, %eax
+ #  endif
+ # endif
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ 
+ 	.p2align 4
+ L(loop_cross_page):
+@@ -556,8 +563,7 @@ L(loop_cross_page):
+ 	subl	%edx, %eax
+ #  endif
+ # endif
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ 
+ 	.p2align 4
+ L(loop_cross_page_2_vec):
+@@ -591,7 +597,14 @@ L(loop_cross_page_2_vec):
+ 	movl	$(PAGE_SIZE / (VEC_SIZE * 4) - 1), %esi
+ 
+ 	testq	%rdi, %rdi
++# ifdef USE_AS_STRNCMP
++	/* At this point, if %rdi value is 0, it already tested
++	   VEC_SIZE*4+%r10 byte starting from %rax. This label
++	   checks whether strncmp maximum offset reached or not.  */
++	je	L(string_nbyte_offset_check)
++# else
+ 	je	L(back_to_loop)
++# endif
+ 	tzcntq	%rdi, %rcx
+ 	addq	%r10, %rcx
+ 	/* Adjust for number of bytes skipped.  */
+@@ -624,8 +637,15 @@ L(loop_cross_page_2_vec):
+ 	subl	%edx, %eax
+ #  endif
+ # endif
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
++
++# ifdef USE_AS_STRNCMP
++L(string_nbyte_offset_check):
++	leaq	(VEC_SIZE * 4)(%r10), %r10
++	cmpq	%r10, %r11
++	jbe	L(zero)
++	jmp	L(back_to_loop)
++# endif
+ 
+ 	.p2align 4
+ L(cross_page_loop):
+@@ -659,8 +679,7 @@ L(cross_page_loop):
+ # ifndef USE_AS_WCSCMP
+ L(different):
+ # endif
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ 
+ # ifdef USE_AS_WCSCMP
+ 	.p2align 4
+@@ -670,16 +689,14 @@ L(different):
+ 	setl	%al
+ 	negl	%eax
+ 	orl	$1, %eax
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ # endif
+ 
+ # ifdef USE_AS_STRNCMP
+ 	.p2align 4
+ L(zero):
+ 	xorl	%eax, %eax
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ 
+ 	.p2align 4
+ L(char0):
+@@ -693,8 +710,7 @@ L(char0):
+ 	movzbl	(%rdi), %eax
+ 	subl	%ecx, %eax
+ #  endif
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ # endif
+ 
+ 	.p2align 4
+@@ -719,8 +735,7 @@ L(last_vector):
+ 	movzbl	(%rsi, %rdx), %edx
+ 	subl	%edx, %eax
+ # endif
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ 
+ 	/* Comparing on page boundary region requires special treatment:
+ 	   It must done one vector at the time, starting with the wider
+@@ -841,7 +856,6 @@ L(cross_page_4bytes):
+ 	testl	%eax, %eax
+ 	jne	L(cross_page_loop)
+ 	subl	%ecx, %eax
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ END (STRCMP)
+ #endif
+diff --git a/sysdeps/x86_64/multiarch/strcmp-evex.S b/sysdeps/x86_64/multiarch/strcmp-evex.S
+new file mode 100644
+index 0000000000..459eeed09f
+--- /dev/null
++++ b/sysdeps/x86_64/multiarch/strcmp-evex.S
+@@ -0,0 +1,1043 @@
++/* strcmp/wcscmp/strncmp/wcsncmp optimized with 256-bit EVEX instructions.
++   Copyright (C) 2021 Free Software Foundation, Inc.
++   This file is part of the GNU C Library.
++
++   The GNU C Library is free software; you can redistribute it and/or
++   modify it under the terms of the GNU Lesser General Public
++   License as published by the Free Software Foundation; either
++   version 2.1 of the License, or (at your option) any later version.
++
++   The GNU C Library is distributed in the hope that it will be useful,
++   but WITHOUT ANY WARRANTY; without even the implied warranty of
++   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
++   Lesser General Public License for more details.
++
++   You should have received a copy of the GNU Lesser General Public
++   License along with the GNU C Library; if not, see
++   <https://www.gnu.org/licenses/>.  */
++
++#if IS_IN (libc)
++
++# include <sysdep.h>
++
++# ifndef STRCMP
++#  define STRCMP	__strcmp_evex
++# endif
++
++# define PAGE_SIZE	4096
++
++/* VEC_SIZE = Number of bytes in a ymm register */
++# define VEC_SIZE	32
++
++/* Shift for dividing by (VEC_SIZE * 4).  */
++# define DIVIDE_BY_VEC_4_SHIFT	7
++# if (VEC_SIZE * 4) != (1 << DIVIDE_BY_VEC_4_SHIFT)
++#  error (VEC_SIZE * 4) != (1 << DIVIDE_BY_VEC_4_SHIFT)
++# endif
++
++# define VMOVU		vmovdqu64
++# define VMOVA		vmovdqa64
++
++# ifdef USE_AS_WCSCMP
++/* Compare packed dwords.  */
++#  define VPCMP		vpcmpd
++#  define SHIFT_REG32	r8d
++#  define SHIFT_REG64	r8
++/* 1 dword char == 4 bytes.  */
++#  define SIZE_OF_CHAR	4
++# else
++/* Compare packed bytes.  */
++#  define VPCMP		vpcmpb
++#  define SHIFT_REG32	ecx
++#  define SHIFT_REG64	rcx
++/* 1 byte char == 1 byte.  */
++#  define SIZE_OF_CHAR	1
++# endif
++
++# define XMMZERO	xmm16
++# define XMM0		xmm17
++# define XMM1		xmm18
++
++# define YMMZERO	ymm16
++# define YMM0		ymm17
++# define YMM1		ymm18
++# define YMM2		ymm19
++# define YMM3		ymm20
++# define YMM4		ymm21
++# define YMM5		ymm22
++# define YMM6		ymm23
++# define YMM7		ymm24
++
++/* Warning!
++           wcscmp/wcsncmp have to use SIGNED comparison for elements.
++           strcmp/strncmp have to use UNSIGNED comparison for elements.
++*/
++
++/* The main idea of the string comparison (byte or dword) using 256-bit
++   EVEX instructions consists of comparing (VPCMP) two ymm vectors. The
++   latter can be on either packed bytes or dwords depending on
++   USE_AS_WCSCMP. In order to check the null char, algorithm keeps the
++   matched bytes/dwords, requiring 5 EVEX instructions (3 VPCMP and 2
++   KORD). In general, the costs of comparing VEC_SIZE bytes (32-bytes)
++   are 3 VPCMP and 2 KORD instructions, together with VMOVU and ktestd
++   instructions.  Main loop (away from from page boundary) compares 4
++   vectors are a time, effectively comparing 4 x VEC_SIZE bytes (128
++   bytes) on each loop.
++
++   The routine strncmp/wcsncmp (enabled by defining USE_AS_STRNCMP) logic
++   is the same as strcmp, except that an a maximum offset is tracked.  If
++   the maximum offset is reached before a difference is found, zero is
++   returned.  */
++
++	.section .text.evex,"ax",@progbits
++ENTRY (STRCMP)
++# ifdef USE_AS_STRNCMP
++	/* Check for simple cases (0 or 1) in offset.  */
++	cmp	$1, %RDX_LP
++	je	L(char0)
++	jb	L(zero)
++#  ifdef USE_AS_WCSCMP
++	/* Convert units: from wide to byte char.  */
++	shl	$2, %RDX_LP
++#  endif
++	/* Register %r11 tracks the maximum offset.  */
++	mov	%RDX_LP, %R11_LP
++# endif
++	movl	%edi, %eax
++	xorl	%edx, %edx
++	/* Make %XMMZERO (%YMMZERO) all zeros in this function.  */
++	vpxorq	%XMMZERO, %XMMZERO, %XMMZERO
++	orl	%esi, %eax
++	andl	$(PAGE_SIZE - 1), %eax
++	cmpl	$(PAGE_SIZE - (VEC_SIZE * 4)), %eax
++	jg	L(cross_page)
++	/* Start comparing 4 vectors.  */
++	VMOVU	(%rdi), %YMM0
++	VMOVU	(%rsi), %YMM1
++
++	/* Each bit in K0 represents a mismatch in YMM0 and YMM1.  */
++	VPCMP	$4, %YMM0, %YMM1, %k0
++
++	/* Check for NULL in YMM0.  */
++	VPCMP	$0, %YMMZERO, %YMM0, %k1
++	/* Check for NULL in YMM1.  */
++	VPCMP	$0, %YMMZERO, %YMM1, %k2
++	/* Each bit in K1 represents a NULL in YMM0 or YMM1.  */
++	kord	%k1, %k2, %k1
++
++	/* Each bit in K1 represents:
++	   1. A mismatch in YMM0 and YMM1.  Or
++	   2. A NULL in YMM0 or YMM1.
++	 */
++	kord	%k0, %k1, %k1
++
++	ktestd	%k1, %k1
++	je	L(next_3_vectors)
++	kmovd	%k1, %ecx
++	tzcntl	%ecx, %edx
++# ifdef USE_AS_WCSCMP
++	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
++	sall	$2, %edx
++# endif
++# ifdef USE_AS_STRNCMP
++	/* Return 0 if the mismatched index (%rdx) is after the maximum
++	   offset (%r11).   */
++	cmpq	%r11, %rdx
++	jae	L(zero)
++# endif
++# ifdef USE_AS_WCSCMP
++	xorl	%eax, %eax
++	movl	(%rdi, %rdx), %ecx
++	cmpl	(%rsi, %rdx), %ecx
++	je	L(return)
++L(wcscmp_return):
++	setl	%al
++	negl	%eax
++	orl	$1, %eax
++L(return):
++# else
++	movzbl	(%rdi, %rdx), %eax
++	movzbl	(%rsi, %rdx), %edx
++	subl	%edx, %eax
++# endif
++	ret
++
++	.p2align 4
++L(return_vec_size):
++	kmovd	%k1, %ecx
++	tzcntl	%ecx, %edx
++# ifdef USE_AS_WCSCMP
++	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
++	sall	$2, %edx
++# endif
++# ifdef USE_AS_STRNCMP
++	/* Return 0 if the mismatched index (%rdx + VEC_SIZE) is after
++	   the maximum offset (%r11).  */
++	addq	$VEC_SIZE, %rdx
++	cmpq	%r11, %rdx
++	jae	L(zero)
++#  ifdef USE_AS_WCSCMP
++	xorl	%eax, %eax
++	movl	(%rdi, %rdx), %ecx
++	cmpl	(%rsi, %rdx), %ecx
++	jne	L(wcscmp_return)
++#  else
++	movzbl	(%rdi, %rdx), %eax
++	movzbl	(%rsi, %rdx), %edx
++	subl	%edx, %eax
++#  endif
++# else
++#  ifdef USE_AS_WCSCMP
++	xorl	%eax, %eax
++	movl	VEC_SIZE(%rdi, %rdx), %ecx
++	cmpl	VEC_SIZE(%rsi, %rdx), %ecx
++	jne	L(wcscmp_return)
++#  else
++	movzbl	VEC_SIZE(%rdi, %rdx), %eax
++	movzbl	VEC_SIZE(%rsi, %rdx), %edx
++	subl	%edx, %eax
++#  endif
++# endif
++	ret
++
++	.p2align 4
++L(return_2_vec_size):
++	kmovd	%k1, %ecx
++	tzcntl	%ecx, %edx
++# ifdef USE_AS_WCSCMP
++	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
++	sall	$2, %edx
++# endif
++# ifdef USE_AS_STRNCMP
++	/* Return 0 if the mismatched index (%rdx + 2 * VEC_SIZE) is
++	   after the maximum offset (%r11).  */
++	addq	$(VEC_SIZE * 2), %rdx
++	cmpq	%r11, %rdx
++	jae	L(zero)
++#  ifdef USE_AS_WCSCMP
++	xorl	%eax, %eax
++	movl	(%rdi, %rdx), %ecx
++	cmpl	(%rsi, %rdx), %ecx
++	jne	L(wcscmp_return)
++#  else
++	movzbl	(%rdi, %rdx), %eax
++	movzbl	(%rsi, %rdx), %edx
++	subl	%edx, %eax
++#  endif
++# else
++#  ifdef USE_AS_WCSCMP
++	xorl	%eax, %eax
++	movl	(VEC_SIZE * 2)(%rdi, %rdx), %ecx
++	cmpl	(VEC_SIZE * 2)(%rsi, %rdx), %ecx
++	jne	L(wcscmp_return)
++#  else
++	movzbl	(VEC_SIZE * 2)(%rdi, %rdx), %eax
++	movzbl	(VEC_SIZE * 2)(%rsi, %rdx), %edx
++	subl	%edx, %eax
++#  endif
++# endif
++	ret
++
++	.p2align 4
++L(return_3_vec_size):
++	kmovd	%k1, %ecx
++	tzcntl	%ecx, %edx
++# ifdef USE_AS_WCSCMP
++	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
++	sall	$2, %edx
++# endif
++# ifdef USE_AS_STRNCMP
++	/* Return 0 if the mismatched index (%rdx + 3 * VEC_SIZE) is
++	   after the maximum offset (%r11).  */
++	addq	$(VEC_SIZE * 3), %rdx
++	cmpq	%r11, %rdx
++	jae	L(zero)
++#  ifdef USE_AS_WCSCMP
++	xorl	%eax, %eax
++	movl	(%rdi, %rdx), %ecx
++	cmpl	(%rsi, %rdx), %ecx
++	jne	L(wcscmp_return)
++#  else
++	movzbl	(%rdi, %rdx), %eax
++	movzbl	(%rsi, %rdx), %edx
++	subl	%edx, %eax
++#  endif
++# else
++#  ifdef USE_AS_WCSCMP
++	xorl	%eax, %eax
++	movl	(VEC_SIZE * 3)(%rdi, %rdx), %ecx
++	cmpl	(VEC_SIZE * 3)(%rsi, %rdx), %ecx
++	jne	L(wcscmp_return)
++#  else
++	movzbl	(VEC_SIZE * 3)(%rdi, %rdx), %eax
++	movzbl	(VEC_SIZE * 3)(%rsi, %rdx), %edx
++	subl	%edx, %eax
++#  endif
++# endif
++	ret
++
++	.p2align 4
++L(next_3_vectors):
++	VMOVU	VEC_SIZE(%rdi), %YMM0
++	VMOVU	VEC_SIZE(%rsi), %YMM1
++	/* Each bit in K0 represents a mismatch in YMM0 and YMM1.  */
++	VPCMP	$4, %YMM0, %YMM1, %k0
++	VPCMP	$0, %YMMZERO, %YMM0, %k1
++	VPCMP	$0, %YMMZERO, %YMM1, %k2
++	/* Each bit in K1 represents a NULL in YMM0 or YMM1.  */
++	kord	%k1, %k2, %k1
++	/* Each bit in K1 represents a NULL or a mismatch.  */
++	kord	%k0, %k1, %k1
++	ktestd	%k1, %k1
++	jne	L(return_vec_size)
++
++	VMOVU	(VEC_SIZE * 2)(%rdi), %YMM2
++	VMOVU	(VEC_SIZE * 3)(%rdi), %YMM3
++	VMOVU	(VEC_SIZE * 2)(%rsi), %YMM4
++	VMOVU	(VEC_SIZE * 3)(%rsi), %YMM5
++
++	/* Each bit in K0 represents a mismatch in YMM2 and YMM4.  */
++	VPCMP	$4, %YMM2, %YMM4, %k0
++	VPCMP	$0, %YMMZERO, %YMM2, %k1
++	VPCMP	$0, %YMMZERO, %YMM4, %k2
++	/* Each bit in K1 represents a NULL in YMM2 or YMM4.  */
++	kord	%k1, %k2, %k1
++	/* Each bit in K1 represents a NULL or a mismatch.  */
++	kord	%k0, %k1, %k1
++	ktestd	%k1, %k1
++	jne	L(return_2_vec_size)
++
++	/* Each bit in K0 represents a mismatch in YMM3 and YMM5.  */
++	VPCMP	$4, %YMM3, %YMM5, %k0
++	VPCMP	$0, %YMMZERO, %YMM3, %k1
++	VPCMP	$0, %YMMZERO, %YMM5, %k2
++	/* Each bit in K1 represents a NULL in YMM3 or YMM5.  */
++	kord	%k1, %k2, %k1
++	/* Each bit in K1 represents a NULL or a mismatch.  */
++	kord	%k0, %k1, %k1
++	ktestd	%k1, %k1
++	jne	L(return_3_vec_size)
++L(main_loop_header):
++	leaq	(VEC_SIZE * 4)(%rdi), %rdx
++	movl	$PAGE_SIZE, %ecx
++	/* Align load via RAX.  */
++	andq	$-(VEC_SIZE * 4), %rdx
++	subq	%rdi, %rdx
++	leaq	(%rdi, %rdx), %rax
++# ifdef USE_AS_STRNCMP
++	/* Starting from this point, the maximum offset, or simply the
++	   'offset', DECREASES by the same amount when base pointers are
++	   moved forward.  Return 0 when:
++	     1) On match: offset <= the matched vector index.
++	     2) On mistmach, offset is before the mistmatched index.
++	 */
++	subq	%rdx, %r11
++	jbe	L(zero)
++# endif
++	addq	%rsi, %rdx
++	movq	%rdx, %rsi
++	andl	$(PAGE_SIZE - 1), %esi
++	/* Number of bytes before page crossing.  */
++	subq	%rsi, %rcx
++	/* Number of VEC_SIZE * 4 blocks before page crossing.  */
++	shrq	$DIVIDE_BY_VEC_4_SHIFT, %rcx
++	/* ESI: Number of VEC_SIZE * 4 blocks before page crossing.   */
++	movl	%ecx, %esi
++	jmp	L(loop_start)
++
++	.p2align 4
++L(loop):
++# ifdef USE_AS_STRNCMP
++	/* Base pointers are moved forward by 4 * VEC_SIZE.  Decrease
++	   the maximum offset (%r11) by the same amount.  */
++	subq	$(VEC_SIZE * 4), %r11
++	jbe	L(zero)
++# endif
++	addq	$(VEC_SIZE * 4), %rax
++	addq	$(VEC_SIZE * 4), %rdx
++L(loop_start):
++	testl	%esi, %esi
++	leal	-1(%esi), %esi
++	je	L(loop_cross_page)
++L(back_to_loop):
++	/* Main loop, comparing 4 vectors are a time.  */
++	VMOVA	(%rax), %YMM0
++	VMOVA	VEC_SIZE(%rax), %YMM2
++	VMOVA	(VEC_SIZE * 2)(%rax), %YMM4
++	VMOVA	(VEC_SIZE * 3)(%rax), %YMM6
++	VMOVU	(%rdx), %YMM1
++	VMOVU	VEC_SIZE(%rdx), %YMM3
++	VMOVU	(VEC_SIZE * 2)(%rdx), %YMM5
++	VMOVU	(VEC_SIZE * 3)(%rdx), %YMM7
++
++	VPCMP	$4, %YMM0, %YMM1, %k0
++	VPCMP	$0, %YMMZERO, %YMM0, %k1
++	VPCMP	$0, %YMMZERO, %YMM1, %k2
++	kord	%k1, %k2, %k1
++	/* Each bit in K4 represents a NULL or a mismatch in YMM0 and
++	   YMM1.  */
++	kord	%k0, %k1, %k4
++
++	VPCMP	$4, %YMM2, %YMM3, %k0
++	VPCMP	$0, %YMMZERO, %YMM2, %k1
++	VPCMP	$0, %YMMZERO, %YMM3, %k2
++	kord	%k1, %k2, %k1
++	/* Each bit in K5 represents a NULL or a mismatch in YMM2 and
++	   YMM3.  */
++	kord	%k0, %k1, %k5
++
++	VPCMP	$4, %YMM4, %YMM5, %k0
++	VPCMP	$0, %YMMZERO, %YMM4, %k1
++	VPCMP	$0, %YMMZERO, %YMM5, %k2
++	kord	%k1, %k2, %k1
++	/* Each bit in K6 represents a NULL or a mismatch in YMM4 and
++	   YMM5.  */
++	kord	%k0, %k1, %k6
++
++	VPCMP	$4, %YMM6, %YMM7, %k0
++	VPCMP	$0, %YMMZERO, %YMM6, %k1
++	VPCMP	$0, %YMMZERO, %YMM7, %k2
++	kord	%k1, %k2, %k1
++	/* Each bit in K7 represents a NULL or a mismatch in YMM6 and
++	   YMM7.  */
++	kord	%k0, %k1, %k7
++
++	kord	%k4, %k5, %k0
++	kord	%k6, %k7, %k1
++
++	/* Test each mask (32 bits) individually because for VEC_SIZE
++	   == 32 is not possible to OR the four masks and keep all bits
++	   in a 64-bit integer register, differing from SSE2 strcmp
++	   where ORing is possible.  */
++	kortestd %k0, %k1
++	je	L(loop)
++	ktestd	%k4, %k4
++	je	L(test_vec)
++	kmovd	%k4, %edi
++	tzcntl	%edi, %ecx
++# ifdef USE_AS_WCSCMP
++	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
++	sall	$2, %ecx
++# endif
++# ifdef USE_AS_STRNCMP
++	cmpq	%rcx, %r11
++	jbe	L(zero)
++#  ifdef USE_AS_WCSCMP
++	movq	%rax, %rsi
++	xorl	%eax, %eax
++	movl	(%rsi, %rcx), %edi
++	cmpl	(%rdx, %rcx), %edi
++	jne	L(wcscmp_return)
++#  else
++	movzbl	(%rax, %rcx), %eax
++	movzbl	(%rdx, %rcx), %edx
++	subl	%edx, %eax
++#  endif
++# else
++#  ifdef USE_AS_WCSCMP
++	movq	%rax, %rsi
++	xorl	%eax, %eax
++	movl	(%rsi, %rcx), %edi
++	cmpl	(%rdx, %rcx), %edi
++	jne	L(wcscmp_return)
++#  else
++	movzbl	(%rax, %rcx), %eax
++	movzbl	(%rdx, %rcx), %edx
++	subl	%edx, %eax
++#  endif
++# endif
++	ret
++
++	.p2align 4
++L(test_vec):
++# ifdef USE_AS_STRNCMP
++	/* The first vector matched.  Return 0 if the maximum offset
++	   (%r11) <= VEC_SIZE.  */
++	cmpq	$VEC_SIZE, %r11
++	jbe	L(zero)
++# endif
++	ktestd	%k5, %k5
++	je	L(test_2_vec)
++	kmovd	%k5, %ecx
++	tzcntl	%ecx, %edi
++# ifdef USE_AS_WCSCMP
++	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
++	sall	$2, %edi
++# endif
++# ifdef USE_AS_STRNCMP
++	addq	$VEC_SIZE, %rdi
++	cmpq	%rdi, %r11
++	jbe	L(zero)
++#  ifdef USE_AS_WCSCMP
++	movq	%rax, %rsi
++	xorl	%eax, %eax
++	movl	(%rsi, %rdi), %ecx
++	cmpl	(%rdx, %rdi), %ecx
++	jne	L(wcscmp_return)
++#  else
++	movzbl	(%rax, %rdi), %eax
++	movzbl	(%rdx, %rdi), %edx
++	subl	%edx, %eax
++#  endif
++# else
++#  ifdef USE_AS_WCSCMP
++	movq	%rax, %rsi
++	xorl	%eax, %eax
++	movl	VEC_SIZE(%rsi, %rdi), %ecx
++	cmpl	VEC_SIZE(%rdx, %rdi), %ecx
++	jne	L(wcscmp_return)
++#  else
++	movzbl	VEC_SIZE(%rax, %rdi), %eax
++	movzbl	VEC_SIZE(%rdx, %rdi), %edx
++	subl	%edx, %eax
++#  endif
++# endif
++	ret
++
++	.p2align 4
++L(test_2_vec):
++# ifdef USE_AS_STRNCMP
++	/* The first 2 vectors matched.  Return 0 if the maximum offset
++	   (%r11) <= 2 * VEC_SIZE.  */
++	cmpq	$(VEC_SIZE * 2), %r11
++	jbe	L(zero)
++# endif
++	ktestd	%k6, %k6
++	je	L(test_3_vec)
++	kmovd	%k6, %ecx
++	tzcntl	%ecx, %edi
++# ifdef USE_AS_WCSCMP
++	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
++	sall	$2, %edi
++# endif
++# ifdef USE_AS_STRNCMP
++	addq	$(VEC_SIZE * 2), %rdi
++	cmpq	%rdi, %r11
++	jbe	L(zero)
++#  ifdef USE_AS_WCSCMP
++	movq	%rax, %rsi
++	xorl	%eax, %eax
++	movl	(%rsi, %rdi), %ecx
++	cmpl	(%rdx, %rdi), %ecx
++	jne	L(wcscmp_return)
++#  else
++	movzbl	(%rax, %rdi), %eax
++	movzbl	(%rdx, %rdi), %edx
++	subl	%edx, %eax
++#  endif
++# else
++#  ifdef USE_AS_WCSCMP
++	movq	%rax, %rsi
++	xorl	%eax, %eax
++	movl	(VEC_SIZE * 2)(%rsi, %rdi), %ecx
++	cmpl	(VEC_SIZE * 2)(%rdx, %rdi), %ecx
++	jne	L(wcscmp_return)
++#  else
++	movzbl	(VEC_SIZE * 2)(%rax, %rdi), %eax
++	movzbl	(VEC_SIZE * 2)(%rdx, %rdi), %edx
++	subl	%edx, %eax
++#  endif
++# endif
++	ret
++
++	.p2align 4
++L(test_3_vec):
++# ifdef USE_AS_STRNCMP
++	/* The first 3 vectors matched.  Return 0 if the maximum offset
++	   (%r11) <= 3 * VEC_SIZE.  */
++	cmpq	$(VEC_SIZE * 3), %r11
++	jbe	L(zero)
++# endif
++	kmovd	%k7, %esi
++	tzcntl	%esi, %ecx
++# ifdef USE_AS_WCSCMP
++	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
++	sall	$2, %ecx
++# endif
++# ifdef USE_AS_STRNCMP
++	addq	$(VEC_SIZE * 3), %rcx
++	cmpq	%rcx, %r11
++	jbe	L(zero)
++#  ifdef USE_AS_WCSCMP
++	movq	%rax, %rsi
++	xorl	%eax, %eax
++	movl	(%rsi, %rcx), %esi
++	cmpl	(%rdx, %rcx), %esi
++	jne	L(wcscmp_return)
++#  else
++	movzbl	(%rax, %rcx), %eax
++	movzbl	(%rdx, %rcx), %edx
++	subl	%edx, %eax
++#  endif
++# else
++#  ifdef USE_AS_WCSCMP
++	movq	%rax, %rsi
++	xorl	%eax, %eax
++	movl	(VEC_SIZE * 3)(%rsi, %rcx), %esi
++	cmpl	(VEC_SIZE * 3)(%rdx, %rcx), %esi
++	jne	L(wcscmp_return)
++#  else
++	movzbl	(VEC_SIZE * 3)(%rax, %rcx), %eax
++	movzbl	(VEC_SIZE * 3)(%rdx, %rcx), %edx
++	subl	%edx, %eax
++#  endif
++# endif
++	ret
++
++	.p2align 4
++L(loop_cross_page):
++	xorl	%r10d, %r10d
++	movq	%rdx, %rcx
++	/* Align load via RDX.  We load the extra ECX bytes which should
++	   be ignored.  */
++	andl	$((VEC_SIZE * 4) - 1), %ecx
++	/* R10 is -RCX.  */
++	subq	%rcx, %r10
++
++	/* This works only if VEC_SIZE * 2 == 64. */
++# if (VEC_SIZE * 2) != 64
++#  error (VEC_SIZE * 2) != 64
++# endif
++
++	/* Check if the first VEC_SIZE * 2 bytes should be ignored.  */
++	cmpl	$(VEC_SIZE * 2), %ecx
++	jge	L(loop_cross_page_2_vec)
++
++	VMOVU	(%rax, %r10), %YMM2
++	VMOVU	VEC_SIZE(%rax, %r10), %YMM3
++	VMOVU	(%rdx, %r10), %YMM4
++	VMOVU	VEC_SIZE(%rdx, %r10), %YMM5
++
++	VPCMP	$4, %YMM4, %YMM2, %k0
++	VPCMP	$0, %YMMZERO, %YMM2, %k1
++	VPCMP	$0, %YMMZERO, %YMM4, %k2
++	kord	%k1, %k2, %k1
++	/* Each bit in K1 represents a NULL or a mismatch in YMM2 and
++	   YMM4.  */
++	kord	%k0, %k1, %k1
++
++	VPCMP	$4, %YMM5, %YMM3, %k3
++	VPCMP	$0, %YMMZERO, %YMM3, %k4
++	VPCMP	$0, %YMMZERO, %YMM5, %k5
++	kord	%k4, %k5, %k4
++	/* Each bit in K3 represents a NULL or a mismatch in YMM3 and
++	   YMM5.  */
++	kord	%k3, %k4, %k3
++
++# ifdef USE_AS_WCSCMP
++	/* NB: Each bit in K1/K3 represents 4-byte element.  */
++	kshiftlw $8, %k3, %k2
++	/* NB: Divide shift count by 4 since each bit in K1 represent 4
++	   bytes.  */
++	movl	%ecx, %SHIFT_REG32
++	sarl	$2, %SHIFT_REG32
++# else
++	kshiftlq $32, %k3, %k2
++# endif
++
++	/* Each bit in K1 represents a NULL or a mismatch.  */
++	korq	%k1, %k2, %k1
++	kmovq	%k1, %rdi
++
++	/* Since ECX < VEC_SIZE * 2, simply skip the first ECX bytes.  */
++	shrxq	%SHIFT_REG64, %rdi, %rdi
++	testq	%rdi, %rdi
++	je	L(loop_cross_page_2_vec)
++	tzcntq	%rdi, %rcx
++# ifdef USE_AS_WCSCMP
++	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
++	sall	$2, %ecx
++# endif
++# ifdef USE_AS_STRNCMP
++	cmpq	%rcx, %r11
++	jbe	L(zero)
++#  ifdef USE_AS_WCSCMP
++	movq	%rax, %rsi
++	xorl	%eax, %eax
++	movl	(%rsi, %rcx), %edi
++	cmpl	(%rdx, %rcx), %edi
++	jne	L(wcscmp_return)
++#  else
++	movzbl	(%rax, %rcx), %eax
++	movzbl	(%rdx, %rcx), %edx
++	subl	%edx, %eax
++#  endif
++# else
++#  ifdef USE_AS_WCSCMP
++	movq	%rax, %rsi
++	xorl	%eax, %eax
++	movl	(%rsi, %rcx), %edi
++	cmpl	(%rdx, %rcx), %edi
++	jne	L(wcscmp_return)
++#  else
++	movzbl	(%rax, %rcx), %eax
++	movzbl	(%rdx, %rcx), %edx
++	subl	%edx, %eax
++#  endif
++# endif
++	ret
++
++	.p2align 4
++L(loop_cross_page_2_vec):
++	/* The first VEC_SIZE * 2 bytes match or are ignored.  */
++	VMOVU	(VEC_SIZE * 2)(%rax, %r10), %YMM0
++	VMOVU	(VEC_SIZE * 3)(%rax, %r10), %YMM1
++	VMOVU	(VEC_SIZE * 2)(%rdx, %r10), %YMM2
++	VMOVU	(VEC_SIZE * 3)(%rdx, %r10), %YMM3
++
++	VPCMP	$4, %YMM0, %YMM2, %k0
++	VPCMP	$0, %YMMZERO, %YMM0, %k1
++	VPCMP	$0, %YMMZERO, %YMM2, %k2
++	kord	%k1, %k2, %k1
++	/* Each bit in K1 represents a NULL or a mismatch in YMM0 and
++	   YMM2.  */
++	kord	%k0, %k1, %k1
++
++	VPCMP	$4, %YMM1, %YMM3, %k3
++	VPCMP	$0, %YMMZERO, %YMM1, %k4
++	VPCMP	$0, %YMMZERO, %YMM3, %k5
++	kord	%k4, %k5, %k4
++	/* Each bit in K3 represents a NULL or a mismatch in YMM1 and
++	   YMM3.  */
++	kord	%k3, %k4, %k3
++
++# ifdef USE_AS_WCSCMP
++	/* NB: Each bit in K1/K3 represents 4-byte element.  */
++	kshiftlw $8, %k3, %k2
++# else
++	kshiftlq $32, %k3, %k2
++# endif
++
++	/* Each bit in K1 represents a NULL or a mismatch.  */
++	korq	%k1, %k2, %k1
++	kmovq	%k1, %rdi
++
++	xorl	%r8d, %r8d
++	/* If ECX > VEC_SIZE * 2, skip ECX - (VEC_SIZE * 2) bytes.  */
++	subl	$(VEC_SIZE * 2), %ecx
++	jle	1f
++	/* R8 has number of bytes skipped.  */
++	movl	%ecx, %r8d
++# ifdef USE_AS_WCSCMP
++	/* NB: Divide shift count by 4 since each bit in K1 represent 4
++	   bytes.  */
++	sarl	$2, %ecx
++# endif
++	/* Skip ECX bytes.  */
++	shrq	%cl, %rdi
++1:
++	/* Before jumping back to the loop, set ESI to the number of
++	   VEC_SIZE * 4 blocks before page crossing.  */
++	movl	$(PAGE_SIZE / (VEC_SIZE * 4) - 1), %esi
++
++	testq	%rdi, %rdi
++# ifdef USE_AS_STRNCMP
++	/* At this point, if %rdi value is 0, it already tested
++	   VEC_SIZE*4+%r10 byte starting from %rax. This label
++	   checks whether strncmp maximum offset reached or not.  */
++	je	L(string_nbyte_offset_check)
++# else
++	je	L(back_to_loop)
++# endif
++	tzcntq	%rdi, %rcx
++# ifdef USE_AS_WCSCMP
++	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
++	sall	$2, %ecx
++# endif
++	addq	%r10, %rcx
++	/* Adjust for number of bytes skipped.  */
++	addq	%r8, %rcx
++# ifdef USE_AS_STRNCMP
++	addq	$(VEC_SIZE * 2), %rcx
++	subq	%rcx, %r11
++	jbe	L(zero)
++#  ifdef USE_AS_WCSCMP
++	movq	%rax, %rsi
++	xorl	%eax, %eax
++	movl	(%rsi, %rcx), %edi
++	cmpl	(%rdx, %rcx), %edi
++	jne	L(wcscmp_return)
++#  else
++	movzbl	(%rax, %rcx), %eax
++	movzbl	(%rdx, %rcx), %edx
++	subl	%edx, %eax
++#  endif
++# else
++#  ifdef USE_AS_WCSCMP
++	movq	%rax, %rsi
++	xorl	%eax, %eax
++	movl	(VEC_SIZE * 2)(%rsi, %rcx), %edi
++	cmpl	(VEC_SIZE * 2)(%rdx, %rcx), %edi
++	jne	L(wcscmp_return)
++#  else
++	movzbl	(VEC_SIZE * 2)(%rax, %rcx), %eax
++	movzbl	(VEC_SIZE * 2)(%rdx, %rcx), %edx
++	subl	%edx, %eax
++#  endif
++# endif
++	ret
++
++# ifdef USE_AS_STRNCMP
++L(string_nbyte_offset_check):
++	leaq	(VEC_SIZE * 4)(%r10), %r10
++	cmpq	%r10, %r11
++	jbe	L(zero)
++	jmp	L(back_to_loop)
++# endif
++
++	.p2align 4
++L(cross_page_loop):
++	/* Check one byte/dword at a time.  */
++# ifdef USE_AS_WCSCMP
++	cmpl	%ecx, %eax
++# else
++	subl	%ecx, %eax
++# endif
++	jne	L(different)
++	addl	$SIZE_OF_CHAR, %edx
++	cmpl	$(VEC_SIZE * 4), %edx
++	je	L(main_loop_header)
++# ifdef USE_AS_STRNCMP
++	cmpq	%r11, %rdx
++	jae	L(zero)
++# endif
++# ifdef USE_AS_WCSCMP
++	movl	(%rdi, %rdx), %eax
++	movl	(%rsi, %rdx), %ecx
++# else
++	movzbl	(%rdi, %rdx), %eax
++	movzbl	(%rsi, %rdx), %ecx
++# endif
++	/* Check null char.  */
++	testl	%eax, %eax
++	jne	L(cross_page_loop)
++	/* Since %eax == 0, subtract is OK for both SIGNED and UNSIGNED
++	   comparisons.  */
++	subl	%ecx, %eax
++# ifndef USE_AS_WCSCMP
++L(different):
++# endif
++	ret
++
++# ifdef USE_AS_WCSCMP
++	.p2align 4
++L(different):
++	/* Use movl to avoid modifying EFLAGS.  */
++	movl	$0, %eax
++	setl	%al
++	negl	%eax
++	orl	$1, %eax
++	ret
++# endif
++
++# ifdef USE_AS_STRNCMP
++	.p2align 4
++L(zero):
++	xorl	%eax, %eax
++	ret
++
++	.p2align 4
++L(char0):
++#  ifdef USE_AS_WCSCMP
++	xorl	%eax, %eax
++	movl	(%rdi), %ecx
++	cmpl	(%rsi), %ecx
++	jne	L(wcscmp_return)
++#  else
++	movzbl	(%rsi), %ecx
++	movzbl	(%rdi), %eax
++	subl	%ecx, %eax
++#  endif
++	ret
++# endif
++
++	.p2align 4
++L(last_vector):
++	addq	%rdx, %rdi
++	addq	%rdx, %rsi
++# ifdef USE_AS_STRNCMP
++	subq	%rdx, %r11
++# endif
++	tzcntl	%ecx, %edx
++# ifdef USE_AS_WCSCMP
++	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
++	sall	$2, %edx
++# endif
++# ifdef USE_AS_STRNCMP
++	cmpq	%r11, %rdx
++	jae	L(zero)
++# endif
++# ifdef USE_AS_WCSCMP
++	xorl	%eax, %eax
++	movl	(%rdi, %rdx), %ecx
++	cmpl	(%rsi, %rdx), %ecx
++	jne	L(wcscmp_return)
++# else
++	movzbl	(%rdi, %rdx), %eax
++	movzbl	(%rsi, %rdx), %edx
++	subl	%edx, %eax
++# endif
++	ret
++
++	/* Comparing on page boundary region requires special treatment:
++	   It must done one vector at the time, starting with the wider
++	   ymm vector if possible, if not, with xmm. If fetching 16 bytes
++	   (xmm) still passes the boundary, byte comparison must be done.
++	 */
++	.p2align 4
++L(cross_page):
++	/* Try one ymm vector at a time.  */
++	cmpl	$(PAGE_SIZE - VEC_SIZE), %eax
++	jg	L(cross_page_1_vector)
++L(loop_1_vector):
++	VMOVU	(%rdi, %rdx), %YMM0
++	VMOVU	(%rsi, %rdx), %YMM1
++
++	/* Each bit in K0 represents a mismatch in YMM0 and YMM1.  */
++	VPCMP	$4, %YMM0, %YMM1, %k0
++	VPCMP	$0, %YMMZERO, %YMM0, %k1
++	VPCMP	$0, %YMMZERO, %YMM1, %k2
++	/* Each bit in K1 represents a NULL in YMM0 or YMM1.  */
++	kord	%k1, %k2, %k1
++	/* Each bit in K1 represents a NULL or a mismatch.  */
++	kord	%k0, %k1, %k1
++	kmovd	%k1, %ecx
++	testl	%ecx, %ecx
++	jne	L(last_vector)
++
++	addl	$VEC_SIZE, %edx
++
++	addl	$VEC_SIZE, %eax
++# ifdef USE_AS_STRNCMP
++	/* Return 0 if the current offset (%rdx) >= the maximum offset
++	   (%r11).  */
++	cmpq	%r11, %rdx
++	jae	L(zero)
++# endif
++	cmpl	$(PAGE_SIZE - VEC_SIZE), %eax
++	jle	L(loop_1_vector)
++L(cross_page_1_vector):
++	/* Less than 32 bytes to check, try one xmm vector.  */
++	cmpl	$(PAGE_SIZE - 16), %eax
++	jg	L(cross_page_1_xmm)
++	VMOVU	(%rdi, %rdx), %XMM0
++	VMOVU	(%rsi, %rdx), %XMM1
++
++	/* Each bit in K0 represents a mismatch in XMM0 and XMM1.  */
++	VPCMP	$4, %XMM0, %XMM1, %k0
++	VPCMP	$0, %XMMZERO, %XMM0, %k1
++	VPCMP	$0, %XMMZERO, %XMM1, %k2
++	/* Each bit in K1 represents a NULL in XMM0 or XMM1.  */
++	korw	%k1, %k2, %k1
++	/* Each bit in K1 represents a NULL or a mismatch.  */
++	korw	%k0, %k1, %k1
++	kmovw	%k1, %ecx
++	testl	%ecx, %ecx
++	jne	L(last_vector)
++
++	addl	$16, %edx
++# ifndef USE_AS_WCSCMP
++	addl	$16, %eax
++# endif
++# ifdef USE_AS_STRNCMP
++	/* Return 0 if the current offset (%rdx) >= the maximum offset
++	   (%r11).  */
++	cmpq	%r11, %rdx
++	jae	L(zero)
++# endif
++
++L(cross_page_1_xmm):
++# ifndef USE_AS_WCSCMP
++	/* Less than 16 bytes to check, try 8 byte vector.  NB: No need
++	   for wcscmp nor wcsncmp since wide char is 4 bytes.   */
++	cmpl	$(PAGE_SIZE - 8), %eax
++	jg	L(cross_page_8bytes)
++	vmovq	(%rdi, %rdx), %XMM0
++	vmovq	(%rsi, %rdx), %XMM1
++
++	/* Each bit in K0 represents a mismatch in XMM0 and XMM1.  */
++	VPCMP	$4, %XMM0, %XMM1, %k0
++	VPCMP	$0, %XMMZERO, %XMM0, %k1
++	VPCMP	$0, %XMMZERO, %XMM1, %k2
++	/* Each bit in K1 represents a NULL in XMM0 or XMM1.  */
++	kord	%k1, %k2, %k1
++	/* Each bit in K1 represents a NULL or a mismatch.  */
++	kord	%k0, %k1, %k1
++	kmovd	%k1, %ecx
++
++# ifdef USE_AS_WCSCMP
++	/* Only last 2 bits are valid.  */
++	andl	$0x3, %ecx
++# else
++	/* Only last 8 bits are valid.  */
++	andl	$0xff, %ecx
++# endif
++
++	testl	%ecx, %ecx
++	jne	L(last_vector)
++
++	addl	$8, %edx
++	addl	$8, %eax
++#  ifdef USE_AS_STRNCMP
++	/* Return 0 if the current offset (%rdx) >= the maximum offset
++	   (%r11).  */
++	cmpq	%r11, %rdx
++	jae	L(zero)
++#  endif
++
++L(cross_page_8bytes):
++	/* Less than 8 bytes to check, try 4 byte vector.  */
++	cmpl	$(PAGE_SIZE - 4), %eax
++	jg	L(cross_page_4bytes)
++	vmovd	(%rdi, %rdx), %XMM0
++	vmovd	(%rsi, %rdx), %XMM1
++
++	/* Each bit in K0 represents a mismatch in XMM0 and XMM1.  */
++	VPCMP	$4, %XMM0, %XMM1, %k0
++	VPCMP	$0, %XMMZERO, %XMM0, %k1
++	VPCMP	$0, %XMMZERO, %XMM1, %k2
++	/* Each bit in K1 represents a NULL in XMM0 or XMM1.  */
++	kord	%k1, %k2, %k1
++	/* Each bit in K1 represents a NULL or a mismatch.  */
++	kord	%k0, %k1, %k1
++	kmovd	%k1, %ecx
++
++# ifdef USE_AS_WCSCMP
++	/* Only the last bit is valid.  */
++	andl	$0x1, %ecx
++# else
++	/* Only last 4 bits are valid.  */
++	andl	$0xf, %ecx
++# endif
++
++	testl	%ecx, %ecx
++	jne	L(last_vector)
++
++	addl	$4, %edx
++#  ifdef USE_AS_STRNCMP
++	/* Return 0 if the current offset (%rdx) >= the maximum offset
++	   (%r11).  */
++	cmpq	%r11, %rdx
++	jae	L(zero)
++#  endif
++
++L(cross_page_4bytes):
++# endif
++	/* Less than 4 bytes to check, try one byte/dword at a time.  */
++# ifdef USE_AS_STRNCMP
++	cmpq	%r11, %rdx
++	jae	L(zero)
++# endif
++# ifdef USE_AS_WCSCMP
++	movl	(%rdi, %rdx), %eax
++	movl	(%rsi, %rdx), %ecx
++# else
++	movzbl	(%rdi, %rdx), %eax
++	movzbl	(%rsi, %rdx), %ecx
++# endif
++	testl	%eax, %eax
++	jne	L(cross_page_loop)
++	subl	%ecx, %eax
++	ret
++END (STRCMP)
++#endif
+diff --git a/sysdeps/x86_64/multiarch/strcmp.c b/sysdeps/x86_64/multiarch/strcmp.c
+index 4db7332ac1..358fa90152 100644
+--- a/sysdeps/x86_64/multiarch/strcmp.c
++++ b/sysdeps/x86_64/multiarch/strcmp.c
+@@ -30,16 +30,29 @@ extern __typeof (REDIRECT_NAME) OPTIMIZE (sse2) attribute_hidden;
+ extern __typeof (REDIRECT_NAME) OPTIMIZE (sse2_unaligned) attribute_hidden;
+ extern __typeof (REDIRECT_NAME) OPTIMIZE (ssse3) attribute_hidden;
+ extern __typeof (REDIRECT_NAME) OPTIMIZE (avx2) attribute_hidden;
++extern __typeof (REDIRECT_NAME) OPTIMIZE (avx2_rtm) attribute_hidden;
++extern __typeof (REDIRECT_NAME) OPTIMIZE (evex) attribute_hidden;
+ 
+ static inline void *
+ IFUNC_SELECTOR (void)
+ {
+   const struct cpu_features* cpu_features = __get_cpu_features ();
+ 
+-  if (!CPU_FEATURES_ARCH_P (cpu_features, Prefer_No_VZEROUPPER)
+-      && CPU_FEATURES_ARCH_P (cpu_features, AVX2_Usable)
++  if (CPU_FEATURES_ARCH_P (cpu_features, AVX2_Usable)
+       && CPU_FEATURES_ARCH_P (cpu_features, AVX_Fast_Unaligned_Load))
+-    return OPTIMIZE (avx2);
++    {
++      if (CPU_FEATURES_ARCH_P (cpu_features, AVX512VL_Usable)
++	  && CPU_FEATURES_ARCH_P (cpu_features, AVX512BW_Usable)
++	  && CPU_FEATURES_CPU_P (cpu_features, BMI2)
++	  && !CPU_FEATURES_ARCH_P (cpu_features, Prefer_AVX2_STRCMP))
++	return OPTIMIZE (evex);
++
++      if (CPU_FEATURES_CPU_P (cpu_features, RTM))
++	return OPTIMIZE (avx2_rtm);
++
++      if (!CPU_FEATURES_ARCH_P (cpu_features, Prefer_No_VZEROUPPER))
++	return OPTIMIZE (avx2);
++    }
+ 
+   if (CPU_FEATURES_ARCH_P (cpu_features, Fast_Unaligned_Load))
+     return OPTIMIZE (sse2_unaligned);
+diff --git a/sysdeps/x86_64/multiarch/strcpy-avx2-rtm.S b/sysdeps/x86_64/multiarch/strcpy-avx2-rtm.S
+new file mode 100644
+index 0000000000..c2c581ecf7
+--- /dev/null
++++ b/sysdeps/x86_64/multiarch/strcpy-avx2-rtm.S
+@@ -0,0 +1,12 @@
++#ifndef STRCPY
++# define STRCPY __strcpy_avx2_rtm
++#endif
++
++#define ZERO_UPPER_VEC_REGISTERS_RETURN \
++  ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST
++
++#define VZEROUPPER_RETURN jmp	 L(return_vzeroupper)
++
++#define SECTION(p) p##.avx.rtm
++
++#include "strcpy-avx2.S"
+diff --git a/sysdeps/x86_64/multiarch/strcpy-avx2.S b/sysdeps/x86_64/multiarch/strcpy-avx2.S
+index 3f2f9e8170..1ce17253ab 100644
+--- a/sysdeps/x86_64/multiarch/strcpy-avx2.S
++++ b/sysdeps/x86_64/multiarch/strcpy-avx2.S
+@@ -37,6 +37,10 @@
+ #  define VZEROUPPER	vzeroupper
+ # endif
+ 
++# ifndef SECTION
++#  define SECTION(p)	p##.avx
++# endif
++
+ /* zero register */
+ #define xmmZ	xmm0
+ #define ymmZ	ymm0
+@@ -46,7 +50,7 @@
+ 
+ # ifndef USE_AS_STRCAT
+ 
+-	.section .text.avx,"ax",@progbits
++	.section SECTION(.text),"ax",@progbits
+ ENTRY (STRCPY)
+ #  ifdef USE_AS_STRNCPY
+ 	mov	%RDX_LP, %R8_LP
+@@ -369,8 +373,8 @@ L(CopyVecSizeExit):
+ 	lea	1(%rdi), %rdi
+ 	jnz	L(StrncpyFillTailWithZero)
+ # endif
+-	VZEROUPPER
+-	ret
++L(return_vzeroupper):
++	ZERO_UPPER_VEC_REGISTERS_RETURN
+ 
+ 	.p2align 4
+ L(CopyTwoVecSize1):
+@@ -553,8 +557,7 @@ L(Exit1):
+ 	lea	2(%rdi), %rdi
+ 	jnz	L(StrncpyFillTailWithZero)
+ # endif
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ 
+ 	.p2align 4
+ L(Exit2):
+@@ -569,8 +572,7 @@ L(Exit2):
+ 	lea	3(%rdi), %rdi
+ 	jnz	L(StrncpyFillTailWithZero)
+ # endif
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ 
+ 	.p2align 4
+ L(Exit3):
+@@ -584,8 +586,7 @@ L(Exit3):
+ 	lea	4(%rdi), %rdi
+ 	jnz	L(StrncpyFillTailWithZero)
+ # endif
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ 
+ 	.p2align 4
+ L(Exit4_7):
+@@ -602,8 +603,7 @@ L(Exit4_7):
+ 	lea	1(%rdi, %rdx), %rdi
+ 	jnz	L(StrncpyFillTailWithZero)
+ # endif
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ 
+ 	.p2align 4
+ L(Exit8_15):
+@@ -620,8 +620,7 @@ L(Exit8_15):
+ 	lea	1(%rdi, %rdx), %rdi
+ 	jnz	L(StrncpyFillTailWithZero)
+ # endif
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ 
+ 	.p2align 4
+ L(Exit16_31):
+@@ -638,8 +637,7 @@ L(Exit16_31):
+ 	lea 1(%rdi, %rdx), %rdi
+ 	jnz L(StrncpyFillTailWithZero)
+ # endif
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ 
+ 	.p2align 4
+ L(Exit32_63):
+@@ -656,8 +654,7 @@ L(Exit32_63):
+ 	lea	1(%rdi, %rdx), %rdi
+ 	jnz	L(StrncpyFillTailWithZero)
+ # endif
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ 
+ # ifdef USE_AS_STRNCPY
+ 
+@@ -671,8 +668,7 @@ L(StrncpyExit1):
+ #  ifdef USE_AS_STRCAT
+ 	movb	$0, 1(%rdi)
+ #  endif
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ 
+ 	.p2align 4
+ L(StrncpyExit2):
+@@ -684,8 +680,7 @@ L(StrncpyExit2):
+ #  ifdef USE_AS_STRCAT
+ 	movb	$0, 2(%rdi)
+ #  endif
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ 
+ 	.p2align 4
+ L(StrncpyExit3_4):
+@@ -699,8 +694,7 @@ L(StrncpyExit3_4):
+ #  ifdef USE_AS_STRCAT
+ 	movb	$0, (%rdi, %r8)
+ #  endif
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ 
+ 	.p2align 4
+ L(StrncpyExit5_8):
+@@ -714,8 +708,7 @@ L(StrncpyExit5_8):
+ #  ifdef USE_AS_STRCAT
+ 	movb	$0, (%rdi, %r8)
+ #  endif
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ 
+ 	.p2align 4
+ L(StrncpyExit9_16):
+@@ -729,8 +722,7 @@ L(StrncpyExit9_16):
+ #  ifdef USE_AS_STRCAT
+ 	movb	$0, (%rdi, %r8)
+ #  endif
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ 
+ 	.p2align 4
+ L(StrncpyExit17_32):
+@@ -744,8 +736,7 @@ L(StrncpyExit17_32):
+ #  ifdef USE_AS_STRCAT
+ 	movb	$0, (%rdi, %r8)
+ #  endif
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ 
+ 	.p2align 4
+ L(StrncpyExit33_64):
+@@ -760,8 +751,7 @@ L(StrncpyExit33_64):
+ #  ifdef USE_AS_STRCAT
+ 	movb	$0, (%rdi, %r8)
+ #  endif
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ 
+ 	.p2align 4
+ L(StrncpyExit65):
+@@ -778,50 +768,43 @@ L(StrncpyExit65):
+ #  ifdef USE_AS_STRCAT
+ 	movb	$0, 65(%rdi)
+ #  endif
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ 
+ #  ifndef USE_AS_STRCAT
+ 
+ 	.p2align 4
+ L(Fill1):
+ 	mov	%dl, (%rdi)
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ 
+ 	.p2align 4
+ L(Fill2):
+ 	mov	%dx, (%rdi)
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ 
+ 	.p2align 4
+ L(Fill3_4):
+ 	mov	%dx, (%rdi)
+ 	mov     %dx, -2(%rdi, %r8)
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ 
+ 	.p2align 4
+ L(Fill5_8):
+ 	mov	%edx, (%rdi)
+ 	mov     %edx, -4(%rdi, %r8)
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ 
+ 	.p2align 4
+ L(Fill9_16):
+ 	mov	%rdx, (%rdi)
+ 	mov	%rdx, -8(%rdi, %r8)
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ 
+ 	.p2align 4
+ L(Fill17_32):
+ 	vmovdqu %xmmZ, (%rdi)
+ 	vmovdqu %xmmZ, -16(%rdi, %r8)
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ 
+ 	.p2align 4
+ L(CopyVecSizeUnalignedVec2):
+@@ -898,8 +881,7 @@ L(Fill):
+ 	cmp	$1, %r8d
+ 	ja	L(Fill2)
+ 	je	L(Fill1)
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ 
+ /* end of ifndef USE_AS_STRCAT */
+ #  endif
+@@ -929,8 +911,7 @@ L(UnalignedFourVecSizeLeaveCase3):
+ #  ifdef USE_AS_STRCAT
+ 	movb	$0, (VEC_SIZE * 4)(%rdi)
+ #  endif
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ 
+ 	.p2align 4
+ L(UnalignedFourVecSizeLeaveCase2):
+@@ -1001,16 +982,14 @@ L(StrncpyExit):
+ #  ifdef USE_AS_STRCAT
+ 	movb	$0, (%rdi)
+ #  endif
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ 
+ 	.p2align 4
+ L(ExitZero):
+ #  ifndef USE_AS_STRCAT
+ 	mov	%rdi, %rax
+ #  endif
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ 
+ # endif
+ 
+diff --git a/sysdeps/x86_64/multiarch/strcpy-evex.S b/sysdeps/x86_64/multiarch/strcpy-evex.S
+new file mode 100644
+index 0000000000..a343a1a692
+--- /dev/null
++++ b/sysdeps/x86_64/multiarch/strcpy-evex.S
+@@ -0,0 +1,1003 @@
++/* strcpy with 256-bit EVEX instructions.
++   Copyright (C) 2021 Free Software Foundation, Inc.
++   This file is part of the GNU C Library.
++
++   The GNU C Library is free software; you can redistribute it and/or
++   modify it under the terms of the GNU Lesser General Public
++   License as published by the Free Software Foundation; either
++   version 2.1 of the License, or (at your option) any later version.
++
++   The GNU C Library is distributed in the hope that it will be useful,
++   but WITHOUT ANY WARRANTY; without even the implied warranty of
++   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
++   Lesser General Public License for more details.
++
++   You should have received a copy of the GNU Lesser General Public
++   License along with the GNU C Library; if not, see
++   <https://www.gnu.org/licenses/>.  */
++
++#if IS_IN (libc)
++
++# ifndef USE_AS_STRCAT
++#  include <sysdep.h>
++
++#  ifndef STRCPY
++#   define STRCPY  __strcpy_evex
++#  endif
++
++# endif
++
++# define VMOVU		vmovdqu64
++# define VMOVA		vmovdqa64
++
++/* Number of bytes in a vector register */
++# ifndef VEC_SIZE
++#  define VEC_SIZE	32
++# endif
++
++# define XMM2		xmm18
++# define XMM3		xmm19
++
++# define YMM2		ymm18
++# define YMM3		ymm19
++# define YMM4		ymm20
++# define YMM5		ymm21
++# define YMM6		ymm22
++# define YMM7		ymm23
++
++# ifndef USE_AS_STRCAT
++
++/* zero register */
++#  define XMMZERO	xmm16
++#  define YMMZERO	ymm16
++#  define YMM1		ymm17
++
++	.section .text.evex,"ax",@progbits
++ENTRY (STRCPY)
++#  ifdef USE_AS_STRNCPY
++	mov	%RDX_LP, %R8_LP
++	test	%R8_LP, %R8_LP
++	jz	L(ExitZero)
++#  endif
++	mov	%rsi, %rcx
++#  ifndef USE_AS_STPCPY
++	mov	%rdi, %rax      /* save result */
++#  endif
++
++	vpxorq	%XMMZERO, %XMMZERO, %XMMZERO
++# endif
++
++	and	$((VEC_SIZE * 4) - 1), %ecx
++	cmp	$(VEC_SIZE * 2), %ecx
++	jbe	L(SourceStringAlignmentLessTwoVecSize)
++
++	and	$-VEC_SIZE, %rsi
++	and	$(VEC_SIZE - 1), %ecx
++
++	vpcmpb	$0, (%rsi), %YMMZERO, %k0
++	kmovd	%k0, %edx
++	shr	%cl, %rdx
++
++# ifdef USE_AS_STRNCPY
++#  if defined USE_AS_STPCPY || defined USE_AS_STRCAT
++	mov	$VEC_SIZE, %r10
++	sub	%rcx, %r10
++	cmp	%r10, %r8
++#  else
++	mov	$(VEC_SIZE + 1), %r10
++	sub	%rcx, %r10
++	cmp	%r10, %r8
++#  endif
++	jbe	L(CopyVecSizeTailCase2OrCase3)
++# endif
++	test	%edx, %edx
++	jnz	L(CopyVecSizeTail)
++
++	vpcmpb	$0, VEC_SIZE(%rsi), %YMMZERO, %k1
++	kmovd	%k1, %edx
++
++# ifdef USE_AS_STRNCPY
++	add	$VEC_SIZE, %r10
++	cmp	%r10, %r8
++	jbe	L(CopyTwoVecSizeCase2OrCase3)
++# endif
++	test	%edx, %edx
++	jnz	L(CopyTwoVecSize)
++
++	VMOVU	(%rsi, %rcx), %YMM2   /* copy VEC_SIZE bytes */
++	VMOVU	%YMM2, (%rdi)
++
++/* If source address alignment != destination address alignment */
++	.p2align 4
++L(UnalignVecSizeBoth):
++	sub	%rcx, %rdi
++# ifdef USE_AS_STRNCPY
++	add	%rcx, %r8
++	sbb	%rcx, %rcx
++	or	%rcx, %r8
++# endif
++	mov	$VEC_SIZE, %rcx
++	VMOVA	(%rsi, %rcx), %YMM2
++	VMOVU	%YMM2, (%rdi, %rcx)
++	VMOVA	VEC_SIZE(%rsi, %rcx), %YMM2
++	vpcmpb	$0, %YMM2, %YMMZERO, %k0
++	kmovd	%k0, %edx
++	add	$VEC_SIZE, %rcx
++# ifdef USE_AS_STRNCPY
++	sub	$(VEC_SIZE * 3), %r8
++	jbe	L(CopyVecSizeCase2OrCase3)
++# endif
++	test	%edx, %edx
++# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
++	jnz	L(CopyVecSizeUnalignedVec2)
++# else
++	jnz	L(CopyVecSize)
++# endif
++
++	VMOVU	%YMM2, (%rdi, %rcx)
++	VMOVA	VEC_SIZE(%rsi, %rcx), %YMM3
++	vpcmpb	$0, %YMM3, %YMMZERO, %k0
++	kmovd	%k0, %edx
++	add	$VEC_SIZE, %rcx
++# ifdef USE_AS_STRNCPY
++	sub	$VEC_SIZE, %r8
++	jbe	L(CopyVecSizeCase2OrCase3)
++# endif
++	test	%edx, %edx
++# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
++	jnz	L(CopyVecSizeUnalignedVec3)
++# else
++	jnz	L(CopyVecSize)
++# endif
++
++	VMOVU	%YMM3, (%rdi, %rcx)
++	VMOVA	VEC_SIZE(%rsi, %rcx), %YMM4
++	vpcmpb	$0, %YMM4, %YMMZERO, %k0
++	kmovd	%k0, %edx
++	add	$VEC_SIZE, %rcx
++# ifdef USE_AS_STRNCPY
++	sub	$VEC_SIZE, %r8
++	jbe	L(CopyVecSizeCase2OrCase3)
++# endif
++	test	%edx, %edx
++# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
++	jnz	L(CopyVecSizeUnalignedVec4)
++# else
++	jnz	L(CopyVecSize)
++# endif
++
++	VMOVU	%YMM4, (%rdi, %rcx)
++	VMOVA	VEC_SIZE(%rsi, %rcx), %YMM2
++	vpcmpb	$0, %YMM2, %YMMZERO, %k0
++	kmovd	%k0, %edx
++	add	$VEC_SIZE, %rcx
++# ifdef USE_AS_STRNCPY
++	sub	$VEC_SIZE, %r8
++	jbe	L(CopyVecSizeCase2OrCase3)
++# endif
++	test	%edx, %edx
++# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
++	jnz	L(CopyVecSizeUnalignedVec2)
++# else
++	jnz	L(CopyVecSize)
++# endif
++
++	VMOVU	%YMM2, (%rdi, %rcx)
++	VMOVA	VEC_SIZE(%rsi, %rcx), %YMM2
++	vpcmpb	$0, %YMM2, %YMMZERO, %k0
++	kmovd	%k0, %edx
++	add	$VEC_SIZE, %rcx
++# ifdef USE_AS_STRNCPY
++	sub	$VEC_SIZE, %r8
++	jbe	L(CopyVecSizeCase2OrCase3)
++# endif
++	test	%edx, %edx
++# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
++	jnz	L(CopyVecSizeUnalignedVec2)
++# else
++	jnz	L(CopyVecSize)
++# endif
++
++	VMOVA	VEC_SIZE(%rsi, %rcx), %YMM3
++	VMOVU	%YMM2, (%rdi, %rcx)
++	vpcmpb	$0, %YMM3, %YMMZERO, %k0
++	kmovd	%k0, %edx
++	add	$VEC_SIZE, %rcx
++# ifdef USE_AS_STRNCPY
++	sub	$VEC_SIZE, %r8
++	jbe	L(CopyVecSizeCase2OrCase3)
++# endif
++	test	%edx, %edx
++# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
++	jnz	L(CopyVecSizeUnalignedVec3)
++# else
++	jnz	L(CopyVecSize)
++# endif
++
++	VMOVU	%YMM3, (%rdi, %rcx)
++	mov	%rsi, %rdx
++	lea	VEC_SIZE(%rsi, %rcx), %rsi
++	and	$-(VEC_SIZE * 4), %rsi
++	sub	%rsi, %rdx
++	sub	%rdx, %rdi
++# ifdef USE_AS_STRNCPY
++	lea	(VEC_SIZE * 8)(%r8, %rdx), %r8
++# endif
++L(UnalignedFourVecSizeLoop):
++	VMOVA	(%rsi), %YMM4
++	VMOVA	VEC_SIZE(%rsi), %YMM5
++	VMOVA	(VEC_SIZE * 2)(%rsi), %YMM6
++	VMOVA	(VEC_SIZE * 3)(%rsi), %YMM7
++	vpminub	%YMM5, %YMM4, %YMM2
++	vpminub	%YMM7, %YMM6, %YMM3
++	vpminub	%YMM2, %YMM3, %YMM2
++	/* If K7 != 0, there is a null byte.  */
++	vpcmpb	$0, %YMM2, %YMMZERO, %k7
++	kmovd	%k7, %edx
++# ifdef USE_AS_STRNCPY
++	sub	$(VEC_SIZE * 4), %r8
++	jbe	L(UnalignedLeaveCase2OrCase3)
++# endif
++	test	%edx, %edx
++	jnz	L(UnalignedFourVecSizeLeave)
++
++L(UnalignedFourVecSizeLoop_start):
++	add	$(VEC_SIZE * 4), %rdi
++	add	$(VEC_SIZE * 4), %rsi
++	VMOVU	%YMM4, -(VEC_SIZE * 4)(%rdi)
++	VMOVA	(%rsi), %YMM4
++	VMOVU	%YMM5, -(VEC_SIZE * 3)(%rdi)
++	VMOVA	VEC_SIZE(%rsi), %YMM5
++	vpminub	%YMM5, %YMM4, %YMM2
++	VMOVU	%YMM6, -(VEC_SIZE * 2)(%rdi)
++	VMOVA	(VEC_SIZE * 2)(%rsi), %YMM6
++	VMOVU	%YMM7, -VEC_SIZE(%rdi)
++	VMOVA	(VEC_SIZE * 3)(%rsi), %YMM7
++	vpminub	%YMM7, %YMM6, %YMM3
++	vpminub	%YMM2, %YMM3, %YMM2
++	/* If K7 != 0, there is a null byte.  */
++	vpcmpb	$0, %YMM2, %YMMZERO, %k7
++	kmovd	%k7, %edx
++# ifdef USE_AS_STRNCPY
++	sub	$(VEC_SIZE * 4), %r8
++	jbe	L(UnalignedLeaveCase2OrCase3)
++# endif
++	test	%edx, %edx
++	jz	L(UnalignedFourVecSizeLoop_start)
++
++L(UnalignedFourVecSizeLeave):
++	vpcmpb	$0, %YMM4, %YMMZERO, %k1
++	kmovd	%k1, %edx
++	test	%edx, %edx
++	jnz	L(CopyVecSizeUnaligned_0)
++
++	vpcmpb	$0, %YMM5, %YMMZERO, %k2
++	kmovd	%k2, %ecx
++	test	%ecx, %ecx
++	jnz	L(CopyVecSizeUnaligned_16)
++
++	vpcmpb	$0, %YMM6, %YMMZERO, %k3
++	kmovd	%k3, %edx
++	test	%edx, %edx
++	jnz	L(CopyVecSizeUnaligned_32)
++
++	vpcmpb	$0, %YMM7, %YMMZERO, %k4
++	kmovd	%k4, %ecx
++	bsf	%ecx, %edx
++	VMOVU	%YMM4, (%rdi)
++	VMOVU	%YMM5, VEC_SIZE(%rdi)
++	VMOVU	%YMM6, (VEC_SIZE * 2)(%rdi)
++# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
++# ifdef USE_AS_STPCPY
++	lea	(VEC_SIZE * 3)(%rdi, %rdx), %rax
++# endif
++	VMOVU	%YMM7, (VEC_SIZE * 3)(%rdi)
++	add	$(VEC_SIZE - 1), %r8
++	sub	%rdx, %r8
++	lea	((VEC_SIZE * 3) + 1)(%rdi, %rdx), %rdi
++	jmp	L(StrncpyFillTailWithZero)
++# else
++	add	$(VEC_SIZE * 3), %rsi
++	add	$(VEC_SIZE * 3), %rdi
++	jmp	L(CopyVecSizeExit)
++# endif
++
++/* If source address alignment == destination address alignment */
++
++L(SourceStringAlignmentLessTwoVecSize):
++	VMOVU	(%rsi), %YMM3
++	VMOVU	VEC_SIZE(%rsi), %YMM2
++	vpcmpb	$0, %YMM3, %YMMZERO, %k0
++	kmovd	%k0, %edx
++
++# ifdef USE_AS_STRNCPY
++#  if defined USE_AS_STPCPY || defined USE_AS_STRCAT
++	cmp	$VEC_SIZE, %r8
++#  else
++	cmp	$(VEC_SIZE + 1), %r8
++#  endif
++	jbe	L(CopyVecSizeTail1Case2OrCase3)
++# endif
++	test	%edx, %edx
++	jnz	L(CopyVecSizeTail1)
++
++	VMOVU	%YMM3, (%rdi)
++	vpcmpb	$0, %YMM2, %YMMZERO, %k0
++	kmovd	%k0, %edx
++
++# ifdef USE_AS_STRNCPY
++#  if defined USE_AS_STPCPY || defined USE_AS_STRCAT
++	cmp	$(VEC_SIZE * 2), %r8
++#  else
++	cmp	$((VEC_SIZE * 2) + 1), %r8
++#  endif
++	jbe	L(CopyTwoVecSize1Case2OrCase3)
++# endif
++	test	%edx, %edx
++	jnz	L(CopyTwoVecSize1)
++
++	and	$-VEC_SIZE, %rsi
++	and	$(VEC_SIZE - 1), %ecx
++	jmp	L(UnalignVecSizeBoth)
++
++/*------End of main part with loops---------------------*/
++
++/* Case1 */
++
++# if (!defined USE_AS_STRNCPY) || (defined USE_AS_STRCAT)
++	.p2align 4
++L(CopyVecSize):
++	add	%rcx, %rdi
++# endif
++L(CopyVecSizeTail):
++	add	%rcx, %rsi
++L(CopyVecSizeTail1):
++	bsf	%edx, %edx
++L(CopyVecSizeExit):
++	cmp	$32, %edx
++	jae	L(Exit32_63)
++	cmp	$16, %edx
++	jae	L(Exit16_31)
++	cmp	$8, %edx
++	jae	L(Exit8_15)
++	cmp	$4, %edx
++	jae	L(Exit4_7)
++	cmp	$3, %edx
++	je	L(Exit3)
++	cmp	$1, %edx
++	ja	L(Exit2)
++	je	L(Exit1)
++	movb	$0, (%rdi)
++# ifdef USE_AS_STPCPY
++	lea	(%rdi), %rax
++# endif
++# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
++	sub	$1, %r8
++	lea	1(%rdi), %rdi
++	jnz	L(StrncpyFillTailWithZero)
++# endif
++	ret
++
++	.p2align 4
++L(CopyTwoVecSize1):
++	add	$VEC_SIZE, %rsi
++	add	$VEC_SIZE, %rdi
++# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
++	sub	$VEC_SIZE, %r8
++# endif
++	jmp	L(CopyVecSizeTail1)
++
++	.p2align 4
++L(CopyTwoVecSize):
++	bsf	%edx, %edx
++	add	%rcx, %rsi
++	add	$VEC_SIZE, %edx
++	sub	%ecx, %edx
++	jmp	L(CopyVecSizeExit)
++
++	.p2align 4
++L(CopyVecSizeUnaligned_0):
++	bsf	%edx, %edx
++# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
++# ifdef USE_AS_STPCPY
++	lea	(%rdi, %rdx), %rax
++# endif
++	VMOVU	%YMM4, (%rdi)
++	add	$((VEC_SIZE * 4) - 1), %r8
++	sub	%rdx, %r8
++	lea	1(%rdi, %rdx), %rdi
++	jmp	L(StrncpyFillTailWithZero)
++# else
++	jmp	L(CopyVecSizeExit)
++# endif
++
++	.p2align 4
++L(CopyVecSizeUnaligned_16):
++	bsf	%ecx, %edx
++	VMOVU	%YMM4, (%rdi)
++# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
++# ifdef USE_AS_STPCPY
++	lea	VEC_SIZE(%rdi, %rdx), %rax
++# endif
++	VMOVU	%YMM5, VEC_SIZE(%rdi)
++	add	$((VEC_SIZE * 3) - 1), %r8
++	sub	%rdx, %r8
++	lea	(VEC_SIZE + 1)(%rdi, %rdx), %rdi
++	jmp	L(StrncpyFillTailWithZero)
++# else
++	add	$VEC_SIZE, %rsi
++	add	$VEC_SIZE, %rdi
++	jmp	L(CopyVecSizeExit)
++# endif
++
++	.p2align 4
++L(CopyVecSizeUnaligned_32):
++	bsf	%edx, %edx
++	VMOVU	%YMM4, (%rdi)
++	VMOVU	%YMM5, VEC_SIZE(%rdi)
++# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
++# ifdef USE_AS_STPCPY
++	lea	(VEC_SIZE * 2)(%rdi, %rdx), %rax
++# endif
++	VMOVU	%YMM6, (VEC_SIZE * 2)(%rdi)
++	add	$((VEC_SIZE * 2) - 1), %r8
++	sub	%rdx, %r8
++	lea	((VEC_SIZE * 2) + 1)(%rdi, %rdx), %rdi
++	jmp	L(StrncpyFillTailWithZero)
++# else
++	add	$(VEC_SIZE * 2), %rsi
++	add	$(VEC_SIZE * 2), %rdi
++	jmp	L(CopyVecSizeExit)
++# endif
++
++# ifdef USE_AS_STRNCPY
++#  ifndef USE_AS_STRCAT
++	.p2align 4
++L(CopyVecSizeUnalignedVec6):
++	VMOVU	%YMM6, (%rdi, %rcx)
++	jmp	L(CopyVecSizeVecExit)
++
++	.p2align 4
++L(CopyVecSizeUnalignedVec5):
++	VMOVU	%YMM5, (%rdi, %rcx)
++	jmp	L(CopyVecSizeVecExit)
++
++	.p2align 4
++L(CopyVecSizeUnalignedVec4):
++	VMOVU	%YMM4, (%rdi, %rcx)
++	jmp	L(CopyVecSizeVecExit)
++
++	.p2align 4
++L(CopyVecSizeUnalignedVec3):
++	VMOVU	%YMM3, (%rdi, %rcx)
++	jmp	L(CopyVecSizeVecExit)
++#  endif
++
++/* Case2 */
++
++	.p2align 4
++L(CopyVecSizeCase2):
++	add	$VEC_SIZE, %r8
++	add	%rcx, %rdi
++	add	%rcx, %rsi
++	bsf	%edx, %edx
++	cmp	%r8d, %edx
++	jb	L(CopyVecSizeExit)
++	jmp	L(StrncpyExit)
++
++	.p2align 4
++L(CopyTwoVecSizeCase2):
++	add	%rcx, %rsi
++	bsf	%edx, %edx
++	add	$VEC_SIZE, %edx
++	sub	%ecx, %edx
++	cmp	%r8d, %edx
++	jb	L(CopyVecSizeExit)
++	jmp	L(StrncpyExit)
++
++L(CopyVecSizeTailCase2):
++	add	%rcx, %rsi
++	bsf	%edx, %edx
++	cmp	%r8d, %edx
++	jb	L(CopyVecSizeExit)
++	jmp	L(StrncpyExit)
++
++L(CopyVecSizeTail1Case2):
++	bsf	%edx, %edx
++	cmp	%r8d, %edx
++	jb	L(CopyVecSizeExit)
++	jmp	L(StrncpyExit)
++
++/* Case2 or Case3,  Case3 */
++
++	.p2align 4
++L(CopyVecSizeCase2OrCase3):
++	test	%rdx, %rdx
++	jnz	L(CopyVecSizeCase2)
++L(CopyVecSizeCase3):
++	add	$VEC_SIZE, %r8
++	add	%rcx, %rdi
++	add	%rcx, %rsi
++	jmp	L(StrncpyExit)
++
++	.p2align 4
++L(CopyTwoVecSizeCase2OrCase3):
++	test	%rdx, %rdx
++	jnz	L(CopyTwoVecSizeCase2)
++	add	%rcx, %rsi
++	jmp	L(StrncpyExit)
++
++	.p2align 4
++L(CopyVecSizeTailCase2OrCase3):
++	test	%rdx, %rdx
++	jnz	L(CopyVecSizeTailCase2)
++	add	%rcx, %rsi
++	jmp	L(StrncpyExit)
++
++	.p2align 4
++L(CopyTwoVecSize1Case2OrCase3):
++	add	$VEC_SIZE, %rdi
++	add	$VEC_SIZE, %rsi
++	sub	$VEC_SIZE, %r8
++L(CopyVecSizeTail1Case2OrCase3):
++	test	%rdx, %rdx
++	jnz	L(CopyVecSizeTail1Case2)
++	jmp	L(StrncpyExit)
++# endif
++
++/*------------End labels regarding with copying 1-VEC_SIZE bytes--and 1-(VEC_SIZE*2) bytes----*/
++
++	.p2align 4
++L(Exit1):
++	movzwl	(%rsi), %edx
++	mov	%dx, (%rdi)
++# ifdef USE_AS_STPCPY
++	lea	1(%rdi), %rax
++# endif
++# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
++	sub	$2, %r8
++	lea	2(%rdi), %rdi
++	jnz	L(StrncpyFillTailWithZero)
++# endif
++	ret
++
++	.p2align 4
++L(Exit2):
++	movzwl	(%rsi), %ecx
++	mov	%cx, (%rdi)
++	movb	$0, 2(%rdi)
++# ifdef USE_AS_STPCPY
++	lea	2(%rdi), %rax
++# endif
++# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
++	sub	$3, %r8
++	lea	3(%rdi), %rdi
++	jnz	L(StrncpyFillTailWithZero)
++# endif
++	ret
++
++	.p2align 4
++L(Exit3):
++	mov	(%rsi), %edx
++	mov	%edx, (%rdi)
++# ifdef USE_AS_STPCPY
++	lea	3(%rdi), %rax
++# endif
++# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
++	sub	$4, %r8
++	lea	4(%rdi), %rdi
++	jnz	L(StrncpyFillTailWithZero)
++# endif
++	ret
++
++	.p2align 4
++L(Exit4_7):
++	mov	(%rsi), %ecx
++	mov	%ecx, (%rdi)
++	mov	-3(%rsi, %rdx), %ecx
++	mov	%ecx, -3(%rdi, %rdx)
++# ifdef USE_AS_STPCPY
++	lea	(%rdi, %rdx), %rax
++# endif
++# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
++	sub	%rdx, %r8
++	sub	$1, %r8
++	lea	1(%rdi, %rdx), %rdi
++	jnz	L(StrncpyFillTailWithZero)
++# endif
++	ret
++
++	.p2align 4
++L(Exit8_15):
++	mov	(%rsi), %rcx
++	mov	-7(%rsi, %rdx), %r9
++	mov	%rcx, (%rdi)
++	mov	%r9, -7(%rdi, %rdx)
++# ifdef USE_AS_STPCPY
++	lea	(%rdi, %rdx), %rax
++# endif
++# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
++	sub	%rdx, %r8
++	sub	$1, %r8
++	lea	1(%rdi, %rdx), %rdi
++	jnz	L(StrncpyFillTailWithZero)
++# endif
++	ret
++
++	.p2align 4
++L(Exit16_31):
++	VMOVU	(%rsi), %XMM2
++	VMOVU	-15(%rsi, %rdx), %XMM3
++	VMOVU	%XMM2, (%rdi)
++	VMOVU	%XMM3, -15(%rdi, %rdx)
++# ifdef USE_AS_STPCPY
++	lea	(%rdi, %rdx), %rax
++# endif
++# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
++	sub %rdx, %r8
++	sub $1, %r8
++	lea 1(%rdi, %rdx), %rdi
++	jnz L(StrncpyFillTailWithZero)
++# endif
++	ret
++
++	.p2align 4
++L(Exit32_63):
++	VMOVU	(%rsi), %YMM2
++	VMOVU	-31(%rsi, %rdx), %YMM3
++	VMOVU	%YMM2, (%rdi)
++	VMOVU	%YMM3, -31(%rdi, %rdx)
++# ifdef USE_AS_STPCPY
++	lea	(%rdi, %rdx), %rax
++# endif
++# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
++	sub	%rdx, %r8
++	sub	$1, %r8
++	lea	1(%rdi, %rdx), %rdi
++	jnz	L(StrncpyFillTailWithZero)
++# endif
++	ret
++
++# ifdef USE_AS_STRNCPY
++
++	.p2align 4
++L(StrncpyExit1):
++	movzbl	(%rsi), %edx
++	mov	%dl, (%rdi)
++#  ifdef USE_AS_STPCPY
++	lea	1(%rdi), %rax
++#  endif
++#  ifdef USE_AS_STRCAT
++	movb	$0, 1(%rdi)
++#  endif
++	ret
++
++	.p2align 4
++L(StrncpyExit2):
++	movzwl	(%rsi), %edx
++	mov	%dx, (%rdi)
++#  ifdef USE_AS_STPCPY
++	lea	2(%rdi), %rax
++#  endif
++#  ifdef USE_AS_STRCAT
++	movb	$0, 2(%rdi)
++#  endif
++	ret
++
++	.p2align 4
++L(StrncpyExit3_4):
++	movzwl	(%rsi), %ecx
++	movzwl	-2(%rsi, %r8), %edx
++	mov	%cx, (%rdi)
++	mov	%dx, -2(%rdi, %r8)
++#  ifdef USE_AS_STPCPY
++	lea	(%rdi, %r8), %rax
++#  endif
++#  ifdef USE_AS_STRCAT
++	movb	$0, (%rdi, %r8)
++#  endif
++	ret
++
++	.p2align 4
++L(StrncpyExit5_8):
++	mov	(%rsi), %ecx
++	mov	-4(%rsi, %r8), %edx
++	mov	%ecx, (%rdi)
++	mov	%edx, -4(%rdi, %r8)
++#  ifdef USE_AS_STPCPY
++	lea	(%rdi, %r8), %rax
++#  endif
++#  ifdef USE_AS_STRCAT
++	movb	$0, (%rdi, %r8)
++#  endif
++	ret
++
++	.p2align 4
++L(StrncpyExit9_16):
++	mov	(%rsi), %rcx
++	mov	-8(%rsi, %r8), %rdx
++	mov	%rcx, (%rdi)
++	mov	%rdx, -8(%rdi, %r8)
++#  ifdef USE_AS_STPCPY
++	lea	(%rdi, %r8), %rax
++#  endif
++#  ifdef USE_AS_STRCAT
++	movb	$0, (%rdi, %r8)
++#  endif
++	ret
++
++	.p2align 4
++L(StrncpyExit17_32):
++	VMOVU	(%rsi), %XMM2
++	VMOVU	-16(%rsi, %r8), %XMM3
++	VMOVU	%XMM2, (%rdi)
++	VMOVU	%XMM3, -16(%rdi, %r8)
++#  ifdef USE_AS_STPCPY
++	lea	(%rdi, %r8), %rax
++#  endif
++#  ifdef USE_AS_STRCAT
++	movb	$0, (%rdi, %r8)
++#  endif
++	ret
++
++	.p2align 4
++L(StrncpyExit33_64):
++	/*  0/32, 31/16 */
++	VMOVU	(%rsi), %YMM2
++	VMOVU	-VEC_SIZE(%rsi, %r8), %YMM3
++	VMOVU	%YMM2, (%rdi)
++	VMOVU	%YMM3, -VEC_SIZE(%rdi, %r8)
++#  ifdef USE_AS_STPCPY
++	lea	(%rdi, %r8), %rax
++#  endif
++#  ifdef USE_AS_STRCAT
++	movb	$0, (%rdi, %r8)
++#  endif
++	ret
++
++	.p2align 4
++L(StrncpyExit65):
++	/* 0/32, 32/32, 64/1 */
++	VMOVU	(%rsi), %YMM2
++	VMOVU	32(%rsi), %YMM3
++	mov	64(%rsi), %cl
++	VMOVU	%YMM2, (%rdi)
++	VMOVU	%YMM3, 32(%rdi)
++	mov	%cl, 64(%rdi)
++#  ifdef USE_AS_STPCPY
++	lea	65(%rdi), %rax
++#  endif
++#  ifdef USE_AS_STRCAT
++	movb	$0, 65(%rdi)
++#  endif
++	ret
++
++#  ifndef USE_AS_STRCAT
++
++	.p2align 4
++L(Fill1):
++	mov	%dl, (%rdi)
++	ret
++
++	.p2align 4
++L(Fill2):
++	mov	%dx, (%rdi)
++	ret
++
++	.p2align 4
++L(Fill3_4):
++	mov	%dx, (%rdi)
++	mov     %dx, -2(%rdi, %r8)
++	ret
++
++	.p2align 4
++L(Fill5_8):
++	mov	%edx, (%rdi)
++	mov     %edx, -4(%rdi, %r8)
++	ret
++
++	.p2align 4
++L(Fill9_16):
++	mov	%rdx, (%rdi)
++	mov	%rdx, -8(%rdi, %r8)
++	ret
++
++	.p2align 4
++L(Fill17_32):
++	VMOVU	%XMMZERO, (%rdi)
++	VMOVU	%XMMZERO, -16(%rdi, %r8)
++	ret
++
++	.p2align 4
++L(CopyVecSizeUnalignedVec2):
++	VMOVU	%YMM2, (%rdi, %rcx)
++
++	.p2align 4
++L(CopyVecSizeVecExit):
++	bsf	%edx, %edx
++	add	$(VEC_SIZE - 1), %r8
++	add	%rcx, %rdi
++#   ifdef USE_AS_STPCPY
++	lea	(%rdi, %rdx), %rax
++#   endif
++	sub	%rdx, %r8
++	lea	1(%rdi, %rdx), %rdi
++
++	.p2align 4
++L(StrncpyFillTailWithZero):
++	xor	%edx, %edx
++	sub	$VEC_SIZE, %r8
++	jbe	L(StrncpyFillExit)
++
++	VMOVU	%YMMZERO, (%rdi)
++	add	$VEC_SIZE, %rdi
++
++	mov	%rdi, %rsi
++	and	$(VEC_SIZE - 1), %esi
++	sub	%rsi, %rdi
++	add	%rsi, %r8
++	sub	$(VEC_SIZE * 4), %r8
++	jb	L(StrncpyFillLessFourVecSize)
++
++L(StrncpyFillLoopVmovdqa):
++	VMOVA	%YMMZERO, (%rdi)
++	VMOVA	%YMMZERO, VEC_SIZE(%rdi)
++	VMOVA	%YMMZERO, (VEC_SIZE * 2)(%rdi)
++	VMOVA	%YMMZERO, (VEC_SIZE * 3)(%rdi)
++	add	$(VEC_SIZE * 4), %rdi
++	sub	$(VEC_SIZE * 4), %r8
++	jae	L(StrncpyFillLoopVmovdqa)
++
++L(StrncpyFillLessFourVecSize):
++	add	$(VEC_SIZE * 2), %r8
++	jl	L(StrncpyFillLessTwoVecSize)
++	VMOVA	%YMMZERO, (%rdi)
++	VMOVA	%YMMZERO, VEC_SIZE(%rdi)
++	add	$(VEC_SIZE * 2), %rdi
++	sub	$VEC_SIZE, %r8
++	jl	L(StrncpyFillExit)
++	VMOVA	%YMMZERO, (%rdi)
++	add	$VEC_SIZE, %rdi
++	jmp	L(Fill)
++
++	.p2align 4
++L(StrncpyFillLessTwoVecSize):
++	add	$VEC_SIZE, %r8
++	jl	L(StrncpyFillExit)
++	VMOVA	%YMMZERO, (%rdi)
++	add	$VEC_SIZE, %rdi
++	jmp	L(Fill)
++
++	.p2align 4
++L(StrncpyFillExit):
++	add	$VEC_SIZE, %r8
++L(Fill):
++	cmp	$17, %r8d
++	jae	L(Fill17_32)
++	cmp	$9, %r8d
++	jae	L(Fill9_16)
++	cmp	$5, %r8d
++	jae	L(Fill5_8)
++	cmp	$3, %r8d
++	jae	L(Fill3_4)
++	cmp	$1, %r8d
++	ja	L(Fill2)
++	je	L(Fill1)
++	ret
++
++/* end of ifndef USE_AS_STRCAT */
++#  endif
++
++	.p2align 4
++L(UnalignedLeaveCase2OrCase3):
++	test	%rdx, %rdx
++	jnz	L(UnalignedFourVecSizeLeaveCase2)
++L(UnalignedFourVecSizeLeaveCase3):
++	lea	(VEC_SIZE * 4)(%r8), %rcx
++	and	$-VEC_SIZE, %rcx
++	add	$(VEC_SIZE * 3), %r8
++	jl	L(CopyVecSizeCase3)
++	VMOVU	%YMM4, (%rdi)
++	sub	$VEC_SIZE, %r8
++	jb	L(CopyVecSizeCase3)
++	VMOVU	%YMM5, VEC_SIZE(%rdi)
++	sub	$VEC_SIZE, %r8
++	jb	L(CopyVecSizeCase3)
++	VMOVU	%YMM6, (VEC_SIZE * 2)(%rdi)
++	sub	$VEC_SIZE, %r8
++	jb	L(CopyVecSizeCase3)
++	VMOVU	%YMM7, (VEC_SIZE * 3)(%rdi)
++#  ifdef USE_AS_STPCPY
++	lea	(VEC_SIZE * 4)(%rdi), %rax
++#  endif
++#  ifdef USE_AS_STRCAT
++	movb	$0, (VEC_SIZE * 4)(%rdi)
++#  endif
++	ret
++
++	.p2align 4
++L(UnalignedFourVecSizeLeaveCase2):
++	xor	%ecx, %ecx
++	vpcmpb	$0, %YMM4, %YMMZERO, %k1
++	kmovd	%k1, %edx
++	add	$(VEC_SIZE * 3), %r8
++	jle	L(CopyVecSizeCase2OrCase3)
++	test	%edx, %edx
++#  ifndef USE_AS_STRCAT
++	jnz	L(CopyVecSizeUnalignedVec4)
++#  else
++	jnz	L(CopyVecSize)
++#  endif
++	vpcmpb	$0, %YMM5, %YMMZERO, %k2
++	kmovd	%k2, %edx
++	VMOVU	%YMM4, (%rdi)
++	add	$VEC_SIZE, %rcx
++	sub	$VEC_SIZE, %r8
++	jbe	L(CopyVecSizeCase2OrCase3)
++	test	%edx, %edx
++#  ifndef USE_AS_STRCAT
++	jnz	L(CopyVecSizeUnalignedVec5)
++#  else
++	jnz	L(CopyVecSize)
++#  endif
++
++	vpcmpb	$0, %YMM6, %YMMZERO, %k3
++	kmovd	%k3, %edx
++	VMOVU	%YMM5, VEC_SIZE(%rdi)
++	add	$VEC_SIZE, %rcx
++	sub	$VEC_SIZE, %r8
++	jbe	L(CopyVecSizeCase2OrCase3)
++	test	%edx, %edx
++#  ifndef USE_AS_STRCAT
++	jnz	L(CopyVecSizeUnalignedVec6)
++#  else
++	jnz	L(CopyVecSize)
++#  endif
++
++	vpcmpb	$0, %YMM7, %YMMZERO, %k4
++	kmovd	%k4, %edx
++	VMOVU	%YMM6, (VEC_SIZE * 2)(%rdi)
++	lea	VEC_SIZE(%rdi, %rcx), %rdi
++	lea	VEC_SIZE(%rsi, %rcx), %rsi
++	bsf	%edx, %edx
++	cmp	%r8d, %edx
++	jb	L(CopyVecSizeExit)
++L(StrncpyExit):
++	cmp	$65, %r8d
++	je	L(StrncpyExit65)
++	cmp	$33, %r8d
++	jae	L(StrncpyExit33_64)
++	cmp	$17, %r8d
++	jae	L(StrncpyExit17_32)
++	cmp	$9, %r8d
++	jae	L(StrncpyExit9_16)
++	cmp	$5, %r8d
++	jae	L(StrncpyExit5_8)
++	cmp	$3, %r8d
++	jae	L(StrncpyExit3_4)
++	cmp	$1, %r8d
++	ja	L(StrncpyExit2)
++	je	L(StrncpyExit1)
++#  ifdef USE_AS_STPCPY
++	mov	%rdi, %rax
++#  endif
++#  ifdef USE_AS_STRCAT
++	movb	$0, (%rdi)
++#  endif
++	ret
++
++	.p2align 4
++L(ExitZero):
++#  ifndef USE_AS_STRCAT
++	mov	%rdi, %rax
++#  endif
++	ret
++
++# endif
++
++# ifndef USE_AS_STRCAT
++END (STRCPY)
++# else
++END (STRCAT)
++# endif
++#endif
+diff --git a/sysdeps/x86_64/multiarch/strlen-avx2-rtm.S b/sysdeps/x86_64/multiarch/strlen-avx2-rtm.S
+new file mode 100644
+index 0000000000..75b4b7612c
+--- /dev/null
++++ b/sysdeps/x86_64/multiarch/strlen-avx2-rtm.S
+@@ -0,0 +1,12 @@
++#ifndef STRLEN
++# define STRLEN __strlen_avx2_rtm
++#endif
++
++#define ZERO_UPPER_VEC_REGISTERS_RETURN \
++  ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST
++
++#define VZEROUPPER_RETURN jmp	 L(return_vzeroupper)
++
++#define SECTION(p) p##.avx.rtm
++
++#include "strlen-avx2.S"
+diff --git a/sysdeps/x86_64/multiarch/strlen-avx2.S b/sysdeps/x86_64/multiarch/strlen-avx2.S
+index 73421ec1b2..45e08e64d6 100644
+--- a/sysdeps/x86_64/multiarch/strlen-avx2.S
++++ b/sysdeps/x86_64/multiarch/strlen-avx2.S
+@@ -27,370 +27,531 @@
+ # ifdef USE_AS_WCSLEN
+ #  define VPCMPEQ	vpcmpeqd
+ #  define VPMINU	vpminud
++#  define CHAR_SIZE	4
+ # else
+ #  define VPCMPEQ	vpcmpeqb
+ #  define VPMINU	vpminub
++#  define CHAR_SIZE	1
+ # endif
+ 
+ # ifndef VZEROUPPER
+ #  define VZEROUPPER	vzeroupper
+ # endif
+ 
++# ifndef SECTION
++#  define SECTION(p)	p##.avx
++# endif
++
+ # define VEC_SIZE 32
++# define PAGE_SIZE 4096
++# define CHAR_PER_VEC	(VEC_SIZE / CHAR_SIZE)
+ 
+-	.section .text.avx,"ax",@progbits
++	.section SECTION(.text),"ax",@progbits
+ ENTRY (STRLEN)
+ # ifdef USE_AS_STRNLEN
+-	/* Check for zero length.  */
++	/* Check zero length.  */
++#  ifdef __ILP32__
++	/* Clear upper bits.  */
++	and	%RSI_LP, %RSI_LP
++#  else
+ 	test	%RSI_LP, %RSI_LP
+-	jz	L(zero)
+-#  ifdef USE_AS_WCSLEN
+-	shl	$2, %RSI_LP
+-#  elif defined __ILP32__
+-	/* Clear the upper 32 bits.  */
+-	movl	%esi, %esi
+ #  endif
++	jz	L(zero)
++	/* Store max len in R8_LP before adjusting if using WCSLEN.  */
+ 	mov	%RSI_LP, %R8_LP
+ # endif
+-	movl	%edi, %ecx
++	movl	%edi, %eax
+ 	movq	%rdi, %rdx
+ 	vpxor	%xmm0, %xmm0, %xmm0
+-
++	/* Clear high bits from edi. Only keeping bits relevant to page
++	   cross check.  */
++	andl	$(PAGE_SIZE - 1), %eax
+ 	/* Check if we may cross page boundary with one vector load.  */
+-	andl	$(2 * VEC_SIZE - 1), %ecx
+-	cmpl	$VEC_SIZE, %ecx
+-	ja	L(cros_page_boundary)
++	cmpl	$(PAGE_SIZE - VEC_SIZE), %eax
++	ja	L(cross_page_boundary)
+ 
+ 	/* Check the first VEC_SIZE bytes.  */
+-	VPCMPEQ (%rdi), %ymm0, %ymm1
++	VPCMPEQ	(%rdi), %ymm0, %ymm1
+ 	vpmovmskb %ymm1, %eax
+-	testl	%eax, %eax
+-
+ # ifdef USE_AS_STRNLEN
+-	jnz	L(first_vec_x0_check)
+-	/* Adjust length and check the end of data.  */
+-	subq	$VEC_SIZE, %rsi
+-	jbe	L(max)
+-# else
+-	jnz	L(first_vec_x0)
++	/* If length < VEC_SIZE handle special.  */
++	cmpq	$CHAR_PER_VEC, %rsi
++	jbe	L(first_vec_x0)
+ # endif
+-
+-	/* Align data for aligned loads in the loop.  */
+-	addq	$VEC_SIZE, %rdi
+-	andl	$(VEC_SIZE - 1), %ecx
+-	andq	$-VEC_SIZE, %rdi
++	/* If empty continue to aligned_more. Otherwise return bit
++	   position of first match.  */
++	testl	%eax, %eax
++	jz	L(aligned_more)
++	tzcntl	%eax, %eax
++# ifdef USE_AS_WCSLEN
++	/* NB: Divide bytes by 4 to get wchar_t count.  */
++	shrl	$2, %eax
++# endif
++	VZEROUPPER_RETURN
+ 
+ # ifdef USE_AS_STRNLEN
+-	/* Adjust length.  */
+-	addq	%rcx, %rsi
++L(zero):
++	xorl	%eax, %eax
++	ret
+ 
+-	subq	$(VEC_SIZE * 4), %rsi
+-	jbe	L(last_4x_vec_or_less)
++	.p2align 4
++L(first_vec_x0):
++	/* Set bit for max len so that tzcnt will return min of max len
++	   and position of first match.  */
++#  ifdef USE_AS_WCSLEN
++	/* NB: Multiply length by 4 to get byte count.  */
++	sall	$2, %esi
++#  endif
++	btsq	%rsi, %rax
++	tzcntl	%eax, %eax
++#  ifdef USE_AS_WCSLEN
++	/* NB: Divide bytes by 4 to get wchar_t count.  */
++	shrl	$2, %eax
++#  endif
++	VZEROUPPER_RETURN
+ # endif
+-	jmp	L(more_4x_vec)
+ 
+ 	.p2align 4
+-L(cros_page_boundary):
+-	andl	$(VEC_SIZE - 1), %ecx
+-	andq	$-VEC_SIZE, %rdi
+-	VPCMPEQ (%rdi), %ymm0, %ymm1
+-	vpmovmskb %ymm1, %eax
+-	/* Remove the leading bytes.  */
+-	sarl	%cl, %eax
+-	testl	%eax, %eax
+-	jz	L(aligned_more)
++L(first_vec_x1):
+ 	tzcntl	%eax, %eax
++	/* Safe to use 32 bit instructions as these are only called for
++	   size = [1, 159].  */
+ # ifdef USE_AS_STRNLEN
+-	/* Check the end of data.  */
+-	cmpq	%rax, %rsi
+-	jbe	L(max)
++	/* Use ecx which was computed earlier to compute correct value.
++	 */
++#  ifdef USE_AS_WCSLEN
++	leal	-(VEC_SIZE * 4 + 1)(%rax, %rcx, 4), %eax
++#  else
++	subl	$(VEC_SIZE * 4 + 1), %ecx
++	addl	%ecx, %eax
++#  endif
++# else
++	subl	%edx, %edi
++	incl	%edi
++	addl	%edi, %eax
+ # endif
+-	addq	%rdi, %rax
+-	addq	%rcx, %rax
+-	subq	%rdx, %rax
+ # ifdef USE_AS_WCSLEN
+-	shrq	$2, %rax
++	/* NB: Divide bytes by 4 to get wchar_t count.  */
++	shrl	$2, %eax
+ # endif
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ 
+ 	.p2align 4
+-L(aligned_more):
++L(first_vec_x2):
++	tzcntl	%eax, %eax
++	/* Safe to use 32 bit instructions as these are only called for
++	   size = [1, 159].  */
+ # ifdef USE_AS_STRNLEN
+-        /* "rcx" is less than VEC_SIZE.  Calculate "rdx + rcx - VEC_SIZE"
+-	    with "rdx - (VEC_SIZE - rcx)" instead of "(rdx + rcx) - VEC_SIZE"
+-	    to void possible addition overflow.  */
+-	negq	%rcx
+-	addq	$VEC_SIZE, %rcx
+-
+-	/* Check the end of data.  */
+-	subq	%rcx, %rsi
+-	jbe	L(max)
++	/* Use ecx which was computed earlier to compute correct value.
++	 */
++#  ifdef USE_AS_WCSLEN
++	leal	-(VEC_SIZE * 3 + 1)(%rax, %rcx, 4), %eax
++#  else
++	subl	$(VEC_SIZE * 3 + 1), %ecx
++	addl	%ecx, %eax
++#  endif
++# else
++	subl	%edx, %edi
++	addl	$(VEC_SIZE + 1), %edi
++	addl	%edi, %eax
+ # endif
++# ifdef USE_AS_WCSLEN
++	/* NB: Divide bytes by 4 to get wchar_t count.  */
++	shrl	$2, %eax
++# endif
++	VZEROUPPER_RETURN
+ 
+-	addq	$VEC_SIZE, %rdi
++	.p2align 4
++L(first_vec_x3):
++	tzcntl	%eax, %eax
++	/* Safe to use 32 bit instructions as these are only called for
++	   size = [1, 159].  */
++# ifdef USE_AS_STRNLEN
++	/* Use ecx which was computed earlier to compute correct value.
++	 */
++#  ifdef USE_AS_WCSLEN
++	leal	-(VEC_SIZE * 2 + 1)(%rax, %rcx, 4), %eax
++#  else
++	subl	$(VEC_SIZE * 2 + 1), %ecx
++	addl	%ecx, %eax
++#  endif
++# else
++	subl	%edx, %edi
++	addl	$(VEC_SIZE * 2 + 1), %edi
++	addl	%edi, %eax
++# endif
++# ifdef USE_AS_WCSLEN
++	/* NB: Divide bytes by 4 to get wchar_t count.  */
++	shrl	$2, %eax
++# endif
++	VZEROUPPER_RETURN
+ 
++	.p2align 4
++L(first_vec_x4):
++	tzcntl	%eax, %eax
++	/* Safe to use 32 bit instructions as these are only called for
++	   size = [1, 159].  */
+ # ifdef USE_AS_STRNLEN
+-	subq	$(VEC_SIZE * 4), %rsi
+-	jbe	L(last_4x_vec_or_less)
++	/* Use ecx which was computed earlier to compute correct value.
++	 */
++#  ifdef USE_AS_WCSLEN
++	leal	-(VEC_SIZE * 1 + 1)(%rax, %rcx, 4), %eax
++#  else
++	subl	$(VEC_SIZE + 1), %ecx
++	addl	%ecx, %eax
++#  endif
++# else
++	subl	%edx, %edi
++	addl	$(VEC_SIZE * 3 + 1), %edi
++	addl	%edi, %eax
+ # endif
++# ifdef USE_AS_WCSLEN
++	/* NB: Divide bytes by 4 to get wchar_t count.  */
++	shrl	$2, %eax
++# endif
++	VZEROUPPER_RETURN
+ 
+-L(more_4x_vec):
++	.p2align 5
++L(aligned_more):
++	/* Align data to VEC_SIZE - 1. This is the same number of
++	   instructions as using andq with -VEC_SIZE but saves 4 bytes of
++	   code on the x4 check.  */
++	orq	$(VEC_SIZE - 1), %rdi
++L(cross_page_continue):
+ 	/* Check the first 4 * VEC_SIZE.  Only one VEC_SIZE at a time
+ 	   since data is only aligned to VEC_SIZE.  */
+-	VPCMPEQ (%rdi), %ymm0, %ymm1
+-	vpmovmskb %ymm1, %eax
+-	testl	%eax, %eax
+-	jnz	L(first_vec_x0)
+-
+-	VPCMPEQ VEC_SIZE(%rdi), %ymm0, %ymm1
++# ifdef USE_AS_STRNLEN
++	/* + 1 because rdi is aligned to VEC_SIZE - 1. + CHAR_SIZE
++	   because it simplies the logic in last_4x_vec_or_less.  */
++	leaq	(VEC_SIZE * 4 + CHAR_SIZE + 1)(%rdi), %rcx
++	subq	%rdx, %rcx
++#  ifdef USE_AS_WCSLEN
++	/* NB: Divide bytes by 4 to get the wchar_t count.  */
++	sarl	$2, %ecx
++#  endif
++# endif
++	/* Load first VEC regardless.  */
++	VPCMPEQ	1(%rdi), %ymm0, %ymm1
++# ifdef USE_AS_STRNLEN
++	/* Adjust length. If near end handle specially.  */
++	subq	%rcx, %rsi
++	jb	L(last_4x_vec_or_less)
++# endif
+ 	vpmovmskb %ymm1, %eax
+ 	testl	%eax, %eax
+ 	jnz	L(first_vec_x1)
+ 
+-	VPCMPEQ (VEC_SIZE * 2)(%rdi), %ymm0, %ymm1
++	VPCMPEQ	(VEC_SIZE + 1)(%rdi), %ymm0, %ymm1
+ 	vpmovmskb %ymm1, %eax
+ 	testl	%eax, %eax
+ 	jnz	L(first_vec_x2)
+ 
+-	VPCMPEQ (VEC_SIZE * 3)(%rdi), %ymm0, %ymm1
++	VPCMPEQ	(VEC_SIZE * 2 + 1)(%rdi), %ymm0, %ymm1
+ 	vpmovmskb %ymm1, %eax
+ 	testl	%eax, %eax
+ 	jnz	L(first_vec_x3)
+ 
+-	addq	$(VEC_SIZE * 4), %rdi
+-
+-# ifdef USE_AS_STRNLEN
+-	subq	$(VEC_SIZE * 4), %rsi
+-	jbe	L(last_4x_vec_or_less)
+-# endif
+-
+-	/* Align data to 4 * VEC_SIZE.  */
+-	movq	%rdi, %rcx
+-	andl	$(4 * VEC_SIZE - 1), %ecx
+-	andq	$-(4 * VEC_SIZE), %rdi
++	VPCMPEQ	(VEC_SIZE * 3 + 1)(%rdi), %ymm0, %ymm1
++	vpmovmskb %ymm1, %eax
++	testl	%eax, %eax
++	jnz	L(first_vec_x4)
+ 
++	/* Align data to VEC_SIZE * 4 - 1.  */
+ # ifdef USE_AS_STRNLEN
+-	/* Adjust length.  */
++	/* Before adjusting length check if at last VEC_SIZE * 4.  */
++	cmpq	$(CHAR_PER_VEC * 4 - 1), %rsi
++	jbe	L(last_4x_vec_or_less_load)
++	incq	%rdi
++	movl	%edi, %ecx
++	orq	$(VEC_SIZE * 4 - 1), %rdi
++	andl	$(VEC_SIZE * 4 - 1), %ecx
++#  ifdef USE_AS_WCSLEN
++	/* NB: Divide bytes by 4 to get the wchar_t count.  */
++	sarl	$2, %ecx
++#  endif
++	/* Readjust length.  */
+ 	addq	%rcx, %rsi
++# else
++	incq	%rdi
++	orq	$(VEC_SIZE * 4 - 1), %rdi
+ # endif
+-
++	/* Compare 4 * VEC at a time forward.  */
+ 	.p2align 4
+ L(loop_4x_vec):
+-	/* Compare 4 * VEC at a time forward.  */
+-	vmovdqa (%rdi), %ymm1
+-	vmovdqa	VEC_SIZE(%rdi), %ymm2
+-	vmovdqa	(VEC_SIZE * 2)(%rdi), %ymm3
+-	vmovdqa	(VEC_SIZE * 3)(%rdi), %ymm4
+-	VPMINU	%ymm1, %ymm2, %ymm5
+-	VPMINU	%ymm3, %ymm4, %ymm6
+-	VPMINU	%ymm5, %ymm6, %ymm5
+-
++# ifdef USE_AS_STRNLEN
++	/* Break if at end of length.  */
++	subq	$(CHAR_PER_VEC * 4), %rsi
++	jb	L(last_4x_vec_or_less_cmpeq)
++# endif
++	/* Save some code size by microfusing VPMINU with the load.
++	   Since the matches in ymm2/ymm4 can only be returned if there
++	   where no matches in ymm1/ymm3 respectively there is no issue
++	   with overlap.  */
++	vmovdqa	1(%rdi), %ymm1
++	VPMINU	(VEC_SIZE + 1)(%rdi), %ymm1, %ymm2
++	vmovdqa	(VEC_SIZE * 2 + 1)(%rdi), %ymm3
++	VPMINU	(VEC_SIZE * 3 + 1)(%rdi), %ymm3, %ymm4
++
++	VPMINU	%ymm2, %ymm4, %ymm5
+ 	VPCMPEQ	%ymm5, %ymm0, %ymm5
+-	vpmovmskb %ymm5, %eax
+-	testl	%eax, %eax
+-	jnz	L(4x_vec_end)
++	vpmovmskb %ymm5, %ecx
+ 
+-	addq	$(VEC_SIZE * 4), %rdi
++	subq	$-(VEC_SIZE * 4), %rdi
++	testl	%ecx, %ecx
++	jz	L(loop_4x_vec)
+ 
+-# ifndef USE_AS_STRNLEN
+-	jmp	L(loop_4x_vec)
+-# else
+-	subq	$(VEC_SIZE * 4), %rsi
+-	ja	L(loop_4x_vec)
+ 
+-L(last_4x_vec_or_less):
+-	/* Less than 4 * VEC and aligned to VEC_SIZE.  */
+-	addl	$(VEC_SIZE * 2), %esi
+-	jle	L(last_2x_vec)
+-
+-	VPCMPEQ (%rdi), %ymm0, %ymm1
+-	vpmovmskb %ymm1, %eax
+-	testl	%eax, %eax
+-	jnz	L(first_vec_x0)
+-
+-	VPCMPEQ VEC_SIZE(%rdi), %ymm0, %ymm1
++	VPCMPEQ	%ymm1, %ymm0, %ymm1
+ 	vpmovmskb %ymm1, %eax
++	subq	%rdx, %rdi
+ 	testl	%eax, %eax
+-	jnz	L(first_vec_x1)
++	jnz	L(last_vec_return_x0)
+ 
+-	VPCMPEQ (VEC_SIZE * 2)(%rdi), %ymm0, %ymm1
+-	vpmovmskb %ymm1, %eax
++	VPCMPEQ	%ymm2, %ymm0, %ymm2
++	vpmovmskb %ymm2, %eax
+ 	testl	%eax, %eax
++	jnz	L(last_vec_return_x1)
+ 
+-	jnz	L(first_vec_x2_check)
+-	subl	$VEC_SIZE, %esi
+-	jle	L(max)
++	/* Combine last 2 VEC.  */
++	VPCMPEQ	%ymm3, %ymm0, %ymm3
++	vpmovmskb %ymm3, %eax
++	/* rcx has combined result from all 4 VEC. It will only be used
++	   if the first 3 other VEC all did not contain a match.  */
++	salq	$32, %rcx
++	orq	%rcx, %rax
++	tzcntq	%rax, %rax
++	subq	$(VEC_SIZE * 2 - 1), %rdi
++	addq	%rdi, %rax
++# ifdef USE_AS_WCSLEN
++	/* NB: Divide bytes by 4 to get wchar_t count.  */
++	shrq	$2, %rax
++# endif
++	VZEROUPPER_RETURN
+ 
+-	VPCMPEQ (VEC_SIZE * 3)(%rdi), %ymm0, %ymm1
+-	vpmovmskb %ymm1, %eax
+-	testl	%eax, %eax
+ 
+-	jnz	L(first_vec_x3_check)
+-	movq	%r8, %rax
++# ifdef USE_AS_STRNLEN
++	.p2align 4
++L(last_4x_vec_or_less_load):
++	/* Depending on entry adjust rdi / prepare first VEC in ymm1.
++	 */
++	subq	$-(VEC_SIZE * 4), %rdi
++L(last_4x_vec_or_less_cmpeq):
++	VPCMPEQ	1(%rdi), %ymm0, %ymm1
++L(last_4x_vec_or_less):
+ #  ifdef USE_AS_WCSLEN
+-	shrq	$2, %rax
++	/* NB: Multiply length by 4 to get byte count.  */
++	sall	$2, %esi
+ #  endif
+-	VZEROUPPER
+-	ret
+-
+-	.p2align 4
+-L(last_2x_vec):
+-	addl	$(VEC_SIZE * 2), %esi
+-	VPCMPEQ (%rdi), %ymm0, %ymm1
+ 	vpmovmskb %ymm1, %eax
++	/* If remaining length > VEC_SIZE * 2. This works if esi is off
++	   by VEC_SIZE * 4.  */
++	testl	$(VEC_SIZE * 2), %esi
++	jnz	L(last_4x_vec)
++
++	/* length may have been negative or positive by an offset of
++	   VEC_SIZE * 4 depending on where this was called from. This fixes
++	   that.  */
++	andl	$(VEC_SIZE * 4 - 1), %esi
+ 	testl	%eax, %eax
++	jnz	L(last_vec_x1_check)
+ 
+-	jnz	L(first_vec_x0_check)
+ 	subl	$VEC_SIZE, %esi
+-	jle	L(max)
++	jb	L(max)
+ 
+-	VPCMPEQ VEC_SIZE(%rdi), %ymm0, %ymm1
++	VPCMPEQ	(VEC_SIZE + 1)(%rdi), %ymm0, %ymm1
+ 	vpmovmskb %ymm1, %eax
+-	testl	%eax, %eax
+-	jnz	L(first_vec_x1_check)
+-	movq	%r8, %rax
+-#  ifdef USE_AS_WCSLEN
+-	shrq	$2, %rax
+-#  endif
+-	VZEROUPPER
+-	ret
+-
+-	.p2align 4
+-L(first_vec_x0_check):
+ 	tzcntl	%eax, %eax
+ 	/* Check the end of data.  */
+-	cmpq	%rax, %rsi
+-	jbe	L(max)
++	cmpl	%eax, %esi
++	jb	L(max)
++	subq	%rdx, %rdi
++	addl	$(VEC_SIZE + 1), %eax
+ 	addq	%rdi, %rax
+-	subq	%rdx, %rax
+ #  ifdef USE_AS_WCSLEN
++	/* NB: Divide bytes by 4 to get wchar_t count.  */
+ 	shrq	$2, %rax
+ #  endif
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
++# endif
+ 
+ 	.p2align 4
+-L(first_vec_x1_check):
++L(last_vec_return_x0):
+ 	tzcntl	%eax, %eax
+-	/* Check the end of data.  */
+-	cmpq	%rax, %rsi
+-	jbe	L(max)
+-	addq	$VEC_SIZE, %rax
++	subq	$(VEC_SIZE * 4 - 1), %rdi
+ 	addq	%rdi, %rax
+-	subq	%rdx, %rax
+-#  ifdef USE_AS_WCSLEN
++# ifdef USE_AS_WCSLEN
++	/* NB: Divide bytes by 4 to get wchar_t count.  */
+ 	shrq	$2, %rax
+-#  endif
+-	VZEROUPPER
+-	ret
++# endif
++	VZEROUPPER_RETURN
+ 
+ 	.p2align 4
+-L(first_vec_x2_check):
++L(last_vec_return_x1):
+ 	tzcntl	%eax, %eax
+-	/* Check the end of data.  */
+-	cmpq	%rax, %rsi
+-	jbe	L(max)
+-	addq	$(VEC_SIZE * 2), %rax
++	subq	$(VEC_SIZE * 3 - 1), %rdi
+ 	addq	%rdi, %rax
+-	subq	%rdx, %rax
+-#  ifdef USE_AS_WCSLEN
++# ifdef USE_AS_WCSLEN
++	/* NB: Divide bytes by 4 to get wchar_t count.  */
+ 	shrq	$2, %rax
+-#  endif
+-	VZEROUPPER
+-	ret
++# endif
++	VZEROUPPER_RETURN
+ 
++# ifdef USE_AS_STRNLEN
+ 	.p2align 4
+-L(first_vec_x3_check):
++L(last_vec_x1_check):
++
+ 	tzcntl	%eax, %eax
+ 	/* Check the end of data.  */
+-	cmpq	%rax, %rsi
+-	jbe	L(max)
+-	addq	$(VEC_SIZE * 3), %rax
++	cmpl	%eax, %esi
++	jb	L(max)
++	subq	%rdx, %rdi
++	incl	%eax
+ 	addq	%rdi, %rax
+-	subq	%rdx, %rax
+ #  ifdef USE_AS_WCSLEN
++	/* NB: Divide bytes by 4 to get wchar_t count.  */
+ 	shrq	$2, %rax
+ #  endif
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ 
+-	.p2align 4
+ L(max):
+ 	movq	%r8, %rax
++	VZEROUPPER_RETURN
++
++	.p2align 4
++L(last_4x_vec):
++	/* Test first 2x VEC normally.  */
++	testl	%eax, %eax
++	jnz	L(last_vec_x1)
++
++	VPCMPEQ	(VEC_SIZE + 1)(%rdi), %ymm0, %ymm1
++	vpmovmskb %ymm1, %eax
++	testl	%eax, %eax
++	jnz	L(last_vec_x2)
++
++	/* Normalize length.  */
++	andl	$(VEC_SIZE * 4 - 1), %esi
++	VPCMPEQ	(VEC_SIZE * 2 + 1)(%rdi), %ymm0, %ymm1
++	vpmovmskb %ymm1, %eax
++	testl	%eax, %eax
++	jnz	L(last_vec_x3)
++
++	subl	$(VEC_SIZE * 3), %esi
++	jb	L(max)
++
++	VPCMPEQ	(VEC_SIZE * 3 + 1)(%rdi), %ymm0, %ymm1
++	vpmovmskb %ymm1, %eax
++	tzcntl	%eax, %eax
++	/* Check the end of data.  */
++	cmpl	%eax, %esi
++	jb	L(max)
++	subq	%rdx, %rdi
++	addl	$(VEC_SIZE * 3 + 1), %eax
++	addq	%rdi, %rax
+ #  ifdef USE_AS_WCSLEN
++	/* NB: Divide bytes by 4 to get wchar_t count.  */
+ 	shrq	$2, %rax
+ #  endif
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ 
+-	.p2align 4
+-L(zero):
+-	xorl	%eax, %eax
+-	ret
+-# endif
+ 
+ 	.p2align 4
+-L(first_vec_x0):
++L(last_vec_x1):
++	/* essentially duplicates of first_vec_x1 but use 64 bit
++	   instructions.  */
+ 	tzcntl	%eax, %eax
++	subq	%rdx, %rdi
++	incl	%eax
+ 	addq	%rdi, %rax
+-	subq	%rdx, %rax
+-# ifdef USE_AS_WCSLEN
++#  ifdef USE_AS_WCSLEN
++	/* NB: Divide bytes by 4 to get wchar_t count.  */
+ 	shrq	$2, %rax
+-# endif
+-	VZEROUPPER
+-	ret
++#  endif
++	VZEROUPPER_RETURN
+ 
+ 	.p2align 4
+-L(first_vec_x1):
++L(last_vec_x2):
++	/* essentially duplicates of first_vec_x1 but use 64 bit
++	   instructions.  */
+ 	tzcntl	%eax, %eax
+-	addq	$VEC_SIZE, %rax
++	subq	%rdx, %rdi
++	addl	$(VEC_SIZE + 1), %eax
+ 	addq	%rdi, %rax
+-	subq	%rdx, %rax
+-# ifdef USE_AS_WCSLEN
++#  ifdef USE_AS_WCSLEN
++	/* NB: Divide bytes by 4 to get wchar_t count.  */
+ 	shrq	$2, %rax
+-# endif
+-	VZEROUPPER
+-	ret
++#  endif
++	VZEROUPPER_RETURN
+ 
+ 	.p2align 4
+-L(first_vec_x2):
++L(last_vec_x3):
+ 	tzcntl	%eax, %eax
+-	addq	$(VEC_SIZE * 2), %rax
++	subl	$(VEC_SIZE * 2), %esi
++	/* Check the end of data.  */
++	cmpl	%eax, %esi
++	jb	L(max_end)
++	subq	%rdx, %rdi
++	addl	$(VEC_SIZE * 2 + 1), %eax
+ 	addq	%rdi, %rax
+-	subq	%rdx, %rax
+-# ifdef USE_AS_WCSLEN
++#  ifdef USE_AS_WCSLEN
++	/* NB: Divide bytes by 4 to get wchar_t count.  */
+ 	shrq	$2, %rax
++#  endif
++	VZEROUPPER_RETURN
++L(max_end):
++	movq	%r8, %rax
++	VZEROUPPER_RETURN
+ # endif
+-	VZEROUPPER
+-	ret
+ 
++	/* Cold case for crossing page with first load.  */
+ 	.p2align 4
+-L(4x_vec_end):
+-	VPCMPEQ	%ymm1, %ymm0, %ymm1
++L(cross_page_boundary):
++	/* Align data to VEC_SIZE - 1.  */
++	orq	$(VEC_SIZE - 1), %rdi
++	VPCMPEQ	-(VEC_SIZE - 1)(%rdi), %ymm0, %ymm1
+ 	vpmovmskb %ymm1, %eax
++	/* Remove the leading bytes. sarxl only uses bits [5:0] of COUNT
++	   so no need to manually mod rdx.  */
++	sarxl	%edx, %eax, %eax
++# ifdef USE_AS_STRNLEN
+ 	testl	%eax, %eax
+-	jnz	L(first_vec_x0)
+-	VPCMPEQ %ymm2, %ymm0, %ymm2
+-	vpmovmskb %ymm2, %eax
+-	testl	%eax, %eax
+-	jnz	L(first_vec_x1)
+-	VPCMPEQ %ymm3, %ymm0, %ymm3
+-	vpmovmskb %ymm3, %eax
++	jnz	L(cross_page_less_vec)
++	leaq	1(%rdi), %rcx
++	subq	%rdx, %rcx
++#  ifdef USE_AS_WCSLEN
++	/* NB: Divide bytes by 4 to get wchar_t count.  */
++	shrl	$2, %ecx
++#  endif
++	/* Check length.  */
++	cmpq	%rsi, %rcx
++	jb	L(cross_page_continue)
++	movq	%r8, %rax
++# else
+ 	testl	%eax, %eax
+-	jnz	L(first_vec_x2)
+-	VPCMPEQ %ymm4, %ymm0, %ymm4
+-	vpmovmskb %ymm4, %eax
+-L(first_vec_x3):
++	jz	L(cross_page_continue)
+ 	tzcntl	%eax, %eax
+-	addq	$(VEC_SIZE * 3), %rax
+-	addq	%rdi, %rax
+-	subq	%rdx, %rax
+-# ifdef USE_AS_WCSLEN
+-	shrq	$2, %rax
++#  ifdef USE_AS_WCSLEN
++	/* NB: Divide length by 4 to get wchar_t count.  */
++	shrl	$2, %eax
++#  endif
++# endif
++L(return_vzeroupper):
++	ZERO_UPPER_VEC_REGISTERS_RETURN
++
++# ifdef USE_AS_STRNLEN
++	.p2align 4
++L(cross_page_less_vec):
++	tzcntl	%eax, %eax
++#  ifdef USE_AS_WCSLEN
++	/* NB: Multiply length by 4 to get byte count.  */
++	sall	$2, %esi
++#  endif
++	cmpq	%rax, %rsi
++	cmovb	%esi, %eax
++#  ifdef USE_AS_WCSLEN
++	shrl	$2, %eax
++#  endif
++	VZEROUPPER_RETURN
+ # endif
+-	VZEROUPPER
+-	ret
+ 
+ END (STRLEN)
+ #endif
+diff --git a/sysdeps/x86_64/multiarch/strlen-evex.S b/sysdeps/x86_64/multiarch/strlen-evex.S
+new file mode 100644
+index 0000000000..4bf6874b82
+--- /dev/null
++++ b/sysdeps/x86_64/multiarch/strlen-evex.S
+@@ -0,0 +1,489 @@
++/* strlen/strnlen/wcslen/wcsnlen optimized with 256-bit EVEX instructions.
++   Copyright (C) 2021 Free Software Foundation, Inc.
++   This file is part of the GNU C Library.
++
++   The GNU C Library is free software; you can redistribute it and/or
++   modify it under the terms of the GNU Lesser General Public
++   License as published by the Free Software Foundation; either
++   version 2.1 of the License, or (at your option) any later version.
++
++   The GNU C Library is distributed in the hope that it will be useful,
++   but WITHOUT ANY WARRANTY; without even the implied warranty of
++   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
++   Lesser General Public License for more details.
++
++   You should have received a copy of the GNU Lesser General Public
++   License along with the GNU C Library; if not, see
++   <https://www.gnu.org/licenses/>.  */
++
++#if IS_IN (libc)
++
++# include <sysdep.h>
++
++# ifndef STRLEN
++#  define STRLEN	__strlen_evex
++# endif
++
++# define VMOVA		vmovdqa64
++
++# ifdef USE_AS_WCSLEN
++#  define VPCMP		vpcmpd
++#  define VPMINU	vpminud
++#  define SHIFT_REG ecx
++#  define CHAR_SIZE	4
++# else
++#  define VPCMP		vpcmpb
++#  define VPMINU	vpminub
++#  define SHIFT_REG edx
++#  define CHAR_SIZE	1
++# endif
++
++# define XMMZERO	xmm16
++# define YMMZERO	ymm16
++# define YMM1		ymm17
++# define YMM2		ymm18
++# define YMM3		ymm19
++# define YMM4		ymm20
++# define YMM5		ymm21
++# define YMM6		ymm22
++
++# define VEC_SIZE 32
++# define PAGE_SIZE 4096
++# define CHAR_PER_VEC (VEC_SIZE / CHAR_SIZE)
++
++	.section .text.evex,"ax",@progbits
++ENTRY (STRLEN)
++# ifdef USE_AS_STRNLEN
++	/* Check zero length.  */
++	test	%RSI_LP, %RSI_LP
++	jz	L(zero)
++#  ifdef __ILP32__
++	/* Clear the upper 32 bits.  */
++	movl	%esi, %esi
++#  endif
++	mov	%RSI_LP, %R8_LP
++# endif
++	movl	%edi, %eax
++	vpxorq	%XMMZERO, %XMMZERO, %XMMZERO
++	/* Clear high bits from edi. Only keeping bits relevant to page
++	   cross check.  */
++	andl	$(PAGE_SIZE - 1), %eax
++	/* Check if we may cross page boundary with one vector load.  */
++	cmpl	$(PAGE_SIZE - VEC_SIZE), %eax
++	ja	L(cross_page_boundary)
++
++	/* Check the first VEC_SIZE bytes.  Each bit in K0 represents a
++	   null byte.  */
++	VPCMP	$0, (%rdi), %YMMZERO, %k0
++	kmovd	%k0, %eax
++# ifdef USE_AS_STRNLEN
++	/* If length < CHAR_PER_VEC handle special.  */
++	cmpq	$CHAR_PER_VEC, %rsi
++	jbe	L(first_vec_x0)
++# endif
++	testl	%eax, %eax
++	jz	L(aligned_more)
++	tzcntl	%eax, %eax
++	ret
++# ifdef USE_AS_STRNLEN
++L(zero):
++	xorl	%eax, %eax
++	ret
++
++	.p2align 4
++L(first_vec_x0):
++	/* Set bit for max len so that tzcnt will return min of max len
++	   and position of first match.  */
++	btsq	%rsi, %rax
++	tzcntl	%eax, %eax
++	ret
++# endif
++
++	.p2align 4
++L(first_vec_x1):
++	tzcntl	%eax, %eax
++	/* Safe to use 32 bit instructions as these are only called for
++	   size = [1, 159].  */
++# ifdef USE_AS_STRNLEN
++	/* Use ecx which was computed earlier to compute correct value.
++	 */
++	leal	-(CHAR_PER_VEC * 4 + 1)(%rcx, %rax), %eax
++# else
++	subl	%edx, %edi
++#  ifdef USE_AS_WCSLEN
++	/* NB: Divide bytes by 4 to get the wchar_t count.  */
++	sarl	$2, %edi
++#  endif
++	leal	CHAR_PER_VEC(%rdi, %rax), %eax
++# endif
++	ret
++
++	.p2align 4
++L(first_vec_x2):
++	tzcntl	%eax, %eax
++	/* Safe to use 32 bit instructions as these are only called for
++	   size = [1, 159].  */
++# ifdef USE_AS_STRNLEN
++	/* Use ecx which was computed earlier to compute correct value.
++	 */
++	leal	-(CHAR_PER_VEC * 3 + 1)(%rcx, %rax), %eax
++# else
++	subl	%edx, %edi
++#  ifdef USE_AS_WCSLEN
++	/* NB: Divide bytes by 4 to get the wchar_t count.  */
++	sarl	$2, %edi
++#  endif
++	leal	(CHAR_PER_VEC * 2)(%rdi, %rax), %eax
++# endif
++	ret
++
++	.p2align 4
++L(first_vec_x3):
++	tzcntl	%eax, %eax
++	/* Safe to use 32 bit instructions as these are only called for
++	   size = [1, 159].  */
++# ifdef USE_AS_STRNLEN
++	/* Use ecx which was computed earlier to compute correct value.
++	 */
++	leal	-(CHAR_PER_VEC * 2 + 1)(%rcx, %rax), %eax
++# else
++	subl	%edx, %edi
++#  ifdef USE_AS_WCSLEN
++	/* NB: Divide bytes by 4 to get the wchar_t count.  */
++	sarl	$2, %edi
++#  endif
++	leal	(CHAR_PER_VEC * 3)(%rdi, %rax), %eax
++# endif
++	ret
++
++	.p2align 4
++L(first_vec_x4):
++	tzcntl	%eax, %eax
++	/* Safe to use 32 bit instructions as these are only called for
++	   size = [1, 159].  */
++# ifdef USE_AS_STRNLEN
++	/* Use ecx which was computed earlier to compute correct value.
++	 */
++	leal	-(CHAR_PER_VEC + 1)(%rcx, %rax), %eax
++# else
++	subl	%edx, %edi
++#  ifdef USE_AS_WCSLEN
++	/* NB: Divide bytes by 4 to get the wchar_t count.  */
++	sarl	$2, %edi
++#  endif
++	leal	(CHAR_PER_VEC * 4)(%rdi, %rax), %eax
++# endif
++	ret
++
++	.p2align 5
++L(aligned_more):
++	movq	%rdi, %rdx
++	/* Align data to VEC_SIZE.  */
++	andq	$-(VEC_SIZE), %rdi
++L(cross_page_continue):
++	/* Check the first 4 * VEC_SIZE.  Only one VEC_SIZE at a time
++	   since data is only aligned to VEC_SIZE.  */
++# ifdef USE_AS_STRNLEN
++	/* + CHAR_SIZE because it simplies the logic in
++	   last_4x_vec_or_less.  */
++	leaq	(VEC_SIZE * 5 + CHAR_SIZE)(%rdi), %rcx
++	subq	%rdx, %rcx
++#  ifdef USE_AS_WCSLEN
++	/* NB: Divide bytes by 4 to get the wchar_t count.  */
++	sarl	$2, %ecx
++#  endif
++# endif
++	/* Load first VEC regardless.  */
++	VPCMP	$0, VEC_SIZE(%rdi), %YMMZERO, %k0
++# ifdef USE_AS_STRNLEN
++	/* Adjust length. If near end handle specially.  */
++	subq	%rcx, %rsi
++	jb	L(last_4x_vec_or_less)
++# endif
++	kmovd	%k0, %eax
++	testl	%eax, %eax
++	jnz	L(first_vec_x1)
++
++	VPCMP	$0, (VEC_SIZE * 2)(%rdi), %YMMZERO, %k0
++	kmovd	%k0, %eax
++	test	%eax, %eax
++	jnz	L(first_vec_x2)
++
++	VPCMP	$0, (VEC_SIZE * 3)(%rdi), %YMMZERO, %k0
++	kmovd	%k0, %eax
++	testl	%eax, %eax
++	jnz	L(first_vec_x3)
++
++	VPCMP	$0, (VEC_SIZE * 4)(%rdi), %YMMZERO, %k0
++	kmovd	%k0, %eax
++	testl	%eax, %eax
++	jnz	L(first_vec_x4)
++
++	addq	$VEC_SIZE, %rdi
++# ifdef USE_AS_STRNLEN
++	/* Check if at last VEC_SIZE * 4 length.  */
++	cmpq	$(CHAR_PER_VEC * 4 - 1), %rsi
++	jbe	L(last_4x_vec_or_less_load)
++	movl	%edi, %ecx
++	andl	$(VEC_SIZE * 4 - 1), %ecx
++#  ifdef USE_AS_WCSLEN
++	/* NB: Divide bytes by 4 to get the wchar_t count.  */
++	sarl	$2, %ecx
++#  endif
++	/* Readjust length.  */
++	addq	%rcx, %rsi
++# endif
++	/* Align data to VEC_SIZE * 4.  */
++	andq	$-(VEC_SIZE * 4), %rdi
++
++	/* Compare 4 * VEC at a time forward.  */
++	.p2align 4
++L(loop_4x_vec):
++	/* Load first VEC regardless.  */
++	VMOVA	(VEC_SIZE * 4)(%rdi), %YMM1
++# ifdef USE_AS_STRNLEN
++	/* Break if at end of length.  */
++	subq	$(CHAR_PER_VEC * 4), %rsi
++	jb	L(last_4x_vec_or_less_cmpeq)
++# endif
++	/* Save some code size by microfusing VPMINU with the load. Since
++	   the matches in ymm2/ymm4 can only be returned if there where no
++	   matches in ymm1/ymm3 respectively there is no issue with overlap.
++	 */
++	VPMINU	(VEC_SIZE * 5)(%rdi), %YMM1, %YMM2
++	VMOVA	(VEC_SIZE * 6)(%rdi), %YMM3
++	VPMINU	(VEC_SIZE * 7)(%rdi), %YMM3, %YMM4
++
++	VPCMP	$0, %YMM2, %YMMZERO, %k0
++	VPCMP	$0, %YMM4, %YMMZERO, %k1
++	subq	$-(VEC_SIZE * 4), %rdi
++	kortestd	%k0, %k1
++	jz	L(loop_4x_vec)
++
++	/* Check if end was in first half.  */
++	kmovd	%k0, %eax
++	subq	%rdx, %rdi
++# ifdef USE_AS_WCSLEN
++	shrq	$2, %rdi
++# endif
++	testl	%eax, %eax
++	jz	L(second_vec_return)
++
++	VPCMP	$0, %YMM1, %YMMZERO, %k2
++	kmovd	%k2, %edx
++	/* Combine VEC1 matches (edx) with VEC2 matches (eax).  */
++# ifdef USE_AS_WCSLEN
++	sall	$CHAR_PER_VEC, %eax
++	orl	%edx, %eax
++	tzcntl	%eax, %eax
++# else
++	salq	$CHAR_PER_VEC, %rax
++	orq	%rdx, %rax
++	tzcntq	%rax, %rax
++# endif
++	addq	%rdi, %rax
++	ret
++
++
++# ifdef USE_AS_STRNLEN
++
++L(last_4x_vec_or_less_load):
++	/* Depending on entry adjust rdi / prepare first VEC in YMM1.  */
++	VMOVA	(VEC_SIZE * 4)(%rdi), %YMM1
++L(last_4x_vec_or_less_cmpeq):
++	VPCMP	$0, %YMM1, %YMMZERO, %k0
++	addq	$(VEC_SIZE * 3), %rdi
++L(last_4x_vec_or_less):
++	kmovd	%k0, %eax
++	/* If remaining length > VEC_SIZE * 2. This works if esi is off by
++	   VEC_SIZE * 4.  */
++	testl	$(CHAR_PER_VEC * 2), %esi
++	jnz	L(last_4x_vec)
++
++	/* length may have been negative or positive by an offset of
++	   CHAR_PER_VEC * 4 depending on where this was called from. This
++	   fixes that.  */
++	andl	$(CHAR_PER_VEC * 4 - 1), %esi
++	testl	%eax, %eax
++	jnz	L(last_vec_x1_check)
++
++	/* Check the end of data.  */
++	subl	$CHAR_PER_VEC, %esi
++	jb	L(max)
++
++	VPCMP	$0, (VEC_SIZE * 2)(%rdi), %YMMZERO, %k0
++	kmovd	%k0, %eax
++	tzcntl	%eax, %eax
++	/* Check the end of data.  */
++	cmpl	%eax, %esi
++	jb	L(max)
++
++	subq	%rdx, %rdi
++#  ifdef USE_AS_WCSLEN
++	/* NB: Divide bytes by 4 to get the wchar_t count.  */
++	sarq	$2, %rdi
++#  endif
++	leaq	(CHAR_PER_VEC * 2)(%rdi, %rax), %rax
++	ret
++L(max):
++	movq	%r8, %rax
++	ret
++# endif
++
++	/* Placed here in strnlen so that the jcc L(last_4x_vec_or_less)
++	   in the 4x VEC loop can use 2 byte encoding.  */
++	.p2align 4
++L(second_vec_return):
++	VPCMP	$0, %YMM3, %YMMZERO, %k0
++	/* Combine YMM3 matches (k0) with YMM4 matches (k1).  */
++# ifdef USE_AS_WCSLEN
++	kunpckbw	%k0, %k1, %k0
++	kmovd	%k0, %eax
++	tzcntl	%eax, %eax
++# else
++	kunpckdq	%k0, %k1, %k0
++	kmovq	%k0, %rax
++	tzcntq	%rax, %rax
++# endif
++	leaq	(CHAR_PER_VEC * 2)(%rdi, %rax), %rax
++	ret
++
++
++# ifdef USE_AS_STRNLEN
++L(last_vec_x1_check):
++	tzcntl	%eax, %eax
++	/* Check the end of data.  */
++	cmpl	%eax, %esi
++	jb	L(max)
++	subq	%rdx, %rdi
++#  ifdef USE_AS_WCSLEN
++	/* NB: Divide bytes by 4 to get the wchar_t count.  */
++	sarq	$2, %rdi
++#  endif
++	leaq	(CHAR_PER_VEC)(%rdi, %rax), %rax
++	ret
++
++	.p2align 4
++L(last_4x_vec):
++	/* Test first 2x VEC normally.  */
++	testl	%eax, %eax
++	jnz	L(last_vec_x1)
++
++	VPCMP	$0, (VEC_SIZE * 2)(%rdi), %YMMZERO, %k0
++	kmovd	%k0, %eax
++	testl	%eax, %eax
++	jnz	L(last_vec_x2)
++
++	/* Normalize length.  */
++	andl	$(CHAR_PER_VEC * 4 - 1), %esi
++	VPCMP	$0, (VEC_SIZE * 3)(%rdi), %YMMZERO, %k0
++	kmovd	%k0, %eax
++	testl	%eax, %eax
++	jnz	L(last_vec_x3)
++
++	/* Check the end of data.  */
++	subl	$(CHAR_PER_VEC * 3), %esi
++	jb	L(max)
++
++	VPCMP	$0, (VEC_SIZE * 4)(%rdi), %YMMZERO, %k0
++	kmovd	%k0, %eax
++	tzcntl	%eax, %eax
++	/* Check the end of data.  */
++	cmpl	%eax, %esi
++	jb	L(max_end)
++
++	subq	%rdx, %rdi
++#  ifdef USE_AS_WCSLEN
++	/* NB: Divide bytes by 4 to get the wchar_t count.  */
++	sarq	$2, %rdi
++#  endif
++	leaq	(CHAR_PER_VEC * 4)(%rdi, %rax), %rax
++	ret
++
++	.p2align 4
++L(last_vec_x1):
++	tzcntl	%eax, %eax
++	subq	%rdx, %rdi
++#  ifdef USE_AS_WCSLEN
++	/* NB: Divide bytes by 4 to get the wchar_t count.  */
++	sarq	$2, %rdi
++#  endif
++	leaq	(CHAR_PER_VEC)(%rdi, %rax), %rax
++	ret
++
++	.p2align 4
++L(last_vec_x2):
++	tzcntl	%eax, %eax
++	subq	%rdx, %rdi
++#  ifdef USE_AS_WCSLEN
++	/* NB: Divide bytes by 4 to get the wchar_t count.  */
++	sarq	$2, %rdi
++#  endif
++	leaq	(CHAR_PER_VEC * 2)(%rdi, %rax), %rax
++	ret
++
++	.p2align 4
++L(last_vec_x3):
++	tzcntl	%eax, %eax
++	subl	$(CHAR_PER_VEC * 2), %esi
++	/* Check the end of data.  */
++	cmpl	%eax, %esi
++	jb	L(max_end)
++	subq	%rdx, %rdi
++#  ifdef USE_AS_WCSLEN
++	/* NB: Divide bytes by 4 to get the wchar_t count.  */
++	sarq	$2, %rdi
++#  endif
++	leaq	(CHAR_PER_VEC * 3)(%rdi, %rax), %rax
++	ret
++L(max_end):
++	movq	%r8, %rax
++	ret
++# endif
++
++	/* Cold case for crossing page with first load.	 */
++	.p2align 4
++L(cross_page_boundary):
++	movq	%rdi, %rdx
++	/* Align data to VEC_SIZE.  */
++	andq	$-VEC_SIZE, %rdi
++	VPCMP	$0, (%rdi), %YMMZERO, %k0
++	kmovd	%k0, %eax
++	/* Remove the leading bytes.  */
++# ifdef USE_AS_WCSLEN
++	/* NB: Divide shift count by 4 since each bit in K0 represent 4
++	   bytes.  */
++	movl	%edx, %ecx
++	shrl	$2, %ecx
++	andl	$(CHAR_PER_VEC - 1), %ecx
++# endif
++	/* SHIFT_REG is ecx for USE_AS_WCSLEN and edx otherwise.  */
++	sarxl	%SHIFT_REG, %eax, %eax
++	testl	%eax, %eax
++# ifndef USE_AS_STRNLEN
++	jz	L(cross_page_continue)
++	tzcntl	%eax, %eax
++	ret
++# else
++	jnz	L(cross_page_less_vec)
++#  ifndef USE_AS_WCSLEN
++	movl	%edx, %ecx
++	andl	$(CHAR_PER_VEC - 1), %ecx
++#  endif
++	movl	$CHAR_PER_VEC, %eax
++	subl	%ecx, %eax
++	/* Check the end of data.  */
++	cmpq	%rax, %rsi
++	ja	L(cross_page_continue)
++	movl	%esi, %eax
++	ret
++L(cross_page_less_vec):
++	tzcntl	%eax, %eax
++	/* Select min of length and position of first null.  */
++	cmpq	%rax, %rsi
++	cmovb	%esi, %eax
++	ret
++# endif
++
++END (STRLEN)
++#endif
+diff --git a/sysdeps/x86_64/multiarch/strlen-sse2.S b/sysdeps/x86_64/multiarch/strlen-sse2.S
+index 055fbbc690..812af73c13 100644
+--- a/sysdeps/x86_64/multiarch/strlen-sse2.S
++++ b/sysdeps/x86_64/multiarch/strlen-sse2.S
+@@ -20,4 +20,4 @@
+ # define strlen __strlen_sse2
+ #endif
  
+-#include "../strlen.S"
++#include "strlen-vec.S"
+diff --git a/sysdeps/x86_64/multiarch/strlen-vec.S b/sysdeps/x86_64/multiarch/strlen-vec.S
+new file mode 100644
+index 0000000000..439e486a43
+--- /dev/null
++++ b/sysdeps/x86_64/multiarch/strlen-vec.S
+@@ -0,0 +1,270 @@
++/* SSE2 version of strlen and SSE4.1 version of wcslen.
++   Copyright (C) 2012-2021 Free Software Foundation, Inc.
++   This file is part of the GNU C Library.
++
++   The GNU C Library is free software; you can redistribute it and/or
++   modify it under the terms of the GNU Lesser General Public
++   License as published by the Free Software Foundation; either
++   version 2.1 of the License, or (at your option) any later version.
++
++   The GNU C Library is distributed in the hope that it will be useful,
++   but WITHOUT ANY WARRANTY; without even the implied warranty of
++   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
++   Lesser General Public License for more details.
++
++   You should have received a copy of the GNU Lesser General Public
++   License along with the GNU C Library; if not, see
++   <https://www.gnu.org/licenses/>.  */
++
++#include <sysdep.h>
++
++#ifdef AS_WCSLEN
++# define PMINU		pminud
++# define PCMPEQ		pcmpeqd
++# define SHIFT_RETURN	shrq $2, %rax
++#else
++# define PMINU		pminub
++# define PCMPEQ		pcmpeqb
++# define SHIFT_RETURN
++#endif
++
++/* Long lived register in strlen(s), strnlen(s, n) are:
++
++	%xmm3 - zero
++	%rdi   - s
++	%r10  (s+n) & (~(64-1))
++	%r11   s+n
++*/
++
++
++.text
++ENTRY(strlen)
++
++/* Test 64 bytes from %rax for zero. Save result as bitmask in %rdx.  */
++#define FIND_ZERO	\
++	PCMPEQ	(%rax), %xmm0;	\
++	PCMPEQ	16(%rax), %xmm1;	\
++	PCMPEQ	32(%rax), %xmm2;	\
++	PCMPEQ	48(%rax), %xmm3;	\
++	pmovmskb	%xmm0, %esi;	\
++	pmovmskb	%xmm1, %edx;	\
++	pmovmskb	%xmm2, %r8d;	\
++	pmovmskb	%xmm3, %ecx;	\
++	salq	$16, %rdx;	\
++	salq	$16, %rcx;	\
++	orq	%rsi, %rdx;	\
++	orq	%r8, %rcx;	\
++	salq	$32, %rcx;	\
++	orq	%rcx, %rdx;
++
++#ifdef AS_STRNLEN
++/* Do not read anything when n==0.  */
++	test	%RSI_LP, %RSI_LP
++	jne	L(n_nonzero)
++	xor	%rax, %rax
++	ret
++L(n_nonzero):
++# ifdef AS_WCSLEN
++/* Check for overflow from maxlen * sizeof(wchar_t). If it would
++   overflow the only way this program doesn't have undefined behavior 
++   is if there is a null terminator in valid memory so wcslen will 
++   suffice.  */
++	mov	%RSI_LP, %R10_LP
++	sar	$62, %R10_LP
++	test	%R10_LP, %R10_LP
++	jnz	__wcslen_sse4_1
++	sal	$2, %RSI_LP
++# endif
++
++
++/* Initialize long lived registers.  */
++
++	add	%RDI_LP, %RSI_LP
++# ifdef AS_WCSLEN
++/* Check for overflow again from s + maxlen * sizeof(wchar_t).  */
++	jbe	__wcslen_sse4_1
++# endif
++	mov	%RSI_LP, %R10_LP
++	and	$-64, %R10_LP
++	mov	%RSI_LP, %R11_LP
++#endif
++
++	pxor	%xmm0, %xmm0
++	pxor	%xmm1, %xmm1
++	pxor	%xmm2, %xmm2
++	pxor	%xmm3, %xmm3
++	movq	%rdi, %rax
++	movq	%rdi, %rcx
++	andq	$4095, %rcx
++/* Offsets 4032-4047 will be aligned into 4032 thus fit into page.  */
++	cmpq	$4047, %rcx
++/* We cannot unify this branching as it would be ~6 cycles slower.  */
++	ja	L(cross_page)
++
++#ifdef AS_STRNLEN
++/* Test if end is among first 64 bytes.  */
++# define STRNLEN_PROLOG	\
++	mov	%r11, %rsi;	\
++	subq	%rax, %rsi;	\
++	andq	$-64, %rax;	\
++	testq	$-64, %rsi;	\
++	je	L(strnlen_ret)
++#else
++# define STRNLEN_PROLOG  andq $-64, %rax;
++#endif
++
++/* Ignore bits in mask that come before start of string.  */
++#define PROLOG(lab)	\
++	movq	%rdi, %rcx;	\
++	xorq	%rax, %rcx;	\
++	STRNLEN_PROLOG;	\
++	sarq	%cl, %rdx;	\
++	test	%rdx, %rdx;	\
++	je	L(lab);	\
++	bsfq	%rdx, %rax;	\
++	SHIFT_RETURN;		\
++	ret
++
++#ifdef AS_STRNLEN
++	andq	$-16, %rax
++	FIND_ZERO
++#else
++	/* Test first 16 bytes unaligned.  */
++	movdqu	(%rax), %xmm4
++	PCMPEQ	%xmm0, %xmm4
++	pmovmskb	%xmm4, %edx
++	test	%edx, %edx
++	je 	L(next48_bytes)
++	bsf	%edx, %eax /* If eax is zeroed 16bit bsf can be used.  */
++	SHIFT_RETURN
++	ret
++
++L(next48_bytes):
++/* Same as FIND_ZERO except we do not check first 16 bytes.  */
++	andq	$-16, %rax
++	PCMPEQ 16(%rax), %xmm1
++	PCMPEQ 32(%rax), %xmm2
++	PCMPEQ 48(%rax), %xmm3
++	pmovmskb	%xmm1, %edx
++	pmovmskb	%xmm2, %r8d
++	pmovmskb	%xmm3, %ecx
++	salq	$16, %rdx
++	salq	$16, %rcx
++	orq	%r8, %rcx
++	salq	$32, %rcx
++	orq	%rcx, %rdx
++#endif
++
++	/* When no zero byte is found xmm1-3 are zero so we do not have to
++	   zero them.  */
++	PROLOG(loop)
++
++	.p2align 4
++L(cross_page):
++	andq	$-64, %rax
++	FIND_ZERO
++	PROLOG(loop_init)
++
++#ifdef AS_STRNLEN
++/* We must do this check to correctly handle strnlen (s, -1).  */
++L(strnlen_ret):
++	bts	%rsi, %rdx
++	sarq	%cl, %rdx
++	test	%rdx, %rdx
++	je	L(loop_init)
++	bsfq	%rdx, %rax
++	SHIFT_RETURN
++	ret
++#endif
++	.p2align 4
++L(loop_init):
++	pxor	%xmm1, %xmm1
++	pxor	%xmm2, %xmm2
++	pxor	%xmm3, %xmm3
++#ifdef AS_STRNLEN
++	.p2align 4
++L(loop):
++
++	addq	$64, %rax
++	cmpq	%rax, %r10
++	je	L(exit_end)
++
++	movdqa	(%rax), %xmm0
++	PMINU	16(%rax), %xmm0
++	PMINU	32(%rax), %xmm0
++	PMINU	48(%rax), %xmm0
++	PCMPEQ	%xmm3, %xmm0
++	pmovmskb	%xmm0, %edx
++	testl	%edx, %edx
++	jne	L(exit)
++	jmp	L(loop)
++
++	.p2align 4
++L(exit_end):
++	cmp	%rax, %r11
++	je	L(first) /* Do not read when end is at page boundary.  */
++	pxor	%xmm0, %xmm0
++	FIND_ZERO
++
++L(first):
++	bts	%r11, %rdx
++	bsfq	%rdx, %rdx
++	addq	%rdx, %rax
++	subq	%rdi, %rax
++	SHIFT_RETURN
++	ret
++
++	.p2align 4
++L(exit):
++	pxor	%xmm0, %xmm0
++	FIND_ZERO
++
++	bsfq	%rdx, %rdx
++	addq	%rdx, %rax
++	subq	%rdi, %rax
++	SHIFT_RETURN
++	ret
++
++#else
++
++	/* Main loop.  Unrolled twice to improve L2 cache performance on core2.  */
++	.p2align 4
++L(loop):
++
++	movdqa	64(%rax), %xmm0
++	PMINU	80(%rax), %xmm0
++	PMINU	96(%rax), %xmm0
++	PMINU	112(%rax), %xmm0
++	PCMPEQ	%xmm3, %xmm0
++	pmovmskb	%xmm0, %edx
++	testl	%edx, %edx
++	jne	L(exit64)
++
++	subq	$-128, %rax
++
++	movdqa	(%rax), %xmm0
++	PMINU	16(%rax), %xmm0
++	PMINU	32(%rax), %xmm0
++	PMINU	48(%rax), %xmm0
++	PCMPEQ	%xmm3, %xmm0
++	pmovmskb	%xmm0, %edx
++	testl	%edx, %edx
++	jne	L(exit0)
++	jmp	L(loop)
++
++	.p2align 4
++L(exit64):
++	addq	$64, %rax
++L(exit0):
++	pxor	%xmm0, %xmm0
++	FIND_ZERO
++
++	bsfq	%rdx, %rdx
++	addq	%rdx, %rax
++	subq	%rdi, %rax
++	SHIFT_RETURN
++	ret
++
++#endif
++
++END(strlen)
+diff --git a/sysdeps/x86_64/multiarch/strncat-avx2-rtm.S b/sysdeps/x86_64/multiarch/strncat-avx2-rtm.S
+new file mode 100644
+index 0000000000..0dcea18dbb
+--- /dev/null
++++ b/sysdeps/x86_64/multiarch/strncat-avx2-rtm.S
+@@ -0,0 +1,3 @@
++#define USE_AS_STRNCAT
++#define STRCAT __strncat_avx2_rtm
++#include "strcat-avx2-rtm.S"
+diff --git a/sysdeps/x86_64/multiarch/strncat-evex.S b/sysdeps/x86_64/multiarch/strncat-evex.S
+new file mode 100644
+index 0000000000..8884f02371
+--- /dev/null
++++ b/sysdeps/x86_64/multiarch/strncat-evex.S
+@@ -0,0 +1,3 @@
++#define USE_AS_STRNCAT
++#define STRCAT __strncat_evex
++#include "strcat-evex.S"
+diff --git a/sysdeps/x86_64/multiarch/strncmp-avx2-rtm.S b/sysdeps/x86_64/multiarch/strncmp-avx2-rtm.S
+new file mode 100644
+index 0000000000..68bad365ba
+--- /dev/null
++++ b/sysdeps/x86_64/multiarch/strncmp-avx2-rtm.S
+@@ -0,0 +1,4 @@
++#define STRCMP	__strncmp_avx2_rtm
++#define USE_AS_STRNCMP 1
++#define OVERFLOW_STRCMP	__strcmp_avx2_rtm
++#include "strcmp-avx2-rtm.S"
+diff --git a/sysdeps/x86_64/multiarch/strncmp-avx2.S b/sysdeps/x86_64/multiarch/strncmp-avx2.S
+index 1678bcc235..f138e9f1fd 100644
+--- a/sysdeps/x86_64/multiarch/strncmp-avx2.S
++++ b/sysdeps/x86_64/multiarch/strncmp-avx2.S
+@@ -1,3 +1,4 @@
+ #define STRCMP	__strncmp_avx2
+ #define USE_AS_STRNCMP 1
++#define OVERFLOW_STRCMP __strcmp_avx2
+ #include "strcmp-avx2.S"
+diff --git a/sysdeps/x86_64/multiarch/strncmp-evex.S b/sysdeps/x86_64/multiarch/strncmp-evex.S
+new file mode 100644
+index 0000000000..a1d53e8c9f
+--- /dev/null
++++ b/sysdeps/x86_64/multiarch/strncmp-evex.S
+@@ -0,0 +1,3 @@
++#define STRCMP	__strncmp_evex
++#define USE_AS_STRNCMP 1
++#include "strcmp-evex.S"
+diff --git a/sysdeps/x86_64/multiarch/strncmp.c b/sysdeps/x86_64/multiarch/strncmp.c
+index 6b63b0ac29..dee2a41b02 100644
+--- a/sysdeps/x86_64/multiarch/strncmp.c
++++ b/sysdeps/x86_64/multiarch/strncmp.c
+@@ -30,16 +30,29 @@ extern __typeof (REDIRECT_NAME) OPTIMIZE (sse2) attribute_hidden;
+ extern __typeof (REDIRECT_NAME) OPTIMIZE (ssse3) attribute_hidden;
+ extern __typeof (REDIRECT_NAME) OPTIMIZE (sse42) attribute_hidden;
+ extern __typeof (REDIRECT_NAME) OPTIMIZE (avx2) attribute_hidden;
++extern __typeof (REDIRECT_NAME) OPTIMIZE (avx2_rtm) attribute_hidden;
++extern __typeof (REDIRECT_NAME) OPTIMIZE (evex) attribute_hidden;
+ 
+ static inline void *
+ IFUNC_SELECTOR (void)
+ {
+   const struct cpu_features* cpu_features = __get_cpu_features ();
  
-diff --git a/sysdeps/x86_64/configure.ac b/sysdeps/x86_64/configure.ac
-index cdaba0c075..611a7d9ba3 100644
---- a/sysdeps/x86_64/configure.ac
-+++ b/sysdeps/x86_64/configure.ac
-@@ -53,31 +53,6 @@ if test x"$build_mathvec" = xnotset; then
-   build_mathvec=yes
- fi
+-  if (!CPU_FEATURES_ARCH_P (cpu_features, Prefer_No_VZEROUPPER)
+-      && CPU_FEATURES_ARCH_P (cpu_features, AVX2_Usable)
++  if (CPU_FEATURES_ARCH_P (cpu_features, AVX2_Usable)
+       && CPU_FEATURES_ARCH_P (cpu_features, AVX_Fast_Unaligned_Load))
+-    return OPTIMIZE (avx2);
++    {
++      if (CPU_FEATURES_ARCH_P (cpu_features, AVX512VL_Usable)
++	  && CPU_FEATURES_ARCH_P (cpu_features, AVX512BW_Usable)
++	  && CPU_FEATURES_CPU_P (cpu_features, BMI2)
++	  && !CPU_FEATURES_ARCH_P (cpu_features, Prefer_AVX2_STRCMP))
++	return OPTIMIZE (evex);
++
++      if (CPU_FEATURES_CPU_P (cpu_features, RTM))
++	return OPTIMIZE (avx2_rtm);
++
++      if (!CPU_FEATURES_ARCH_P (cpu_features, Prefer_No_VZEROUPPER))
++	return OPTIMIZE (avx2);
++    }
  
--dnl Check if linker supports static PIE with the fix for
--dnl
--dnl https://sourceware.org/bugzilla/show_bug.cgi?id=21782
--dnl
--if test "$static_pie" = yes; then
--  AC_CACHE_CHECK(for linker static PIE support, libc_cv_ld_static_pie, [dnl
--cat > conftest.s <<\EOF
--	.text
--	.global _start
--	.weak foo
--_start:
--	leaq	foo(%rip), %rax
--EOF
--  libc_cv_pie_option="-Wl,-pie"
--  if AC_TRY_COMMAND(${CC-cc} $CFLAGS $CPPFLAGS $LDFLAGS -nostartfiles -nostdlib $no_ssp $libc_cv_pie_option -o conftest conftest.s 1>&AS_MESSAGE_LOG_FD); then
--    libc_cv_ld_static_pie=yes
--  else
--    libc_cv_ld_static_pie=no
--  fi
--rm -f conftest*])
--  if test "$libc_cv_ld_static_pie" != yes; then
--    AC_MSG_ERROR([linker support for static PIE needed])
--  fi
--fi
--
- dnl It is always possible to access static and hidden symbols in an
- dnl position independent way.
- AC_DEFINE(PI_STATIC_AND_HIDDEN)
-diff --git a/sysdeps/x86_64/dl-machine.h b/sysdeps/x86_64/dl-machine.h
-index 8e9baffeb4..74029871d8 100644
---- a/sysdeps/x86_64/dl-machine.h
-+++ b/sysdeps/x86_64/dl-machine.h
-@@ -315,16 +315,22 @@ elf_machine_rela (struct link_map *map, const ElfW(Rela) *reloc,
- 	{
- # ifndef RTLD_BOOTSTRAP
- 	  if (sym_map != map
--	      && sym_map->l_type != lt_executable
- 	      && !sym_map->l_relocated)
- 	    {
- 	      const char *strtab
- 		= (const char *) D_PTR (map, l_info[DT_STRTAB]);
--	      _dl_error_printf ("\
-+	      if (sym_map->l_type == lt_executable)
-+		_dl_fatal_printf ("\
-+%s: IFUNC symbol '%s' referenced in '%s' is defined in the executable \
-+and creates an unsatisfiable circular dependency.\n",
-+				  RTLD_PROGNAME, strtab + refsym->st_name,
-+				  map->l_name);
-+	      else
-+		_dl_error_printf ("\
- %s: Relink `%s' with `%s' for IFUNC symbol `%s'\n",
--				RTLD_PROGNAME, map->l_name,
--				sym_map->l_name,
--				strtab + refsym->st_name);
-+				  RTLD_PROGNAME, map->l_name,
-+				  sym_map->l_name,
-+				  strtab + refsym->st_name);
- 	    }
+   if (CPU_FEATURES_CPU_P (cpu_features, SSE4_2)
+       && !CPU_FEATURES_ARCH_P (cpu_features, Slow_SSE4_2))
+diff --git a/sysdeps/x86_64/multiarch/strncpy-avx2-rtm.S b/sysdeps/x86_64/multiarch/strncpy-avx2-rtm.S
+new file mode 100644
+index 0000000000..79e7083299
+--- /dev/null
++++ b/sysdeps/x86_64/multiarch/strncpy-avx2-rtm.S
+@@ -0,0 +1,3 @@
++#define USE_AS_STRNCPY
++#define STRCPY __strncpy_avx2_rtm
++#include "strcpy-avx2-rtm.S"
+diff --git a/sysdeps/x86_64/multiarch/strncpy-evex.S b/sysdeps/x86_64/multiarch/strncpy-evex.S
+new file mode 100644
+index 0000000000..40e391f0da
+--- /dev/null
++++ b/sysdeps/x86_64/multiarch/strncpy-evex.S
+@@ -0,0 +1,3 @@
++#define USE_AS_STRNCPY
++#define STRCPY __strncpy_evex
++#include "strcpy-evex.S"
+diff --git a/sysdeps/x86_64/multiarch/strnlen-avx2-rtm.S b/sysdeps/x86_64/multiarch/strnlen-avx2-rtm.S
+new file mode 100644
+index 0000000000..04f1626a5c
+--- /dev/null
++++ b/sysdeps/x86_64/multiarch/strnlen-avx2-rtm.S
+@@ -0,0 +1,4 @@
++#define STRLEN __strnlen_avx2_rtm
++#define USE_AS_STRNLEN 1
++
++#include "strlen-avx2-rtm.S"
+diff --git a/sysdeps/x86_64/multiarch/strnlen-evex.S b/sysdeps/x86_64/multiarch/strnlen-evex.S
+new file mode 100644
+index 0000000000..722022f303
+--- /dev/null
++++ b/sysdeps/x86_64/multiarch/strnlen-evex.S
+@@ -0,0 +1,4 @@
++#define STRLEN __strnlen_evex
++#define USE_AS_STRNLEN 1
++
++#include "strlen-evex.S"
+diff --git a/sysdeps/x86_64/multiarch/strrchr-avx2-rtm.S b/sysdeps/x86_64/multiarch/strrchr-avx2-rtm.S
+new file mode 100644
+index 0000000000..5def14ec1c
+--- /dev/null
++++ b/sysdeps/x86_64/multiarch/strrchr-avx2-rtm.S
+@@ -0,0 +1,12 @@
++#ifndef STRRCHR
++# define STRRCHR __strrchr_avx2_rtm
++#endif
++
++#define ZERO_UPPER_VEC_REGISTERS_RETURN \
++  ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST
++
++#define VZEROUPPER_RETURN jmp	 L(return_vzeroupper)
++
++#define SECTION(p) p##.avx.rtm
++
++#include "strrchr-avx2.S"
+diff --git a/sysdeps/x86_64/multiarch/strrchr-avx2.S b/sysdeps/x86_64/multiarch/strrchr-avx2.S
+index 23077b4c45..bfb860ebba 100644
+--- a/sysdeps/x86_64/multiarch/strrchr-avx2.S
++++ b/sysdeps/x86_64/multiarch/strrchr-avx2.S
+@@ -36,9 +36,13 @@
+ #  define VZEROUPPER	vzeroupper
  # endif
- 	  value = ((ElfW(Addr) (*) (void)) value) ();
-diff --git a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
-index c763b7d871..06f70eb1b0 100644
---- a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
-+++ b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
-@@ -67,6 +67,13 @@
- # define REP_MOVSB_THRESHOLD	(2048 * (VEC_SIZE / 16))
- #endif
  
-+/* Avoid short distance rep movsb only with non-SSE vector.  */
-+#ifndef AVOID_SHORT_DISTANCE_REP_MOVSB
-+# define AVOID_SHORT_DISTANCE_REP_MOVSB (VEC_SIZE > 16)
-+#else
-+# define AVOID_SHORT_DISTANCE_REP_MOVSB 0
-+#endif
++# ifndef SECTION
++#  define SECTION(p)	p##.avx
++# endif
 +
- #ifndef PREFETCH
- # define PREFETCH(addr) prefetcht0 addr
- #endif
-@@ -244,7 +251,7 @@ L(return):
- 	ret
+ # define VEC_SIZE	32
  
- L(movsb):
--	cmpq	__x86_shared_non_temporal_threshold(%rip), %rdx
-+	cmp	__x86_shared_non_temporal_threshold(%rip), %RDX_LP
- 	jae	L(more_8x_vec)
- 	cmpq	%rsi, %rdi
- 	jb	1f
-@@ -257,7 +264,21 @@ L(movsb):
- #  error Unsupported REP_MOVSB_THRESHOLD and VEC_SIZE!
+-	.section .text.avx,"ax",@progbits
++	.section SECTION(.text),"ax",@progbits
+ ENTRY (STRRCHR)
+ 	movd	%esi, %xmm4
+ 	movl	%edi, %ecx
+@@ -166,8 +170,8 @@ L(return_value):
  # endif
- 	jb	L(more_8x_vec_backward)
-+# if AVOID_SHORT_DISTANCE_REP_MOVSB
-+	movq	%rdi, %rcx
-+	subq	%rsi, %rcx
-+	jmp	2f
+ 	bsrl	%eax, %eax
+ 	leaq	-VEC_SIZE(%rdi, %rax), %rax
+-	VZEROUPPER
+-	ret
++L(return_vzeroupper):
++	ZERO_UPPER_VEC_REGISTERS_RETURN
+ 
+ 	.p2align 4
+ L(match):
+@@ -198,8 +202,7 @@ L(find_nul):
+ 	jz	L(return_value)
+ 	bsrl	%eax, %eax
+ 	leaq	-VEC_SIZE(%rdi, %rax), %rax
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ 
+ 	.p2align 4
+ L(char_and_nul):
+@@ -222,14 +225,12 @@ L(char_and_nul_in_first_vec):
+ 	jz	L(return_null)
+ 	bsrl	%eax, %eax
+ 	leaq	-VEC_SIZE(%rdi, %rax), %rax
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ 
+ 	.p2align 4
+ L(return_null):
+ 	xorl	%eax, %eax
+-	VZEROUPPER
+-	ret
++	VZEROUPPER_RETURN
+ 
+ END (STRRCHR)
+ #endif
+diff --git a/sysdeps/x86_64/multiarch/strrchr-evex.S b/sysdeps/x86_64/multiarch/strrchr-evex.S
+new file mode 100644
+index 0000000000..f920b5a584
+--- /dev/null
++++ b/sysdeps/x86_64/multiarch/strrchr-evex.S
+@@ -0,0 +1,265 @@
++/* strrchr/wcsrchr optimized with 256-bit EVEX instructions.
++   Copyright (C) 2021 Free Software Foundation, Inc.
++   This file is part of the GNU C Library.
++
++   The GNU C Library is free software; you can redistribute it and/or
++   modify it under the terms of the GNU Lesser General Public
++   License as published by the Free Software Foundation; either
++   version 2.1 of the License, or (at your option) any later version.
++
++   The GNU C Library is distributed in the hope that it will be useful,
++   but WITHOUT ANY WARRANTY; without even the implied warranty of
++   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
++   Lesser General Public License for more details.
++
++   You should have received a copy of the GNU Lesser General Public
++   License along with the GNU C Library; if not, see
++   <https://www.gnu.org/licenses/>.  */
++
++#if IS_IN (libc)
++
++# include <sysdep.h>
++
++# ifndef STRRCHR
++#  define STRRCHR	__strrchr_evex
++# endif
++
++# define VMOVU		vmovdqu64
++# define VMOVA		vmovdqa64
++
++# ifdef USE_AS_WCSRCHR
++#  define VPBROADCAST	vpbroadcastd
++#  define VPCMP		vpcmpd
++#  define SHIFT_REG	r8d
++# else
++#  define VPBROADCAST	vpbroadcastb
++#  define VPCMP		vpcmpb
++#  define SHIFT_REG	ecx
++# endif
++
++# define XMMZERO	xmm16
++# define YMMZERO	ymm16
++# define YMMMATCH	ymm17
++# define YMM1		ymm18
++
++# define VEC_SIZE	32
++
++	.section .text.evex,"ax",@progbits
++ENTRY (STRRCHR)
++	movl	%edi, %ecx
++	/* Broadcast CHAR to YMMMATCH.  */
++	VPBROADCAST %esi, %YMMMATCH
++
++	vpxorq	%XMMZERO, %XMMZERO, %XMMZERO
++
++	/* Check if we may cross page boundary with one vector load.  */
++	andl	$(2 * VEC_SIZE - 1), %ecx
++	cmpl	$VEC_SIZE, %ecx
++	ja	L(cros_page_boundary)
++
++	VMOVU	(%rdi), %YMM1
++
++	/* Each bit in K0 represents a null byte in YMM1.  */
++	VPCMP	$0, %YMMZERO, %YMM1, %k0
++	/* Each bit in K1 represents a CHAR in YMM1.  */
++	VPCMP	$0, %YMMMATCH, %YMM1, %k1
++	kmovd	%k0, %ecx
++	kmovd	%k1, %eax
++
++	addq	$VEC_SIZE, %rdi
++
++	testl	%eax, %eax
++	jnz	L(first_vec)
++
++	testl	%ecx, %ecx
++	jnz	L(return_null)
++
++	andq	$-VEC_SIZE, %rdi
++	xorl	%edx, %edx
++	jmp	L(aligned_loop)
++
++	.p2align 4
++L(first_vec):
++	/* Check if there is a null byte.  */
++	testl	%ecx, %ecx
++	jnz	L(char_and_nul_in_first_vec)
++
++	/* Remember the match and keep searching.  */
++	movl	%eax, %edx
++	movq	%rdi, %rsi
++	andq	$-VEC_SIZE, %rdi
++	jmp	L(aligned_loop)
++
++	.p2align 4
++L(cros_page_boundary):
++	andl	$(VEC_SIZE - 1), %ecx
++	andq	$-VEC_SIZE, %rdi
++
++# ifdef USE_AS_WCSRCHR
++	/* NB: Divide shift count by 4 since each bit in K1 represent 4
++	   bytes.  */
++	movl	%ecx, %SHIFT_REG
++	sarl	$2, %SHIFT_REG
++# endif
++
++	VMOVA	(%rdi), %YMM1
++
++	/* Each bit in K0 represents a null byte in YMM1.  */
++	VPCMP	$0, %YMMZERO, %YMM1, %k0
++	/* Each bit in K1 represents a CHAR in YMM1.  */
++	VPCMP	$0, %YMMMATCH, %YMM1, %k1
++	kmovd	%k0, %edx
++	kmovd	%k1, %eax
++
++	shrxl	%SHIFT_REG, %edx, %edx
++	shrxl	%SHIFT_REG, %eax, %eax
++	addq	$VEC_SIZE, %rdi
++
++	/* Check if there is a CHAR.  */
++	testl	%eax, %eax
++	jnz	L(found_char)
++
++	testl	%edx, %edx
++	jnz	L(return_null)
++
++	jmp	L(aligned_loop)
++
++	.p2align 4
++L(found_char):
++	testl	%edx, %edx
++	jnz	L(char_and_nul)
++
++	/* Remember the match and keep searching.  */
++	movl	%eax, %edx
++	leaq	(%rdi, %rcx), %rsi
++
++	.p2align 4
++L(aligned_loop):
++	VMOVA	(%rdi), %YMM1
++	addq	$VEC_SIZE, %rdi
++
++	/* Each bit in K0 represents a null byte in YMM1.  */
++	VPCMP	$0, %YMMZERO, %YMM1, %k0
++	/* Each bit in K1 represents a CHAR in YMM1.  */
++	VPCMP	$0, %YMMMATCH, %YMM1, %k1
++	kmovd	%k0, %ecx
++	kmovd	%k1, %eax
++	orl	%eax, %ecx
++	jnz	L(char_nor_null)
++
++	VMOVA	(%rdi), %YMM1
++	add	$VEC_SIZE, %rdi
++
++	/* Each bit in K0 represents a null byte in YMM1.  */
++	VPCMP	$0, %YMMZERO, %YMM1, %k0
++	/* Each bit in K1 represents a CHAR in YMM1.  */
++	VPCMP	$0, %YMMMATCH, %YMM1, %k1
++	kmovd	%k0, %ecx
++	kmovd	%k1, %eax
++	orl	%eax, %ecx
++	jnz	L(char_nor_null)
++
++	VMOVA	(%rdi), %YMM1
++	addq	$VEC_SIZE, %rdi
++
++	/* Each bit in K0 represents a null byte in YMM1.  */
++	VPCMP	$0, %YMMZERO, %YMM1, %k0
++	/* Each bit in K1 represents a CHAR in YMM1.  */
++	VPCMP	$0, %YMMMATCH, %YMM1, %k1
++	kmovd	%k0, %ecx
++	kmovd	%k1, %eax
++	orl	%eax, %ecx
++	jnz	L(char_nor_null)
++
++	VMOVA	(%rdi), %YMM1
++	addq	$VEC_SIZE, %rdi
++
++	/* Each bit in K0 represents a null byte in YMM1.  */
++	VPCMP	$0, %YMMZERO, %YMM1, %k0
++	/* Each bit in K1 represents a CHAR in YMM1.  */
++	VPCMP	$0, %YMMMATCH, %YMM1, %k1
++	kmovd	%k0, %ecx
++	kmovd	%k1, %eax
++	orl	%eax, %ecx
++	jz	L(aligned_loop)
++
++	.p2align 4
++L(char_nor_null):
++	/* Find a CHAR or a null byte in a loop.  */
++	testl	%eax, %eax
++	jnz	L(match)
++L(return_value):
++	testl	%edx, %edx
++	jz	L(return_null)
++	movl	%edx, %eax
++	movq	%rsi, %rdi
++	bsrl	%eax, %eax
++# ifdef USE_AS_WCSRCHR
++	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
++	leaq	-VEC_SIZE(%rdi, %rax, 4), %rax
++# else
++	leaq	-VEC_SIZE(%rdi, %rax), %rax
 +# endif
- 1:
-+# if AVOID_SHORT_DISTANCE_REP_MOVSB
-+	movq	%rsi, %rcx
-+	subq	%rdi, %rcx
-+2:
-+/* Avoid "rep movsb" if RCX, the distance between source and destination,
-+   is N*4GB + [1..63] with N >= 0.  */
-+	cmpl	$63, %ecx
-+	jbe	L(more_2x_vec)	/* Avoid "rep movsb" if ECX <= 63.  */
++	ret
++
++	.p2align 4
++L(match):
++	/* Find a CHAR.  Check if there is a null byte.  */
++	kmovd	%k0, %ecx
++	testl	%ecx, %ecx
++	jnz	L(find_nul)
++
++	/* Remember the match and keep searching.  */
++	movl	%eax, %edx
++	movq	%rdi, %rsi
++	jmp	L(aligned_loop)
++
++	.p2align 4
++L(find_nul):
++	/* Mask out any matching bits after the null byte.  */
++	movl	%ecx, %r8d
++	subl	$1, %r8d
++	xorl	%ecx, %r8d
++	andl	%r8d, %eax
++	testl	%eax, %eax
++	/* If there is no CHAR here, return the remembered one.  */
++	jz	L(return_value)
++	bsrl	%eax, %eax
++# ifdef USE_AS_WCSRCHR
++	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
++	leaq	-VEC_SIZE(%rdi, %rax, 4), %rax
++# else
++	leaq	-VEC_SIZE(%rdi, %rax), %rax
 +# endif
- 	mov	%RDX_LP, %RCX_LP
- 	rep movsb
- L(nop):
-@@ -402,7 +423,7 @@ L(more_8x_vec):
- 	addq	%r8, %rdx
- #if (defined USE_MULTIARCH || VEC_SIZE == 16) && IS_IN (libc)
- 	/* Check non-temporal store threshold.  */
--	cmpq	__x86_shared_non_temporal_threshold(%rip), %rdx
-+	cmp	__x86_shared_non_temporal_threshold(%rip), %RDX_LP
- 	ja	L(large_forward)
- #endif
- L(loop_4x_vec_forward):
-@@ -454,7 +475,7 @@ L(more_8x_vec_backward):
- 	subq	%r8, %rdx
- #if (defined USE_MULTIARCH || VEC_SIZE == 16) && IS_IN (libc)
- 	/* Check non-temporal store threshold.  */
--	cmpq	__x86_shared_non_temporal_threshold(%rip), %rdx
-+	cmp	__x86_shared_non_temporal_threshold(%rip), %RDX_LP
- 	ja	L(large_backward)
- #endif
- L(loop_4x_vec_backward):
-diff --git a/sysdeps/x86_64/multiarch/strcmp-avx2.S b/sysdeps/x86_64/multiarch/strcmp-avx2.S
-index 48d03a9f46..ee82fa3e19 100644
---- a/sysdeps/x86_64/multiarch/strcmp-avx2.S
-+++ b/sysdeps/x86_64/multiarch/strcmp-avx2.S
-@@ -591,7 +591,14 @@ L(loop_cross_page_2_vec):
- 	movl	$(PAGE_SIZE / (VEC_SIZE * 4) - 1), %esi
- 
- 	testq	%rdi, %rdi
-+# ifdef USE_AS_STRNCMP
-+	/* At this point, if %rdi value is 0, it already tested
-+	   VEC_SIZE*4+%r10 byte starting from %rax. This label
-+	   checks whether strncmp maximum offset reached or not.  */
-+	je	L(string_nbyte_offset_check)
++	ret
++
++	.p2align 4
++L(char_and_nul):
++	/* Find both a CHAR and a null byte.  */
++	addq	%rcx, %rdi
++	movl	%edx, %ecx
++L(char_and_nul_in_first_vec):
++	/* Mask out any matching bits after the null byte.  */
++	movl	%ecx, %r8d
++	subl	$1, %r8d
++	xorl	%ecx, %r8d
++	andl	%r8d, %eax
++	testl	%eax, %eax
++	/* Return null pointer if the null byte comes first.  */
++	jz	L(return_null)
++	bsrl	%eax, %eax
++# ifdef USE_AS_WCSRCHR
++	/* NB: Multiply wchar_t count by 4 to get the number of bytes.  */
++	leaq	-VEC_SIZE(%rdi, %rax, 4), %rax
 +# else
- 	je	L(back_to_loop)
++	leaq	-VEC_SIZE(%rdi, %rax), %rax
 +# endif
- 	tzcntq	%rdi, %rcx
- 	addq	%r10, %rcx
- 	/* Adjust for number of bytes skipped.  */
-@@ -627,6 +634,14 @@ L(loop_cross_page_2_vec):
- 	VZEROUPPER
- 	ret
++	ret
++
++	.p2align 4
++L(return_null):
++	xorl	%eax, %eax
++	ret
++
++END (STRRCHR)
++#endif
+diff --git a/sysdeps/x86_64/multiarch/wcschr-avx2-rtm.S b/sysdeps/x86_64/multiarch/wcschr-avx2-rtm.S
+new file mode 100644
+index 0000000000..d49dbbf0b4
+--- /dev/null
++++ b/sysdeps/x86_64/multiarch/wcschr-avx2-rtm.S
+@@ -0,0 +1,3 @@
++#define STRCHR __wcschr_avx2_rtm
++#define USE_AS_WCSCHR 1
++#include "strchr-avx2-rtm.S"
+diff --git a/sysdeps/x86_64/multiarch/wcschr-evex.S b/sysdeps/x86_64/multiarch/wcschr-evex.S
+new file mode 100644
+index 0000000000..7cb8f1e41a
+--- /dev/null
++++ b/sysdeps/x86_64/multiarch/wcschr-evex.S
+@@ -0,0 +1,3 @@
++#define STRCHR __wcschr_evex
++#define USE_AS_WCSCHR 1
++#include "strchr-evex.S"
+diff --git a/sysdeps/x86_64/multiarch/wcscmp-avx2-rtm.S b/sysdeps/x86_64/multiarch/wcscmp-avx2-rtm.S
+new file mode 100644
+index 0000000000..d6ca2b8064
+--- /dev/null
++++ b/sysdeps/x86_64/multiarch/wcscmp-avx2-rtm.S
+@@ -0,0 +1,4 @@
++#define STRCMP __wcscmp_avx2_rtm
++#define USE_AS_WCSCMP 1
++
++#include "strcmp-avx2-rtm.S"
+diff --git a/sysdeps/x86_64/multiarch/wcscmp-evex.S b/sysdeps/x86_64/multiarch/wcscmp-evex.S
+new file mode 100644
+index 0000000000..42e73e51eb
+--- /dev/null
++++ b/sysdeps/x86_64/multiarch/wcscmp-evex.S
+@@ -0,0 +1,4 @@
++#define STRCMP __wcscmp_evex
++#define USE_AS_WCSCMP 1
++
++#include "strcmp-evex.S"
+diff --git a/sysdeps/x86_64/multiarch/wcslen-avx2-rtm.S b/sysdeps/x86_64/multiarch/wcslen-avx2-rtm.S
+new file mode 100644
+index 0000000000..35658d7365
+--- /dev/null
++++ b/sysdeps/x86_64/multiarch/wcslen-avx2-rtm.S
+@@ -0,0 +1,4 @@
++#define STRLEN __wcslen_avx2_rtm
++#define USE_AS_WCSLEN 1
++
++#include "strlen-avx2-rtm.S"
+diff --git a/sysdeps/x86_64/multiarch/wcslen-evex.S b/sysdeps/x86_64/multiarch/wcslen-evex.S
+new file mode 100644
+index 0000000000..bdafa83bd5
+--- /dev/null
++++ b/sysdeps/x86_64/multiarch/wcslen-evex.S
+@@ -0,0 +1,4 @@
++#define STRLEN __wcslen_evex
++#define USE_AS_WCSLEN 1
++
++#include "strlen-evex.S"
+diff --git a/sysdeps/x86_64/multiarch/wcslen-sse4_1.S b/sysdeps/x86_64/multiarch/wcslen-sse4_1.S
+new file mode 100644
+index 0000000000..7e62621afc
+--- /dev/null
++++ b/sysdeps/x86_64/multiarch/wcslen-sse4_1.S
+@@ -0,0 +1,4 @@
++#define AS_WCSLEN
++#define strlen	__wcslen_sse4_1
++
++#include "strlen-vec.S"
+diff --git a/sysdeps/x86_64/multiarch/wcslen.c b/sysdeps/x86_64/multiarch/wcslen.c
+index bb97438c7f..26b5fdffd6 100644
+--- a/sysdeps/x86_64/multiarch/wcslen.c
++++ b/sysdeps/x86_64/multiarch/wcslen.c
+@@ -24,7 +24,7 @@
+ # undef __wcslen
+ 
+ # define SYMBOL_NAME wcslen
+-# include "ifunc-avx2.h"
++# include "ifunc-wcslen.h"
+ 
+ libc_ifunc_redirected (__redirect_wcslen, __wcslen, IFUNC_SELECTOR ());
+ weak_alias (__wcslen, wcslen);
+diff --git a/sysdeps/x86_64/multiarch/wcsncmp-avx2-rtm.S b/sysdeps/x86_64/multiarch/wcsncmp-avx2-rtm.S
+new file mode 100644
+index 0000000000..f467582cbe
+--- /dev/null
++++ b/sysdeps/x86_64/multiarch/wcsncmp-avx2-rtm.S
+@@ -0,0 +1,5 @@
++#define STRCMP __wcsncmp_avx2_rtm
++#define USE_AS_STRNCMP 1
++#define USE_AS_WCSCMP 1
++#define OVERFLOW_STRCMP	__wcscmp_avx2_rtm
++#include "strcmp-avx2-rtm.S"
+diff --git a/sysdeps/x86_64/multiarch/wcsncmp-avx2.S b/sysdeps/x86_64/multiarch/wcsncmp-avx2.S
+index 4fa1de4d3f..e9ede522b8 100644
+--- a/sysdeps/x86_64/multiarch/wcsncmp-avx2.S
++++ b/sysdeps/x86_64/multiarch/wcsncmp-avx2.S
+@@ -1,5 +1,5 @@
+ #define STRCMP __wcsncmp_avx2
+ #define USE_AS_STRNCMP 1
+ #define USE_AS_WCSCMP 1
+-
++#define OVERFLOW_STRCMP	__wcscmp_avx2
+ #include "strcmp-avx2.S"
+diff --git a/sysdeps/x86_64/multiarch/wcsncmp-evex.S b/sysdeps/x86_64/multiarch/wcsncmp-evex.S
+new file mode 100644
+index 0000000000..8a8e310713
+--- /dev/null
++++ b/sysdeps/x86_64/multiarch/wcsncmp-evex.S
+@@ -0,0 +1,5 @@
++#define STRCMP __wcsncmp_evex
++#define USE_AS_STRNCMP 1
++#define USE_AS_WCSCMP 1
++
++#include "strcmp-evex.S"
+diff --git a/sysdeps/x86_64/multiarch/wcsnlen-avx2-rtm.S b/sysdeps/x86_64/multiarch/wcsnlen-avx2-rtm.S
+new file mode 100644
+index 0000000000..7437ebee2d
+--- /dev/null
++++ b/sysdeps/x86_64/multiarch/wcsnlen-avx2-rtm.S
+@@ -0,0 +1,5 @@
++#define STRLEN __wcsnlen_avx2_rtm
++#define USE_AS_WCSLEN 1
++#define USE_AS_STRNLEN 1
++
++#include "strlen-avx2-rtm.S"
+diff --git a/sysdeps/x86_64/multiarch/wcsnlen-evex.S b/sysdeps/x86_64/multiarch/wcsnlen-evex.S
+new file mode 100644
+index 0000000000..24773bb4e2
+--- /dev/null
++++ b/sysdeps/x86_64/multiarch/wcsnlen-evex.S
+@@ -0,0 +1,5 @@
++#define STRLEN __wcsnlen_evex
++#define USE_AS_WCSLEN 1
++#define USE_AS_STRNLEN 1
++
++#include "strlen-evex.S"
+diff --git a/sysdeps/x86_64/multiarch/wcsnlen-sse4_1.S b/sysdeps/x86_64/multiarch/wcsnlen-sse4_1.S
+index a8cab0cb00..5fa51fe07c 100644
+--- a/sysdeps/x86_64/multiarch/wcsnlen-sse4_1.S
++++ b/sysdeps/x86_64/multiarch/wcsnlen-sse4_1.S
+@@ -2,4 +2,4 @@
+ #define AS_STRNLEN
+ #define strlen	__wcsnlen_sse4_1
+ 
+-#include "../strlen.S"
++#include "strlen-vec.S"
+diff --git a/sysdeps/x86_64/multiarch/wcsnlen.c b/sysdeps/x86_64/multiarch/wcsnlen.c
+index 8c1fc1a574..f15c1b328b 100644
+--- a/sysdeps/x86_64/multiarch/wcsnlen.c
++++ b/sysdeps/x86_64/multiarch/wcsnlen.c
+@@ -24,27 +24,7 @@
+ # undef __wcsnlen
+ 
+ # define SYMBOL_NAME wcsnlen
+-# include <init-arch.h>
+-
+-extern __typeof (REDIRECT_NAME) OPTIMIZE (sse2) attribute_hidden;
+-extern __typeof (REDIRECT_NAME) OPTIMIZE (sse4_1) attribute_hidden;
+-extern __typeof (REDIRECT_NAME) OPTIMIZE (avx2) attribute_hidden;
+-
+-static inline void *
+-IFUNC_SELECTOR (void)
+-{
+-  const struct cpu_features* cpu_features = __get_cpu_features ();
+-
+-  if (!CPU_FEATURES_ARCH_P (cpu_features, Prefer_No_VZEROUPPER)
+-      && CPU_FEATURES_ARCH_P (cpu_features, AVX2_Usable)
+-      && CPU_FEATURES_ARCH_P (cpu_features, AVX_Fast_Unaligned_Load))
+-    return OPTIMIZE (avx2);
+-
+-  if (CPU_FEATURES_CPU_P (cpu_features, SSE4_1))
+-    return OPTIMIZE (sse4_1);
+-
+-  return OPTIMIZE (sse2);
+-}
++# include "ifunc-wcslen.h"
  
-+# ifdef USE_AS_STRNCMP
-+L(string_nbyte_offset_check):
-+	leaq	(VEC_SIZE * 4)(%r10), %r10
-+	cmpq	%r10, %r11
-+	jbe	L(zero)
-+	jmp	L(back_to_loop)
-+# endif
+ libc_ifunc_redirected (__redirect_wcsnlen, __wcsnlen, IFUNC_SELECTOR ());
+ weak_alias (__wcsnlen, wcsnlen);
+diff --git a/sysdeps/x86_64/multiarch/wcsrchr-avx2-rtm.S b/sysdeps/x86_64/multiarch/wcsrchr-avx2-rtm.S
+new file mode 100644
+index 0000000000..9bf760833f
+--- /dev/null
++++ b/sysdeps/x86_64/multiarch/wcsrchr-avx2-rtm.S
+@@ -0,0 +1,3 @@
++#define STRRCHR __wcsrchr_avx2_rtm
++#define USE_AS_WCSRCHR 1
++#include "strrchr-avx2-rtm.S"
+diff --git a/sysdeps/x86_64/multiarch/wcsrchr-evex.S b/sysdeps/x86_64/multiarch/wcsrchr-evex.S
+new file mode 100644
+index 0000000000..c64602f7dc
+--- /dev/null
++++ b/sysdeps/x86_64/multiarch/wcsrchr-evex.S
+@@ -0,0 +1,3 @@
++#define STRRCHR __wcsrchr_evex
++#define USE_AS_WCSRCHR 1
++#include "strrchr-evex.S"
+diff --git a/sysdeps/x86_64/multiarch/wmemchr-avx2-rtm.S b/sysdeps/x86_64/multiarch/wmemchr-avx2-rtm.S
+new file mode 100644
+index 0000000000..58ed21db01
+--- /dev/null
++++ b/sysdeps/x86_64/multiarch/wmemchr-avx2-rtm.S
+@@ -0,0 +1,4 @@
++#define MEMCHR __wmemchr_avx2_rtm
++#define USE_AS_WMEMCHR 1
 +
- 	.p2align 4
- L(cross_page_loop):
- 	/* Check one byte/dword at a time.  */
++#include "memchr-avx2-rtm.S"
+diff --git a/sysdeps/x86_64/multiarch/wmemchr-evex.S b/sysdeps/x86_64/multiarch/wmemchr-evex.S
+new file mode 100644
+index 0000000000..06cd0f9f5a
+--- /dev/null
++++ b/sysdeps/x86_64/multiarch/wmemchr-evex.S
+@@ -0,0 +1,4 @@
++#define MEMCHR __wmemchr_evex
++#define USE_AS_WMEMCHR 1
++
++#include "memchr-evex.S"
+diff --git a/sysdeps/x86_64/multiarch/wmemcmp-avx2-movbe-rtm.S b/sysdeps/x86_64/multiarch/wmemcmp-avx2-movbe-rtm.S
+new file mode 100644
+index 0000000000..31104d1215
+--- /dev/null
++++ b/sysdeps/x86_64/multiarch/wmemcmp-avx2-movbe-rtm.S
+@@ -0,0 +1,4 @@
++#define MEMCMP __wmemcmp_avx2_movbe_rtm
++#define USE_AS_WMEMCMP 1
++
++#include "memcmp-avx2-movbe-rtm.S"
+diff --git a/sysdeps/x86_64/multiarch/wmemcmp-evex-movbe.S b/sysdeps/x86_64/multiarch/wmemcmp-evex-movbe.S
+new file mode 100644
+index 0000000000..4726d74aa1
+--- /dev/null
++++ b/sysdeps/x86_64/multiarch/wmemcmp-evex-movbe.S
+@@ -0,0 +1,4 @@
++#define MEMCMP __wmemcmp_evex_movbe
++#define USE_AS_WMEMCMP 1
++
++#include "memcmp-evex-movbe.S"
+diff --git a/sysdeps/x86_64/strlen.S b/sysdeps/x86_64/strlen.S
+index 2e226d0d55..8422c15cc8 100644
+--- a/sysdeps/x86_64/strlen.S
++++ b/sysdeps/x86_64/strlen.S
+@@ -1,5 +1,5 @@
+-/* SSE2 version of strlen/wcslen.
+-   Copyright (C) 2012-2020 Free Software Foundation, Inc.
++/* SSE2 version of strlen.
++   Copyright (C) 2021 Free Software Foundation, Inc.
+    This file is part of the GNU C Library.
+ 
+    The GNU C Library is free software; you can redistribute it and/or
+@@ -16,243 +16,6 @@
+    License along with the GNU C Library; if not, see
+    <https://www.gnu.org/licenses/>.  */
+ 
+-#include <sysdep.h>
++#include "multiarch/strlen-vec.S"
+ 
+-#ifdef AS_WCSLEN
+-# define PMINU		pminud
+-# define PCMPEQ		pcmpeqd
+-# define SHIFT_RETURN	shrq $2, %rax
+-#else
+-# define PMINU		pminub
+-# define PCMPEQ		pcmpeqb
+-# define SHIFT_RETURN
+-#endif
+-
+-/* Long lived register in strlen(s), strnlen(s, n) are:
+-
+-	%xmm3 - zero
+-	%rdi   - s
+-	%r10  (s+n) & (~(64-1))
+-	%r11   s+n
+-*/
+-
+-
+-.text
+-ENTRY(strlen)
+-
+-/* Test 64 bytes from %rax for zero. Save result as bitmask in %rdx.  */
+-#define FIND_ZERO	\
+-	PCMPEQ	(%rax), %xmm0;	\
+-	PCMPEQ	16(%rax), %xmm1;	\
+-	PCMPEQ	32(%rax), %xmm2;	\
+-	PCMPEQ	48(%rax), %xmm3;	\
+-	pmovmskb	%xmm0, %esi;	\
+-	pmovmskb	%xmm1, %edx;	\
+-	pmovmskb	%xmm2, %r8d;	\
+-	pmovmskb	%xmm3, %ecx;	\
+-	salq	$16, %rdx;	\
+-	salq	$16, %rcx;	\
+-	orq	%rsi, %rdx;	\
+-	orq	%r8, %rcx;	\
+-	salq	$32, %rcx;	\
+-	orq	%rcx, %rdx;
+-
+-#ifdef AS_STRNLEN
+-/* Do not read anything when n==0.  */
+-	test	%RSI_LP, %RSI_LP
+-	jne	L(n_nonzero)
+-	xor	%rax, %rax
+-	ret
+-L(n_nonzero):
+-# ifdef AS_WCSLEN
+-	shl	$2, %RSI_LP
+-# endif
+-
+-/* Initialize long lived registers.  */
+-
+-	add	%RDI_LP, %RSI_LP
+-	mov	%RSI_LP, %R10_LP
+-	and	$-64, %R10_LP
+-	mov	%RSI_LP, %R11_LP
+-#endif
+-
+-	pxor	%xmm0, %xmm0
+-	pxor	%xmm1, %xmm1
+-	pxor	%xmm2, %xmm2
+-	pxor	%xmm3, %xmm3
+-	movq	%rdi, %rax
+-	movq	%rdi, %rcx
+-	andq	$4095, %rcx
+-/* Offsets 4032-4047 will be aligned into 4032 thus fit into page.  */
+-	cmpq	$4047, %rcx
+-/* We cannot unify this branching as it would be ~6 cycles slower.  */
+-	ja	L(cross_page)
+-
+-#ifdef AS_STRNLEN
+-/* Test if end is among first 64 bytes.  */
+-# define STRNLEN_PROLOG	\
+-	mov	%r11, %rsi;	\
+-	subq	%rax, %rsi;	\
+-	andq	$-64, %rax;	\
+-	testq	$-64, %rsi;	\
+-	je	L(strnlen_ret)
+-#else
+-# define STRNLEN_PROLOG  andq $-64, %rax;
+-#endif
+-
+-/* Ignore bits in mask that come before start of string.  */
+-#define PROLOG(lab)	\
+-	movq	%rdi, %rcx;	\
+-	xorq	%rax, %rcx;	\
+-	STRNLEN_PROLOG;	\
+-	sarq	%cl, %rdx;	\
+-	test	%rdx, %rdx;	\
+-	je	L(lab);	\
+-	bsfq	%rdx, %rax;	\
+-	SHIFT_RETURN;		\
+-	ret
+-
+-#ifdef AS_STRNLEN
+-	andq	$-16, %rax
+-	FIND_ZERO
+-#else
+-	/* Test first 16 bytes unaligned.  */
+-	movdqu	(%rax), %xmm4
+-	PCMPEQ	%xmm0, %xmm4
+-	pmovmskb	%xmm4, %edx
+-	test	%edx, %edx
+-	je 	L(next48_bytes)
+-	bsf	%edx, %eax /* If eax is zeroed 16bit bsf can be used.  */
+-	SHIFT_RETURN
+-	ret
+-
+-L(next48_bytes):
+-/* Same as FIND_ZERO except we do not check first 16 bytes.  */
+-	andq	$-16, %rax
+-	PCMPEQ 16(%rax), %xmm1
+-	PCMPEQ 32(%rax), %xmm2
+-	PCMPEQ 48(%rax), %xmm3
+-	pmovmskb	%xmm1, %edx
+-	pmovmskb	%xmm2, %r8d
+-	pmovmskb	%xmm3, %ecx
+-	salq	$16, %rdx
+-	salq	$16, %rcx
+-	orq	%r8, %rcx
+-	salq	$32, %rcx
+-	orq	%rcx, %rdx
+-#endif
+-
+-	/* When no zero byte is found xmm1-3 are zero so we do not have to
+-	   zero them.  */
+-	PROLOG(loop)
+-
+-	.p2align 4
+-L(cross_page):
+-	andq	$-64, %rax
+-	FIND_ZERO
+-	PROLOG(loop_init)
+-
+-#ifdef AS_STRNLEN
+-/* We must do this check to correctly handle strnlen (s, -1).  */
+-L(strnlen_ret):
+-	bts	%rsi, %rdx
+-	sarq	%cl, %rdx
+-	test	%rdx, %rdx
+-	je	L(loop_init)
+-	bsfq	%rdx, %rax
+-	SHIFT_RETURN
+-	ret
+-#endif
+-	.p2align 4
+-L(loop_init):
+-	pxor	%xmm1, %xmm1
+-	pxor	%xmm2, %xmm2
+-	pxor	%xmm3, %xmm3
+-#ifdef AS_STRNLEN
+-	.p2align 4
+-L(loop):
+-
+-	addq	$64, %rax
+-	cmpq	%rax, %r10
+-	je	L(exit_end)
+-
+-	movdqa	(%rax), %xmm0
+-	PMINU	16(%rax), %xmm0
+-	PMINU	32(%rax), %xmm0
+-	PMINU	48(%rax), %xmm0
+-	PCMPEQ	%xmm3, %xmm0
+-	pmovmskb	%xmm0, %edx
+-	testl	%edx, %edx
+-	jne	L(exit)
+-	jmp	L(loop)
+-
+-	.p2align 4
+-L(exit_end):
+-	cmp	%rax, %r11
+-	je	L(first) /* Do not read when end is at page boundary.  */
+-	pxor	%xmm0, %xmm0
+-	FIND_ZERO
+-
+-L(first):
+-	bts	%r11, %rdx
+-	bsfq	%rdx, %rdx
+-	addq	%rdx, %rax
+-	subq	%rdi, %rax
+-	SHIFT_RETURN
+-	ret
+-
+-	.p2align 4
+-L(exit):
+-	pxor	%xmm0, %xmm0
+-	FIND_ZERO
+-
+-	bsfq	%rdx, %rdx
+-	addq	%rdx, %rax
+-	subq	%rdi, %rax
+-	SHIFT_RETURN
+-	ret
+-
+-#else
+-
+-	/* Main loop.  Unrolled twice to improve L2 cache performance on core2.  */
+-	.p2align 4
+-L(loop):
+-
+-	movdqa	64(%rax), %xmm0
+-	PMINU	80(%rax), %xmm0
+-	PMINU	96(%rax), %xmm0
+-	PMINU	112(%rax), %xmm0
+-	PCMPEQ	%xmm3, %xmm0
+-	pmovmskb	%xmm0, %edx
+-	testl	%edx, %edx
+-	jne	L(exit64)
+-
+-	subq	$-128, %rax
+-
+-	movdqa	(%rax), %xmm0
+-	PMINU	16(%rax), %xmm0
+-	PMINU	32(%rax), %xmm0
+-	PMINU	48(%rax), %xmm0
+-	PCMPEQ	%xmm3, %xmm0
+-	pmovmskb	%xmm0, %edx
+-	testl	%edx, %edx
+-	jne	L(exit0)
+-	jmp	L(loop)
+-
+-	.p2align 4
+-L(exit64):
+-	addq	$64, %rax
+-L(exit0):
+-	pxor	%xmm0, %xmm0
+-	FIND_ZERO
+-
+-	bsfq	%rdx, %rdx
+-	addq	%rdx, %rax
+-	subq	%rdi, %rax
+-	SHIFT_RETURN
+-	ret
+-
+-#endif
+-
+-END(strlen)
+ libc_hidden_builtin_def (strlen)
+diff --git a/sysdeps/x86_64/sysdep.h b/sysdeps/x86_64/sysdep.h
+index 0b73674f68..c8ad778fee 100644
+--- a/sysdeps/x86_64/sysdep.h
++++ b/sysdeps/x86_64/sysdep.h
+@@ -95,6 +95,28 @@ lose:									      \
+ #define R14_LP	r14
+ #define R15_LP	r15
+ 
++/* Zero upper vector registers and return with xtest.  NB: Use VZEROALL
++   to avoid RTM abort triggered by VZEROUPPER inside transactionally.  */
++#define ZERO_UPPER_VEC_REGISTERS_RETURN_XTEST \
++	xtest;							\
++	jz	1f;						\
++	vzeroall;						\
++	ret;							\
++1:								\
++	vzeroupper;						\
++	ret
++
++/* Zero upper vector registers and return.  */
++#ifndef ZERO_UPPER_VEC_REGISTERS_RETURN
++# define ZERO_UPPER_VEC_REGISTERS_RETURN \
++	VZEROUPPER;						\
++	ret
++#endif
++
++#ifndef VZEROUPPER_RETURN
++# define VZEROUPPER_RETURN	VZEROUPPER; ret
++#endif
++
+ #else	/* __ASSEMBLER__ */
+ 
+ /* Long and pointer size in bytes.  */
+diff --git a/sysdeps/x86_64/tst-rsi-strlen.c b/sysdeps/x86_64/tst-rsi-strlen.c
+new file mode 100644
+index 0000000000..a80c4f85c2
+--- /dev/null
++++ b/sysdeps/x86_64/tst-rsi-strlen.c
+@@ -0,0 +1,81 @@
++/* Test strlen with 0 in the RSI register.
++   Copyright (C) 2021 Free Software Foundation, Inc.
++   This file is part of the GNU C Library.
++
++   The GNU C Library is free software; you can redistribute it and/or
++   modify it under the terms of the GNU Lesser General Public
++   License as published by the Free Software Foundation; either
++   version 2.1 of the License, or (at your option) any later version.
++
++   The GNU C Library is distributed in the hope that it will be useful,
++   but WITHOUT ANY WARRANTY; without even the implied warranty of
++   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
++   Lesser General Public License for more details.
++
++   You should have received a copy of the GNU Lesser General Public
++   License along with the GNU C Library; if not, see
++   <https://www.gnu.org/licenses/>.  */
++
++#ifdef WIDE
++# define TEST_NAME "wcslen"
++#else
++# define TEST_NAME "strlen"
++#endif /* WIDE */
++
++#define TEST_MAIN
++#include <string/test-string.h>
++
++#ifdef WIDE
++# include <wchar.h>
++# define STRLEN wcslen
++# define CHAR wchar_t
++#else
++# define STRLEN strlen
++# define CHAR char
++#endif /* WIDE */
++
++IMPL (STRLEN, 1)
++
++typedef size_t (*proto_t) (const CHAR *);
++
++typedef struct
++{
++  void (*fn) (void);
++} parameter_t;
++
++size_t
++__attribute__ ((weak, noinline, noclone))
++do_strlen (parameter_t *a, int zero, const CHAR *str)
++{
++  return CALL (a, str);
++}
++
++static int
++test_main (void)
++{
++  test_init ();
++
++  size_t size = page_size / sizeof (CHAR) - 1;
++  CHAR *buf = (CHAR *) buf2;
++  buf[size] = 0;
++
++  parameter_t a;
++
++  int ret = 0;
++  FOR_EACH_IMPL (impl, 0)
++    {
++      a.fn = impl->fn;
++      /* NB: Pass 0 in RSI.  */
++      size_t res = do_strlen (&a, 0, buf);
++      if (res != size)
++	{
++	  error (0, 0, "Wrong result in function %s: %zu != %zu",
++		 impl->name, res, size);
++	  ret = 1;
++	}
++    }
++
++  return ret ? EXIT_FAILURE : EXIT_SUCCESS;
++}
++
++#include <support/test-driver.c>
+diff --git a/sysdeps/x86_64/tst-rsi-wcslen.c b/sysdeps/x86_64/tst-rsi-wcslen.c
+new file mode 100644
+index 0000000000..f45a7dfb51
+--- /dev/null
++++ b/sysdeps/x86_64/tst-rsi-wcslen.c
+@@ -0,0 +1,20 @@
++/* Test wcslen with 0 in the RSI register.
++   Copyright (C) 2021 Free Software Foundation, Inc.
++   This file is part of the GNU C Library.
++
++   The GNU C Library is free software; you can redistribute it and/or
++   modify it under the terms of the GNU Lesser General Public
++   License as published by the Free Software Foundation; either
++   version 2.1 of the License, or (at your option) any later version.
++
++   The GNU C Library is distributed in the hope that it will be useful,
++   but WITHOUT ANY WARRANTY; without even the implied warranty of
++   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
++   Lesser General Public License for more details.
++
++   You should have received a copy of the GNU Lesser General Public
++   License along with the GNU C Library; if not, see
++   <https://www.gnu.org/licenses/>.  */
++
++#define WIDE 1
++#include "tst-rsi-strlen.c"
diff --git a/debian/patches/hurd-i386/git-posix_openpt.diff b/debian/patches/hurd-i386/git-posix_openpt.diff
index cd655929..2078bb4c 100644
--- a/debian/patches/hurd-i386/git-posix_openpt.diff
+++ b/debian/patches/hurd-i386/git-posix_openpt.diff
@@ -13,14 +13,6 @@ Author: Samuel Thibault <samuel.thibault@ens-lyon.org>
     (__posix_openpt): Replace stub with implementation on top of __bsd_openpt.
     (posix_openpt): Remove stub warning.
     
-    * sysdeps/unix/sysv/linux/getpt.c (__bsd_getpt): Replace prototype with
-    __bsd_openpt prototype.
-    (__getpt): Use __bsd_openpt instead of __bsd_getpt (as fallback when
-    _posix_openpt fails).
-    (getpt): Add alias
-    (__getpt): Do not define.
-    (HAVE_GETPT): Define.
-
 diff --git a/sysdeps/unix/bsd/getpt.c b/sysdeps/unix/bsd/getpt.c
 index 46207f4e62..0eff0b54a3 100644
 --- a/sysdeps/unix/bsd/getpt.c
@@ -70,34 +62,3 @@ index 46207f4e62..0eff0b54a3 100644
 -
 -stub_warning (posix_openpt)
  #endif
-diff --git a/sysdeps/unix/sysv/linux/getpt.c b/sysdeps/unix/sysv/linux/getpt.c
-index cdde8377f5..1cb99d5185 100644
---- a/sysdeps/unix/sysv/linux/getpt.c
-+++ b/sysdeps/unix/sysv/linux/getpt.c
-@@ -31,7 +31,7 @@
- #define _PATH_DEVPTS _PATH_DEV "pts"
- 
- /* Prototype for function that opens BSD-style master pseudo-terminals.  */
--extern int __bsd_getpt (void) attribute_hidden;
-+extern int __bsd_openpt (int oflag) attribute_hidden;
- 
- /* Open a master pseudo terminal and return its file descriptor.  */
- int
-@@ -88,14 +88,15 @@ __getpt (void)
- {
-   int fd = __posix_openpt (O_RDWR);
-   if (fd == -1)
--    fd = __bsd_getpt ();
-+    fd = __bsd_openpt (O_RDWR);
-   return fd;
- }
-+weak_alias (__getpt, getpt)
- 
- 
- #define PTYNAME1 "pqrstuvwxyzabcde";
- #define PTYNAME2 "0123456789abcdef";
- 
--#define __getpt __bsd_getpt
-+#define HAVE_GETPT
- #define HAVE_POSIX_OPENPT
- #include <sysdeps/unix/bsd/getpt.c>
diff --git a/debian/rules.d/build.mk b/debian/rules.d/build.mk
index 1f3fdaeb..77a8258c 100644
--- a/debian/rules.d/build.mk
+++ b/debian/rules.d/build.mk
@@ -110,6 +110,7 @@ endif
 		--enable-stackguard-randomization \
 		--enable-stack-protector=strong \
 		--enable-obsolete-rpc \
+		--with-default-link=no \
 		--with-pkgversion="Debian GLIBC $(DEB_VERSION)" \
 		--with-bugurl="http://www.debian.org/Bugs/"; \
 		$(if $(filter $(pt_chown),yes),--enable-pt_chown) \

Reply to: