aboutsummaryrefslogtreecommitdiff
path: root/sysdeps/aarch64/multiarch
AgeCommit message (Collapse)AuthorFilesLines
2025-04-15aarch64: Add back non-temporal load/stores from oryon-1's memsetAndrew Pinski1-0/+26
I misunderstood the recommendation from the hardware team about non-temporal load/stores. It is still recommended to use them in memset for large sizes. It was not recommended for their use with device memory and memset is already not valid to be used with device memory. This reverts commit e6590f0c86632c36c9a784cf96075f4be2e920d2. Signed-off-by: Andrew Pinski <quic_apinski@quicinc.com> Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
2025-04-15aarch64: Add back non-temporal load/stores from oryon-1's memcpyAndrew Pinski1-0/+40
I misunderstood the recommendation from the hardware team about non-temporal load/stores. It is still recommended to use them in memcpy for large sizes. It was not recommended for their use with device memory and memcpy is already not valid to be use with device memory. This reverts commit eb5eeb47403e0a91de834868e501b4d62b8d2cb9. Signed-off-by: Andrew Pinski <quic_apinski@quicinc.com> Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
2025-02-27AArch64: Use prefer_sve_ifuncs for SVE memsetWilco Dijkstra1-1/+1
Use prefer_sve_ifuncs for SVE memset just like memcpy. Reviewed-by: Yury Khrustalev <yury.khrustalev@arm.com>
2025-02-24AArch64: Remove PTR_ARG/SIZE_ARG definesWilco Dijkstra12-50/+0
This series removes various ILP32 defines that are now no longer needed. Remove PTR_ARG/SIZE_ARG. Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2025-02-20AArch64: Add SVE memsetWilco Dijkstra4-0/+129
Add SVE memset based on the generic memset with predicated load for sizes < 16. Unaligned memsets of 128-1024 are improved by ~20% on average by using aligned stores for the last 64 bytes. Performance of random memset benchmark improves by ~2% on Neoverse V1. Reviewed-by: Yury Khrustalev <yury.khrustalev@arm.com>
2025-01-01Update copyright dates with scripts/update-copyrightsPaul Eggert25-25/+25
2024-11-21aarch64: Remove non-temporal load/stores from oryon-1's memsetAndrew Pinski1-26/+0
The hardware architects have a new recommendation not to use non-temporal load/stores for memset. This patch removes this path. I found there was no difference in the memset speed with/without non-temporal load/stores either. Signed-off-by: Andrew Pinski <quic_apinski@quicinc.com> Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
2024-11-21aarch64: Remove non-temporal load/stores from oryon-1's memcpyAndrew Pinski1-40/+0
The hardware architects have a new recommendation not to use non-temporal load/stores for memcpy. This patch removes this path. I found there was no difference in the memcpy speed with/without non-temporal load/stores either. Signed-off-by: Andrew Pinski <quic_apinski@quicinc.com> Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
2024-11-20AArch64: Remove thunderx{,2} memcpyAndrew Pinski6-784/+0
ThunderX1 and ThunderX2 have been retired for a few years now. So let's remove the thunderx{,2} specific versions of memcpy. The performance gain or them was for medium and large sizes while the generic (aarch64) memcpy will handle just slightly worse. Signed-off-by: Andrew Pinski <quic_apinski@quicinc.com> Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
2024-09-10AArch64: Remove memset-reg.hWilco Dijkstra4-4/+28
Remove memset-reg.h by moving register definitions into the memset implementations. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
2024-06-30Aarch64: Add new memset for Qualcomm's oryon-1 coreAndrew Pinski4-0/+176
Qualcom's new core, oryon-1, has a different characteristics for memset than the current versions of memset. For non-zero, larger sizes, using GPRs rather than the SIMD stores is ~30% faster. For even larger sizes, using the nontemporal stores is needed not to polute the L1/L2 caches. For zero values, using `dc zva` should be used. Since we know the size will always be 64 bytes, we don't need to figure out the size there. I started with the emag memset and added back the `dc zva` code. Changes since v1: * v3: Fix comment formating Signed-off-by: Andrew Pinski <quic_apinski@quicinc.com> Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
2024-06-30Aarch64: Add memcpy for qualcomm's oryon-1 coreAndrew Pinski4-0/+310
Qualcomm's new core (oryon-1) has a different performance characteristic than other cores. For memcpy, it is faster to use the GPRs to do the copy for large sizes (2x faster). For even larger sizes, it is better to use the nontemporal load/store instructions so we don't pollute the L1/L2 caches. For smaller sizes, the characteristic are very similar to other cores. I used the thunderx memcpy as a starting point and expanded from there. Changes since v1: * v2: Fix ordering in Makefile. * v3: Fix comment grammar about the ldnp/stnp instructions. Signed-off-by: Andrew Pinski <quic_apinski@quicinc.com> Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
2024-05-23aarch64: Remove duplicate memchr/strlen in libc.a (BZ 31777)Adhemerval Zanella2-0/+6
The generic version provides weak definitions of memchr/strlen, which are already provided by the ifunc resolvers. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2024-05-07elf: Only process multiple tunable once (BZ 31686)Adhemerval Zanella1-0/+4
The 680c597e9c3 commit made loader reject ill-formatted strings by first tracking all set tunables and then applying them. However, it does not take into consideration if the same tunable is set multiple times, where parse_tunables_string appends the found tunable without checking if it was already in the list. It leads to a stack-based buffer overflow if the tunable is specified more than the total number of tunables. For instance: GLIBC_TUNABLES=glibc.malloc.check=2:... (repeat over the number of total support for different tunable). Instead, use the index of the tunable list to get the expected tunable entry. Since now the initial list is zero-initialized, the compiler might emit an extra memset and this requires some minor adjustment on some ports. Checked on x86_64-linux-gnu and aarch64-linux-gnu. Reported-by: Yuto Maeda <maeda@cyberdefense.jp> Reported-by: Yutaro Shimizu <shimizu@cyberdefense.jp> Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
2024-03-21AArch64: Check kernel version for SVE ifuncsWilco Dijkstra3-2/+4
Old Linux kernels disable SVE after every system call. Calling the SVE-optimized memcpy afterwards will then cause a trap to reenable SVE. As a result, applications with a high use of syscalls may run slower with the SVE memcpy. This is true for kernels between 4.15.0 and before 6.2.0, except for 5.14.0 which was patched. Avoid this by checking the kernel version and selecting the SVE ifunc on modern kernels. Parse the kernel version reported by uname() into a 24-bit kernel.major.minor value without calling any library functions. If uname() is not supported or if the version format is not recognized, assume the kernel is modern. Tested-by: Florian Weimer <fweimer@redhat.com> Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2024-01-01Update copyright dates with scripts/update-copyrightsPaul Eggert25-25/+25
2023-12-04aarch64: fix tested ifunc variantsSzabolcs Nagy1-3/+3
Don't test a64fx string functions when BTI is enabled since they are not BTI compatible.
2023-11-13AArch64: Remove Falkor memcpyWilco Dijkstra5-324/+0
The latest implementations of memcpy are actually faster than the Falkor implementations [1], so remove the falkor/phecda ifuncs for memcpy and the now unused IS_FALKOR/IS_PHECDA defines. [1] https://sourceware.org/pipermail/libc-alpha/2022-December/144227.html Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
2023-11-13AArch64: Add memset_zva64Wilco Dijkstra5-63/+33
Add a specialized memset for the common ZVA size of 64 to avoid the overhead of reading the ZVA size. Since the code is identical to __memset_falkor, remove the latter. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
2023-11-13AArch64: Cleanup emag memsetWilco Dijkstra4-197/+90
Cleanup emag memset - merge the memset_base64.S file, remove the unused ZVA code (since it is disabled on emag). Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
2023-11-01AArch64: Cleanup ifuncsWilco Dijkstra17-124/+40
Cleanup ifuncs. Remove uses of libc_hidden_builtin_def, use ENTRY rather than ENTRY_ALIGN, remove unnecessary defines and conditional compilation. Rename strlen_mte to strlen_generic. Remove rtld-memset. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2023-10-24AArch64: Add support for MOPS memcpy/memmove/memsetWilco Dijkstra9-1/+137
Add support for MOPS in cpu_features and INIT_ARCH. Add ifuncs using MOPS for memcpy, memmove and memset (use .inst for now so it works with all binutils versions without needing complex configure and conditional compilation). Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2023-05-30Fix misspellings in sysdeps/ -- BZ 25337Paul Pluzhnikov1-4/+4
2023-02-06AArch64: Improve SVE memcpy and memmoveWilco Dijkstra1-20/+14
Improve SVE memcpy by copying 2 vectors if the size is small enough. This improves performance of random memcpy by ~9% on Neoverse V1, and 33-64 byte copies are ~16% faster. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2023-01-17AArch64: Improve strlen_asimdWilco Dijkstra1-12/+4
Use shrn for the mask, merge tst+bne into cbnz, and tweak code alignment. Performance improves slightly as a result. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2023-01-06Update copyright dates with scripts/update-copyrightsJoseph Myers25-25/+25
2022-10-26aarch64: Use memcpy_simd as the default memcpyWilco Dijkstra5-259/+0
Since __memcpy_simd is the fastest memcpy on almost all cores, replace the generic memcpy with it. If SVE is available, a SVE memcpy will be used by default (including for Neoverse N2).
2022-10-26aarch64: Cleanup memset ifuncWilco Dijkstra2-17/+26
Cleanup memset ifunc selectors. The A64FX memset relies on a ZVA size of 256, so add an explicit check.
2022-10-10elf: Remove -fno-tree-loop-distribute-patterns usage on dl-supportAdhemerval Zanella1-0/+24
Besides the option being gcc specific, this approach is still fragile and not future proof since we do not know if this will be the only optimization option gcc will add that transforms loops to memset (or any libcall). This patch adds a new header, dl-symbol-redir-ifunc.h, that can b used to redirect the compiler generated libcalls to port the generic memset implementation if required. Checked on x86_64-linux-gnu and aarch64-linux-gnu. Reviewed-by: Carlos O'Donell <carlos@redhat.com>
2022-06-10Add bounds check to __libc_ifunc_impl_listWilco Dijkstra1-7/+2
Add a proper bounds check to __libc_ifunc_impl_list. This makes MAX_IFUNC redundant and fixes several targets that will write outside the array. To avoid unnecessary large diffs, pass the maximum in the argument 'i' to IFUNC_IMPL_ADD - 'max' can be used in new ifunc definitions and existing ones can be updated if desired. Passes buildmanyglibc. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
2022-06-07AArch64: Sort makefile entriesWilco Dijkstra1-6/+18
Sort makefile entries to reduce conflicts.
2022-06-07AArch64: Add SVE memcpyWilco Dijkstra5-42/+284
Add an initial SVE memcpy implementation. Copies up to 32 bytes use SVE vectors which improves the random memcpy benchmark significantly. Cleanup the memcpy and memmove ifunc selectors.
2022-01-06AArch64: Check for SVE in ifuncs [BZ #28744]Wilco Dijkstra3-3/+3
Add a check for SVE in the A64FX ifuncs for memcpy, memset and memmove. This fixes BZ #28744.
2022-01-01Update copyright dates with scripts/update-copyrightsPaul Eggert24-24/+24
I used these shell commands: ../glibc/scripts/update-copyrights $PWD/../gnulib/build-aux/update-copyright (cd ../glibc && git commit -am"[this commit message]") and then ignored the output, which consisted lines saying "FOO: warning: copyright statement not found" for each of 7061 files FOO. I then removed trailing white space from math/tgmath.h, support/tst-support-open-dev-null-range.c, and sysdeps/x86_64/multiarch/strlen-vec.S, to work around the following obscure pre-commit check failure diagnostics from Savannah. I don't know why I run into these diagnostics whereas others evidently do not. remote: *** 912-#endif remote: *** 913: remote: *** 914- remote: *** error: lines with trailing whitespace found ... remote: *** error: sysdeps/unix/sysv/linux/statx_cp.c: trailing lines
2021-12-02AArch64: Improve A64FX memcpyWilco Dijkstra1-321/+225
v2 is a complete rewrite of the A64FX memcpy. Performance is improved by streamlining the code, aligning all large copies and using a single unrolled loop for all sizes. The code size for memcpy and memmove goes down from 1796 bytes to 868 bytes. Performance is better in all cases: bench-memcpy-random is 2.3% faster overall, bench-memcpy-large is ~33% faster for large sizes, bench-memcpy-walk is 25% faster for small sizes and 20% for the largest sizes. The geomean of all tests in bench-memcpy is 5.1% faster, and total time is reduced by 4%. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2021-09-24aarch64: Disable A64FX memcpy/memmove BTI unconditionallyNaohiro Tamura1-0/+3
This patch disables A64FX memcpy/memmove BTI instruction insertion unconditionally such as A64FX memset patch [1] for performance. [1] commit 07b427296b8d59f439144029d9a948f6c1ce0a31 Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2021-09-06AArch64: Update A64FX memset not to degrade at 16KBNaohiro Tamura1-1/+8
This patch updates unroll8 code so as not to degrade at the peak performance 16KB for both FX1000 and FX700. Inserted 2 instructions at the beginning of the unroll8 loop, cmp and branch, are a workaround that is found heuristically. Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
2021-09-06Revert "AArch64: Update A64FX memset not to degrade at 16KB"Szabolcs Nagy1-8/+1
Because of wrong commit author. Will recommit it with right author. This reverts commit 23777232c23f80809613bdfa329f63aadf992922.
2021-09-03AArch64: Update A64FX memset not to degrade at 16KBNaohiro Tamura via Libc-alpha1-1/+8
This patch updates unroll8 code so as not to degrade at the peak performance 16KB for both FX1000 and FX700. Inserted 2 instructions at the beginning of the unroll8 loop, cmp and branch, are a workaround that is found heuristically. Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
2021-08-10[5/5] AArch64: Improve A64FX memset medium loopsWilco Dijkstra1-26/+19
Simplify the code for memsets smaller than L1. Improve the unroll8 and L1_prefetch loops. Reviewed-by: Naohiro Tamura <naohirot@fujitsu.com>
2021-08-10[4/5] AArch64: Improve A64FX memset by removing unroll32Wilco Dijkstra1-17/+1
Remove unroll32 code since it doesn't improve performance. Reviewed-by: Naohiro Tamura <naohirot@fujitsu.com>
2021-08-10[3/5] AArch64: Improve A64FX memset for remaining bytesWilco Dijkstra1-33/+13
Simplify handling of remaining bytes. Avoid lots of taken branches and complex whilelo computations, instead unconditionally write vectors from the end. Reviewed-by: Naohiro Tamura <naohirot@fujitsu.com>
2021-08-10[2/5] AArch64: Improve A64FX memset for large sizesWilco Dijkstra1-60/+25
Improve performance of large memsets. Simplify alignment code. For zero memset use DC ZVA, which almost doubles performance. For non-zero memsets use the unroll8 loop which is about 10% faster. Reviewed-by: Naohiro Tamura <naohirot@fujitsu.com>
2021-08-10[1/5] AArch64: Improve A64FX memset for small sizesWilco Dijkstra1-60/+36
Improve performance of small memsets by reducing instruction counts and improving code alignment. Bench-memset shows 35-45% performance gain for small sizes. Reviewed-by: Naohiro Tamura <naohirot@fujitsu.com>
2021-05-27aarch64: Added optimized memset for A64FXNaohiro Tamura4-5/+286
This patch optimizes the performance of memset for A64FX [1] which implements ARMv8-A SVE and has L1 64KB cache per core and L2 8MB cache per NUMA node. The performance optimization makes use of Scalable Vector Register with several techniques such as loop unrolling, memory access alignment, cache zero fill and prefetch. SVE assembler code for memset is implemented as Vector Length Agnostic code so theoretically it can be run on any SOC which supports ARMv8-A SVE standard. We confirmed that all testcases have been passed by running 'make check' and 'make xcheck' not only on A64FX but also on ThunderX2. And also we confirmed that the SVE 512 bit vector register performance is roughly 4 times better than Advanced SIMD 128 bit register and 8 times better than scalar 64 bit register by running 'make bench'. [1] https://github.com/fujitsu/A64FX Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com> Reviewed-by: Szabolcs Nagy <Szabolcs.Nagy@arm.com>
2021-05-27aarch64: Added optimized memcpy and memmove for A64FXNaohiro Tamura6-13/+443
This patch optimizes the performance of memcpy/memmove for A64FX [1] which implements ARMv8-A SVE and has L1 64KB cache per core and L2 8MB cache per NUMA node. The performance optimization makes use of Scalable Vector Register with several techniques such as loop unrolling, memory access alignment, cache zero fill, and software pipelining. SVE assembler code for memcpy/memmove is implemented as Vector Length Agnostic code so theoretically it can be run on any SOC which supports ARMv8-A SVE standard. We confirmed that all testcases have been passed by running 'make check' and 'make xcheck' not only on A64FX but also on ThunderX2. And also we confirmed that the SVE 512 bit vector register performance is roughly 4 times better than Advanced SIMD 128 bit register and 8 times better than scalar 64 bit register by running 'make bench'. [1] https://github.com/fujitsu/A64FX Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com> Reviewed-by: Szabolcs Nagy <Szabolcs.Nagy@arm.com>
2021-01-25aarch64: Fix the list of tested IFUNC variants [BZ #26818]Szabolcs Nagy2-4/+6
Some IFUNC variants are not compatible with BTI and MTE so don't set them as usable for testing and benchmarking on a BTI or MTE enabled system. As far as IFUNC selectors are concerned a system is BTI enabled if the cpu supports it and glibc was built with BTI branch protection. Most IFUNC variants are BTI compatible, but thunderx2 memcpy and memmove use a jump table with indirect jump, without a BTI j. Fixes bug 26818.
2021-01-25aarch64: Move and update the definition of MTE_ENABLEDSzabolcs Nagy2-11/+11
The hwcap value is now in linux 5.10 and in glibc bits/hwcap.h, so use that definition. Move the definition to init-arch.h so all ifunc selectors can use it and expose an "mte" shorthand for mte enabled runtime. For now we allow user code to enable tag checks and use PROT_MTE mappings without libc involvment, this is not guaranteed ABI, but can be useful for testing and debugging with MTE.
2021-01-21aarch64: revert memcpy optimze for kunpeng to avoid performance degradationShuo Wang1-1/+1
In commit 863d775c481704baaa41855fc93e5a1ca2dc6bf6, kunpeng920 is added to default memcpy version, however, there is performance degradation when the copy size is some large bytes, eg: 100k. This is the result, tested in glibc-2.28: before backport after backport Performance improvement memcpy_1k 0.005 0.005 0.00% memcpy_10k 0.032 0.029 10.34% memcpy_100k 0.356 0.429 -17.02% memcpy_1m 7.470 11.153 -33.02% This is the demo #include "stdio.h" #include "string.h" #include "stdlib.h" char a[1024*1024] = {12}; char b[1024*1024] = {13}; int main(int argc, char *argv[]) { int i = atoi(argv[1]); int j; int size = atoi(argv[2]); for (j = 0; j < i; j++) memcpy(b, a, size*1024); return 0; } # gcc -g -O0 memcpy.c -o memcpy # time taskset -c 10 ./memcpy 100000 1024 Co-authored-by: liqingqing <liqingqing3@huawei.com>
2021-01-02Update copyright dates with scripts/update-copyrightsPaul Eggert22-22/+22
I used these shell commands: ../glibc/scripts/update-copyrights $PWD/../gnulib/build-aux/update-copyright (cd ../glibc && git commit -am"[this commit message]") and then ignored the output, which consisted lines saying "FOO: warning: copyright statement not found" for each of 6694 files FOO. I then removed trailing white space from benchtests/bench-pthread-locks.c and iconvdata/tst-iconv-big5-hkscs-to-2ucs4.c, to work around this diagnostic from Savannah: remote: *** pre-commit check failed ... remote: *** error: lines with trailing whitespace found remote: error: hook declined to update refs/heads/master