glibc.git - Unnamed repository; edit this file 'description' to name the repository.

Age	Commit message (Collapse)	Author	Files	Lines
2025-04-15	aarch64: Add back non-temporal load/stores from oryon-1's memset	Andrew Pinski	1	-0/+26
	I misunderstood the recommendation from the hardware team about non-temporal load/stores. It is still recommended to use them in memset for large sizes. It was not recommended for their use with device memory and memset is already not valid to be used with device memory. This reverts commit e6590f0c86632c36c9a784cf96075f4be2e920d2. Signed-off-by: Andrew Pinski <quic_apinski@quicinc.com> Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
2025-04-15	aarch64: Add back non-temporal load/stores from oryon-1's memcpy	Andrew Pinski	1	-0/+40
	I misunderstood the recommendation from the hardware team about non-temporal load/stores. It is still recommended to use them in memcpy for large sizes. It was not recommended for their use with device memory and memcpy is already not valid to be use with device memory. This reverts commit eb5eeb47403e0a91de834868e501b4d62b8d2cb9. Signed-off-by: Andrew Pinski <quic_apinski@quicinc.com> Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
2025-02-27	AArch64: Use prefer_sve_ifuncs for SVE memset	Wilco Dijkstra	1	-1/+1
	Use prefer_sve_ifuncs for SVE memset just like memcpy. Reviewed-by: Yury Khrustalev <yury.khrustalev@arm.com>
2025-02-24	AArch64: Remove PTR_ARG/SIZE_ARG defines	Wilco Dijkstra	12	-50/+0
	This series removes various ILP32 defines that are now no longer needed. Remove PTR_ARG/SIZE_ARG. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
2025-02-20	AArch64: Add SVE memset	Wilco Dijkstra	4	-0/+129
	Add SVE memset based on the generic memset with predicated load for sizes < 16. Unaligned memsets of 128-1024 are improved by ~20% on average by using aligned stores for the last 64 bytes. Performance of random memset benchmark improves by ~2% on Neoverse V1. Reviewed-by: Yury Khrustalev <yury.khrustalev@arm.com>
2025-01-01	Update copyright dates with scripts/update-copyrights	Paul Eggert	25	-25/+25

2024-11-21	aarch64: Remove non-temporal load/stores from oryon-1's memset	Andrew Pinski	1	-26/+0
	The hardware architects have a new recommendation not to use non-temporal load/stores for memset. This patch removes this path. I found there was no difference in the memset speed with/without non-temporal load/stores either. Signed-off-by: Andrew Pinski <quic_apinski@quicinc.com> Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
2024-11-21	aarch64: Remove non-temporal load/stores from oryon-1's memcpy	Andrew Pinski	1	-40/+0
	The hardware architects have a new recommendation not to use non-temporal load/stores for memcpy. This patch removes this path. I found there was no difference in the memcpy speed with/without non-temporal load/stores either. Signed-off-by: Andrew Pinski <quic_apinski@quicinc.com> Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
2024-11-20	AArch64: Remove thunderx{,2} memcpy	Andrew Pinski	6	-784/+0
	ThunderX1 and ThunderX2 have been retired for a few years now. So let's remove the thunderx{,2} specific versions of memcpy. The performance gain or them was for medium and large sizes while the generic (aarch64) memcpy will handle just slightly worse. Signed-off-by: Andrew Pinski <quic_apinski@quicinc.com> Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
2024-09-10	AArch64: Remove memset-reg.h	Wilco Dijkstra	4	-4/+28
	Remove memset-reg.h by moving register definitions into the memset implementations. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
2024-06-30	Aarch64: Add new memset for Qualcomm's oryon-1 core	Andrew Pinski	4	-0/+176
	Qualcom's new core, oryon-1, has a different characteristics for memset than the current versions of memset. For non-zero, larger sizes, using GPRs rather than the SIMD stores is ~30% faster. For even larger sizes, using the nontemporal stores is needed not to polute the L1/L2 caches. For zero values, using `dc zva` should be used. Since we know the size will always be 64 bytes, we don't need to figure out the size there. I started with the emag memset and added back the `dc zva` code. Changes since v1: * v3: Fix comment formating Signed-off-by: Andrew Pinski <quic_apinski@quicinc.com> Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
2024-06-30	Aarch64: Add memcpy for qualcomm's oryon-1 core	Andrew Pinski	4	-0/+310
	Qualcomm's new core (oryon-1) has a different performance characteristic than other cores. For memcpy, it is faster to use the GPRs to do the copy for large sizes (2x faster). For even larger sizes, it is better to use the nontemporal load/store instructions so we don't pollute the L1/L2 caches. For smaller sizes, the characteristic are very similar to other cores. I used the thunderx memcpy as a starting point and expanded from there. Changes since v1: * v2: Fix ordering in Makefile. * v3: Fix comment grammar about the ldnp/stnp instructions. Signed-off-by: Andrew Pinski <quic_apinski@quicinc.com> Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
2024-05-23	aarch64: Remove duplicate memchr/strlen in libc.a (BZ 31777)	Adhemerval Zanella	2	-0/+6
	The generic version provides weak definitions of memchr/strlen, which are already provided by the ifunc resolvers. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2024-05-07	elf: Only process multiple tunable once (BZ 31686)	Adhemerval Zanella	1	-0/+4
	The 680c597e9c3 commit made loader reject ill-formatted strings by first tracking all set tunables and then applying them. However, it does not take into consideration if the same tunable is set multiple times, where parse_tunables_string appends the found tunable without checking if it was already in the list. It leads to a stack-based buffer overflow if the tunable is specified more than the total number of tunables. For instance: GLIBC_TUNABLES=glibc.malloc.check=2:... (repeat over the number of total support for different tunable). Instead, use the index of the tunable list to get the expected tunable entry. Since now the initial list is zero-initialized, the compiler might emit an extra memset and this requires some minor adjustment on some ports. Checked on x86_64-linux-gnu and aarch64-linux-gnu. Reported-by: Yuto Maeda <maeda@cyberdefense.jp> Reported-by: Yutaro Shimizu <shimizu@cyberdefense.jp> Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
2024-03-21	AArch64: Check kernel version for SVE ifuncs	Wilco Dijkstra	3	-2/+4
	Old Linux kernels disable SVE after every system call. Calling the SVE-optimized memcpy afterwards will then cause a trap to reenable SVE. As a result, applications with a high use of syscalls may run slower with the SVE memcpy. This is true for kernels between 4.15.0 and before 6.2.0, except for 5.14.0 which was patched. Avoid this by checking the kernel version and selecting the SVE ifunc on modern kernels. Parse the kernel version reported by uname() into a 24-bit kernel.major.minor value without calling any library functions. If uname() is not supported or if the version format is not recognized, assume the kernel is modern. Tested-by: Florian Weimer <fweimer@redhat.com> Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2024-01-01	Update copyright dates with scripts/update-copyrights	Paul Eggert	25	-25/+25

2023-12-04	aarch64: fix tested ifunc variants	Szabolcs Nagy	1	-3/+3
	Don't test a64fx string functions when BTI is enabled since they are not BTI compatible.
2023-11-13	AArch64: Remove Falkor memcpy	Wilco Dijkstra	5	-324/+0
	The latest implementations of memcpy are actually faster than the Falkor implementations [1], so remove the falkor/phecda ifuncs for memcpy and the now unused IS_FALKOR/IS_PHECDA defines. [1] https://sourceware.org/pipermail/libc-alpha/2022-December/144227.html Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
2023-11-13	AArch64: Add memset_zva64	Wilco Dijkstra	5	-63/+33
	Add a specialized memset for the common ZVA size of 64 to avoid the overhead of reading the ZVA size. Since the code is identical to __memset_falkor, remove the latter. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
2023-11-13	AArch64: Cleanup emag memset	Wilco Dijkstra	4	-197/+90
	Cleanup emag memset - merge the memset_base64.S file, remove the unused ZVA code (since it is disabled on emag). Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
2023-11-01	AArch64: Cleanup ifuncs	Wilco Dijkstra	17	-124/+40
	Cleanup ifuncs. Remove uses of libc_hidden_builtin_def, use ENTRY rather than ENTRY_ALIGN, remove unnecessary defines and conditional compilation. Rename strlen_mte to strlen_generic. Remove rtld-memset. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2023-10-24	AArch64: Add support for MOPS memcpy/memmove/memset	Wilco Dijkstra	9	-1/+137
	Add support for MOPS in cpu_features and INIT_ARCH. Add ifuncs using MOPS for memcpy, memmove and memset (use .inst for now so it works with all binutils versions without needing complex configure and conditional compilation). Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2023-05-30	Fix misspellings in sysdeps/ -- BZ 25337	Paul Pluzhnikov	1	-4/+4

2023-02-06	AArch64: Improve SVE memcpy and memmove	Wilco Dijkstra	1	-20/+14
	Improve SVE memcpy by copying 2 vectors if the size is small enough. This improves performance of random memcpy by ~9% on Neoverse V1, and 33-64 byte copies are ~16% faster. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2023-01-17	AArch64: Improve strlen_asimd	Wilco Dijkstra	1	-12/+4
	Use shrn for the mask, merge tst+bne into cbnz, and tweak code alignment. Performance improves slightly as a result. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2023-01-06	Update copyright dates with scripts/update-copyrights	Joseph Myers	25	-25/+25

2022-10-26	aarch64: Use memcpy_simd as the default memcpy	Wilco Dijkstra	5	-259/+0
	Since __memcpy_simd is the fastest memcpy on almost all cores, replace the generic memcpy with it. If SVE is available, a SVE memcpy will be used by default (including for Neoverse N2).
2022-10-26	aarch64: Cleanup memset ifunc	Wilco Dijkstra	2	-17/+26
	Cleanup memset ifunc selectors. The A64FX memset relies on a ZVA size of 256, so add an explicit check.
2022-10-10	elf: Remove -fno-tree-loop-distribute-patterns usage on dl-support	Adhemerval Zanella	1	-0/+24
	Besides the option being gcc specific, this approach is still fragile and not future proof since we do not know if this will be the only optimization option gcc will add that transforms loops to memset (or any libcall). This patch adds a new header, dl-symbol-redir-ifunc.h, that can b used to redirect the compiler generated libcalls to port the generic memset implementation if required. Checked on x86_64-linux-gnu and aarch64-linux-gnu. Reviewed-by: Carlos O'Donell <carlos@redhat.com>
2022-06-10	Add bounds check to __libc_ifunc_impl_list	Wilco Dijkstra	1	-7/+2
	Add a proper bounds check to __libc_ifunc_impl_list. This makes MAX_IFUNC redundant and fixes several targets that will write outside the array. To avoid unnecessary large diffs, pass the maximum in the argument 'i' to IFUNC_IMPL_ADD - 'max' can be used in new ifunc definitions and existing ones can be updated if desired. Passes buildmanyglibc. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
2022-06-07	AArch64: Sort makefile entries	Wilco Dijkstra	1	-6/+18
	Sort makefile entries to reduce conflicts.
2022-06-07	AArch64: Add SVE memcpy	Wilco Dijkstra	5	-42/+284
	Add an initial SVE memcpy implementation. Copies up to 32 bytes use SVE vectors which improves the random memcpy benchmark significantly. Cleanup the memcpy and memmove ifunc selectors.
2022-01-06	AArch64: Check for SVE in ifuncs [BZ #28744]	Wilco Dijkstra	3	-3/+3
	Add a check for SVE in the A64FX ifuncs for memcpy, memset and memmove. This fixes BZ #28744.
2022-01-01	Update copyright dates with scripts/update-copyrights	Paul Eggert	24	-24/+24
	I used these shell commands: ../glibc/scripts/update-copyrights $PWD/../gnulib/build-aux/update-copyright (cd ../glibc && git commit -am"[this commit message]") and then ignored the output, which consisted lines saying "FOO: warning: copyright statement not found" for each of 7061 files FOO. I then removed trailing white space from math/tgmath.h, support/tst-support-open-dev-null-range.c, and sysdeps/x86_64/multiarch/strlen-vec.S, to work around the following obscure pre-commit check failure diagnostics from Savannah. I don't know why I run into these diagnostics whereas others evidently do not. remote: * 912-#endif remote: * 913: remote: * 914- remote: * error: lines with trailing whitespace found ... remote: *** error: sysdeps/unix/sysv/linux/statx_cp.c: trailing lines
2021-12-02	AArch64: Improve A64FX memcpy	Wilco Dijkstra	1	-321/+225
	v2 is a complete rewrite of the A64FX memcpy. Performance is improved by streamlining the code, aligning all large copies and using a single unrolled loop for all sizes. The code size for memcpy and memmove goes down from 1796 bytes to 868 bytes. Performance is better in all cases: bench-memcpy-random is 2.3% faster overall, bench-memcpy-large is ~33% faster for large sizes, bench-memcpy-walk is 25% faster for small sizes and 20% for the largest sizes. The geomean of all tests in bench-memcpy is 5.1% faster, and total time is reduced by 4%. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2021-09-24	aarch64: Disable A64FX memcpy/memmove BTI unconditionally	Naohiro Tamura	1	-0/+3
	This patch disables A64FX memcpy/memmove BTI instruction insertion unconditionally such as A64FX memset patch [1] for performance. [1] commit 07b427296b8d59f439144029d9a948f6c1ce0a31 Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2021-09-06	AArch64: Update A64FX memset not to degrade at 16KB	Naohiro Tamura	1	-1/+8
	This patch updates unroll8 code so as not to degrade at the peak performance 16KB for both FX1000 and FX700. Inserted 2 instructions at the beginning of the unroll8 loop, cmp and branch, are a workaround that is found heuristically. Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
2021-09-06	Revert "AArch64: Update A64FX memset not to degrade at 16KB"	Szabolcs Nagy	1	-8/+1
	Because of wrong commit author. Will recommit it with right author. This reverts commit 23777232c23f80809613bdfa329f63aadf992922.
2021-09-03	AArch64: Update A64FX memset not to degrade at 16KB	Naohiro Tamura via Libc-alpha	1	-1/+8
	This patch updates unroll8 code so as not to degrade at the peak performance 16KB for both FX1000 and FX700. Inserted 2 instructions at the beginning of the unroll8 loop, cmp and branch, are a workaround that is found heuristically. Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
2021-08-10	[5/5] AArch64: Improve A64FX memset medium loops	Wilco Dijkstra	1	-26/+19
	Simplify the code for memsets smaller than L1. Improve the unroll8 and L1_prefetch loops. Reviewed-by: Naohiro Tamura <naohirot@fujitsu.com>
2021-08-10	[4/5] AArch64: Improve A64FX memset by removing unroll32	Wilco Dijkstra	1	-17/+1
	Remove unroll32 code since it doesn't improve performance. Reviewed-by: Naohiro Tamura <naohirot@fujitsu.com>
2021-08-10	[3/5] AArch64: Improve A64FX memset for remaining bytes	Wilco Dijkstra	1	-33/+13
	Simplify handling of remaining bytes. Avoid lots of taken branches and complex whilelo computations, instead unconditionally write vectors from the end. Reviewed-by: Naohiro Tamura <naohirot@fujitsu.com>
2021-08-10	[2/5] AArch64: Improve A64FX memset for large sizes	Wilco Dijkstra	1	-60/+25
	Improve performance of large memsets. Simplify alignment code. For zero memset use DC ZVA, which almost doubles performance. For non-zero memsets use the unroll8 loop which is about 10% faster. Reviewed-by: Naohiro Tamura <naohirot@fujitsu.com>
2021-08-10	[1/5] AArch64: Improve A64FX memset for small sizes	Wilco Dijkstra	1	-60/+36
	Improve performance of small memsets by reducing instruction counts and improving code alignment. Bench-memset shows 35-45% performance gain for small sizes. Reviewed-by: Naohiro Tamura <naohirot@fujitsu.com>
2021-05-27	aarch64: Added optimized memset for A64FX	Naohiro Tamura	4	-5/+286
	This patch optimizes the performance of memset for A64FX [1] which implements ARMv8-A SVE and has L1 64KB cache per core and L2 8MB cache per NUMA node. The performance optimization makes use of Scalable Vector Register with several techniques such as loop unrolling, memory access alignment, cache zero fill and prefetch. SVE assembler code for memset is implemented as Vector Length Agnostic code so theoretically it can be run on any SOC which supports ARMv8-A SVE standard. We confirmed that all testcases have been passed by running 'make check' and 'make xcheck' not only on A64FX but also on ThunderX2. And also we confirmed that the SVE 512 bit vector register performance is roughly 4 times better than Advanced SIMD 128 bit register and 8 times better than scalar 64 bit register by running 'make bench'. [1] https://github.com/fujitsu/A64FX Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com> Reviewed-by: Szabolcs Nagy <Szabolcs.Nagy@arm.com>
2021-05-27	aarch64: Added optimized memcpy and memmove for A64FX	Naohiro Tamura	6	-13/+443
	This patch optimizes the performance of memcpy/memmove for A64FX [1] which implements ARMv8-A SVE and has L1 64KB cache per core and L2 8MB cache per NUMA node. The performance optimization makes use of Scalable Vector Register with several techniques such as loop unrolling, memory access alignment, cache zero fill, and software pipelining. SVE assembler code for memcpy/memmove is implemented as Vector Length Agnostic code so theoretically it can be run on any SOC which supports ARMv8-A SVE standard. We confirmed that all testcases have been passed by running 'make check' and 'make xcheck' not only on A64FX but also on ThunderX2. And also we confirmed that the SVE 512 bit vector register performance is roughly 4 times better than Advanced SIMD 128 bit register and 8 times better than scalar 64 bit register by running 'make bench'. [1] https://github.com/fujitsu/A64FX Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com> Reviewed-by: Szabolcs Nagy <Szabolcs.Nagy@arm.com>
2021-01-25	aarch64: Fix the list of tested IFUNC variants [BZ #26818]	Szabolcs Nagy	2	-4/+6
	Some IFUNC variants are not compatible with BTI and MTE so don't set them as usable for testing and benchmarking on a BTI or MTE enabled system. As far as IFUNC selectors are concerned a system is BTI enabled if the cpu supports it and glibc was built with BTI branch protection. Most IFUNC variants are BTI compatible, but thunderx2 memcpy and memmove use a jump table with indirect jump, without a BTI j. Fixes bug 26818.
2021-01-25	aarch64: Move and update the definition of MTE_ENABLED	Szabolcs Nagy	2	-11/+11
	The hwcap value is now in linux 5.10 and in glibc bits/hwcap.h, so use that definition. Move the definition to init-arch.h so all ifunc selectors can use it and expose an "mte" shorthand for mte enabled runtime. For now we allow user code to enable tag checks and use PROT_MTE mappings without libc involvment, this is not guaranteed ABI, but can be useful for testing and debugging with MTE.
2021-01-21	aarch64: revert memcpy optimze for kunpeng to avoid performance degradation	Shuo Wang	1	-1/+1
	In commit 863d775c481704baaa41855fc93e5a1ca2dc6bf6, kunpeng920 is added to default memcpy version, however, there is performance degradation when the copy size is some large bytes, eg: 100k. This is the result, tested in glibc-2.28: before backport after backport Performance improvement memcpy_1k 0.005 0.005 0.00% memcpy_10k 0.032 0.029 10.34% memcpy_100k 0.356 0.429 -17.02% memcpy_1m 7.470 11.153 -33.02% This is the demo #include "stdio.h" #include "string.h" #include "stdlib.h" char a[10241024] = {12}; char b[10241024] = {13}; int main(int argc, char argv[]) { int i = atoi(argv[1]); int j; int size = atoi(argv[2]); for (j = 0; j < i; j++) memcpy(b, a, size1024); return 0; } # gcc -g -O0 memcpy.c -o memcpy # time taskset -c 10 ./memcpy 100000 1024 Co-authored-by: liqingqing <liqingqing3@huawei.com>
2021-01-02	Update copyright dates with scripts/update-copyrights	Paul Eggert	22	-22/+22
	I used these shell commands: ../glibc/scripts/update-copyrights $PWD/../gnulib/build-aux/update-copyright (cd ../glibc && git commit -am"[this commit message]") and then ignored the output, which consisted lines saying "FOO: warning: copyright statement not found" for each of 6694 files FOO. I then removed trailing white space from benchtests/bench-pthread-locks.c and iconvdata/tst-iconv-big5-hkscs-to-2ucs4.c, to work around this diagnostic from Savannah: remote: * pre-commit check failed ... remote: * error: lines with trailing whitespace found remote: error: hook declined to update refs/heads/master