aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2025-04-14x86: Detect Intel Diamond Rapidsrelease/2.39/masterH.J. Lu1-0/+12
Detect Intel Diamond Rapids and tune it similar to Intel Granite Rapids. Signed-off-by: H.J. Lu <hjl.tools@gmail.com> Reviewed-by: Sunil K Pandey <skpgkp2@gmail.com> (cherry picked from commit de14f1959ee5f9b845a7cae43bee03068b8136f0)
2025-04-14x86: Handle unknown Intel processor with default tuningSunil K Pandey1-143/+143
Enable default tuning for unknown Intel processor. Tested on x86, no regression. Co-Authored-By: H.J. Lu <hjl.tools@gmail.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit 9f0deff558d1d6b08c425c157f50de85013ada9c)
2025-04-14x86: Add ARL/PTL/CWF model detection supportSunil K Pandey1-0/+10
- Add ARROWLAKE model detection. - Add PANTHERLAKE model detection. - Add CLEARWATERFOREST model detection. IntelĀ® Architecture Instruction Set Extensions Programming Reference https://cdrdv2.intel.com/v1/dl/getContent/671368 Section 1.2. No regression, validated model detection on SDE. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit e53eb952b970ac94c97d74fb447418fb327ca096)
2025-04-14x86: Optimize xstate size calculationSunil K Pandey2-56/+24
Scan xstate IDs up to the maximum supported xstate ID. Remove the separate AMX xstate calculation. Instead, exclude the AMX space from the start of TILECFG to the end of TILEDATA in xsave_state_size. Completed validation on SKL/SKX/SPR/SDE and compared xsave state size with "ld.so --list-diagnostics" option, no regression. Co-Authored-By: H.J. Lu <hjl.tools@gmail.com> Reviewed-by: Sunil K Pandey <skpgkp2@gmail.com> (cherry picked from commit 70b648855185e967e54668b101d24704c3fb869d)
2025-04-14x86: Use `Avoid_Non_Temporal_Memset` to control non-temporal pathNoah Goldstein2-8/+23
This is just a refactor and there should be no behavioral change from this commit. The goal is to make `Avoid_Non_Temporal_Memset` a more universal knob for controlling whether we use non-temporal memset rather than having extra logic based on vendor. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit b93dddfaf440aa12f45d7c356f6ffe9f27d35577)
2025-04-14x86: Tunables may incorrectly set Prefer_PMINUB_for_stringop (bug 32047)Florian Weimer1-0/+1
Fixes commit 5bcf6265f215326d14dfacdce8532792c2c7f8f8 ("x86: Disable non-temporal memset on Skylake Server"). Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com> (cherry picked from commit 7a630f7d3392ca391a399486ce2846f9e4b4ee63)
2025-04-14x86: Disable non-temporal memset on Skylake ServerNoah Goldstein5-12/+26
The original commit enabling non-temporal memset on Skylake Server had erroneous benchmarks (actually done on ICX). Further benchmarks indicate non-temporal stores may in fact by a regression on Skylake Server. This commit may be over-cautious in some cases, but should avoid any regressions for 2.40. Tested using qemu on all x86_64 cpu arch supported by both qemu + GLIBC. Reviewed-by: DJ Delorie <dj@redhat.com> Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit 5bcf6265f215326d14dfacdce8532792c2c7f8f8)
2025-04-14x86: Fix value for `x86_memset_non_temporal_threshold` when it is undesirableNoah Goldstein1-3/+3
When we don't want to use non-temporal stores for memset, we set `x86_memset_non_temporal_threshold` to SIZE_MAX. The current code, however, we using `maximum_non_temporal_threshold` as the upper bound which is `SIZE_MAX >> 4` so we ended up with a value of `0`. Fix is to just use `SIZE_MAX` as the upper bound for when setting the tunable. Tested-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit 5b54a33435e5533653a9956728f2de9d16a3b4ee)
2025-04-14x86: Enable non-temporal memset tunable for AMDJoe Damato1-4/+4
In commit 46b5e98ef6f1 ("x86: Add seperate non-temporal tunable for memset") a tunable threshold for enabling non-temporal memset was added, but only for Intel hardware. Since that commit, new benchmark results suggest that non-temporal memset is beneficial on AMD, as well, so allow this tunable to be set for AMD. See: https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing which has been updated to include data using different stategies for large memset on AMD Zen2, Zen3, and Zen4. Signed-off-by: Joe Damato <jdamato@fastly.com> Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com> (cherry picked from commit bef2a827a55fc759693ccc5b0f614353b8ad712d)
2025-04-14x86: Add seperate non-temporal tunable for memsetNoah Goldstein7-6/+49
The tuning for non-temporal stores for memset vs memcpy is not always the same. This includes both the exact value and whether non-temporal stores are profitable at all for a given arch. This patch add `x86_memset_non_temporal_threshold`. Currently we disable non-temporal stores for non Intel vendors as the only benchmarks showing its benefit have been on Intel hardware. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit 46b5e98ef6f1b9f4b53851f152ecb8209064b26c)
2025-03-31x86: Link tst-gnu2-tls2-x86-noxsave{,c,xsavec} with libpthreadFlorian Weimer1-0/+3
This fixes a test build failure on Hurd. Fixes commit 145097dff170507fe73190e8e41194f5b5f7e6bf ("x86: Use separate variable for TLSDESC XSAVE/XSAVEC state size (bug 32810)"). Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> (cherry picked from commit c6e2895695118ab59c7b17feb0fcb75a53e3478c)
2025-03-29x86: Use separate variable for TLSDESC XSAVE/XSAVEC state size (bug 32810)Florian Weimer10-8/+41
Previously, the initialization code reused the xsave_state_full_size member of struct cpu_features for the TLSDESC state size. However, the tunable processing code assumes that this member has the original XSAVE (non-compact) state size, so that it can use its value if XSAVEC is disabled via tunable. This change uses a separate variable and not a struct member because the value is only needed in ld.so and the static libc, but not in libc.so. As a result, struct cpu_features layout does not change, helping a future backport of this change. Fixes commit 9b7091415af47082664717210ac49d51551456ab ("x86-64: Update _dl_tlsdesc_dynamic to preserve AMX registers"). Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit 145097dff170507fe73190e8e41194f5b5f7e6bf)
2025-03-29x86: Skip XSAVE state size reset if ISA level requires XSAVEFlorian Weimer1-0/+5
If we have to use XSAVE or XSAVEC trampolines, do not adjust the size information they need. Technically, it is an operator error to try to run with -XSAVE,-XSAVEC on such builds, but this change here disables some unnecessary code with higher ISA levels and simplifies testing. Related to commit befe2d3c4dec8be2cdd01a47132e47bdb7020922 ("x86-64: Don't use SSE resolvers for ISA level 3 or above"). Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit 59585ddaa2d44f22af04bb4b8bd4ad1e302c4c02)
2025-03-19x86_64: Add atanh with FMASunil K Pandey5-0/+51
On SPR, it improves atanh bench performance by: Before After Improvement reciprocal-throughput 15.1715 14.8628 2% latency 57.1941 56.1883 2% Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit c7c4a5906f326f1290b1c2413a83c530564ec4b8)
2025-03-19x86_64: Add sinh with FMASunil K Pandey5-0/+58
On SPR, it improves sinh bench performance by: Before After Improvement reciprocal-throughput 14.2017 11.815 17% latency 36.4917 35.2114 4% Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit dded0d20f67ba1925ccbcb9cf28f0c75febe0dbe)
2025-03-19x86_64: Add tanh with FMASunil K Pandey4-0/+49
On Skylake, it improves tanh bench performance by: Before After Improvement max 110.89 95.826 14% min 20.966 20.157 4% mean 30.9601 29.8431 4% Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit c6352111c72a20b3588ae304dd99b63e25dd6d85)
2025-03-19x86-64: Exclude FMA4 IFUNC functions for -mapxfH.J. Lu6-9/+70
When -mapxf is used to build glibc, the resulting glibc will never run on FMA4 machines. Exclude FMA4 IFUNC functions when -mapxf is used. This requires GCC which defines __APX_F__ for -mapxf with commit: 1df56719bd8 x86: Define __APX_F__ for -mapxf Reviewed-by: Sunil K Pandey <skpgkp2@gmail.com> (cherry picked from commit 9e1f4aef865ddeffeb4b5f6578fefab606783120)
2025-03-12nptl: clear the whole rseq area before registrationMichael Jeanson2-6/+6
Due to the extensible nature of the rseq area we can't explictly initialize fields that are not part of the ABI yet. It was agreed with upstream that all new fields will be documented as zero initialized by userspace. Future kernels configured with CONFIG_DEBUG_RSEQ will validate the content of all fields during registration. Replace the explicit field initialization with a memset of the whole rseq area which will cover fields as they are added to future kernels. Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Reviewed-by: Florian Weimer <fweimer@redhat.com> (cherry picked from commit 689a62a4217fae78b9ce0db781dc2a421f2b1ab4)
2025-02-28math: Improve layout of exp/exp10 dataWilco Dijkstra1-2/+4
GCC aligns global data to 16 bytes if their size is >= 16 bytes. This patch changes the exp_data struct slightly so that the fields are better aligned and without gaps. As a result on targets that support them, more load-pair instructions are used in exp. Exp10 is improved by moving invlog10_2N later so that neglog10_2hiN and neglog10_2loN can be loaded using load-pair. The exp benchmark improves 2.5%, "144bits" by 7.2%, "768bits" by 12.7% on Neoverse V2. Exp10 improves by 1.5%. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> (cherry picked from commit 5afaf99edb326fd9f36eb306a828d129a3a1d7f7)
2025-02-28AArch64: Use prefer_sve_ifuncs for SVE memsetWilco Dijkstra1-1/+1
Use prefer_sve_ifuncs for SVE memset just like memcpy. Reviewed-by: Yury Khrustalev <yury.khrustalev@arm.com> (cherry picked from commit 0f044be1dae5169d0e57f8d487b427863aeadab4)
2025-02-28AArch64: Add SVE memsetWilco Dijkstra4-0/+129
Add SVE memset based on the generic memset with predicated load for sizes < 16. Unaligned memsets of 128-1024 are improved by ~20% on average by using aligned stores for the last 64 bytes. Performance of random memset benchmark improves by ~2% on Neoverse V1. Reviewed-by: Yury Khrustalev <yury.khrustalev@arm.com> (cherry picked from commit 163b1bbb76caba4d9673c07940c5930a1afa7548)
2025-02-28math: Improve layout of expf dataWilco Dijkstra1-1/+1
GCC aligns global data to 16 bytes if their size is >= 16 bytes. This patch changes the exp2f_data struct slightly so that the fields are better aligned. As a result on targets that support them, load-pair instructions accessing poly_scaled and invln2_scaled are now 16-byte aligned. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> (cherry picked from commit 44fa9c1080fe6a9539f0d2345b9d2ae37b8ee57a)
2025-02-28AArch64: Remove zva_128 from memsetWilco Dijkstra1-24/+1
Remove ZVA 128 support from memset - the new memset no longer guarantees count >= 256, which can result in underflow and a crash if ZVA size is 128 ([1]). Since only one CPU uses a ZVA size of 128 and its memcpy implementation was removed in commit e162ab2bf1b82c40f29e1925986582fa07568ce8, remove this special case too. [1] https://sourceware.org/pipermail/libc-alpha/2024-November/161626.html Reviewed-by: Andrew Pinski <quic_apinski@quicinc.com> (cherry picked from commit a08d9a52f967531a77e1824c23b5368c6434a72d)
2025-02-28AArch64: Optimize memsetWilco Dijkstra1-111/+84
Improve small memsets by avoiding branches and use overlapping stores. Use DC ZVA for copies over 128 bytes. Remove unnecessary code for ZVA sizes other than 64 and 128. Performance of random memset benchmark improves by 24% on Neoverse N1. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> (cherry picked from commit cec3aef32412779e207f825db0d057ebb4628ae8)
2025-02-28AArch64: Improve generic strlenWilco Dijkstra1-12/+27
Improve performance by handling another 16 bytes before entering the loop. Use ADDHN in the loop to avoid SHRN+FMOV when it terminates. Change final size computation to avoid increasing latency. On Neoverse V1 performance of the random strlen benchmark improves by 4.6%. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> (cherry picked from commit 3dc426b642dcafdbc11a99f2767e081d086f5fc7)
2025-02-27AArch64: Improve codegen for SVE logsYat Long Poon4-47/+114
Reduce memory access by using lanewise MLA and moving constants to struct and reduce number of MOVPRFXs. Update maximum ULP error for double log_sve from 1 to 2. Speedup on Neoverse V1 for log (3%), log2 (5%), and log10 (4%). (cherry picked from commit 32d193a372feb28f9da247bb7283d404b84429c6)
2025-02-27AArch64: Improve codegen in SVE tansLuna Lamb2-41/+68
Improves memory access. Tan: MOVPRFX 7 -> 2, LD1RD 12 -> 5, move MOV away from return. Tanf: MOV 2 -> 1, MOVPRFX 6 -> 3, LD1RW 5 -> 4, move mov away from return. (cherry picked from commit aa6609feb20ebf8653db639dabe2a6afc77b02cc)
2025-02-27AArch64: Improve codegen of AdvSIMD atan(2)(f)Joana Cruz3-68/+160
Load the polynomial evaluation coefficients into 2 vectors and use lanewise MLAs. 8% improvement in throughput microbenchmark on Neoverse V1. Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com> (cherry picked from commit 6914774b9d3460876d9ad4482782213ec01a752e)
2025-02-27AArch64: Improve codegen of AdvSIMD logf function familyJoana Cruz3-40/+66
Load the polynomial evaluation coefficients into 2 vectors and use lanewise MLAs. 8% improvement in throughput microbenchmark on Neoverse V1 for log2 and log, and 2% for log10. Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com> (cherry picked from commit d6e034f5b222a9ed1aeb5de0c0c7d0dda8b63da3)
2025-02-27AArch64: Improve codegen in AdvSIMD logsPierre Blanchard3-106/+140
Remove spurious ADRP and a few MOVs. Reduce memory access by using more indexed MLAs in polynomial. Align notation so that algorithms are easier to compare. Speedup on Neoverse V1 for log10 (8%), log (8.5%), and log2 (10%). Update error threshold in AdvSIMD log (now matches SVE log). (cherry picked from commit 8eb5ad2ebc94cc5bedbac57c226c02ec254479c7)
2025-02-27AArch64: Simplify rounding-multiply pattern in several AdvSIMD routinesJoe Ramsay5-38/+30
This operation can be simplified to use simpler multiply-round-convert sequence, which uses fewer instructions and constants. Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com> (cherry picked from commit 16a59571e4e9fd019d3fc23a2e7d73c1df8bb5cb)
2025-02-27aarch64: Avoid redundant MOVs in AdvSIMD F32 logsJoe Ramsay3-45/+72
Since the last operation is destructive, the first argument to the FMA also has to be the first argument to the special-case in order to avoid unnecessary MOVs. Reorder arguments and adjust special-case bounds to facilitate this. Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com> (cherry picked from commit 8b09af572b208bfde4d31c6abbae047dcc217675)
2025-02-27aarch64: Fix AdvSIMD libmvec routines for big-endianJoe Ramsay8-31/+39
Previously many routines used * to load from vector types stored in the data table. This is emitted as ldr, which byte-swaps the entire vector register, and causes bugs for big-endian when not all lanes contain the same value. When a vector is to be used this way, it has been replaced with an array and the load with an explicit ld1 intrinsic, which byte-swaps only within lanes. As well, many routines previously used non-standard GCC syntax for vector operations such as indexing into vectors types with [] and assembling vectors using {}. This syntax should not be mixed with ACLE, as the former does not respect endianness whereas the latter does. Such examples have been replaced with, for instance, vcombine_* and vgetq_lane* intrinsics. Helpers which only use the GCC syntax, such as the v_call helpers, do not need changing as they do not use intrinsics. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com> (cherry picked from commit 90a6ca8b28bf34e361e577e526e1b0f4c39a32a5)
2025-02-13assert: Add test for CVE-2025-0395Siddhesh Poyarekar2-0/+93
Use the __progname symbol to override the program name to induce the failure that CVE-2025-0395 describes. This is related to BZ #32582 Signed-off-by: Siddhesh Poyarekar <siddhesh@sourceware.org> Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> (cherry picked from commit cdb9ba84191ce72e86346fb8b1d906e7cd930ea2)
2025-01-25stdlib: Test using setenv with updated environ [BZ #32588]H.J. Lu2-0/+37
Add a test for setenv with updated environ. Verify that BZ #32588 is fixed. Signed-off-by: H.J. Lu <hjl.tools@gmail.com> Reviewed-by: Florian Weimer <fweimer@redhat.com> (cherry picked from commit 8ab34497de14e35aff09b607222fe1309ef156da)
2025-01-24malloc: obscure calloc use in tst-callocSam James1-4/+8
Similar to a9944a52c967ce76a5894c30d0274b824df43c7a and f9493a15ea9cfb63a815c00c23142369ec09d8ce, we need to hide calloc use from the compiler to accommodate GCC's r15-6566-g804e9d55d9e54c change. First, include tst-malloc-aux.h, but then use `volatile` variables for size. The test passes without the tst-malloc-aux.h change but IMO we want it there for consistency and to avoid future problems (possibly silent). Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit c3d1dac96bdd10250aa37bb367d5ef8334a093a1)
2025-01-24Hide all malloc functions from compiler [BZ #32366]H.J. Lu6-5/+30
Since -1 isn't a power of two, compiler may reject it, hide memalign from Clang 19 which issues an error: tst-memalign.c:86:31: error: requested alignment is not a power of 2 [-Werror,-Wnon-power-of-two-alignment] 86 | p = memalign (-1, pagesize); | ^~ tst-memalign.c:86:31: error: requested alignment must be 4294967296 bytes or smaller; maximum alignment assumed [-Werror,-Wbuiltin-assume-aligned-alignment] 86 | p = memalign (-1, pagesize); | ^~ Update tst-malloc-aux.h to hide all malloc functions and include it in all malloc tests to prevent compiler from optimizing out any malloc functions. Tested with Clang 19.1.5 and GCC 15 20241206 for BZ #32366. Signed-off-by: H.J. Lu <hjl.tools@gmail.com> Reviewed-by: Sam James <sam@gentoo.org> (cherry picked from commit f9493a15ea9cfb63a815c00c23142369ec09d8ce)
2025-01-22Fix underallocation of abort_msg_s struct (CVE-2025-0395)Siddhesh Poyarekar3-2/+11
Include the space needed to store the length of the message itself, in addition to the message string. This resolves BZ #32582. Signed-off-by: Siddhesh Poyarekar <siddhesh@sourceware.org> Reviewed: Adhemerval Zanella <adhemerval.zanella@linaro.org> (cherry picked from commit 68ee0f704cb81e9ad0a78c644a83e1e9cd2ee578)
2025-01-08x86/string: Fixup alignment of main loop in str{n}cmp-evex [BZ #32212]Noah Goldstein1-13/+13
The loop should be aligned to 32-bytes so that it can ideally run out the DSB. This is particularly important on Skylake-Server where deficiencies in it's DSB implementation make it prone to not being able to run loops out of the DSB. For example running strcmp-evex on 200Mb string: 32-byte aligned loop: - 43,399,578,766 idq.dsb_uops not 32-byte aligned loop: - 6,060,139,704 idq.dsb_uops This results in a 25% performance degradation for the non-aligned version. The fix is to just ensure the code layout is such that the loop is aligned. (Which was previously the case but was accidentally dropped in 84e7c46df). NB: The fix was actually 64-byte alignment. This is because 64-byte alignment generally produces more stable performance than 32-byte aligned code (cache line crosses can affect perf), so if we are going past 16-byte alignmnent, might as well go to 64. 64-byte alignment also matches most other functions we over-align, so it creates a common point of optimization. Times are reported as ratio of Time_With_Patch / Time_Without_Patch. Lower is better. The values being reported is the geometric mean of the ratio across all tests in bench-strcmp and bench-strncmp. Note this patch is only attempting to improve the Skylake-Server strcmp for long strings. The rest of the numbers are only to test for regressions. Tigerlake Results Strings <= 512: strcmp : 1.026 strncmp: 0.949 Tigerlake Results Strings > 512: strcmp : 0.994 strncmp: 0.998 Skylake-Server Results Strings <= 512: strcmp : 0.945 strncmp: 0.943 Skylake-Server Results Strings > 512: strcmp : 0.778 strncmp: 1.000 The 2.6% regression on TGL-strcmp is due to slowdowns caused by changes in alignment of code handling small sizes (most on the page-cross logic). These should be safe to ignore because 1) We previously only 16-byte aligned the function so this behavior is not new and was essentially up to chance before this patch and 2) this type of alignment related regression on small sizes really only comes up in tight micro-benchmark loops and is unlikely to have any affect on realworld performance. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit 483443d3211532903d7e790211af5a1d55fdb1f3)
2025-01-08x86: Improve large memset perf with non-temporal stores [RHEL-29312]Noah Goldstein1-58/+91
Previously we use `rep stosb` for all medium/large memsets. This is notably worse than non-temporal stores for large (above a few MBs) memsets. See: https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing For data using different stategies for large memset on ICX and SKX. Using non-temporal stores can be up to 3x faster on ICX and 2x faster on SKX. Historically, these numbers would not have been so good because of the zero-over-zero writeback optimization that `rep stosb` is able to do. But, the zero-over-zero writeback optimization has been removed as a potential side-channel attack, so there is no longer any good reason to only rely on `rep stosb` for large memsets. On the flip size, non-temporal writes can avoid data in their RFO requests saving memory bandwidth. All of the other changes to the file are to re-organize the code-blocks to maintain "good" alignment given the new code added in the `L(stosb_local)` case. The results from running the GLIBC memset benchmarks on TGL-client for N=20 runs: Geometric Mean across the suite New / Old EXEX256: 0.979 Geometric Mean across the suite New / Old EXEX512: 0.979 Geometric Mean across the suite New / Old AVX2 : 0.986 Geometric Mean across the suite New / Old SSE2 : 0.979 Most of the cases are essentially unchanged, this is mostly to show that adding the non-temporal case didn't add any regressions to the other cases. The results on the memset-large benchmark suite on TGL-client for N=20 runs: Geometric Mean across the suite New / Old EXEX256: 0.926 Geometric Mean across the suite New / Old EXEX512: 0.925 Geometric Mean across the suite New / Old AVX2 : 0.928 Geometric Mean across the suite New / Old SSE2 : 0.924 So roughly a 7.5% speedup. This is lower than what we see on servers (likely because clients typically have faster single-core bandwidth so saving bandwidth on RFOs is less impactful), but still advantageous. Full test-suite passes on x86_64 w/ and w/o multiarch. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> (cherry picked from commit 5bf0ab80573d66e4ae5d94b094659094336da90f)
2024-12-17x86: Avoid integer truncation with large cache sizes (bug 32470)Florian Weimer2-2/+3
Some hypervisors report 1 TiB L3 cache size. This results in some variables incorrectly getting zeroed, causing crashes in memcpy/memmove because invariants are violated. (cherry picked from commit 61c3450db96dce96ad2b24b4f0b548e6a46d68e5)
2024-12-11math: Exclude internal math symbols for tests [BZ #32414]H.J. Lu2-1/+6
Since internal tests don't have access to internal symbols in libm, exclude them for internal tests. Also make tst-strtod5 and tst-strtod5i depend on $(libm) to support older versions of GCC which can't inline copysign family functions. This fixes BZ #32414. Signed-off-by: H.J. Lu <hjl.tools@gmail.com> Reviewed-by: Sunil K Pandey <skpgkp2@gmail.com> (cherry picked from commit 5df09b444835fca6e64b3d4b4a5beb19b3b2ba21)
2024-12-10malloc: add indirection for malloc(-like) functions in tests [BZ #32366]Sam James9-2/+53
GCC 15 introduces allocation dead code removal (DCE) for PR117370 in r15-5255-g7828dc070510f8. This breaks various glibc tests which want to assert various properties of the allocator without doing anything obviously useful with the allocated memory. Alexander Monakov rightly pointed out that we can and should do better than passing -fno-malloc-dce to paper over the problem. Not least because GCC 14 already does such DCE where there's no testing of malloc's return value against NULL, and LLVM has such optimisations too. Handle this by providing malloc (and friends) wrappers with a volatile function pointer to obscure that we're calling malloc (et. al) from the compiler. Reviewed-by: Paul Eggert <eggert@cs.ucla.edu> (cherry picked from commit a9944a52c967ce76a5894c30d0274b824df43c7a)
2024-12-06Pass -nostdlib -nostartfiles together with -r [BZ #31753]H.J. Lu1-1/+2
Since -r in GCC 6/7/8 doesn't imply -nostdlib -nostartfiles, update the link-static-libc.out rule to also pass -nostdlib -nostartfiles. This fixes BZ #31753. Signed-off-by: H.J. Lu <hjl.tools@gmail.com> Reviewed-by: Florian Weimer <fweimer@redhat.com> (cherry picked from commit 2be3352f0b1ebaa39