glibc.git - Unnamed repository; edit this file 'description' to name the repository.

Age	Commit message (Collapse)	Author	Files	Lines
2025-03-18	AArch64: Optimize algorithm in users of SVE expf helper	Pierre Blanchard	3	-26/+16
	Polynomial order was unnecessarily high, unlocking multiple optimizations. Max error for new SVE expf is 0.88 +0.5ULP. Max error for new SVE coshf is 2.56 +0.5ULP. Performance improvement on Neoverse V1: expf (30%), coshf (26%). Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
2025-02-24	AArch64: Simplify lrint	Wilco Dijkstra	1	-51/+0
	Simplify lrint. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
2025-02-13	AArch64: Improve codegen for SVE powf	Yat Long Poon	1	-58/+59
	Improve memory access with indexed/unpredicated instructions. Eliminate register spills. Speedup on Neoverse V1: 3%. Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
2025-02-13	AArch64: Improve codegen for SVE pow	Yat Long Poon	1	-103/+142
	Move constants to struct. Improve memory access with indexed/unpredicated instructions. Eliminate register spills. Speedup on Neoverse V1: 24%. Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
2025-02-13	AArch64: Improve codegen for SVE erfcf	Yat Long Poon	1	-6/+6
	Reduce number of MOV/MOVPRFXs and use unpredicated FMUL. Replace MUL with LSL. Speedup on Neoverse V1: 6%. Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
2025-02-13	Aarch64: Improve codegen in SVE exp and users, and update expf_inline	Luna Lamb	5	-49/+59
	Use unpredicted muls, and improve memory access. 7%, 3% and 1% improvement in throughput microbenchmark on Neoverse V1, for exp, exp2 and cosh respectively. Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
2025-02-13	Aarch64: Improve codegen in SVE asinh	Luna Lamb	1	-34/+77
	Use unpredicated muls, use lanewise mla's and improve memory access. 1% regression in throughput microbenchmark on Neoverse V1. Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
2025-01-13	aarch64: Use 64-bit variable to access the special registers	Adhemerval Zanella	2	-12/+27
	clang issues: error: value size does not match register size specified by the constraint and modifier [-Werror,-Wasm-operand-widths] while tryng to use 32 bit variables with 'mrs' to get/set the fpsr, dczid_el0, and ctr.
2025-01-03	AArch64: Improve codegen in SVE expm1f and users	Luna Lamb	4	-45/+44
	Use unpredicated muls, use absolute compare and improve memory access. Expm1f, sinhf and tanhf show 7%, 5% and 1% improvement in throughput microbenchmark on Neoverse V1.
2025-01-03	AArch64: Add vector tanpi routines	Joe Ramsay	12	-1/+336
	Vector variant of the new C23 tanpi. New tests pass on AArch64.
2025-01-03	AArch64: Add vector cospi routines	Joe Ramsay	12	-0/+319
	Vector variant of the new C23 cospi. New tests pass on AArch64.
2025-01-03	AArch64: Add vector sinpi to libmvec	Joe Ramsay	12	-0/+309
	Vector variant of the new C23 sinpi. New tests pass on AArch64.
2025-01-03	AArch64: Improve codegen for SVE log1pf users	Yat Long Poon	5	-122/+95
	Reduce memory access by using lanewise MLA and reduce number of MOVPRFXs. Move log1pf implementation to inline helper function. Speedup on Neoverse V1 for log1pf (10%), acoshf (-1%), atanhf (2%), asinhf (2%).
2025-01-03	AArch64: Improve codegen for SVE logs	Yat Long Poon	3	-46/+113
	Reduce memory access by using lanewise MLA and moving constants to struct and reduce number of MOVPRFXs. Update maximum ULP error for double log_sve from 1 to 2. Speedup on Neoverse V1 for log (3%), log2 (5%), and log10 (4%).
2025-01-03	AArch64: Improve codegen in SVE tans	Luna Lamb	2	-41/+68
	Improves memory access. Tan: MOVPRFX 7 -> 2, LD1RD 12 -> 5, move MOV away from return. Tanf: MOV 2 -> 1, MOVPRFX 6 -> 3, LD1RW 5 -> 4, move mov away from return.
2025-01-03	AArch64: Improve codegen in AdvSIMD asinh	Luna Lamb	1	-55/+119
	Improves memory access and removes spills. Load the polynomial evaluation coefficients into 2 vectors and use lanewise MLAs. Reduces MOVs 6->3 , LDR 11->5, STR/STP 2->0, ADRP 3->2.
2025-01-01	Update copyright dates with scripts/update-copyrights	Paul Eggert	183	-183/+183

2024-12-17	AArch64: Improve codegen of AdvSIMD expf family	Joana Cruz	5	-118/+127
	Load the polynomial evaluation coefficients into 2 vectors and use lanewise MLAs. Also use intrinsics instead of native operations. expf: 3% improvement in throughput microbenchmark on Neoverse V1, exp2f: 5%, exp10f: 13%, coshf: 14%. Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
2024-12-17	AArch64: Improve codegen of AdvSIMD atan(2)(f)	Joana Cruz	3	-68/+160
	Load the polynomial evaluation coefficients into 2 vectors and use lanewise MLAs. 8% improvement in throughput microbenchmark on Neoverse V1. Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
2024-12-17	AArch64: Improve codegen of AdvSIMD logf function family	Joana Cruz	3	-40/+66
	Load the polynomial evaluation coefficients into 2 vectors and use lanewise MLAs. 8% improvement in throughput microbenchmark on Neoverse V1 for log2 and log, and 2% for log10. Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
2024-12-09	AArch64: Improve codegen in users of ADVSIMD expm1 helper	Pierre Blanchard	5	-162/+135
	Add inline helper for expm1 and rearrange operations so MOV is not necessary in reduction or around the special-case handler. Reduce memory access by using more indexed MLAs in polynomial. Speedup on Neoverse V1 for expm1 (19%), sinh (8.5%), and tanh (7.5%).
2024-12-09	AArch64: Improve codegen in users of ADVSIMD log1p helper	Pierre Blanchard	4	-127/+93
	Add inline helper for log1p and rearrange operations so MOV is not necessary in reduction or around the special-case handler. Reduce memory access by using more indexed MLAs in polynomial. Speedup on Neoverse V1 for log1p (3.5%), acosh (7.5%) and atanh (10%).
2024-12-09	AArch64: Improve codegen in AdvSIMD logs	Pierre Blanchard	3	-106/+140
	Remove spurious ADRP and a few MOVs. Reduce memory access by using more indexed MLAs in polynomial. Align notation so that algorithms are easier to compare. Speedup on Neoverse V1 for log10 (8%), log (8.5%), and log2 (10%). Update error threshold in AdvSIMD log (now matches SVE log).
2024-12-09	AArch64: Improve codegen in AdvSIMD pow	Pierre Blanchard	1	-53/+62
	Remove spurious ADRP. Improve memory access by shuffling constants and using more indexed MLAs. A few more optimisation with no impact on accuracy - force fmas contraction - switch from shift-aided rint to rint instruction Between 1 and 5% throughput improvement on Neoverse V1 depending on benchmark.
2024-11-01	AArch64: Remove SVE erf and erfc tables	Joe Ramsay	16	-2691/+50
	By using a combination of mask-and-add instead of the shift-based index calculation the routines can share the same table as other variants with no performance degradation. The tables change name because of other changes in downstream AOR. Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
2024-10-28	AArch64: Small optimisation in AdvSIMD erf and erfc	Joe Ramsay	2	-15/+23
	In both routines, reduce register pressure such that GCC 14 emits no spills for erf and fewer spills for erfc. Also use more efficient comparison for the special-case in erf. Benchtests show erf improves by 6.4%, erfc by 1.0%.
2024-09-23	AArch64: Simplify rounding-multiply pattern in several AdvSIMD routines	Joe Ramsay	5	-38/+30
	This operation can be simplified to use simpler multiply-round-convert sequence, which uses fewer instructions and constants. Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
2024-09-23	AArch64: Improve codegen in users of ADVSIMD expm1f helper	Joe Ramsay	4	-91/+58
	Rearrange operations so MOV is not necessary in reduction or around the special-case handler. Reduce memory access by using more indexed MLAs in polynomial. Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
2024-09-23	AArch64: Improve codegen in users of AdvSIMD log1pf helper	Joe Ramsay	5	-139/+146
	log1pf is quite register-intensive - use fewer registers for the polynomial, and make various changes to shorten dependency chains in parent routines. There is now no spilling with GCC 14. Accuracy moves around a little - comments adjusted accordingly but does not require regen-ulps. Use the helper in log1pf as well, instead of having separate implementations. The more accurate polynomial means special-casing can be simplified, and the shorter dependency chain avoids the usual dance around v0, which is otherwise difficult. There is a small duplication of vectors containing 1.0f (or 0x3f800000) - GCC is not currently able to efficiently handle values which fit in FMOV but not MOVI, and are reinterpreted to integer. There may be potential for more optimisation if this is fixed. Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
2024-09-23	AArch64: Improve codegen in SVE F32 logs	Joe Ramsay	3	-47/+69
	Reduce MOVPRFXs by using unpredicated (non-destructive) instructions where possible. Similar to the recent change to AdvSIMD F32 logs, adjust special-case arguments and bounds to allow for more optimal register usage. For all 3 routines one MOVPRFX remains in the reduction, which cannot be avoided as immediate AND and ASR are both destructive. Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
2024-09-23	AArch64: Improve codegen in SVE expf & related routines	Joe Ramsay	5	-148/+136
	Reduce MOV and MOVPRFX by improving special-case handling. Use inline helper to duplicate the entire computation between the special- and non-special case branches, removing the contention for z0 between x and the return value. Also rearrange some MLAs and MLSs - by making the multiplicand the destination we can avoid a MOVPRFX in several cases. Also change which constants go in the vector used for lanewise ops - the last lane is no longer wasted. Spotted that shift was incorrect in exp2f and exp10f, w.r.t. to the comment that explains it. Fixed - worst-case ULP for exp2f moves around but it doesn't change significantly for either routine. Worst-case error for coshf increases due to passing x to exp rather than abs(x) - updated the comment, but does not require regen-ulps. Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
2024-09-19	AArch64: Add vector logp1 alias for log1p	Joe Ramsay	7	-0/+25
	This enables vectorisation of C23 logp1, which is an alias for log1p. There are no new tests or ulp entries because the new symbols are simply aliases. Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
2024-09-09	aarch64: Avoid redundant MOVs in AdvSIMD F32 logs	Joe Ramsay	3	-45/+72
	Since the last operation is destructive, the first argument to the FMA also has to be the first argument to the special-case in order to avoid unnecessary MOVs. Reorder arguments and adjust special-case bounds to facilitate this. Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
2024-05-21	aarch64/fpu: Add vector variants of pow	Joe Ramsay	19	-12/+2223
	Plus a small amount of moving includes around in order to be able to remove duplicate definition of asuint64. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2024-05-16	aarch64/fpu: Add vector variants of cbrt	Joe Ramsay	12	-0/+513
	Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2024-05-16	aarch64/fpu: Add vector variants of hypot	Joe Ramsay	12	-0/+316
	Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2024-05-14	aarch64: Fix AdvSIMD libmvec routines for big-endian	Joe Ramsay	17	-85/+119
	Previously many routines used * to load from vector types stored in the data table. This is emitted as ldr, which byte-swaps the entire vector register, and causes bugs for big-endian when not all lanes contain the same value. When a vector is to be used this way, it has been replaced with an array and the load with an explicit ld1 intrinsic, which byte-swaps only within lanes. As well, many routines previously used non-standard GCC syntax for vector operations such as indexing into vectors types with [] and assembling vectors using {}. This syntax should not be mixed with ACLE, as the former does not respect endianness whereas the latter does. Such examples have been replaced with, for instance, vcombine_* and vgetq_lane* intrinsics. Helpers which only use the GCC syntax, such as the v_call helpers, do not need changing as they do not use intrinsics. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2024-04-04	aarch64/fpu: Add vector variants of erfc	Joe Ramsay	15	-1/+4884
	Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2024-04-04	aarch64/fpu: Add vector variants of tanh	Joe Ramsay	12	-1/+366
	Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2024-04-04	aarch64/fpu: Add vector variants of sinh	Joe Ramsay	14	-0/+559
	Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2024-04-04	aarch64/fpu: Add vector variants of atanh	Joe Ramsay	12	-0/+275
	Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2024-04-04	aarch64/fpu: Add vector variants of asinh	Joe Ramsay	12	-0/+476
	Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2024-04-04	aarch64/fpu: Add vector variants of acosh	Joe Ramsay	17	-0/+640
	Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2024-04-04	aarch64/fpu: Add vector variants of cosh	Joe Ramsay	16	-1/+635
	Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2024-04-04	aarch64/fpu: Add vector variants of erf	Joe Ramsay	17	-1/+4518
	Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2024-02-26	aarch64/fpu: Sync libmvec routines from 2.39 and before with AOR	Joe Ramsay	18	-105/+111
	This includes a fix for big-endian in AdvSIMD log, some cosmetic changes, and numerous small optimisations mainly around inlining and using indexed variants of MLA intrinsics. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
2024-01-01	Update copyright dates with scripts/update-copyrights	Paul Eggert	121	-121/+121

2023-12-20	aarch64: Add SIMD attributes to math functions with vector versions	Joe Ramsay	2	-0/+113
	Added annotations for autovec by GCC and GFortran - this enables GCC >= 9 to autovectorise math calls at -Ofast. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2023-12-20	aarch64: Add half-width versions of AdvSIMD f32 libmvec routines	Joe Ramsay	18	-14/+108
	Compilers may emit calls to 'half-width' routines (two-lane single-precision variants). These have been added in the form of wrappers around the full-width versions, where the low half of the vector is simply duplicated. This will perform poorly when one lane triggers the special-case handler, as there will be a redundant call to the scalar version, however this is expected to be rare at Ofast. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2023-11-29	aarch64: Improve special-case handling in AdvSIMD double-precision libmvec ↵	Joe Ramsay	1	-1/+7
	routines Avoids emitting many saves/restores of vector registers, reduces the amount of code generated around the scalar fallback.