aboutsummaryrefslogtreecommitdiff
path: root/sysdeps/aarch64/fpu
AgeCommit message (Collapse)AuthorFilesLines
2025-03-18AArch64: Optimize algorithm in users of SVE expf helperPierre Blanchard3-26/+16
Polynomial order was unnecessarily high, unlocking multiple optimizations. Max error for new SVE expf is 0.88 +0.5ULP. Max error for new SVE coshf is 2.56 +0.5ULP. Performance improvement on Neoverse V1: expf (30%), coshf (26%). Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
2025-02-24AArch64: Simplify lrintWilco Dijkstra1-51/+0
Simplify lrint. Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2025-02-13AArch64: Improve codegen for SVE powfYat Long Poon1-58/+59
Improve memory access with indexed/unpredicated instructions. Eliminate register spills. Speedup on Neoverse V1: 3%. Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
2025-02-13AArch64: Improve codegen for SVE powYat Long Poon1-103/+142
Move constants to struct. Improve memory access with indexed/unpredicated instructions. Eliminate register spills. Speedup on Neoverse V1: 24%. Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
2025-02-13AArch64: Improve codegen for SVE erfcfYat Long Poon1-6/+6
Reduce number of MOV/MOVPRFXs and use unpredicated FMUL. Replace MUL with LSL. Speedup on Neoverse V1: 6%. Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
2025-02-13Aarch64: Improve codegen in SVE exp and users, and update expf_inlineLuna Lamb5-49/+59
Use unpredicted muls, and improve memory access. 7%, 3% and 1% improvement in throughput microbenchmark on Neoverse V1, for exp, exp2 and cosh respectively. Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
2025-02-13Aarch64: Improve codegen in SVE asinhLuna Lamb1-34/+77
Use unpredicated muls, use lanewise mla's and improve memory access. 1% regression in throughput microbenchmark on Neoverse V1. Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
2025-01-13aarch64: Use 64-bit variable to access the special registersAdhemerval Zanella2-12/+27
clang issues: error: value size does not match register size specified by the constraint and modifier [-Werror,-Wasm-operand-widths] while tryng to use 32 bit variables with 'mrs' to get/set the fpsr, dczid_el0, and ctr.
2025-01-03AArch64: Improve codegen in SVE expm1f and usersLuna Lamb4-45/+44
Use unpredicated muls, use absolute compare and improve memory access. Expm1f, sinhf and tanhf show 7%, 5% and 1% improvement in throughput microbenchmark on Neoverse V1.
2025-01-03AArch64: Add vector tanpi routinesJoe Ramsay12-1/+336
Vector variant of the new C23 tanpi. New tests pass on AArch64.
2025-01-03AArch64: Add vector cospi routinesJoe Ramsay12-0/+319
Vector variant of the new C23 cospi. New tests pass on AArch64.
2025-01-03AArch64: Add vector sinpi to libmvecJoe Ramsay12-0/+309
Vector variant of the new C23 sinpi. New tests pass on AArch64.
2025-01-03AArch64: Improve codegen for SVE log1pf usersYat Long Poon5-122/+95
Reduce memory access by using lanewise MLA and reduce number of MOVPRFXs. Move log1pf implementation to inline helper function. Speedup on Neoverse V1 for log1pf (10%), acoshf (-1%), atanhf (2%), asinhf (2%).
2025-01-03AArch64: Improve codegen for SVE logsYat Long Poon3-46/+113
Reduce memory access by using lanewise MLA and moving constants to struct and reduce number of MOVPRFXs. Update maximum ULP error for double log_sve from 1 to 2. Speedup on Neoverse V1 for log (3%), log2 (5%), and log10 (4%).
2025-01-03AArch64: Improve codegen in SVE tansLuna Lamb2-41/+68
Improves memory access. Tan: MOVPRFX 7 -> 2, LD1RD 12 -> 5, move MOV away from return. Tanf: MOV 2 -> 1, MOVPRFX 6 -> 3, LD1RW 5 -> 4, move mov away from return.
2025-01-03AArch64: Improve codegen in AdvSIMD asinhLuna Lamb1-55/+119
Improves memory access and removes spills. Load the polynomial evaluation coefficients into 2 vectors and use lanewise MLAs. Reduces MOVs 6->3 , LDR 11->5, STR/STP 2->0, ADRP 3->2.
2025-01-01Update copyright dates with scripts/update-copyrightsPaul Eggert183-183/+183
2024-12-17AArch64: Improve codegen of AdvSIMD expf familyJoana Cruz5-118/+127
Load the polynomial evaluation coefficients into 2 vectors and use lanewise MLAs. Also use intrinsics instead of native operations. expf: 3% improvement in throughput microbenchmark on Neoverse V1, exp2f: 5%, exp10f: 13%, coshf: 14%. Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
2024-12-17AArch64: Improve codegen of AdvSIMD atan(2)(f)Joana Cruz3-68/+160
Load the polynomial evaluation coefficients into 2 vectors and use lanewise MLAs. 8% improvement in throughput microbenchmark on Neoverse V1. Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
2024-12-17AArch64: Improve codegen of AdvSIMD logf function familyJoana Cruz3-40/+66
Load the polynomial evaluation coefficients into 2 vectors and use lanewise MLAs. 8% improvement in throughput microbenchmark on Neoverse V1 for log2 and log, and 2% for log10. Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
2024-12-09AArch64: Improve codegen in users of ADVSIMD expm1 helperPierre Blanchard5-162/+135
Add inline helper for expm1 and rearrange operations so MOV is not necessary in reduction or around the special-case handler. Reduce memory access by using more indexed MLAs in polynomial. Speedup on Neoverse V1 for expm1 (19%), sinh (8.5%), and tanh (7.5%).
2024-12-09AArch64: Improve codegen in users of ADVSIMD log1p helperPierre Blanchard4-127/+93
Add inline helper for log1p and rearrange operations so MOV is not necessary in reduction or around the special-case handler. Reduce memory access by using more indexed MLAs in polynomial. Speedup on Neoverse V1 for log1p (3.5%), acosh (7.5%) and atanh (10%).
2024-12-09AArch64: Improve codegen in AdvSIMD logsPierre Blanchard3-106/+140
Remove spurious ADRP and a few MOVs. Reduce memory access by using more indexed MLAs in polynomial. Align notation so that algorithms are easier to compare. Speedup on Neoverse V1 for log10 (8%), log (8.5%), and log2 (10%). Update error threshold in AdvSIMD log (now matches SVE log).
2024-12-09AArch64: Improve codegen in AdvSIMD powPierre Blanchard1-53/+62
Remove spurious ADRP. Improve memory access by shuffling constants and using more indexed MLAs. A few more optimisation with no impact on accuracy - force fmas contraction - switch from shift-aided rint to rint instruction Between 1 and 5% throughput improvement on Neoverse V1 depending on benchmark.
2024-11-01AArch64: Remove SVE erf and erfc tablesJoe Ramsay16-2691/+50
By using a combination of mask-and-add instead of the shift-based index calculation the routines can share the same table as other variants with no performance degradation. The tables change name because of other changes in downstream AOR. Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
2024-10-28AArch64: Small optimisation in AdvSIMD erf and erfcJoe Ramsay2-15/+23
In both routines, reduce register pressure such that GCC 14 emits no spills for erf and fewer spills for erfc. Also use more efficient comparison for the special-case in erf. Benchtests show erf improves by 6.4%, erfc by 1.0%.
2024-09-23AArch64: Simplify rounding-multiply pattern in several AdvSIMD routinesJoe Ramsay5-38/+30
This operation can be simplified to use simpler multiply-round-convert sequence, which uses fewer instructions and constants. Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
2024-09-23AArch64: Improve codegen in users of ADVSIMD expm1f helperJoe Ramsay4-91/+58
Rearrange operations so MOV is not necessary in reduction or around the special-case handler. Reduce memory access by using more indexed MLAs in polynomial. Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
2024-09-23AArch64: Improve codegen in users of AdvSIMD log1pf helperJoe Ramsay5-139/+146
log1pf is quite register-intensive - use fewer registers for the polynomial, and make various changes to shorten dependency chains in parent routines. There is now no spilling with GCC 14. Accuracy moves around a little - comments adjusted accordingly but does not require regen-ulps. Use the helper in log1pf as well, instead of having separate implementations. The more accurate polynomial means special-casing can be simplified, and the shorter dependency chain avoids the usual dance around v0, which is otherwise difficult. There is a small duplication of vectors containing 1.0f (or 0x3f800000) - GCC is not currently able to efficiently handle values which fit in FMOV but not MOVI, and are reinterpreted to integer. There may be potential for more optimisation if this is fixed. Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
2024-09-23AArch64: Improve codegen in SVE F32 logsJoe Ramsay3-47/+69
Reduce MOVPRFXs by using unpredicated (non-destructive) instructions where possible. Similar to the recent change to AdvSIMD F32 logs, adjust special-case arguments and bounds to allow for more optimal register usage. For all 3 routines one MOVPRFX remains in the reduction, which cannot be avoided as immediate AND and ASR are both destructive. Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
2024-09-23AArch64: Improve codegen in SVE expf & related routinesJoe Ramsay5-148/+136
Reduce MOV and MOVPRFX by improving special-case handling. Use inline helper to duplicate the entire computation between the special- and non-special case branches, removing the contention for z0 between x and the return value. Also rearrange some MLAs and MLSs - by making the multiplicand the destination we can avoid a MOVPRFX in several cases. Also change which constants go in the vector used for lanewise ops - the last lane is no longer wasted. Spotted that shift was incorrect in exp2f and exp10f, w.r.t. to the comment that explains it. Fixed - worst-case ULP for exp2f moves around but it doesn't change significantly for either routine. Worst-case error for coshf increases due to passing x to exp rather than abs(x) - updated the comment, but does not require regen-ulps. Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
2024-09-19AArch64: Add vector logp1 alias for log1pJoe Ramsay7-0/+25
This enables vectorisation of C23 logp1, which is an alias for log1p. There are no new tests or ulp entries because the new symbols are simply aliases. Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
2024-09-09aarch64: Avoid redundant MOVs in AdvSIMD F32 logsJoe Ramsay3-45/+72
Since the last operation is destructive, the first argument to the FMA also has to be the first argument to the special-case in order to avoid unnecessary MOVs. Reorder arguments and adjust special-case bounds to facilitate this. Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
2024-05-21aarch64/fpu: Add vector variants of powJoe Ramsay19-12/+2223
Plus a small amount of moving includes around in order to be able to remove duplicate definition of asuint64. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2024-05-16aarch64/fpu: Add vector variants of cbrtJoe Ramsay12-0/+513
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2024-05-16aarch64/fpu: Add vector variants of hypotJoe Ramsay12-0/+316
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2024-05-14aarch64: Fix AdvSIMD libmvec routines for big-endianJoe Ramsay17-85/+119
Previously many routines used * to load from vector types stored in the data table. This is emitted as ldr, which byte-swaps the entire vector register, and causes bugs for big-endian when not all lanes contain the same value. When a vector is to be used this way, it has been replaced with an array and the load with an explicit ld1 intrinsic, which byte-swaps only within lanes. As well, many routines previously used non-standard GCC syntax for vector operations such as indexing into vectors types with [] and assembling vectors using {}. This syntax should not be mixed with ACLE, as the former does not respect endianness whereas the latter does. Such examples have been replaced with, for instance, vcombine_* and vgetq_lane* intrinsics. Helpers which only use the GCC syntax, such as the v_call helpers, do not need changing as they do not use intrinsics. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2024-04-04aarch64/fpu: Add vector variants of erfcJoe Ramsay15-1/+4884
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2024-04-04aarch64/fpu: Add vector variants of tanhJoe Ramsay12-1/+366
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2024-04-04aarch64/fpu: Add vector variants of sinhJoe Ramsay14-0/+559
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2024-04-04aarch64/fpu: Add vector variants of atanhJoe Ramsay12-0/+275
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2024-04-04aarch64/fpu: Add vector variants of asinhJoe Ramsay12-0/+476
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2024-04-04aarch64/fpu: Add vector variants of acoshJoe Ramsay17-0/+640
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2024-04-04aarch64/fpu: Add vector variants of coshJoe Ramsay16-1/+635
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2024-04-04aarch64/fpu: Add vector variants of erfJoe Ramsay17-1/+4518
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2024-02-26aarch64/fpu: Sync libmvec routines from 2.39 and before with AORJoe Ramsay18-105/+111
This includes a fix for big-endian in AdvSIMD log, some cosmetic changes, and numerous small optimisations mainly around inlining and using indexed variants of MLA intrinsics. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
2024-01-01Update copyright dates with scripts/update-copyrightsPaul Eggert121-121/+121
2023-12-20aarch64: Add SIMD attributes to math functions with vector versionsJoe Ramsay2-0/+113
Added annotations for autovec by GCC and GFortran - this enables GCC >= 9 to autovectorise math calls at -Ofast. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2023-12-20aarch64: Add half-width versions of AdvSIMD f32 libmvec routinesJoe Ramsay18-14/+108
Compilers may emit calls to 'half-width' routines (two-lane single-precision variants). These have been added in the form of wrappers around the full-width versions, where the low half of the vector is simply duplicated. This will perform poorly when one lane triggers the special-case handler, as there will be a redundant call to the scalar version, however this is expected to be rare at Ofast. Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
2023-11-29aarch64: Improve special-case handling in AdvSIMD double-precision libmvec ↵Joe Ramsay1-1/+7
routines Avoids emitting many saves/restores of vector registers, reduces the amount of code generated around the scalar fallback.