aboutsummaryrefslogtreecommitdiff
path: root/sysdeps/aarch64
AgeCommit message (Collapse)AuthorFilesLines
2025-04-15aarch64: Add back non-temporal load/stores from oryon-1's memsetAndrew Pinski1-0/+26
I misunderstood the recommendation from the hardware team about non-temporal load/stores. It is still recommended to use them in memset for large sizes. It was not recommended for their use with device memory and memset is already not valid to be used with device memory. This reverts commit e6590f0c86632c36c9a784cf96075f4be2e920d2. Signed-off-by: Andrew Pinski <quic_apinski@quicinc.com> Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
2025-04-15aarch64: Add back non-temporal load/stores from oryon-1's memcpyAndrew Pinski1-0/+40
I misunderstood the recommendation from the hardware team about non-temporal load/stores. It is still recommended to use them in memcpy for large sizes. It was not recommended for their use with device memory and memcpy is already not valid to be use with device memory. This reverts commit eb5eeb47403e0a91de834868e501b4d62b8d2cb9. Signed-off-by: Andrew Pinski <quic_apinski@quicinc.com> Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
2025-03-31aarch64: Fix _dl_tlsdesc_dynamic unwind for pac-ret (BZ 32612)Adhemerval Zanella1-12/+12
When libgcc is built with pac-ret, it requires to autenticate the unwinding frame based on CFI information. The _dl_tlsdesc_dynamic uses a custom calling convention, where it is responsible to save and restore all registers it might use (even volatile). The pac-ret support added by 1be3d6eb823d8b952fa54b7bbc90cbecb8981380 was added only on the slow-path, but the fast path also adds DWARF Register Rule Instruction (cfi_adjust_cfa_offset) since it requires to save/restore some auxiliary register. It seems that this is not fully supported neither by libgcc nor AArch64 ABI [1]. Instead, move paciasp/autiasp to function prologue/epilogue to be used on both fast and slow paths. I also corrected the _dl_tlsdesc_dynamic comment description, it was copied from i386 implementation without any adjustment. Checked on aarch64-linux-gnu with a toolchain built with --enable-standard-branch-protection on a system with pac-ret support. [1] https://github.com/ARM-software/abi-aa/blob/main/aadwarf64/aadwarf64.rst#id1 Reviewed-by: Yury Khrustalev <yury.khrustalev@arm.com>
2025-03-18AArch64: Optimize algorithm in users of SVE expf helperPierre Blanchard3-26/+16
Polynomial order was unnecessarily high, unlocking multiple optimizations. Max error for new SVE expf is 0.88 +0.5ULP. Max error for new SVE coshf is 2.56 +0.5ULP. Performance improvement on Neoverse V1: expf (30%), coshf (26%). Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
2025-03-12math: Refactor how to use libm-test-ulpsAdhemerval Zanella2-1783/+0
The current approach tracks math maximum supported errors by explicitly setting them per function and architecture. On newer implementations or new compiler versions, the file is updated with newer values if it shows higher results. The idea is to track the maximum known error, to update the manual with the obtained values. The constant libm-test-ulps shows little value, where it is usually a mechanical change done by the maintainer, for past releases it is usually ignored whether the ulp change resulted from a compiler regression, and the math tests already have a maximum ulp error that triggers a regression. It was shown by a recent update after the new acosf [1] implementation that is correctly rounded, where the libm-test-ulps was indeed from a compiler issue. This patch removes all arch-specific libm-test-ulps, adds system generic libm-test-ulps where applicable, and changes its semantics. The generic files now track specific implementation constraints, like if it is expected to be correctly rounded, or if the system-specific has different error expectations. Now multiple libm-test-ulps can be defined, and system-specific overrides generic implementation. This is for the case where arch-specific implementation might show worse precision than generic implementation, for instance, the cbrtf on i686. Regressions are only reported if the implementation shows larger errors than 9 ulps (13 for IBM long double) unless it is overridden by libm-test-ulps and the maximum error is not printed at the end of tests. The regen-ulps rule is also removed since it does not make sense to update the libm-test-ulps automatically. The manual error table is also removed, Paul Zimmermann and others have been tracking libm precision with a more comprehensive analysis for some releases; so link to his work instead. [1] https://sourceware.org/git/?p=glibc.git;a=commit;h=9cc9f8e11e8fb8f54f1e84d9f024917634a78201
2025-02-27AArch64: Use prefer_sve_ifuncs for SVE memsetWilco Dijkstra1-1/+1
Use prefer_sve_ifuncs for SVE memset just like memcpy. Reviewed-by: Yury Khrustalev <yury.khrustalev@arm.com>
2025-02-24AArch64: Remove LP64 and ILP32 ifdefsWilco Dijkstra3-32/+9
Remove LP64 and ILP32 ifdefs. Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2025-02-24AArch64: Simplify lrintWilco Dijkstra1-51/+0
Simplify lrint. Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2025-02-24AArch64: Remove AARCH64_R macroWilco Dijkstra3-20/+16
Remove AArch64_R relocation macro. Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2025-02-24AArch64: Cleanup pointer manglingWilco Dijkstra3-34/+16
Cleanup pointer mangling. Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2025-02-24AArch64: Remove PTR_REG definesWilco Dijkstra6-67/+40
Remove PTR_REG defines. Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2025-02-24AArch64: Remove PTR_ARG/SIZE_ARG definesWilco Dijkstra30-93/+0
This series removes various ILP32 defines that are now no longer needed. Remove PTR_ARG/SIZE_ARG. Reviewed-by: Adhemerval Zanella  <adhemerval.zanella@linaro.org>
2025-02-20AArch64: Add SVE memsetWilco Dijkstra4-0/+129
Add SVE memset based on the generic memset with predicated load for sizes < 16. Unaligned memsets of 128-1024 are improved by ~20% on average by using aligned stores for the last 64 bytes. Performance of random memset benchmark improves by ~2% on Neoverse V1. Reviewed-by: Yury Khrustalev <yury.khrustalev@arm.com>
2025-02-13AArch64: Improve codegen for SVE powfYat Long Poon1-58/+59
Improve memory access with indexed/unpredicated instructions. Eliminate register spills. Speedup on Neoverse V1: 3%. Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
2025-02-13AArch64: Improve codegen for SVE powYat Long Poon1-103/+142
Move constants to struct. Improve memory access with indexed/unpredicated instructions. Eliminate register spills. Speedup on Neoverse V1: 24%. Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
2025-02-13AArch64: Improve codegen for SVE erfcfYat Long Poon1-6/+6
Reduce number of MOV/MOVPRFXs and use unpredicated FMUL. Replace MUL with LSL. Speedup on Neoverse V1: 6%. Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
2025-02-13Aarch64: Improve codegen in SVE exp and users, and update expf_inlineLuna Lamb5-49/+59
Use unpredicted muls, and improve memory access. 7%, 3% and 1% improvement in throughput microbenchmark on Neoverse V1, for exp, exp2 and cosh respectively. Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
2025-02-13Aarch64: Improve codegen in SVE asinhLuna Lamb1-34/+77
Use unpredicated muls, use lanewise mla's and improve memory access. 1% regression in throughput microbenchmark on Neoverse V1. Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
2025-02-12math: Use tanpif from CORE-MATHAdhemerval Zanella1-4/+0
The CORE-MATH implementation is correctly rounded (for any rounding mode) and shows better performance to the generic tanpif. The code was adapted to glibc style and to use the definition of math_config.h (to handle errno, overflow, and underflow). Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1, gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1): latency master patched improvement x86_64 85.1683 47.7990 43.88% x86_64v2 76.8219 41.4679 46.02% x86_64v3 73.7775 37.7734 48.80% aarch64 (Neoverse) 35.4514 18.0742 49.02% power8 22.7604 10.1054 55.60% power10 22.1358 9.9553 55.03% reciprocal-throughput master patched improvement x86_64 41.0174 19.4718 52.53% x86_64v2 34.8565 11.3761 67.36% x86_64v3 34.0325 9.6989 71.50% aarch64 (Neoverse) 25.4349 9.2017 63.82% power8 13.8626 3.8486 72.24% power10 11.7933 3.6420 69.12% Reviewed-by: DJ Delorie <dj@redhat.com>
2025-02-12math: Use sinpif from CORE-MATHAdhemerval Zanella1-4/+0
The CORE-MATH implementation is correctly rounded (for any rounding mode) and shows better performance to the generic sinpif. The code was adapted to glibc style and to use the definition of math_config.h (to handle errno, overflow, and underflow). Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1, gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1): latency master patched improvement x86_64 47.5710 38.4455 19.18% x86_64v2 46.8828 40.7563 13.07% x86_64v3 44.0034 34.1497 22.39% aarch64 (Neoverse) 19.2493 14.1968 26.25% power8 23.5312 16.3854 30.37% power10 22.6485 10.2888 54.57% reciprocal-throughput master patched improvement x86_64 21.8858 11.6717 46.67% x86_64v2 22.0620 11.9853 45.67% x86_64v3 21.5653 11.3291 47.47% aarch64 (Neoverse) 13.0615 6.5499 49.85% power8 16.2030 6.9580 57.06% power10 12.8911 4.2858 66.75% Reviewed-by: DJ Delorie <dj@redhat.com>
2025-02-12math: Use cospif from CORE-MATHAdhemerval Zanella1-4/+0
The CORE-MATH implementation is correctly rounded (for any rounding mode) and shows better performance to the generic cospif. The code was adapted to glibc style and to use the definition of math_config.h (to handle errno, overflow, and underflow). Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1, gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1): latency master patched improvement x86_64 47.4679 38.4157 19.07% x86_64v2 46.9686 38.3329 18.39% x86_64v3 43.8929 31.8510 27.43% aarch64 (Neoverse) 18.8867 13.2089 30.06% power8 22.9435 7.8023 65.99% power10 15.4472 7.77505 49.67% reciprocal-throughput master patched improvement x86_64 20.9518 11.4991 45.12% x86_64v2 19.8699 10.5921 46.69% x86_64v3 19.3475 9.3998 51.42% aarch64 (Neoverse) 12.5767 6.2158 50.58% power8 15.0566 3.2654 78.31% power10 9.2866 3.1147 66.46% Reviewed-by: DJ Delorie <dj@redhat.com>
2025-02-12math: Use atanpif from CORE-MATHAdhemerval Zanella1-4/+0
The CORE-MATH implementation is correctly rounded (for any rounding mode) and shows better performance to the generic atanpif. The code was adapted to glibc style and to use the definition of math_config.h (to handle errno, overflow, and underflow). Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1, gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1): latency master patched improvement x86_64 66.3296 52.7558 20.46% x86_64v2 66.0429 51.4007 22.17% x86_64v3 60.6294 48.7876 19.53% aarch64 (Neoverse) 24.3163 20.9110 14.00% power8 16.5766 13.3620 19.39% power10 16.5115 13.4072 18.80% reciprocal-throughput master patched improvement x86_64 30.8599 16.0866 47.87% x86_64v2 29.2286 15.4688 47.08% x86_64v3 23.0960 12.8510 44.36% aarch64 (Neoverse) 15.4619 10.6752 30.96% power8 7.9200 5.2483 33.73% power10 6.8539 4.6262 32.50% Reviewed-by: DJ Delorie <dj@redhat.com>
2025-02-12math: Use atan2pif from CORE-MATHAdhemerval Zanella1-4/+0
The CORE-MATH implementation is correctly rounded (for any rounding mode) and shows better performance to the generic atan2pif. The code was adapted to glibc style and to use the definition of math_config.h (to handle errno, overflow, and underflow). Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1, gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1): latency master patched improvement x86_64 79.4006 70.8726 10.74% x86_64v2 77.5136 69.1424 10.80% x86_64v3 71.8050 68.1637 5.07% aarch64 (Neoverse) 27.8363 24.7700 11.02% power8 39.3893 17.2929 56.10% power10 19.7200 16.8187 14.71% reciprocal-throughput master patched improvement x86_64 38.3457 30.9471 19.29% x86_64v2 37.4023 30.3112 18.96% x86_64v3 33.0713 24.4891 25.95% aarch64 (Neoverse) 19.3683 15.3259 20.87% power8 19.5507 8.27165 57.69% power10 9.05331 7.63775 15.64% Reviewed-by: DJ Delorie <dj@redhat.com>
2025-02-12math: Use asinpif from CORE-MATHAdhemerval Zanella1-4/+0
The CORE-MATH implementation is correctly rounded (for any rounding mode) and shows better performance to the generic asinpif. The code was adapted to glibc style and to use the definition of math_config.h (to handle errno, overflow, and underflow). Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1, gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1): latency master patched improvement x86_64 46.4996 41.6126 10.51% x86_64v2 46.7551 38.8235 16.96% x86_64v3 42.6235 33.7603 20.79% aarch64 (Neoverse) 17.4161 14.3604 17.55% power8 10.7347 9.0193 15.98% power10 10.6420 9.0362 15.09% reciprocal-throughput master patched improvement x86_64 24.7208 16.5544 33.03% x86_64v2 24.2177 14.8938 38.50% x86_64v3 20.5617 10.5452 48.71% aarch64 (Neoverse) 13.4827 7.17613 46.78% power8 6.46134 3.56089 44.89% power10 5.79007 3.49544 39.63% Reviewed-by: DJ Delorie <dj@redhat.com>
2025-02-12math: Use acospif from CORE-MATHAdhemerval Zanella1-4/+0
The CORE-MATH implementation is correctly rounded (for any rounding mode) and shows better performance to the generic acospif. The code was adapted to glibc style and to use the definition of math_config.h (to handle errno, overflow, and underflow). Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1, gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1): latency master patched improvement x86_64 54.8281 42.9070 21.74% x86_64v2 54.1717 42.7497 21.08% x86_64v3 49.3552 34.1512 30.81% aarch64 (Neoverse) 17.9395 14.3733 19.88% power8 20.3110 8.8609 56.37% power10 11.3113 8.84067 21.84% reciprocal-throughput master patched improvement x86_64 21.2301 14.4803 31.79% x86_64v2 20.6858 13.9506 32.56% x86_64v3 16.1944 11.3377 29.99% aarch64 (Neoverse) 11.4474 7.13282 37.69% power8 10.6916 3.57547 66.56% power10 4.64269 3.54145 23.72% Reviewed-by: DJ Delorie <dj@redhat.com>
2025-01-20aarch64: Fix tests not compatible with targets supporting GCSYury Khrustalev1-1/+3
- Add GCS marking to some of the tests when target supports GCS - Fix tst-ro-dynamic-mod.map linker script to avoid removing GNU properties - Add header with macros for GNU properties Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
2025-01-20aarch64: Add GCS user-space allocation logicSzabolcs Nagy3-1/+93
Allocate GCS based on the stack size, this can be used for coroutines (makecontext) and thread creation (if the kernel allows user allocated GCS). Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
2025-01-20aarch64: Ignore GCS property of ld.soSzabolcs Nagy1-0/+5
check_gcs is called for each dependency of a DSO, but the GNU property of the ld.so is not processed so ldso->l_mach.gcs may not be correct. Just assume ld.so is GCS compatible independently of the ELF marking. Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
2025-01-20aarch64: Handle GCS markingSzabolcs Nagy3-6/+103
- Handle GCS marking - Use l_searchlist.r_list for gcs (allows using the same function for static exe) Co-authored-by: Yury Khrustalev <yury.khrustalev@arm.com> Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
2025-01-20aarch64: Use l_searchlist.r_list for btiSzabolcs Nagy1-3/+2
Allows using the same function for static exe. Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
2025-01-20aarch64: Mark objects with GCS property noteSzabolcs Nagy1-2/+3
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com> Reviewed-by: Carlos O'Donell <carlos@redhat.com>
2025-01-20aarch64: Enable GCS in dynamic linked exeSzabolcs Nagy4-6/+66
Use the dynamic linker start code to enable GCS in the dynamic linked case after _dl_start returns and before _dl_start_user which marks the point after which user code may run. Like in the static linked case this ensures that GCS is enabled on a top level stack frame. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
2025-01-20aarch64: Add glibc.cpu.aarch64_gcs tunableSzabolcs Nagy1-0/+6
This tunable controls Guarded Control Stack (GCS) for the process. 0 = disabled: do not enable GCS 1 = enforced: check markings and fail if any binary is not marked 2 = optional: check markings but keep GCS off if a binary is unmarked 3 = override: enable GCS, markings are ignored By default it is 0, so GCS is disabled, value 1 will enable GCS. The status is stored into GL(dl_aarch64_gcs) early and only applied later, since enabling GCS is tricky: it must happen on a top level stack frame. Using GL instead of GLRO because it may need updates depending on loaded libraries that happen after readonly protection is applied, however library marking based GCS setting is not yet implemented. Describe new tunable in the manual. Co-authored-by: Yury Khrustalev <yury.khrustalev@arm.com> Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
2025-01-20aarch64: Mark swapcontext with indirect_returnSzabolcs Nagy1-0/+36
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com> Reviewed-by: Carlos O'Donell <carlos@redhat.com>
2025-01-20aarch64: Add GCS support to longjmpSzabolcs Nagy2-0/+40
This implementations ensures that longjmp across different stacks works: it scans for GCS cap token and switches GCS if necessary then the target GCSPR is restored with a GCSPOPM loop once the current GCSPR is on the same GCS. This makes longjmp linear time in the number of jumped over stack frames when GCS is enabled. Reviewed-by: Carlos O'Donell <carlos@redhat.com> Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
2025-01-20aarch64: Define jmp_buf offset for GCSSzabolcs Nagy1-0/+62
The target specific internal __longjmp is called with a __jmp_buf argument which has its size exposed in the ABI. On aarch64 this has no space left, so GCSPR cannot be restored in longjmp in the usual way, which is needed for the Guarded Control Stack (GCS) extension. setjmp is implemented via __sigsetjmp which has a jmp_buf argument however it is also called with __pthread_unwind_buf_t argument cast to jmp_buf (in cancellation cleanup code built with -fno-exception). The two types, jmp_buf and __pthread_unwind_buf_t, have common bits beyond the __jmp_buf field and there is unused space there which we can use for saving GCSPR. For this to work some bits of those two generic types have to be reserved for target specific use and the generic code in glibc has to ensure that __longjmp is always called with a __jmp_buf that is embedded into one of those two types. Morally __longjmp should be changed to take jmp_buf as argument, but that is an intrusive change across targets. Note: longjmp is never called with __pthread_unwind_buf_t from user code, only the internal __libc_longjmp is called with that type and thus the two types could have separate longjmp implementations on a target. We don't rely on this now (but might in the future given that cancellation unwind does not need to restore GCSPR). Given the above this patch finds an unused slot for GCSPR. This placement is not exposed in the ABI so it may change in the future. This is also very target ABI specific so the generic types cannot be easily changed to clearly mark the reserved fields. Reviewed-by: Carlos O'Donell <carlos@redhat.com> Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
2025-01-20aarch64: Add asm helpers for GCSSzabolcs Nagy1-0/+7
The Guarded Control Stack instructions can be present even if the hardware does not support the extension (runtime checked feature), so the asm code should be backward compatible with old assemblers. Reviewed-by: Carlos O'Donell <carlos@redhat.com> Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
2025-01-13aarch64: Use 64-bit variable to access the special registersAdhemerval Zanella3-13/+28
clang issues: error: value size does not match register size specified by the constraint and modifier [-Werror,-Wasm-operand-widths] while tryng to use 32 bit variables with 'mrs' to get/set the fpsr, dczid_el0, and ctr.
2025-01-09math: Fix acosf when building with gcc <= 11Adhemerval Zanella1-2/+0
GCC <= 11 wrongly assumes the rounding is to nearest and performs a constant folding where it should evaluate since the result is not exact [1]. [1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57245
2025-01-03AArch64: Improve codegen in SVE expm1f and usersLuna Lamb4-45/+44
Use unpredicated muls, use absolute compare and improve memory access. Expm1f, sinhf and tanhf show 7%, 5% and 1% improvement in throughput microbenchmark on Neoverse V1.
2025-01-03AArch64: Add vector tanpi routinesJoe Ramsay13-1/+344
Vector variant of the new C23 tanpi. New tests pass on AArch64.
2025-01-03AArch64: Add vector cospi routinesJoe Ramsay13-0/+327
Vector variant of the new C23 cospi. New tests pass on AArch64.
2025-01-03AArch64: Add vector sinpi to libmvecJoe Ramsay13-0/+317
Vector variant of the new C23 sinpi. New tests pass on AArch64.
2025-01-03AArch64: Improve codegen for SVE log1pf usersYat Long Poon5-122/+95
Reduce memory access by using lanewise MLA and reduce number of MOVPRFXs. Move log1pf implementation to inline helper function. Speedup on Neoverse V1 for log1pf (10%), acoshf (-1%), atanhf (2%), asinhf (2%).
2025-01-03AArch64: Improve codegen for SVE logsYat Long Poon4-47/+114
Reduce memory access by using lanewise MLA and moving constants to struct and reduce number of MOVPRFXs. Update maximum ULP error for double log_sve from 1 to 2. Speedup on Neoverse V1 for log (3%), log2 (5%), and log10 (4%).
2025-01-03AArch64: Improve codegen in SVE tansLuna Lamb2-41/+68
Improves memory access. Tan: MOVPRFX 7 -> 2, LD1RD 12 -> 5, move MOV away from return. Tanf: MOV 2 -> 1, MOVPRFX 6 -> 3, LD1RW 5 -> 4, move mov away from return.
2025-01-03AArch64: Improve codegen in AdvSIMD asinhLuna Lamb1-55/+119
Improves memory access and removes spills. Load the polynomial evaluation coefficients into 2 vectors and use lanewise MLAs. Reduces MOVs 6->3 , LDR 11->5, STR/STP 2->0, ADRP 3->2.