Commit Graph

2428 Commits

Author SHA1 Message Date
Annop Wongwathanarat
edef2e4441 Fix bug in ARM64 sbgemv_t 2025-03-13 20:55:31 +00:00
Martin Kroeker
b55ca71d5b Merge pull request #5182 from annop-w/sgemm_ncopy
Optimize aarch64 sgemm_ncopy
2025-03-13 16:04:39 +01:00
Martin Kroeker
2f778554b8 Merge pull request #5181 from taoye9/change_sbgemn_cast_bf16
replace customize bf16_to_fp32 with arm neon vcvtah_f32_bf16
2025-03-13 13:50:26 +01:00
Annop Wongwathanarat
9807f56580 Optimize aarch64 sgemm_ncopy 2025-03-13 10:17:43 +00:00
Martin Kroeker
a3e7b16072 Merge pull request #5157 from manaalmj/feature
Optimize gemv_n_sve kernel
2025-03-12 21:08:23 +01:00
Ye Tao
4c00099ed6 replace customize bf16_to_fp32 with arm neon vcvtah_f32_bf16 2025-03-12 16:20:15 +00:00
Annop Wongwathanarat
a085b6c9ec Fix aarch64 sbgemv_t compilation error for GCC < 13 2025-03-12 14:52:42 +00:00
manjam01
5c4e38ab17 Optimize gemv_n_sve kernel 2025-03-10 16:39:20 +00:00
Martin Kroeker
1d5ed5c46b Merge pull request #5168 from taoye9/add_sbgemvn_on_neonversen2
Add dispatch of SBGEMVNKERNEL for NEOVERSEN2 and NEOVERSEV2
2025-03-04 16:39:22 +01:00
Ye Tao
6b8b35cdf2 fix minior issues of redeclaration of float x0,x1 in sbgemv_n_neon.c 2025-03-03 11:55:27 +00:00
Ye Tao
38ee7c9301 Add dispatch of SBGEMVNKERNEL for NEOVERSEN2 and NEOVERSEV2 2025-03-03 11:32:05 +00:00
Martin Kroeker
2b941c44b5 Merge branch 'develop' into sbgemv_n_neon 2025-03-02 22:39:32 +01:00
Ye Tao
35bdbca153 Add sbgemv_n_neon kernel for arm64. 2025-02-28 14:37:06 +00:00
Annop Wongwathanarat
edaf51dd99 Add sbgemv_t_bfdot kernel for ARM64
This improves performance for sbgemv_t by up to 100x on NEOVERSEV1.
The geometric mean speedup is ~61x for M=N=[2,512].
2025-02-28 12:31:50 +00:00
Martin Kroeker
77fba0f400 Fix "dummy2" flag handling 2025-02-22 20:09:21 +01:00
Martin Kroeker
eb84aac7ad Merge pull request #5084 from quic/topic/sgemm_direct_sme1
Support for SGEMM_DIRECT Kernel based on SME1
2025-02-19 10:56:49 +01:00
Martin Kroeker
b9ae246f20 define USE_TRMM for RISCV64 targets as well 2025-02-16 23:18:04 +01:00
Vaisakh K V
f66ca05b31 Merge branch 'develop' into topic/sgemm_direct_sme1 2025-02-13 14:54:37 +05:30
Vaisakh K V
d23eb3b93e Support for SME1 based sgemm_direct kernel for cblas_sgemm level 3 API
* Added ARMV9SME target
* Added SGEMM_DIRECT kernel based on SME1
2025-02-13 14:51:21 +05:30
Martin Kroeker
8d487ef6eb Merge pull request #5124 from XiWeiGu/LoongArch64-LA264-lapack-fixed
LoongArch64: Fixed lapack test for LA264
2025-02-12 14:58:30 +01:00
Martin Kroeker
81eed868b6 Restore the non-vectorized code from before PR4880 for POWER8 2025-02-12 09:07:20 +01:00
Martin Kroeker
98b5ef929c Restore the non-vectorized code from before PR4880 for POWER8 2025-02-12 09:04:22 +01:00
gxw
2c4a5cc6e6 LoongArch64: Fixed snrm2_lsx.S and cnrm2_lsx.S
When the data type is single-precision real or single-precision complex,
converting it to double precision does not prevent overflow (as exposed in LAPACK tests).
The only solution is to follow C's approach: find the maximum value in the
array and divide each element by that maximum to avoid this issue
2025-02-12 15:48:01 +08:00
gxw
9e75d6b3d1 LoongArch64: Fixed swap_lsx.S
Fixed the error when the stride is zero
2025-02-12 14:57:35 +08:00
gxw
e8c740368c LoongArch64: Fixed rot_lsx.S ane crot_lsx.S
Do not check whether the input parameters c and s are zero,
as this may cause errors with special values (same as scal).
Although OpenBLAS's own test suite doesn't catch this, it will
cause LAPACK test cases to fail.
2025-02-12 14:52:49 +08:00
Hao Chen
c2212d0abd LoongArch64: Fixed copy_lsx.S
Fixed incorrect store operation

Signed-off-by: gxw <guxiwei-hf@loongson.cn>
2025-02-12 14:52:20 +08:00
Hao Chen
7f1ebc7ae6 LoongArch64: Fixed iamax_lsx.S
Fixed index retrieval issue when there are
identical maximum absolute values

Signed-off-by: Hao Chen <chenhao@loongson.cn>
Signed-off-by: gxw <guxiwei-hf@loongson.cn>
2025-02-12 14:44:44 +08:00
Hao Chen
31d326f895 LoongArch64: Fixed dot_lsx.S
Fixed incorrect register usage in instructions

Signed-off-by: gxw <guxiwei-hf@loongson.cn>
2025-02-12 14:44:11 +08:00
Hao Chen
5d6356bc16 LoongArch64: Fixed amax_lsx.S
Fixed register zeroing operation

Signed-off-by: Hao Chen <chenhao@loongson.cn>
Signed-off-by: gxw <guxiwei-hf@loongson.cn>
2025-02-12 14:39:29 +08:00
Ye Tao
c748e6a338 optimized sbgemm kernel for neoverse-v1 (sve-256)
Signed-off-by: Ye Tao <ye.tao@arm.com>
2025-02-05 10:06:37 +00:00
Aditya Tewari
4379a6fbe3 * checkpoint sbgemm for SVE-256 2025-02-03 12:49:49 +00:00
Martin Kroeker
d7036cfd74 Remove trailing blanks that break the cmake parser 2025-01-27 09:32:17 +01:00
Martin Kroeker
6e393a5599 Merge branch 'develop' into gemv_t 2025-01-25 12:54:04 +01:00
Martin Kroeker
876ba58e28 Merge pull request #5091 from goplanid/develop
Small gemm kernel improvements for AArch64
2025-01-24 10:59:16 +01:00
Martin Kroeker
180ba5e7d0 Merge pull request #5069 from tingboliao/dev_rotm_20250107
Further rearranged the rotm kernel for the different architectures.
2025-01-23 10:16:43 +01:00
Deeksha Goplani
d1bfa979f7 small gemm kernel packing modifications 2025-01-23 09:41:45 +05:30
Martin Kroeker
1a6a9fb22f add another generator line for rotm 2025-01-23 00:17:04 +01:00
Martin Kroeker
4924319c50 fix position of srotm, qrotm 2025-01-22 16:07:35 +01:00
Martin Kroeker
b58cba9eb6 fix qrotm build rules 2025-01-22 15:51:49 +01:00
tingbo.liao
3c8df6358f Further rearranged the rotm kernel for the different architectures.
Signed-off-by: tingbo.liao <tingbo.liao@starfivetech.com>
2025-01-22 11:41:12 +08:00
Annop Wongwathanarat
c0318cea6e Simplify gemv_t_sve_v1x3 kernel 2025-01-21 13:40:17 +00:00
Martin Kroeker
87083fdbf6 [WIP] Work around assembler limitations in current LLVM for Windows on Arm (#5076)
* Protect align directives in assembly files that are currently problematic with LLVM on WoA

* use the armv8 zdot on WoA to work around other LLVM issues
2025-01-18 16:45:56 +01:00
tingbo.liao
ef7f54b357 Optimized the gemm_tcopy_8_rvv to be compatible with the vlens 128 and 256.
Signed-off-by: tingbo.liao <tingbo.liao@starfivetech.com>
2025-01-15 11:31:28 +08:00
gxw
e0a8216554 LoongArch64: Update dsymv LSX version 2025-01-14 19:45:42 +08:00
gxw
a9070ba3f9 LoongArch64: Update ssymv LSX version 2025-01-14 09:06:59 +00:00
Xi Ruoyao
af10c132b8 LoongArch64: Fix dsymv and ssymv LASX version
"fmov.d $f2, $f4" leaves all the bits higher than the 63-th bit
unpredictable but it's obvious that the following code uses the value of
those high bits.  We actually want to replicate the lower 64 bits here,
so we should use xvreplve0.d instead.

LA464 (Loongson 3[A-Z]-5000) happens to replicate them for us due to
some uarch internal details so the issue was not detected, but for LA664
(Loongson 3[A-Z]-6000) and future uarch we need to do things correctly
or we end up getting a lot of test failures.

Closes: https://bbs.aosc.io/t/topic/302
Signed-off-by: Xi Ruoyao <xry111@xry111.site>
2025-01-13 22:16:00 +08:00
Martin Kroeker
d74eb02954 Merge pull request #5057 from martin-frbg/issue5050
Replace while loop in generic C/ZGEMM_BETA to avoid going out of bounds
2025-01-11 11:33:56 -08:00
Martin Kroeker
30f7a4120b Merge pull request #5056 from tingboliao/dev_omatcopy_20250108
Optimize the omatcopy_cn/zomatcopy_cn kernels with RVV 1.0 intrinsic.
2025-01-11 09:42:57 -08:00
gxw
20a8e48f25 LoongArch64: Update ssymv LASX version 2025-01-10 16:02:54 +08:00
gxw
e0748588b8 LoongArch64: Update dsymv LASX version 2025-01-10 14:52:57 +08:00