OpenBLAS

mirror of https://github.com/OpenMathLib/OpenBLAS synced 2026-06-05 00:17:12 +08:00

Files

Fadi Arafeh f30202b705 Accelerate SVE128 SBGEMM/BGEMM

This accelerates SBGEMM/BGEMM by extending the existing 8x4 kernel to 8x8 (unrolling N by 8)

Not sure if it's a good idea to delete the previous 8x4 kernel?

Here are the speedups on single core Neoverse-V2 (SVE128) compared to prev state:

Per-shape speedup
  M=N=K=64: SBGEMM 1.164x (16.42%), BGEMM 1.133x (13.30%)
  M=N=K=128: SBGEMM 1.220x (22.02%), BGEMM 1.186x (18.56%)
  M=N=K=256: SBGEMM 1.241x (24.08%), BGEMM 1.235x (23.54%)
  M=N=K=512: SBGEMM 1.240x (23.95%), BGEMM 1.227x (22.75%)
  M=N=K=1024: SBGEMM 1.251x (25.11%), BGEMM 1.232x (23.23%)
  M=N=K=2048: SBGEMM 1.235x (23.47%), BGEMM 1.246x (24.64%)

Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com>

2026-03-05 13:50:07 +00:00

alpha

Further rearranged the rotm kernel for the different architectures.

2025-01-22 11:41:12 +08:00

arm

Merge pull request #5081 from XiWeiGu/kernel_generic_fixed_cscal_zscal

2025-06-12 01:03:00 -07:00

arm64

Accelerate SVE128 SBGEMM/BGEMM

2026-03-05 13:50:07 +00:00

csky

Further rearranged the rotm kernel for the different architectures.