Yichao Yu
b94e9b92ad
Fix compilation on ARM
...
Define a dummy function if SME is not supported, following what sgemm does
2025-10-11 20:28:59 -04:00
Martin Kroeker
e40714cabd
Merge pull request #5450 from quic/topic/strmm_direct_sme1
...
Support for SME1 based strmm_direct kernel for cblas_strmm level 3 API
2025-10-11 15:20:19 -07:00
changjua
644ea07ef9
Support for SME1 based strmm_direct kernel for cblas_strmm level 3 API
2025-10-10 10:48:27 +08:00
Chris Sidebottom
578e7dae85
Fix bf16->f32 conversion for NEOVERSEV1 and NEOVERSEN2 targets
...
This fixes an issue originally introduced with the BGEMM kernel.
I've updated the tests to run with `beta=1.0` so as to test loading and
updating from C.
Alongside this, the tests now return sensible return values to reduce
the risk of them being ignored.
Also fixed a bug in `generic/gemv_t.c` resulting in weird outputs for
`bgemv`.
2025-10-06 18:05:58 +00:00
Rajendra Prasad Matcha
19268471cc
Support for SME1 based ssymm_direct kernel for cblas_ssymm level 3 API
2025-09-30 15:05:33 +05:30
h-motoki
855945befb
Implementing SVE in [SD]AXPY Kernels for A64FX and Graviton3E
2025-08-21 20:56:58 +09:00
Martin Kroeker
f3b2a15fad
Merge pull request #5420 from yuanjia111/develop
...
Move the value assignment of vector x in gemv_n_sve.c to the outermos…
2025-08-16 12:06:53 -07:00
yuanjia
803e8d4838
Move the value assignment of vector x in gemv_n_sve.c to the outermost loop to reduce the repeated data retrieval.
...
1.Verify correctness using BLAS-Tester
2.Using the built-in benchmark to verify performance, the performance of float and doule type improved by about 60% and about 40% respectively.The test command is:
export OMP_NUM_THREADS=1;numactl -C 10 -l ./sgemv.goto 3000 4000 100
export OMP_NUM_THREADS=1;numactl -C 10 -l ./dgemv.goto 3000 4000 100
2025-08-12 18:03:16 +08:00
Chris Sidebottom
5f47b872f1
Remove older kernels for BGEMM on NEOVERSEV1
2025-08-11 09:25:19 +00:00
Chris Sidebottom
114316f361
Optimize SBGEMM / BGEMM for NEOVERSEV1 further
...
This changes the kernels to pack full SVE vectors and reduces the
overall complexity of the inner GEMM loop.
2025-08-11 09:25:13 +00:00
Martin Kroeker
f1ee61ea30
Include NEON header for the bfloat conversion functions
2025-08-04 00:21:39 -07:00
Martin Kroeker
b3ffd5524a
Include NEON header for the bfloat conversion functions
2025-08-04 00:20:28 -07:00
Martin Kroeker
a5e7c0e3e0
Merge pull request #5396 from abhishek-iitmadras/abhishekk_bfloat16
...
ARM64: Enable bfloat16 kernels by default
2025-07-28 13:39:08 -07:00
abhishek-fujitsu
0bc79da587
add neon header
2025-07-25 11:10:20 +05:30
Chris Sidebottom
ea2faf0c9a
Add optimized BGEMM for NEOVERSEN2 target
...
This re-uses the existing NEOVERSEN2 8x4 `sbgemm` kernel to implement `bgemm`.
2025-07-24 10:59:28 +00:00
Chris Sidebottom
2c3cdaf74e
Optimized BGEMV for NEOVERSEV1 target
...
- Adds bgemv T based off of sbgemv T kernel
- Adds bgemv N which is slightly alterated to not use Y as an
accumulator due to the output being bf16 which results in loss of
precision
- Enables BGEMM_GEMV_FORWARD to proxy BGEMM to BGEMV with new kernels
2025-07-23 10:51:41 +01:00
Martin Kroeker
39c90f9859
Merge pull request #5380 from quic/topic/sgemm_direct_sme1_alpha_beta
...
SME1 based direct kernel (with alpha and beta) for cblas_sgemm level 3
2025-07-18 23:23:39 +02:00
Rajendra Prasad Matcha
eae0abfdb6
SME1 based direct kernel with alpha and beta for cblas_sgemm level 3 API.
2025-07-17 16:14:31 +05:30
Chris Sidebottom
740efd71c4
Add optimized BGEMM kernel for NEOVERSEV1 target
...
This also improves the testing and generic kernel by re-using the BF16
conversion functions.
Built on top of https://github.com/OpenMathLib/OpenBLAS/pull/5357 and derived from https://github.com/OpenMathLib/OpenBLAS/pull/5287
Co-authored-by: Ye Tao <ye.tao@arm.com >
2025-07-10 23:23:27 +00:00
Martin Kroeker
fd37406817
Merge branch 'develop' into optimized_gemv_n_1x3
2025-07-08 21:05:30 +02:00
Iha, Taisei
f7ad906b49
Performance improvements of [SD]DOT with loop-unrolling on A64FX
2025-07-04 22:57:44 +09:00
Martin Kroeker
ee26caffb3
Merge pull request #5309 from davidz-ampere/dev-ampereone
...
Add support for Ampere AmpereOne processors
2025-06-24 12:27:08 +02:00
davidz-ampere
aa90ab4142
Add support for Ampere AmpereOne processors
2025-06-24 00:12:34 -04:00
Ian McInerney
badef1d32e
Update sbgemm_tcopy_4_neoversev1 kernel to use standard C types
2025-06-19 14:26:16 +01:00
davidz-ampere
84730068af
reduce duplicate kernel code
2025-06-17 03:05:34 -04:00
davidz-ampere
be68ef03b4
Add support for Ampere processors
2025-06-15 22:00:40 -04:00
Martin Kroeker
58eeb9041c
fix handling of dummy2
2025-06-12 03:03:01 -07:00
Martin Kroeker
1589d0b21e
Merge pull request #5281 from martin-frbg/zscal_arm64
...
kernel/arm64: fixed cscal and zscal
2025-06-12 01:04:18 -07:00
Sharif Inamdar
8279e68805
Optimize gemv_n_sve_v1x3 kernel
...
- Calculate predicate outside the loop
- Divide matrix in blocks of 3
2025-06-11 10:16:56 +00:00
Arne Juul
5442aff218
Accumulate results in output register explicitly
2025-06-09 19:03:22 +00:00
Martin Kroeker
28f8fdaf0f
support flag for NaN/Inf handling and fix scaling of NaN/Inf values
2025-05-23 14:59:59 +02:00
Martin Kroeker
5141a90993
Fix ARMV9SME target in DYNAMIC_ARCH and add SME query code for MacOS ( #5222 )
...
* Fix ARMV9SME target and add support_sme1 code for MacOS
* make sgemm_direct unconditionally available on all arm64
* build a (dummy) sgemm_direct kernel on all arm64
* Update dynamic_arm64.c
2025-05-10 22:39:32 +02:00
Martin Kroeker
151b74284e
Merge pull request #5203 from quic/fix-sgemmdirect-sme1
...
Add vector registers to clobber list to prevent compiler optimization.
2025-05-09 05:39:47 -07:00
abhishek-fujitsu
9c02cdb073
optimise dot using thread throttling for NEOVERSE V1
2025-04-23 22:35:05 +05:30
Martin Kroeker
d0e8fd6d40
Merge pull request #5239 from annop-w/gemv_n_sve
...
Use SVE kernel for S/DGEMVN for SVE machines
2025-04-22 10:19:49 -07:00
Iha, Taisei
08b5c18d70
fixed a potential out-of-bounds on gemv.
2025-04-22 19:56:44 +09:00
Annop Wongwathanarat
e11744a411
Use SVE kernel for S/DGEMVN for SVE machines
2025-04-22 09:40:13 +00:00
Martin Kroeker
dd38b4e811
Merge pull request #5225 from annop-w/gemv_n
...
Improve performance for SGEMVN on NEONVERSEN1
2025-04-17 01:54:10 -07:00
Martin Kroeker
0241d516f6
Merge pull request #5220 from iha-taisei/sdgemv_n_unroll
...
Further performance improvements to non-transposed [SD]GEMV kernels for A64FX and Neoverse V1.
2025-04-16 12:55:55 -07:00
Annop Wongwathanarat
d535728803
Improve performance for SGEMVN on NEONVERSEN1
2025-04-16 09:54:30 +00:00
Usui, Tetsuzo
d711906e3e
Add symv kernels for arm64
2025-04-11 20:39:52 +09:00
Iha, Taisei
f1e628b889
Further performance improvements to [SD]GEMV.
2025-04-11 20:00:33 +09:00
Annop Wongwathanarat
ec146157d3
Use SVE kernel for S/DGEMVT for SVE machines
2025-04-09 20:38:14 +00:00
Vaisakh K V
04915be829
Add vector registers to clobber list to prevent compiler optimization.
...
SME based SGEMMDIRECT kernel uses the vector registers (z) and adding
clobber list informs compiler not to optimize these registers.
2025-04-03 12:18:43 +05:30
Ye Tao
f27ba5efd1
fix bugs in aarch64 sbgemv_n kernel
2025-03-14 17:55:40 +00:00
Annop Wongwathanarat
edef2e4441
Fix bug in ARM64 sbgemv_t
2025-03-13 20:55:31 +00:00
Martin Kroeker
b55ca71d5b
Merge pull request #5182 from annop-w/sgemm_ncopy
...
Optimize aarch64 sgemm_ncopy
2025-03-13 16:04:39 +01:00
Martin Kroeker
2f778554b8
Merge pull request #5181 from taoye9/change_sbgemn_cast_bf16
...
replace customize bf16_to_fp32 with arm neon vcvtah_f32_bf16
2025-03-13 13:50:26 +01:00
Annop Wongwathanarat
9807f56580
Optimize aarch64 sgemm_ncopy
2025-03-13 10:17:43 +00:00
Martin Kroeker
a3e7b16072
Merge pull request #5157 from manaalmj/feature
...
Optimize gemv_n_sve kernel
2025-03-12 21:08:23 +01:00