Martin Kroeker
1ee8879c78
Add VORTEXM4
2025-08-20 09:59:32 -07:00
Martin Kroeker
edaa73fd24
Hide the local 2VLx2VL symbol as static is insufficient for this with gcc
2025-08-20 06:33:28 -07:00
Martin Kroeker
501728a354
adjust register 20 accesses to 21 after moving x18
2025-08-20 06:24:38 -07:00
Martin Kroeker
107c883c8a
Update SME-related kernels
2025-08-19 05:13:28 -07:00
Martin Kroeker
05dbb54362
Delete misplaced file
2025-08-19 05:12:09 -07:00
Martin Kroeker
4609732e69
Relax version number requirement for AppleClang
2025-08-18 14:54:20 -07:00
Martin Kroeker
bf98e448eb
Add VORTEXM4 to DYNAMIC_ARCH list
2025-08-18 14:43:08 -07:00
Martin Kroeker
0bc19a1335
Update SME kernel details
2025-08-18 14:38:16 -07:00
Martin Kroeker
426b5f23ed
Add compiler options for VORTEXM4
2025-08-18 14:35:36 -07:00
Martin Kroeker
4328c91e27
relax requirements in compiler SME capability check
2025-08-18 14:34:51 -07:00
Martin Kroeker
c794d0a4ce
Add VORTEXM4
2025-08-18 14:33:24 -07:00
Martin Kroeker
a4f5fec46e
Add compiler options for VORTEXM4
2025-08-18 14:32:07 -07:00
Martin Kroeker
ca542f319f
Add VORTEXM4
2025-08-18 08:41:38 -07:00
Martin Kroeker
18f9582f3e
Add VORTEXM4
2025-08-18 01:54:09 -07:00
Martin Kroeker
4e2a8c18e5
Split VORTEXM4 from VORTEX target due to SME support
2025-08-18 01:53:04 -07:00
Martin Kroeker
30970460b8
Add VORTEXM4 target
2025-08-18 01:52:05 -07:00
Martin Kroeker
b0a00fbd62
Add minimal compiler flags for VORTEXM4
2025-08-18 01:51:10 -07:00
Martin Kroeker
ccfd0170fb
Enable SME on MacOS and add VORTEXM4 to DYNAMIC_ARCH list
2025-08-18 01:50:13 -07:00
Martin Kroeker
ef0b883dff
Add sgemm_direct_performant for ARM64
2025-08-18 01:48:08 -07:00
Martin Kroeker
e76c39099a
Add sgemm_direct_performant for ARM64
2025-08-18 01:47:17 -07:00
Martin Kroeker
202a7a0e2a
Separate VORTEXM4 from VORTEX and ARMV9SME
2025-08-18 01:45:40 -07:00
Martin Kroeker
de91afd2ae
Move SGEMM_DIRECT after the CBLAS parameter check and add sgemm_direct_performant for ARM64
2025-08-18 01:44:21 -07:00
Martin Kroeker
0203657f40
Add sgemm_direct_performant for ARM64
2025-08-18 01:42:32 -07:00
Martin Kroeker
e82bcd2740
Update ARM64 sgemm_direct object generation
2025-08-18 01:41:13 -07:00
Martin Kroeker
731f4dd686
Add VORTEXM4 settings
2025-08-18 01:39:35 -07:00
Martin Kroeker
53d3bb50cc
Get symbol name from build system; change b.first to b.mi for AppleClang compatibility
2025-08-18 01:37:50 -07:00
Martin Kroeker
08a00326a4
Build symbol name from build system variables
2025-08-18 01:35:41 -07:00
Martin Kroeker
89898fc499
Add sgemm_direct_performant for switching between direct and regular kernels
2025-08-18 01:31:40 -07:00
Martin Kroeker
22c6607db9
Use ASMNAME to get symbol name from build system; leave x18 unused as reserved on MacOS
2025-08-18 01:30:10 -07:00
Martin Kroeker
ca22e28ca1
Rename sgemm_direct_sme1.S to sgemm_direct_sme1_2VLx2VL.S
2025-08-18 01:25:44 -07:00
Martin Kroeker
9c43301b6d
Merge pull request #5421 from reibax-marcus/develop
...
fix: broken cblas installation when using makefile based builds
2025-08-17 03:03:05 -07:00
Martin Kroeker
9d6df1dd3e
Merge pull request #5422 from ChipKerchner/addRVVVectorizedPacking
...
Add and use vectorized packing in ZVL128B and ZVL256B for RISCV
2025-08-16 13:45:35 -07:00
Martin Kroeker
f3b2a15fad
Merge pull request #5420 from yuanjia111/develop
...
Move the value assignment of vector x in gemv_n_sve.c to the outermos…
2025-08-16 12:06:53 -07:00
Chip Kerchner
64401b4417
Disable vectorized packing for DGEMM - since it is slower than scalar.
2025-08-13 13:41:12 +00:00
Martin Kroeker
5e43ba948c
Merge pull request #5419 from Mousius/bgemm-optimisation
...
Optimize SBGEMM / BGEMM for NEOVERSEV1 further
2025-08-13 02:10:20 -07:00
Chip Kerchner
c00afc86a6
Add and use vectorized packing to ZVL128B and ZVL256B. Up to 3x+ faster than generic scalar functions.
2025-08-12 17:18:56 +00:00
Xabier Marquiegui
3a6b79c50f
fix: broken cblas installation when using makefile based builds
...
Fix cblas.h missing from target directory if NO_CBLAS is defined but has
a value that indicates you do want cblas built and installed.
2025-08-12 14:41:15 +02:00
yuanjia
803e8d4838
Move the value assignment of vector x in gemv_n_sve.c to the outermost loop to reduce the repeated data retrieval.
...
1.Verify correctness using BLAS-Tester
2.Using the built-in benchmark to verify performance, the performance of float and doule type improved by about 60% and about 40% respectively.The test command is:
export OMP_NUM_THREADS=1;numactl -C 10 -l ./sgemv.goto 3000 4000 100
export OMP_NUM_THREADS=1;numactl -C 10 -l ./dgemv.goto 3000 4000 100
2025-08-12 18:03:16 +08:00
Chris Sidebottom
5f47b872f1
Remove older kernels for BGEMM on NEOVERSEV1
2025-08-11 09:25:19 +00:00
Chris Sidebottom
114316f361
Optimize SBGEMM / BGEMM for NEOVERSEV1 further
...
This changes the kernels to pack full SVE vectors and reduces the
overall complexity of the inner GEMM loop.
2025-08-11 09:25:13 +00:00
Martin Kroeker
75c6ab4036
CI: Update WoA job to use LLVM 20.1.8 and avoid stray preinstalled LLVM19 ( #5411 )
...
* Update to 20.1.8
* fix PATH to avoid the obsolete LLVM19 that appeared in the preinstalled msvc folder hierarchy
2025-08-09 12:28:24 +02:00
Martin Kroeker
5c5f852ee3
Merge pull request #5415 from martin-frbg/Fixum-5399
...
Fix compilation of the NeoverseN2 SBGEMM kernel
2025-08-04 04:29:26 -07:00
Martin Kroeker
f1ee61ea30
Include NEON header for the bfloat conversion functions
2025-08-04 00:21:39 -07:00
Martin Kroeker
b3ffd5524a
Include NEON header for the bfloat conversion functions
2025-08-04 00:20:28 -07:00
Martin Kroeker
d23680b81d
Merge pull request #5407 from nakagawa-fj/feature/gemm_divide_rate_for_neoversev1
...
Multi-thread Performance Improvement of GEMM on NeoverseV1 with DIVIDE_RATE=1
2025-07-30 13:19:50 -07:00
Martin Kroeker
b4cc4be2ce
Merge pull request #5410 from martin-frbg/issue5404
...
Adjust multithreading threshold in S/DGER and add an intermediate step
2025-07-30 12:16:05 -07:00
Martin Kroeker
0968dddf1a
Merge pull request #5409 from martin-frbg/issue5372
...
Work around gcc15.1 on POWER misoptimizing DGEMV at -O3
2025-07-30 10:36:39 -07:00
Martin Kroeker
eddfe1e6b3
Merge pull request #5408 from ChipKerchner/fixRISCV64GEMVInitializationAndWarnings
...
Fix bad vector zero initializer and other compiler warnings for RISC-V.
2025-07-30 08:43:08 -07:00
Martin Kroeker
30d11bc92c
Adjust multithreading threshold and add an intermediate step
2025-07-30 08:13:33 -07:00
Martin Kroeker
a3b9c933c5
mark xbuffer as volatile to work around gcc15.1 optimizer bug
2025-07-30 17:05:36 +02:00