Commit Graph

9543 Commits

Author SHA1 Message Date
Martin Kroeker
1ee8879c78 Add VORTEXM4 2025-08-20 09:59:32 -07:00
Martin Kroeker
edaa73fd24 Hide the local 2VLx2VL symbol as static is insufficient for this with gcc 2025-08-20 06:33:28 -07:00
Martin Kroeker
501728a354 adjust register 20 accesses to 21 after moving x18 2025-08-20 06:24:38 -07:00
Martin Kroeker
107c883c8a Update SME-related kernels 2025-08-19 05:13:28 -07:00
Martin Kroeker
05dbb54362 Delete misplaced file 2025-08-19 05:12:09 -07:00
Martin Kroeker
4609732e69 Relax version number requirement for AppleClang 2025-08-18 14:54:20 -07:00
Martin Kroeker
bf98e448eb Add VORTEXM4 to DYNAMIC_ARCH list 2025-08-18 14:43:08 -07:00
Martin Kroeker
0bc19a1335 Update SME kernel details 2025-08-18 14:38:16 -07:00
Martin Kroeker
426b5f23ed Add compiler options for VORTEXM4 2025-08-18 14:35:36 -07:00
Martin Kroeker
4328c91e27 relax requirements in compiler SME capability check 2025-08-18 14:34:51 -07:00
Martin Kroeker
c794d0a4ce Add VORTEXM4 2025-08-18 14:33:24 -07:00
Martin Kroeker
a4f5fec46e Add compiler options for VORTEXM4 2025-08-18 14:32:07 -07:00
Martin Kroeker
ca542f319f Add VORTEXM4 2025-08-18 08:41:38 -07:00
Martin Kroeker
18f9582f3e Add VORTEXM4 2025-08-18 01:54:09 -07:00
Martin Kroeker
4e2a8c18e5 Split VORTEXM4 from VORTEX target due to SME support 2025-08-18 01:53:04 -07:00
Martin Kroeker
30970460b8 Add VORTEXM4 target 2025-08-18 01:52:05 -07:00
Martin Kroeker
b0a00fbd62 Add minimal compiler flags for VORTEXM4 2025-08-18 01:51:10 -07:00
Martin Kroeker
ccfd0170fb Enable SME on MacOS and add VORTEXM4 to DYNAMIC_ARCH list 2025-08-18 01:50:13 -07:00
Martin Kroeker
ef0b883dff Add sgemm_direct_performant for ARM64 2025-08-18 01:48:08 -07:00
Martin Kroeker
e76c39099a Add sgemm_direct_performant for ARM64 2025-08-18 01:47:17 -07:00
Martin Kroeker
202a7a0e2a Separate VORTEXM4 from VORTEX and ARMV9SME 2025-08-18 01:45:40 -07:00
Martin Kroeker
de91afd2ae Move SGEMM_DIRECT after the CBLAS parameter check and add sgemm_direct_performant for ARM64 2025-08-18 01:44:21 -07:00
Martin Kroeker
0203657f40 Add sgemm_direct_performant for ARM64 2025-08-18 01:42:32 -07:00
Martin Kroeker
e82bcd2740 Update ARM64 sgemm_direct object generation 2025-08-18 01:41:13 -07:00
Martin Kroeker
731f4dd686 Add VORTEXM4 settings 2025-08-18 01:39:35 -07:00
Martin Kroeker
53d3bb50cc Get symbol name from build system; change b.first to b.mi for AppleClang compatibility 2025-08-18 01:37:50 -07:00
Martin Kroeker
08a00326a4 Build symbol name from build system variables 2025-08-18 01:35:41 -07:00
Martin Kroeker
89898fc499 Add sgemm_direct_performant for switching between direct and regular kernels 2025-08-18 01:31:40 -07:00
Martin Kroeker
22c6607db9 Use ASMNAME to get symbol name from build system; leave x18 unused as reserved on MacOS 2025-08-18 01:30:10 -07:00
Martin Kroeker
ca22e28ca1 Rename sgemm_direct_sme1.S to sgemm_direct_sme1_2VLx2VL.S 2025-08-18 01:25:44 -07:00
Martin Kroeker
9c43301b6d Merge pull request #5421 from reibax-marcus/develop
fix: broken cblas installation when using makefile based builds
2025-08-17 03:03:05 -07:00
Martin Kroeker
9d6df1dd3e Merge pull request #5422 from ChipKerchner/addRVVVectorizedPacking
Add and use vectorized packing in ZVL128B and ZVL256B for RISCV
2025-08-16 13:45:35 -07:00
Martin Kroeker
f3b2a15fad Merge pull request #5420 from yuanjia111/develop
Move the value assignment of vector x in gemv_n_sve.c to the outermos…
2025-08-16 12:06:53 -07:00
Chip Kerchner
64401b4417 Disable vectorized packing for DGEMM - since it is slower than scalar. 2025-08-13 13:41:12 +00:00
Martin Kroeker
5e43ba948c Merge pull request #5419 from Mousius/bgemm-optimisation
Optimize SBGEMM / BGEMM for NEOVERSEV1 further
2025-08-13 02:10:20 -07:00
Chip Kerchner
c00afc86a6 Add and use vectorized packing to ZVL128B and ZVL256B. Up to 3x+ faster than generic scalar functions. 2025-08-12 17:18:56 +00:00
Xabier Marquiegui
3a6b79c50f fix: broken cblas installation when using makefile based builds
Fix cblas.h missing from target directory if NO_CBLAS is defined but has
a value that indicates you do want cblas built and installed.
2025-08-12 14:41:15 +02:00
yuanjia
803e8d4838 Move the value assignment of vector x in gemv_n_sve.c to the outermost loop to reduce the repeated data retrieval.
1.Verify correctness using BLAS-Tester
    2.Using the built-in benchmark to verify performance, the performance of float and doule type improved by about 60% and about 40% respectively.The test command is:
     export OMP_NUM_THREADS=1;numactl -C 10 -l ./sgemv.goto 3000 4000 100
     export OMP_NUM_THREADS=1;numactl -C 10 -l ./dgemv.goto 3000 4000 100
2025-08-12 18:03:16 +08:00
Chris Sidebottom
5f47b872f1 Remove older kernels for BGEMM on NEOVERSEV1 2025-08-11 09:25:19 +00:00
Chris Sidebottom
114316f361 Optimize SBGEMM / BGEMM for NEOVERSEV1 further
This changes the kernels to pack full SVE vectors and reduces the
overall complexity of the inner GEMM loop.
2025-08-11 09:25:13 +00:00
Martin Kroeker
75c6ab4036 CI: Update WoA job to use LLVM 20.1.8 and avoid stray preinstalled LLVM19 (#5411)
* Update to 20.1.8

* fix PATH to avoid the obsolete LLVM19 that appeared in the preinstalled msvc folder hierarchy
2025-08-09 12:28:24 +02:00
Martin Kroeker
5c5f852ee3 Merge pull request #5415 from martin-frbg/Fixum-5399
Fix compilation of the NeoverseN2 SBGEMM kernel
2025-08-04 04:29:26 -07:00
Martin Kroeker
f1ee61ea30 Include NEON header for the bfloat conversion functions 2025-08-04 00:21:39 -07:00
Martin Kroeker
b3ffd5524a Include NEON header for the bfloat conversion functions 2025-08-04 00:20:28 -07:00
Martin Kroeker
d23680b81d Merge pull request #5407 from nakagawa-fj/feature/gemm_divide_rate_for_neoversev1
Multi-thread Performance Improvement of GEMM on NeoverseV1 with DIVIDE_RATE=1
2025-07-30 13:19:50 -07:00
Martin Kroeker
b4cc4be2ce Merge pull request #5410 from martin-frbg/issue5404
Adjust multithreading threshold in S/DGER and add an intermediate step
2025-07-30 12:16:05 -07:00
Martin Kroeker
0968dddf1a Merge pull request #5409 from martin-frbg/issue5372
Work around gcc15.1 on POWER misoptimizing DGEMV at -O3
2025-07-30 10:36:39 -07:00
Martin Kroeker
eddfe1e6b3 Merge pull request #5408 from ChipKerchner/fixRISCV64GEMVInitializationAndWarnings
Fix bad vector zero initializer and other compiler warnings for RISC-V.
2025-07-30 08:43:08 -07:00
Martin Kroeker
30d11bc92c Adjust multithreading threshold and add an intermediate step 2025-07-30 08:13:33 -07:00
Martin Kroeker
a3b9c933c5 mark xbuffer as volatile to work around gcc15.1 optimizer bug 2025-07-30 17:05:36 +02:00