189 Commits

Author SHA1 Message Date
yuanjia
e955736005 Fix: Remove invalid parentheses after endif 2026-02-04 09:57:48 +08:00
Chris Sidebottom
2c3cdaf74e Optimized BGEMV for NEOVERSEV1 target
- Adds bgemv T based off of sbgemv T kernel
- Adds bgemv N which is slightly alterated to not use Y as an
accumulator due to the output being bf16 which results in loss of
precision
- Enables BGEMM_GEMV_FORWARD to proxy BGEMM to BGEMV with new kernels
2025-07-23 10:51:41 +01:00
Chris Sidebottom
740efd71c4 Add optimized BGEMM kernel for NEOVERSEV1 target
This also improves the testing and generic kernel by re-using the BF16
conversion functions.

Built on top of https://github.com/OpenMathLib/OpenBLAS/pull/5357 and derived from https://github.com/OpenMathLib/OpenBLAS/pull/5287

Co-authored-by: Ye Tao <ye.tao@arm.com>
2025-07-10 23:23:27 +00:00
Srangrang
9f13b2c6ac style: modify HALF to BFLOAT16 in benchmark folder 2025-06-15 20:57:05 +08:00
gkdddd
670ec6f757 Added shgemm_kernel_8x8 for RISCV64_ZVL128B and shgemm_kernel_16x8 for RISCV64_ZVL256B
Added HFLOAT16 support for RISCV64
Added shgemm_kernel_8x8 for RISCV64_ZVL128B and shgemm_kernel_16x8 for RISCV64_ZVL256B based on HFLOAT16
The instruction sets used are ZVFH and ZFH, which need to be supported by RVV1.0

Related to issue #5279
Co-authored-by Linjin Li <linjin_li@163.com>
2025-06-03 20:14:30 +08:00
Martin Kroeker
0f8ff82592 Add build notes for Windows and flang from gh Discussion 5008 2024-12-06 01:35:42 -08:00
gxw
ffaa5765a4 Bench: Add omatcopy 2024-10-18 11:07:52 +08:00
Martin Kroeker
4460d3ee7f re-enable the sgesdd benchmark 2024-07-26 15:07:52 +02:00
Evgeni Burovski
cd3c167c28 ignore sgesdd failure on codspeed
In https://github.com/OpenMathLib/OpenBLAS/issues/4776
we're hitting
** On entry to SLASCL parameter number  4 had an illegal value

on codspeed, but not outside (either locally or on github runners)
2024-07-03 12:35:26 +03:00
Evgeni Burovski
28fb95d0be BENCH: actually add gemv/gbmv f2py wrappers 2024-06-27 12:38:47 +03:00
Evgeni Burovski
11a0c56166 BENCH: add BLAS level 2 gemv and gbmv 2024-06-27 11:14:22 +03:00
Evgeni Burovski
400cf9f63d restore the problem sizes for codspeed benchmarks 2024-06-24 16:47:20 +03:00
Evgeni Burovski
37a854718b BENCH: sync codspeed-benchmarks with BLAS-benchmarks 2024-06-24 14:33:06 +03:00
Evgeni Burovski
81cf0db047 DOC: add a readme for benchmarks/pybench 2024-05-18 15:30:00 +03:00
Evgeni Burovski
9f28161837 BENCH: add benchmarks using codspeed.io 2024-05-18 15:25:16 +03:00
Martin Kroeker
3f1ec74fe7 Fix OPENBLAS_LOOPS assignment 2024-03-20 19:22:48 +01:00
Martin Kroeker
fe39c891a6 Fix OPENBLAS_LOOPS assignment 2024-03-20 19:21:37 +01:00
Martin Kroeker
ffcbaca167 Fix OPENBLAS_LOOPS assignment 2024-03-20 19:20:16 +01:00
Martin Kroeker
05d0438c25 Fix OPENBLAS_LOOPS assignment 2024-03-20 19:19:11 +01:00
Sergei Lewis
3ffd6868d7 Merge branch 'develop' into dev/slewis/merge-from-riscv 2024-02-01 11:29:41 +00:00
gxw
3d4dfd0085 Benchmark: Rename the executable file names for {sc/dz}a{min/max}
No interface named {c/z}a{min/max}, keeping it would
cause ambiguity
2024-01-30 11:33:01 +08:00
Sergei Lewis
1093def0d1 Merge branch 'risc-v' into develop 2024-01-29 11:11:39 +00:00
Shiyou Yin
f745f02f35 benchmark: Fix missing colons in outputs of ./strsv.goto 2023-11-24 14:55:18 +08:00
martin-frbg
fec4867748 Fix file permissions (issue 4095) 2023-07-23 20:31:55 +02:00
Chris Sidebottom
ec334e69dc Use SVE kernel for SGEMM/DGEMM on Arm(R) Neoverse(TM) V1
This re-spins #3869 with some additional copy unrolling which helps maintain SYRK performance.

After #3868, the SVE kernels represent a pretty good boost.

This re-uses ARMV8SVE as a base and I'm going to incrementally move everything to use ARMV8SVE in additional patches (as well as fix up anything that's not already in ARMV8SVE).
2023-04-17 17:38:42 +01:00
Xianyi Zhang
e5313f53d5 Merge branch 'develop' of https://github.com/HellerZheng/OpenBLAS_riscv_x280 into HellerZheng-develop 2022-12-03 12:00:52 +08:00
Bart Oldeman
bae45d94d1 scal benchmark: eliminate y, move init/timing out of loop
Removing y avoids cache effects (if y is the size of the L1 cache, the
main array x is removed from it).
Moving init and timing out of the loop makes the scal benchmark behave like
the gemm benchmark, and allows higher accuracy for smaller test cases since
the loop overhead is much smaller than the timing overhead.

Example:
OPENBLAS_LOOPS=10000 ./dscal.goto 1024 8192 1024
on AMD Zen2 (7532) with 32k (4k doubles) L1 cache per core.

Before
From : 1024  To : 8192 Step = 1024 Inc_x = 1 Inc_y = 1 Loops = 10000
   SIZE       Flops
   1024 :     5627.08 MFlops   0.000000 sec
   2048 :     5907.34 MFlops   0.000000 sec
   3072 :     5553.30 MFlops   0.000001 sec
   4096 :     5446.38 MFlops   0.000001 sec
   5120 :     5504.61 MFlops   0.000001 sec
   6144 :     5501.80 MFlops   0.000001 sec
   7168 :     5547.43 MFlops   0.000001 sec
   8192 :     5548.46 MFlops   0.000001 sec

After
From : 1024  To : 8192 Step = 1024 Inc_x = 1 Inc_y = 1 Loops = 10000
   SIZE       Flops
   1024 :     6310.28 MFlops   0.000000 sec
   2048 :     6396.29 MFlops   0.000000 sec
   3072 :     6439.14 MFlops   0.000000 sec
   4096 :     6327.14 MFlops   0.000001 sec
   5120 :     5628.24 MFlops   0.000001 sec
   6144 :     5616.41 MFlops   0.000001 sec
   7168 :     5553.13 MFlops   0.000001 sec
   8192 :     5600.88 MFlops   0.000001 sec

We can see the L1->L2 switchover point is now where it should be, and the
number of flops for L1 is more accurate.
2022-11-29 08:02:45 -05:00
HellerZheng
943372bdf5 Merge branch 'develop' into develop 2022-11-18 10:12:46 +08:00
Martin Kroeker
f92dd6e303 change line endings from CRLF to LF 2022-11-17 10:18:36 +01:00
Heller Zheng
bef47917bd Initial version for riscv sifive x280 2022-11-15 00:06:25 -08:00
Bart Oldeman
9e6b060bf3 Fix comment.
It stores the pointer, not an offset (that would be an alternative approach).
2022-10-20 20:11:09 -04:00
Bart Oldeman
9959a60873 Benchmarks: align malloc'ed buffers.
Benchmarks should allocate with cacheline (often 64 bytes) alignment
to avoid unreliable timings. This technique, storing the offset in the
byte before the pointer, doesn't require C11's aligned_alloc for
compatibility with older compilers.

For example, Glibc's x86_64 malloc returns 16-byte aligned buffers, which is
not sufficient for AVX/AVX2 (32-byte preferred) or AVX512 (64-byte).
2022-10-20 13:28:20 -04:00
Marius Hillenbrand
f119e26354 Fix flipped indices in benchmark for gemv
Fixes #3439
2021-11-03 12:45:09 +01:00
Martin Kroeker
14e33e0f7e Handle OPENBLAS_LOOPS in SYR2 benchmark 2021-07-10 21:27:53 +02:00
Martin Kroeker
4ed99c2ce3 Merge pull request #3292 from martin-frbg/syrk_limit
Add lower limit for multithreading in xSYRK
2021-07-07 20:46:28 +02:00
Martin Kroeker
a4543e4918 Handle OPENBLAS_LOOP 2021-07-04 16:59:43 +02:00
Martin Kroeker
dcfc5cf714 Handle OPENBLAS_LOOPS for more stable results 2021-07-01 17:39:37 +02:00
Martin Kroeker
06e3b07ecb Handle OPENBLAS_LOOPS and OPENBLAS_TEST options 2021-07-01 17:38:45 +02:00
Martin Kroeker
1f8bda71b9 Add OPENBLAS_LOOPS support to potrf/potrs/potri benchmark 2021-06-26 23:46:00 +02:00
Martin Kroeker
d57c681a6d Fix compilation on older OSX versions 2021-03-26 22:29:29 +01:00
Martin Kroeker
38dcf3454b Support timing Apple M1 2021-03-02 17:50:55 +01:00
Qiyu8
f917c26e83 Refractoring remaining benchmark cases. 2020-10-26 10:25:05 +08:00
Qiyu8
dd6ebdfdab Refactor the performance measurement system 2020-10-23 10:32:03 +08:00
Martin Kroeker
7ae9e8960e Change "HALF" and "sh" to "BFLOAT16" and "sb" 2020-10-12 00:08:29 +02:00
Martin Kroeker
5464eb13ea Change ifdef linux to __linux for C11 compatibility 2020-09-30 22:59:41 +02:00
Martin Kroeker
6f8fad87c5 Use POSIX2001 clock.gettime for higher resolution 2020-09-05 19:44:01 +02:00
Martin Kroeker
ced49466f0 Use the fortran compiler to link LAPACK-related benchmarks
to fix linking problems with (at least) the AMD version of flang that creates dependencies on more than just the fortran runtime.
2020-05-29 13:35:51 +02:00
Martin Kroeker
6e270f91ec add support for RETURN_BY_STACK semantics, e.g. clang 2020-05-29 13:29:10 +02:00
Rajalakshmi Srinivasaraghavan
ce90e2bd3f Include shgemm in benchtest
This patch is to enable benchtest for half precision gemm
when BUILD_HALF is set during make.
2020-05-11 09:57:46 -05:00
l00536773
6b7ef6543a [OpenBLAS]: benchmark error of potrf
[description]: when the matrix size goes higher than 5800 during the cpotrf test, error info, such as "Potrf info = 5679", will be returned on ARM64 and x86 machines. Uplo = L & F.
[solution]: changed the func for building the matrix so that the complex Hermitian matrix can stay positive definite during the computation.
[dts]:
2020-04-16 10:55:10 +08:00