add gemm_batch, gemm_batch_strided, bgemm/bgemv and fp16 extensions

2026-05-31 00:45:48 +08:00 · 2026-03-22 22:34:27 +01:00
parent 1e48eca408
commit 496af0d8bb
1 changed files with 13 additions and 1 deletions
--- a/docs/extensions.md
+++ b/docs/extensions.md
@@ -13,7 +13,9 @@ This page documents those non-standard APIs.
 | ?omatcopy     | s,d,c,z       | out-of-place transposition/copying              |
 | ?geadd        | s,d,c,z       | ATLAS-like matrix add `B = &alpha;*A+&beta;*B`  |
 | ?gemmt        | s,d,c,z       | `gemm` but only a triangular part updated       |
-
+| cblas_?gemm_batch | s,d,c,z,b | `gemm` with several groups of input data
+|
+| cblas_?gemm_batch_strided | s,d,c,z,b | `gemm` with groups of data stored at fixed offsets in the input arrays 

 ## bfloat16 functionality

@@ -26,6 +28,15 @@ BLAS-like and conversion functions for `bfloat16` (available when OpenBLAS was c
 * `float cblas_sbdot` computes the dot product of two bfloat16 arrays
 * `void cblas_sbgemv` performs the matrix-vector operations of GEMV with the input matrix and X vector as bfloat16
 * `void cblas_sbgemm` performs the matrix-matrix operations of GEMM with both input arrays containing bfloat16
+* `void cblas_bgemv` performs the matrix-vector operations of GEMV with the input matrix, X vector and result as bfloat16
+* `void cblas_bgemm` performs the matrix-matrix operations of GEMM with both input arrays containing bfloat16 and the output being bfloat16 as well
+
+## half-precision float or fp16 functionality
+
+BLAS-like and conversion functions for `hfloat16` (available when OpenBLAS was compiled with `BUILD_HFLOAT16=1`):
+
+* `void cblas_shgemm` performs the matrix-matrix operations of GEMM with both input arrays containing hfloat16
+

 ## Utility functions

@@ -36,4 +47,5 @@ BLAS-like and conversion functions for `bfloat16` (available when OpenBLAS was c
 * `char * openblas_get_config()` returns the options OpenBLAS was built with, something like `NO_LAPACKE DYNAMIC_ARCH NO_AFFINITY Haswell`
 * `int openblas_set_affinity(int thread_index, size_t cpusetsize, cpu_set_t *cpuset)` sets the CPU affinity mask of the given thread
  to the provided cpuset. Only available on Linux, with semantics identical to `pthread_setaffinity_np`.
+* `openblas_set_thread_callback_function` overrides the default multithreading backend with the provided argument