mirror of
https://github.com/OpenMathLib/OpenBLAS
synced 2026-06-05 00:17:12 +08:00
As the new MMA instructions need the inputs in 4x2 order for bfloat16, changing the format in copy/packing code. This avoids permute instructions in the gemm kernel inner loop.