mirror of
https://github.com/OpenMathLib/OpenBLAS
synced 2026-05-31 00:45:48 +08:00
As the new MMA instructions need the inputs in 4x2 order for bfloat16, changing the format in copy/packing code. This avoids permute instructions in the gemm kernel inner loop.