mirror of
https://github.com/OpenMathLib/OpenBLAS
synced 2026-05-31 00:45:48 +08:00
for skylake kernels. This is the same method as used in [sd]asum. _mm_set1_epi64x was commented out for zasum, but has the advantage of avoiding possible undefined behaviour (using an uninitialized variable), optimized out by NVHPC and icx. The new code works fine with those compilers. For GCC 12.3 the generated code is identical; no matter what method you use, the compiler optimizes the code into a compile-time constant, there is no performance benefit using mm_cmpeq_epi8 since the corresponding instruction (VPCMPEQB) isn't actually generated!