• Paul E. Murphy's avatar
    core: vectorize dotProd_32s · 33fb253a
    Paul E. Murphy authored
    Use 4x FMA chains to sum on SIMD 128 FP64 targets. On
    x86 this showed about 1.4x improvement.
    
    For PPC, do a full multiply (32x32->64b), convert to DP
    then accumulate. This may be slightly less precise for
    some inputs. But is 1.5x faster than the above which
    is about 1.5x than the FMA above for ~2.5x speedup.
    33fb253a
matmul.simd.hpp 90.4 KB