modules/core/src/matmul.simd.hpp · 048ddbf9ee9fac8d2560d5003d7c0b2bcc811f49 · submodule / opencv

Paul E. Murphy authored Aug 20, 2019

Use 4x FMA chains to sum on SIMD 128 FP64 targets. On
x86 this showed about 1.4x improvement.

For PPC, do a full multiply (32x32->64b), convert to DP
then accumulate. This may be slightly less precise for
some inputs. But is 1.5x faster than the above which
is about 1.5x than the FMA above for ~2.5x speedup.

33fb253a

matmul.simd.hpp 90.4 KB

Replace matmul.simd.hpp