• Paul E. Murphy's avatar
    core: vectorize dotProd_32s · 33fb253a
    Paul E. Murphy authored
    Use 4x FMA chains to sum on SIMD 128 FP64 targets. On
    x86 this showed about 1.4x improvement.
    
    For PPC, do a full multiply (32x32->64b), convert to DP
    then accumulate. This may be slightly less precise for
    some inputs. But is 1.5x faster than the above which
    is about 1.5x than the FMA above for ~2.5x speedup.
    33fb253a
Name
Last commit
Last update
.github Loading commit data...
3rdparty Loading commit data...
apps Loading commit data...
cmake Loading commit data...
data Loading commit data...
doc Loading commit data...
include Loading commit data...
modules Loading commit data...
platforms Loading commit data...
samples Loading commit data...
.editorconfig Loading commit data...
.gitattributes Loading commit data...
.gitignore Loading commit data...
CMakeLists.txt Loading commit data...
CONTRIBUTING.md Loading commit data...
LICENSE Loading commit data...
README.md Loading commit data...