-
Paul E. Murphy authored
Use 4x FMA chains to sum on SIMD 128 FP64 targets. On x86 this showed about 1.4x improvement. For PPC, do a full multiply (32x32->64b), convert to DP then accumulate. This may be slightly less precise for some inputs. But is 1.5x faster than the above which is about 1.5x than the FMA above for ~2.5x speedup.
33fb253a
Name |
Last commit
|
Last update |
---|---|---|
.github | ||
3rdparty | ||
apps | ||
cmake | ||
data | ||
doc | ||
include | ||
modules | ||
platforms | ||
samples | ||
.editorconfig | ||
.gitattributes | ||
.gitignore | ||
CMakeLists.txt | ||
CONTRIBUTING.md | ||
LICENSE | ||
README.md |