• Paul E. Murphy's avatar
    core: vectorize dotProd_32s · 33fb253a
    Paul E. Murphy authored
    Use 4x FMA chains to sum on SIMD 128 FP64 targets. On
    x86 this showed about 1.4x improvement.
    
    For PPC, do a full multiply (32x32->64b), convert to DP
    then accumulate. This may be slightly less precise for
    some inputs. But is 1.5x faster than the above which
    is about 1.5x than the FMA above for ~2.5x speedup.
    33fb253a
Name
Last commit
Last update
..
calib3d Loading commit data...
core Loading commit data...
cudaarithm Loading commit data...
cudabgsegm Loading commit data...
cudacodec Loading commit data...
cudafeatures2d Loading commit data...
cudafilters Loading commit data...
cudaimgproc Loading commit data...
cudalegacy Loading commit data...
cudaobjdetect Loading commit data...
cudaoptflow Loading commit data...
cudastereo Loading commit data...
cudawarping Loading commit data...
cudev Loading commit data...
dnn Loading commit data...
features2d Loading commit data...
flann Loading commit data...
highgui Loading commit data...
imgcodecs Loading commit data...
imgproc Loading commit data...
java Loading commit data...
js Loading commit data...
ml Loading commit data...
objdetect Loading commit data...
photo Loading commit data...
python Loading commit data...
shape Loading commit data...
stitching Loading commit data...
superres Loading commit data...
ts Loading commit data...
video Loading commit data...
videoio Loading commit data...
videostab Loading commit data...
viz Loading commit data...
world Loading commit data...
CMakeLists.txt Loading commit data...