• Frank Barchard's avatar
    Gaussian reorder for benefit of A73 · f0a9d6d2
    Frank Barchard authored
    Roughly. instead of 4 loads and 8 multiples, use 1 load and 2 multiples
    4 times over.  The original code, as with the C code from clang and gcc,
    did all the loads, then all the math, then the store.  The new code
    does a load, then the math, then the next load, etc.
    This schedules better on current arm 64 cpus.
    Number of registers also reduced, reusing the same registers.
    
    HiSilicon ARM A73:
    
    Now
    TestGaussRow_Opt (890 ms)
    TestGaussCol_Opt (571 ms)
    
    Was
    TestGaussRow_Opt (1061 ms)
    TestGaussCol_Opt (595 ms)
    
    Qualcomm 821 (Pixel):
    
    Now
    TestGaussRow_Opt (571 ms)
    TestGaussCol_Opt (474 ms)
    
    Was
    TestGaussRow_Opt (751 ms)
    TestGaussCol_Opt (520 ms)
    
    TBR=kjellander@chromium.org
    BUG=libyuv:719
    TEST=LibYUVPlanarTest.TestGaussRow_Opt
    
    Reviewed-on: https://chromium-review.googlesource.com/627478Reviewed-by: 's avatarCheng Wang <wangcheng@google.com>
    Reviewed-by: 's avatarFrank Barchard <fbarchard@google.com>
    Change-Id: I5ec81191d460801f0d4a89f0384f89925ff036de
    Reviewed-on: https://chromium-review.googlesource.com/634448
    Commit-Queue: Frank Barchard <fbarchard@google.com>
    f0a9d6d2
row_neon64.cc 129 KB