• Frank Barchard's avatar
    I422ToYUY2Row_AVX2 use vpmovzxbd instead of vpermq · 7ff53f32
    Frank Barchard authored
    I422ToYUY2Row_AVX2 optimized from 7 cycles per 32 pixels to 6 cycles.
    Instead of 2 vpermq and vpunpcklbw:
    vmovdqu    (%1),%%xmm2
    vmovdqu    0x00(%1,%2,1),%%xmm3
    lea        0x10(%1),%1
    vpermq     $0xd8,%%ymm2,%%ymm2
    vpermq     $0xd8,%%ymm3,%%ymm3
    vpunpcklbw %%ymm3,%%ymm2,%%ymm2
    
    ..use vpmovzxbd to expand the bytes to shorts, then vpslld and vpor
    vpmovzxbd  (%1),%%ymm2
    vpmovzxbd  0x00(%1,%2,1),%%ymm3
    vpslld     $0x10,%%ymm3,%%ymm3
    vpor       %%ymm3,%%ymm2,%%ymm2
    which reduces the port 5 bottleneck by 1 cycle.
    
    Bug: libyuv:556
    Test: out/Release/libyuv_unittest --gtest_filter=*I42?To*UY*Opt
    
    I422ToYUY2Row_AVX2 optimization
    
    Improve performance of AVX2 code by avoiding vpermq
    
    Bug: libyuv:556
    Test: /usr/local/google/home/fbarchard/iaca-lin64/bin/iaca.sh -reduceout -arch BDW out/Release/obj/libyuv_internal/row_gcc.o
    Change-Id: Ie36732da23ecea1ffcc6b297bacc962780b59ef1
    Reviewed-on: https://chromium-review.googlesource.com/898067
    Commit-Queue: Frank Barchard <fbarchard@chromium.org>
    Reviewed-by: 's avatarrichard winterton <rrwinterton@gmail.com>
    7ff53f32
row_gcc.cc 278 KB