• Frank Barchard's avatar
    I422ToUYVYRow_AVX2 use vpmovzxbd instead of vpermq · 5790a765
    Frank Barchard authored
    I422ToUYVYRow_AVX2 optimized from 7 cycles per 32 pixels to 4.6 cycles.
    Instead of 2 vpermq and vpunpcklbw:
    vmovdqu    (%1),%%xmm2
    vmovdqu    0x00(%1,%2,1),%%xmm3
    vpermq     $0xd8,%%ymm2,%%ymm2
    vpermq     $0xd8,%%ymm3,%%ymm3
    vpunpcklbw %%ymm3,%%ymm2,%%ymm2
    
    ..use vpmovzxbd to expand the bytes to shorts, then vpslld and vpor
    vpmovzxbd  (%1),%%ymm2
    vpmovzxbd  0x00(%1,%2,1),%%ymm3
    vpslld     $0x10,%%ymm3,%%ymm3
    vpor       %%ymm3,%%ymm2,%%ymm2
    which reduces the port 5 bottleneck by 1 cycle.
    
    Bug: libyuv:556
    Test: out/Release/libyuv_unittest --gtest_filter=*I42?To*UY*Opt
    
    Change-Id: I53799e53cc6b090a1a695c839094c193be3eecaf
    Reviewed-on: https://chromium-review.googlesource.com/899873
    Commit-Queue: Frank Barchard <fbarchard@chromium.org>
    Reviewed-by: 's avatarrichard winterton <rrwinterton@gmail.com>
    Reviewed-by: 's avatarCheng Wang <wangcheng@google.com>
    5790a765
README.chromium 203 Bytes