• Fenglei's avatar
    nvgpu reduce to scalar optimization (#1491) · 5f40d957
    Fenglei authored
    * add cuda reduce
    
    * clang format
    
    * fix bugs
    
    * fix bug
    
    * add 1d reduce
    
    * clang format
    
    * fix bugs
    
    * unroll loop
    
    * remove debug info
    
    * revert tests
    
    * unroll 1D reduce op
    
    * add comments
    
    * using cudnn for nd to scalar reduction
    
    * remove cuda 1d reduction since cudnn version is faster
    
    * remove 1D kernel
    
    * fix bugs
    
    * 1d multi block size
    
    * remove debug
    
    * change kernel name
    
    * add reduce to scalar optimization, add test
    
    * fix bugs and tune parameters
    
    * clang format
    
    * update comments
    
    * update comments
    
    * update comments
    
    * clang format
    
    * update comments
    
    * remove wrong comments, apply clang format
    
    * resolve Bob's comment
    
    * clang format
    
    * pass shared mem size from cuLaunchKernel, set unroll loop size through host code
    
    * remove unused code.clang format
    
    * change reduce to thread with shfl for each warp first
    
    * add seed
    
    * unroll size
    5f40d957
backend_test.in.cpp 365 KB