-
Fenglei authored
* add cuda reduce * clang format * fix bugs * fix bug * add 1d reduce * clang format * fix bugs * unroll loop * remove debug info * revert tests * unroll 1D reduce op * add comments * using cudnn for nd to scalar reduction * remove cuda 1d reduction since cudnn version is faster * remove 1D kernel * fix bugs * 1d multi block size * remove debug * change kernel name * add reduce to scalar optimization, add test * fix bugs and tune parameters * clang format * update comments * update comments * update comments * clang format * update comments * remove wrong comments, apply clang format * resolve Bob's comment * clang format * pass shared mem size from cuLaunchKernel, set unroll loop size through host code * remove unused code.clang format * change reduce to thread with shfl for each warp first * add seed * unroll size
5f40d957