nvgpu cuda softmax optimization (#2101)
* add some helper function
* update with new helper function
* update reduce to nd with new helper function
* update float sum to stable sum
* fix bug
* update all reduce to stable sum for float
* fix bug and pass the sum stable test
* remove debug info
* style
* update with shape
* fix bug
* add host parameters to cuda_emitter
* clang format
* fix bugs
* add element::type support
* format
* add a cached value with datatype name
* add init_reduce_value
* unroll loop
* optimization
* remove the need for init_value
* add memset kernel
* add memcpy
* working version
* remove debug info
* add comments, clean up code.
* change in_idx to input_idx
* fix bug
* change args name for memset in emitter
* pass element::Type instead of string
* the op::reduce come with init value, add support
* resolve codacy-bot comment
* fix bug
* resove codacy-bot comment
* add soft_max_block_reduce kernel
* fix bugs
* add softmax_block_reduce to cuda_emitter
* compiing ok, result wrong
* fix bug in kernel
* working version
* removed unused code
* remove unused comments, resolve comments
* cuda reduce for max, min, mul, reduce op init value, format
* use type::info
* use type info for numeric_limits
* remove code from gpu_host_parameters
* header
* remvoe outdated comments
* add helper to check if stable sum is needed
* add stable sum test for double
* remove extra line
* consolidate helper functions
* no need list now.
* remove extra ;
* clang format
* style
* add skip test for cpu and intelGPU side
* resolve more conflict
* update comment
* fix a warning
* Update src/ngraph/runtime/gpu/gpu_cuda_kernel_builder.cpp
using load.
Co-Authored-By:
fengleitian <35274053+fengleitian@users.noreply.github.com>
* using WARPSIZE instead of 32, using lambda
* more WARPSIZE instead of 32
* fix block_size_x bug
* using __expf
Showing
Please
register
or
sign in
to comment