CUDA softmax kernel and broadcast kernel support for multiple non-consecutive axes (#1070)
* Added op::ReplaceSlice and enabled respective tests. * div64 -> division_by_invariant_multiplication * Added GPUMemoryManager for aggregating memory allocations and copies into a single operation for kernel arguments, and a reusuable memory space for workspace allocations. * Added GPUShape and reworked Shape helpers to be compatible with different shape types. Shape is now implicitly convertable to GPUShape. * Updated shape helpers signature and add conversion operators/constructors for GPUShape. * Removed several unecessary static_casts now that GPUShape is utilized. GPUTensorViewWrapper had a few functions returning std::vector<size_t> instead of Shape/Strides. These were updated as well to take advantage of GPUShape convertion operators. * Forgot to fix lambda for workspace allocations to match that of argspace allocations. * Added GPUShape and reworked Shape helpers to be compatible with different shape types. Shape is now implicitly convertable to GPUShape. * Updated shape helpers signature and add conversion operators/constructors for GPUShape. * Adjust row_major_strides to avoid reversed-copy. * Moved declaration out of loop for clang. * Moved gpu_shape to gpu transformer. * Removed no longer necessary headers. * Added stdexcept header to gpu_shape.hpp * Coordinate->GPUShape * Refactored replace_slice into CudaKernelBuilder. Simplified allocations using new GPUAllocator and GPUMemoryManager. * Refactor allocations to make use of primitive emitter. Now memory primitives are registered at compile time and the gpu memory address is resolved at runtime by ivoking the primitive. * Changed check on 64bit shape to check if high bits are set. * Added const qualifier to data being copied in GPUAllocator::reserve_argspace * Added const qualifier to data being copied in GPUAllocator::reserve_argspace * Replaced runtime host to device memcpys with GPUAllocator reservations in order to move them to compile time. * Forgot to remove no longer necessary buffer freeing from op emitters. * Removed replace slice. * Removed more replace_slice diffs. * Updated replace_slice op to utilize GPUShape and GPUMemoryManager. * Added back missing changes after timeline resolution. * Added spacing between functions in GPUShape and boolean operators in shape.hpp. * Template parameters are UPPER_SNAKE_CASE. * Added unit tests for GPUMemoryManager and added checks that ensure the device memory is allocated prior to address resolution by the memory_primitives. Also exposed the allocation size of the memory manager. * Return type of shape_size should be large enough to encapsulate the full stride of the tensor. This should be 64bits wide regardless of the underlying value_type of the ShapeType. * Upstreaming changes to shape_size (which returns size_t). * cuDNN softmax impl. for all axis activation. * Added catch for per-axis activations. * Removed commended headers. * Added explicit function for queueing kernel argument data rather than inline in the reservation function per @fengleitian recommendation. * Add softmax cuda kernel. It relies on atomic memory addition to global memory, this will add contention and should be optimized in the future. A multilevel reduction can be found in cs/gpu_softmax_cuda_shfl but it requires some further engineering. * Refactored reduce coordinate transform code into a helper and applied it to broadcast. Broadcast added to CUDAEmitter, now supports multiple non-consecutive axes. * Removed change to data_types variable and updated/removed comments. * Refactored softmax into the emission of two fused elementwise collective ops. Added fused elementwise + collective kernels. Softmax is then just the combination of exp_sum_reduce + div_broadcast. * Added default param to GPUAllocator::reserve_workspace to request memory initialization for each invocation of the memory primitive. * GPU workspace memory is zero initialized by default but can be turned off if desired. * Added template parameter to CUDAEmitter::build_elementwise, REDUCE_OP_TYPE, to specify the ngraph op type to use for the reduction in the fusted ew_collective kernel. * Renamed variables and updated a comment. * Removed outdated softmax kernel to avoid confusion. Can be added later when atomic reduce is replaced. * Clang complained about lack of explicit destructor for AxisSet. Since cuda_emitter doesn't need AxisSet specifically, switch to std::set<size_t>. This also has the benefit that in the future, if we wish to emit kernels without ngraph core (for example in a standalone binary via a serialized graph manifest, we don't depend on AxisSet. * softmax -> broadcast in build_broadcast. * Separate elementwise and elementwise_collective.
Showing
This diff is collapsed.
Please
register
or
sign in
to comment