src/ngraph/runtime/gpu/gpu_cuda_kernel_emitters.cpp · 83e6aa5ff7e5aef617c18a1bb57535b4e5530ed9 · submodule / ngraph

CUDA softmax kernel and broadcast kernel support for multiple non-consecutive axes (#1070) · 83e6aa5f

Chris Sullivan authored Jun 12, 2018

* Added op::ReplaceSlice and enabled respective tests.

* div64 -> division_by_invariant_multiplication

* Added GPUMemoryManager for aggregating memory allocations and copies into a single operation for kernel arguments, and a reusuable memory space for workspace allocations.

* Added GPUShape and reworked Shape helpers to be
compatible with different shape types.
Shape is now implicitly convertable to GPUShape.

* Updated shape helpers signature and add conversion operators/constructors for GPUShape.

* Removed several unecessary static_casts now that GPUShape is utilized. GPUTensorViewWrapper had a few functions returning std::vector<size_t> instead of Shape/Strides. These were updated as well to take advantage of GPUShape convertion operators.

* Forgot to fix lambda for workspace allocations to match that of argspace allocations.

* Added GPUShape and reworked Shape helpers to be
compatible with different shape types.
Shape is now implicitly convertable to GPUShape.

* Updated shape helpers signature and add conversion operators/constructors for GPUShape.

* Adjust row_major_strides to avoid reversed-copy.

* Moved declaration out of loop for clang.

* Moved gpu_shape to gpu transformer.

* Removed no longer necessary headers.

* Added stdexcept header to gpu_shape.hpp

* Coordinate->GPUShape

* Refactored replace_slice into CudaKernelBuilder. Simplified allocations using new GPUAllocator and GPUMemoryManager.

* Refactor allocations to make use of primitive emitter.
Now memory primitives are registered at compile time and
the gpu memory address is resolved at runtime by ivoking
the primitive.

* Changed check on 64bit shape to check if high bits are set.

* Added const qualifier to data being copied in GPUAllocator::reserve_argspace

* Replaced runtime host to device memcpys with GPUAllocator reservations in order to move them to compile time.

* Forgot to remove no longer necessary buffer freeing from op emitters.

* Removed replace slice.

* Removed more replace_slice diffs.

* Updated replace_slice op to utilize GPUShape and GPUMemoryManager.

* Added back missing changes after timeline resolution.

* Added spacing between functions in GPUShape and boolean operators in shape.hpp.

* Template parameters are UPPER_SNAKE_CASE.

* Added unit tests for GPUMemoryManager and added checks that ensure the
device memory is allocated prior to address resolution by the memory_primitives.
Also exposed the allocation size of the memory manager.

* Return type of shape_size should be large enough to encapsulate the full stride of the tensor.
This should be 64bits wide regardless of the underlying value_type of the ShapeType.

* Upstreaming changes to shape_size (which returns size_t).

* cuDNN softmax impl. for all axis activation.

* Added catch for per-axis activations.

* Removed commended headers.

* Added explicit function for queueing kernel argument data rather than inline in the reservation function per @fengleitian recommendation.

* Add softmax cuda kernel. It relies on atomic memory addition to global
memory, this will add contention and should be optimized in the
future. A multilevel reduction can be found in
cs/gpu_softmax_cuda_shfl but it requires some further engineering.

* Refactored reduce coordinate transform code into a helper and applied it to broadcast.
Broadcast added to CUDAEmitter, now supports multiple non-consecutive axes.

* Removed change to data_types variable and updated/removed comments.

* Refactored softmax into the emission of two fused elementwise collective ops.
Added fused elementwise + collective kernels. Softmax is then just the combination of exp_sum_reduce + div_broadcast.

* Added default param to GPUAllocator::reserve_workspace to request memory initialization for each invocation of the memory primitive.

* GPU workspace memory is zero initialized by default but can be turned off if desired.

* Added template parameter to CUDAEmitter::build_elementwise, REDUCE_OP_TYPE,
to specify the ngraph op type to use for the reduction in the fusted ew_collective kernel.

* Renamed variables and updated a comment.

* Removed outdated softmax kernel to avoid confusion. Can be added later when atomic reduce is replaced.

* Clang complained about lack of explicit destructor for AxisSet. Since cuda_emitter doesn't need AxisSet specifically, switch to std::set<size_t>.
This also has the benefit that in the future, if we wish to emit kernels without ngraph core (for example in a standalone binary via a
serialized graph manifest, we don't depend on AxisSet.

* softmax -> broadcast in build_broadcast.

* Separate elementwise and elementwise_collective.

83e6aa5f

gpu_cuda_kernel_emitters.cpp 7.97 KB

Replace gpu_cuda_kernel_emitters.cpp