• Chris Sullivan's avatar
    CUDA softmax kernel and broadcast kernel support for multiple non-consecutive axes (#1070) · 83e6aa5f
    Chris Sullivan authored
    * Added op::ReplaceSlice and enabled respective tests.
    
    * div64 -> division_by_invariant_multiplication
    
    * Added GPUMemoryManager for aggregating memory allocations and copies into a single operation for kernel arguments, and a reusuable memory space for workspace allocations.
    
    * Added GPUShape and reworked Shape helpers to be
    compatible with different shape types.
    Shape is now implicitly convertable to GPUShape.
    
    * Updated shape helpers signature and add conversion operators/constructors for GPUShape.
    
    * Removed several unecessary static_casts now that GPUShape is utilized. GPUTensorViewWrapper had a few functions returning std::vector<size_t> instead of Shape/Strides. These were updated as well to take advantage of GPUShape convertion operators.
    
    * Forgot to fix lambda for workspace allocations to match that of argspace allocations.
    
    * Added GPUShape and reworked Shape helpers to be
    compatible with different shape types.
    Shape is now implicitly convertable to GPUShape.
    
    * Updated shape helpers signature and add conversion operators/constructors for GPUShape.
    
    * Adjust row_major_strides to avoid reversed-copy.
    
    * Moved declaration out of loop for clang.
    
    * Moved gpu_shape to gpu transformer.
    
    * Removed no longer necessary headers.
    
    * Added stdexcept header to gpu_shape.hpp
    
    * Coordinate->GPUShape
    
    * Refactored replace_slice into CudaKernelBuilder. Simplified allocations using new GPUAllocator and GPUMemoryManager.
    
    * Refactor allocations to make use of primitive emitter.
    Now memory primitives are registered at compile time and
    the gpu memory address is resolved at runtime by ivoking
    the primitive.
    
    * Changed check on 64bit shape to check if high bits are set.
    
    * Added const qualifier to data being copied in GPUAllocator::reserve_argspace
    
    * Added const qualifier to data being copied in GPUAllocator::reserve_argspace
    
    * Replaced runtime host to device memcpys with GPUAllocator reservations in order to move them to compile time.
    
    * Forgot to remove no longer necessary buffer freeing from op emitters.
    
    * Removed replace slice.
    
    * Removed more replace_slice diffs.
    
    * Updated replace_slice op to utilize GPUShape and GPUMemoryManager.
    
    * Added back missing changes after timeline resolution.
    
    * Added spacing between functions in GPUShape and boolean operators in shape.hpp.
    
    * Template parameters are UPPER_SNAKE_CASE.
    
    * Added unit tests for GPUMemoryManager and added checks that ensure the
    device memory is allocated prior to address resolution by the memory_primitives.
    Also exposed the allocation size of the memory manager.
    
    * Return type of shape_size should be large enough to encapsulate the full stride of the tensor.
    This should be 64bits wide regardless of the underlying value_type of the ShapeType.
    
    * Upstreaming changes to shape_size (which returns size_t).
    
    * cuDNN softmax impl. for all axis activation.
    
    * Added catch for per-axis activations.
    
    * Removed commended headers.
    
    * Added explicit function for queueing kernel argument data rather than inline in the reservation function per @fengleitian recommendation.
    
    * Add softmax cuda kernel. It relies on atomic memory addition to global
    memory, this will add contention and should be optimized in the
    future. A multilevel reduction can be found in
    cs/gpu_softmax_cuda_shfl but it requires some further engineering.
    
    * Refactored reduce coordinate transform code into a helper and applied it to broadcast.
    Broadcast added to CUDAEmitter, now supports multiple non-consecutive axes.
    
    * Removed change to data_types variable and updated/removed comments.
    
    * Refactored softmax into the emission of two fused elementwise collective ops.
    Added fused elementwise + collective kernels. Softmax is then just the combination of exp_sum_reduce + div_broadcast.
    
    * Added default param to GPUAllocator::reserve_workspace to request memory initialization for each invocation of the memory primitive.
    
    * GPU workspace memory is zero initialized by default but can be turned off if desired.
    
    * Added template parameter to CUDAEmitter::build_elementwise, REDUCE_OP_TYPE,
    to specify the ngraph op type to use for the reduction in the fusted ew_collective kernel.
    
    * Renamed variables and updated a comment.
    
    * Removed outdated softmax kernel to avoid confusion. Can be added later when atomic reduce is replaced.
    
    * Clang complained about lack of explicit destructor for AxisSet. Since cuda_emitter doesn't need AxisSet specifically, switch to std::set<size_t>.
    This also has the benefit that in the future, if we wish to emit kernels without ngraph core (for example in a standalone binary via a
    serialized graph manifest, we don't depend on AxisSet.
    
    * softmax -> broadcast in build_broadcast.
    
    * Separate elementwise and elementwise_collective.
    83e6aa5f
gpu_cuda_kernel_builder.cpp 22.1 KB