• Chris Sullivan's avatar
    CUDA softmax kernel and broadcast kernel support for multiple non-consecutive axes (#1070) · 83e6aa5f
    Chris Sullivan authored
    * Added op::ReplaceSlice and enabled respective tests.
    
    * div64 -> division_by_invariant_multiplication
    
    * Added GPUMemoryManager for aggregating memory allocations and copies into a single operation for kernel arguments, and a reusuable memory space for workspace allocations.
    
    * Added GPUShape and reworked Shape helpers to be
    compatible with different shape types.
    Shape is now implicitly convertable to GPUShape.
    
    * Updated shape helpers signature and add conversion operators/constructors for GPUShape.
    
    * Removed several unecessary static_casts now that GPUShape is utilized. GPUTensorViewWrapper had a few functions returning std::vector<size_t> instead of Shape/Strides. These were updated as well to take advantage of GPUShape convertion operators.
    
    * Forgot to fix lambda for workspace allocations to match that of argspace allocations.
    
    * Added GPUShape and reworked Shape helpers to be
    compatible with different shape types.
    Shape is now implicitly convertable to GPUShape.
    
    * Updated shape helpers signature and add conversion operators/constructors for GPUShape.
    
    * Adjust row_major_strides to avoid reversed-copy.
    
    * Moved declaration out of loop for clang.
    
    * Moved gpu_shape to gpu transformer.
    
    * Removed no longer necessary headers.
    
    * Added stdexcept header to gpu_shape.hpp
    
    * Coordinate->GPUShape
    
    * Refactored replace_slice into CudaKernelBuilder. Simplified allocations using new GPUAllocator and GPUMemoryManager.
    
    * Refactor allocations to make use of primitive emitter.
    Now memory primitives are registered at compile time and
    the gpu memory address is resolved at runtime by ivoking
    the primitive.
    
    * Changed check on 64bit shape to check if high bits are set.
    
    * Added const qualifier to data being copied in GPUAllocator::reserve_argspace
    
    * Added const qualifier to data being copied in GPUAllocator::reserve_argspace
    
    * Replaced runtime host to device memcpys with GPUAllocator reservations in order to move them to compile time.
    
    * Forgot to remove no longer necessary buffer freeing from op emitters.
    
    * Removed replace slice.
    
    * Removed more replace_slice diffs.
    
    * Updated replace_slice op to utilize GPUShape and GPUMemoryManager.
    
    * Added back missing changes after timeline resolution.
    
    * Added spacing between functions in GPUShape and boolean operators in shape.hpp.
    
    * Template parameters are UPPER_SNAKE_CASE.
    
    * Added unit tests for GPUMemoryManager and added checks that ensure the
    device memory is allocated prior to address resolution by the memory_primitives.
    Also exposed the allocation size of the memory manager.
    
    * Return type of shape_size should be large enough to encapsulate the full stride of the tensor.
    This should be 64bits wide regardless of the underlying value_type of the ShapeType.
    
    * Upstreaming changes to shape_size (which returns size_t).
    
    * cuDNN softmax impl. for all axis activation.
    
    * Added catch for per-axis activations.
    
    * Removed commended headers.
    
    * Added explicit function for queueing kernel argument data rather than inline in the reservation function per @fengleitian recommendation.
    
    * Add softmax cuda kernel. It relies on atomic memory addition to global
    memory, this will add contention and should be optimized in the
    future. A multilevel reduction can be found in
    cs/gpu_softmax_cuda_shfl but it requires some further engineering.
    
    * Refactored reduce coordinate transform code into a helper and applied it to broadcast.
    Broadcast added to CUDAEmitter, now supports multiple non-consecutive axes.
    
    * Removed change to data_types variable and updated/removed comments.
    
    * Refactored softmax into the emission of two fused elementwise collective ops.
    Added fused elementwise + collective kernels. Softmax is then just the combination of exp_sum_reduce + div_broadcast.
    
    * Added default param to GPUAllocator::reserve_workspace to request memory initialization for each invocation of the memory primitive.
    
    * GPU workspace memory is zero initialized by default but can be turned off if desired.
    
    * Added template parameter to CUDAEmitter::build_elementwise, REDUCE_OP_TYPE,
    to specify the ngraph op type to use for the reduction in the fusted ew_collective kernel.
    
    * Renamed variables and updated a comment.
    
    * Removed outdated softmax kernel to avoid confusion. Can be added later when atomic reduce is replaced.
    
    * Clang complained about lack of explicit destructor for AxisSet. Since cuda_emitter doesn't need AxisSet specifically, switch to std::set<size_t>.
    This also has the benefit that in the future, if we wish to emit kernels without ngraph core (for example in a standalone binary via a
    serialized graph manifest, we don't depend on AxisSet.
    
    * softmax -> broadcast in build_broadcast.
    
    * Separate elementwise and elementwise_collective.
    83e6aa5f
Name
Last commit
Last update
.ci/travis/ubuntu Loading commit data...
cmake Loading commit data...
contrib/docker Loading commit data...
doc Loading commit data...
licenses Loading commit data...
maint Loading commit data...
python Loading commit data...
src Loading commit data...
test Loading commit data...
.clang-format Loading commit data...
.gitignore Loading commit data...
.gitmodules Loading commit data...
.travis.yml Loading commit data...
CMakeLists.txt Loading commit data...
CONTRIB.md Loading commit data...
INSTALL.md Loading commit data...
LICENSE Loading commit data...
README.md Loading commit data...
VERSION.in Loading commit data...
changes.md Loading commit data...