• Yashas Samaga B L's avatar
    Merge pull request #14827 from YashasSamaga:cuda4dnn-csl-low · 613c12e5
    Yashas Samaga B L authored
    CUDA backend for the DNN module
    
    * stub cuda4dnn design
    
    * minor fixes for tests and doxygen
    
    * add csl public api directory to module headers
    
    * add low-level CSL components
    
    * add high-level CSL components
    
    * integrate csl::Tensor into backbone code
    
    * switch to CPU iff unsupported; otherwise, fail on error
    
    * add fully connected layer
    
    * add softmax layer
    
    * add activation layers
    
    * support arbitary rank TensorDescriptor
    
    * pass input wrappers to `initCUDA()`
    
    * add 1d/2d/3d-convolution
    
    * add pooling layer
    
    * reorganize and refactor code
    
    * fixes for gcc, clang and doxygen; remove cxx14/17 code
    
    * add blank_layer
    
    * add LRN layer
    
    * add rounding modes for pooling layer
    
    * split tensor.hpp into tensor.hpp and tensor_ops.hpp
    
    * add concat layer
    
    * add scale layer
    
    * add batch normalization layer
    
    * split math.cu into activations.cu and math.hpp
    
    * add eltwise layer
    
    * add flatten layer
    
    * add tensor transform api
    
    * add asymmetric padding support for convolution layer
    
    * add reshape layer
    
    * fix rebase issues
    
    * add permute layer
    
    * add padding support for concat layer
    
    * refactor and reorganize code
    
    * add normalize layer
    
    * optimize bias addition in scale layer
    
    * add prior box layer
    
    * fix and optimize normalize layer
    
    * add asymmetric padding support for pooling layer
    
    * add event API
    
    * improve pooling performance for some padding scenarios
    
    * avoid over-allocation of compute resources to kernels
    
    * improve prior box performance
    
    * enable layer fusion
    
    * add const layer
    
    * add resize layer
    
    * add slice layer
    
    * add padding layer
    
    * add deconvolution layer
    
    * fix channelwise  ReLU initialization
    
    * add vector traits
    
    * add vectorized versions of relu, clipped_relu, power
    
    * add vectorized concat kernels
    
    * improve concat_with_offsets performance
    
    * vectorize scale and bias kernels
    
    * add support for multi-billion element tensors
    
    * vectorize prior box kernels
    
    * fix address alignment check
    
    * improve bias addition performance of conv/deconv/fc layers
    
    * restructure code for supporting multiple targets
    
    * add DNN_TARGET_CUDA_FP64
    
    * add DNN_TARGET_FP16
    
    * improve vectorization
    
    * add region layer
    
    * improve tensor API, add dynamic ranks
    
    1. use ManagedPtr instead of a Tensor in backend wrapper
    2. add new methods to tensor classes
      - size_range: computes the combined size of for a given axis range
      - tensor span/view can be constructed from a raw pointer and shape
    3. the tensor classes can change their rank at runtime (previously rank was fixed at compile-time)
    4. remove device code from tensor classes (as they are unused)
    5. enforce strict conditions on tensor class APIs to improve debugging ability
    
    * fix parametric relu activation
    
    * add squeeze/unsqueeze tensor API
    
    * add reorg layer
    
    * optimize permute and enable 2d permute
    
    * enable 1d and 2d slice
    
    * add split layer
    
    * add shuffle channel layer
    
    * allow tensors of different ranks in reshape primitive
    
    * patch SliceOp to allow Crop Layer
    
    * allow extra shape inputs in reshape layer
    
    * use `std::move_backward` instead of `std::move` for insert in resizable_static_array
    
    * improve workspace management
    
    * add spatial LRN
    
    * add nms (cpu) to region layer
    
    * add max pooling with argmax ( and a fix to limits.hpp)
    
    * add max unpooling layer
    
    * rename DNN_TARGET_CUDA_FP32 to DNN_TARGET_CUDA
    
    * update supportBackend to be more rigorous
    
    * remove stray include from preventing non-cuda build
    
    * include op_cuda.hpp outside condition #if
    
    * refactoring, fixes and many optimizations
    
    * drop DNN_TARGET_CUDA_FP64
    
    * fix gcc errors
    
    * increase max. tensor rank limit to six
    
    * add Interp layer
    
    * drop custom layers; use BackendNode
    
    * vectorize activation kernels
    
    * fixes for gcc
    
    * remove wrong assertion
    
    * fix broken assertion in unpooling primitive
    
    * fix build errors in non-CUDA build
    
    * completely remove workspace from public API
    
    * fix permute layer
    
    * enable accuracy and perf. tests for DNN_TARGET_CUDA
    
    * add asynchronous forward
    
    * vectorize eltwise ops
    
    * vectorize fill kernel
    
    * fixes for gcc
    
    * remove CSL headers from public API
    
    * remove csl header source group from cmake
    
    * update min. cudnn version in cmake
    
    * add numerically stable FP32 log1pexp
    
    * refactor code
    
    * add FP16 specialization to cudnn based tensor addition
    
    * vectorize scale1 and bias1 + minor refactoring
    
    * fix doxygen build
    
    * fix invalid alignment assertion
    
    * clear backend wrappers before allocateLayers
    
    * ignore memory lock failures
    
    * do not allocate internal blobs
    
    * integrate NVTX
    
    * add numerically stable half precision log1pexp
    
    * fix indentation, following coding style,  improve docs
    
    * remove accidental modification of IE code
    
    * Revert "add asynchronous forward"
    
    This reverts commit 1154b9da9da07e9b52f8a81bdcea48cf31c56f70.
    
    * [cmake] throw error for unsupported CC versions
    
    * fix rebase issues
    
    * add more docs, refactor code, fix bugs
    
    * minor refactoring and fixes
    
    * resolve warnings/errors from clang
    
    * remove haveCUDA() checks from supportBackend()
    
    * remove NVTX integration
    
    * changes based on review comments
    
    * avoid exception when no CUDA device is present
    
    * add color code for CUDA in Net::dump
    613c12e5
OpenCVMinDepVersions.cmake 208 Bytes