• Yashas Samaga B L's avatar
    Merge pull request #14827 from YashasSamaga:cuda4dnn-csl-low · 613c12e5
    Yashas Samaga B L authored
    CUDA backend for the DNN module
    
    * stub cuda4dnn design
    
    * minor fixes for tests and doxygen
    
    * add csl public api directory to module headers
    
    * add low-level CSL components
    
    * add high-level CSL components
    
    * integrate csl::Tensor into backbone code
    
    * switch to CPU iff unsupported; otherwise, fail on error
    
    * add fully connected layer
    
    * add softmax layer
    
    * add activation layers
    
    * support arbitary rank TensorDescriptor
    
    * pass input wrappers to `initCUDA()`
    
    * add 1d/2d/3d-convolution
    
    * add pooling layer
    
    * reorganize and refactor code
    
    * fixes for gcc, clang and doxygen; remove cxx14/17 code
    
    * add blank_layer
    
    * add LRN layer
    
    * add rounding modes for pooling layer
    
    * split tensor.hpp into tensor.hpp and tensor_ops.hpp
    
    * add concat layer
    
    * add scale layer
    
    * add batch normalization layer
    
    * split math.cu into activations.cu and math.hpp
    
    * add eltwise layer
    
    * add flatten layer
    
    * add tensor transform api
    
    * add asymmetric padding support for convolution layer
    
    * add reshape layer
    
    * fix rebase issues
    
    * add permute layer
    
    * add padding support for concat layer
    
    * refactor and reorganize code
    
    * add normalize layer
    
    * optimize bias addition in scale layer
    
    * add prior box layer
    
    * fix and optimize normalize layer
    
    * add asymmetric padding support for pooling layer
    
    * add event API
    
    * improve pooling performance for some padding scenarios
    
    * avoid over-allocation of compute resources to kernels
    
    * improve prior box performance
    
    * enable layer fusion
    
    * add const layer
    
    * add resize layer
    
    * add slice layer
    
    * add padding layer
    
    * add deconvolution layer
    
    * fix channelwise  ReLU initialization
    
    * add vector traits
    
    * add vectorized versions of relu, clipped_relu, power
    
    * add vectorized concat kernels
    
    * improve concat_with_offsets performance
    
    * vectorize scale and bias kernels
    
    * add support for multi-billion element tensors
    
    * vectorize prior box kernels
    
    * fix address alignment check
    
    * improve bias addition performance of conv/deconv/fc layers
    
    * restructure code for supporting multiple targets
    
    * add DNN_TARGET_CUDA_FP64
    
    * add DNN_TARGET_FP16
    
    * improve vectorization
    
    * add region layer
    
    * improve tensor API, add dynamic ranks
    
    1. use ManagedPtr instead of a Tensor in backend wrapper
    2. add new methods to tensor classes
      - size_range: computes the combined size of for a given axis range
      - tensor span/view can be constructed from a raw pointer and shape
    3. the tensor classes can change their rank at runtime (previously rank was fixed at compile-time)
    4. remove device code from tensor classes (as they are unused)
    5. enforce strict conditions on tensor class APIs to improve debugging ability
    
    * fix parametric relu activation
    
    * add squeeze/unsqueeze tensor API
    
    * add reorg layer
    
    * optimize permute and enable 2d permute
    
    * enable 1d and 2d slice
    
    * add split layer
    
    * add shuffle channel layer
    
    * allow tensors of different ranks in reshape primitive
    
    * patch SliceOp to allow Crop Layer
    
    * allow extra shape inputs in reshape layer
    
    * use `std::move_backward` instead of `std::move` for insert in resizable_static_array
    
    * improve workspace management
    
    * add spatial LRN
    
    * add nms (cpu) to region layer
    
    * add max pooling with argmax ( and a fix to limits.hpp)
    
    * add max unpooling layer
    
    * rename DNN_TARGET_CUDA_FP32 to DNN_TARGET_CUDA
    
    * update supportBackend to be more rigorous
    
    * remove stray include from preventing non-cuda build
    
    * include op_cuda.hpp outside condition #if
    
    * refactoring, fixes and many optimizations
    
    * drop DNN_TARGET_CUDA_FP64
    
    * fix gcc errors
    
    * increase max. tensor rank limit to six
    
    * add Interp layer
    
    * drop custom layers; use BackendNode
    
    * vectorize activation kernels
    
    * fixes for gcc
    
    * remove wrong assertion
    
    * fix broken assertion in unpooling primitive
    
    * fix build errors in non-CUDA build
    
    * completely remove workspace from public API
    
    * fix permute layer
    
    * enable accuracy and perf. tests for DNN_TARGET_CUDA
    
    * add asynchronous forward
    
    * vectorize eltwise ops
    
    * vectorize fill kernel
    
    * fixes for gcc
    
    * remove CSL headers from public API
    
    * remove csl header source group from cmake
    
    * update min. cudnn version in cmake
    
    * add numerically stable FP32 log1pexp
    
    * refactor code
    
    * add FP16 specialization to cudnn based tensor addition
    
    * vectorize scale1 and bias1 + minor refactoring
    
    * fix doxygen build
    
    * fix invalid alignment assertion
    
    * clear backend wrappers before allocateLayers
    
    * ignore memory lock failures
    
    * do not allocate internal blobs
    
    * integrate NVTX
    
    * add numerically stable half precision log1pexp
    
    * fix indentation, following coding style,  improve docs
    
    * remove accidental modification of IE code
    
    * Revert "add asynchronous forward"
    
    This reverts commit 1154b9da9da07e9b52f8a81bdcea48cf31c56f70.
    
    * [cmake] throw error for unsupported CC versions
    
    * fix rebase issues
    
    * add more docs, refactor code, fix bugs
    
    * minor refactoring and fixes
    
    * resolve warnings/errors from clang
    
    * remove haveCUDA() checks from supportBackend()
    
    * remove NVTX integration
    
    * changes based on review comments
    
    * avoid exception when no CUDA device is present
    
    * add color code for CUDA in Net::dump
    613c12e5
max_unpooling_layer.cpp 6.91 KB
// This file is part of OpenCV project.
// It is subject to the license terms in the LICENSE file found in the top-level directory
// of this distribution and at http://opencv.org/license.html.

// Copyright (C) 2016, Intel Corporation, all rights reserved.
// Third party copyrights are property of their respective owners.

/*
Implementation of Batch Normalization layer.
*/

#include "../precomp.hpp"
#include "layers_common.hpp"
#include "../op_cuda.hpp"
#include "../op_halide.hpp"
#include <opencv2/dnn/shape_utils.hpp>

#ifdef HAVE_CUDA
#include "../cuda4dnn/primitives/max_unpooling.hpp"
using namespace cv::dnn::cuda4dnn;
#endif

namespace cv
{
namespace dnn
{

class MaxUnpoolLayerImpl CV_FINAL : public MaxUnpoolLayer
{
public:
    MaxUnpoolLayerImpl(const LayerParams& params)
    {
        setParamsFrom(params);
        poolKernel = Size(params.get<int>("pool_k_w"), params.get<int>("pool_k_h"));
        poolPad = Size(params.get<int>("pool_pad_w"), params.get<int>("pool_pad_h"));
        poolStride = Size(params.get<int>("pool_stride_w"), params.get<int>("pool_stride_h"));
    }

    virtual bool supportBackend(int backendId) CV_OVERRIDE
    {
        return backendId == DNN_BACKEND_OPENCV ||
               backendId == DNN_BACKEND_CUDA ||
               (backendId == DNN_BACKEND_HALIDE && haveHalide() && !poolPad.width && !poolPad.height);
    }

    bool getMemoryShapes(const std::vector<MatShape> &inputs,
                         const int requiredOutputs,
                         std::vector<MatShape> &outputs,
                         std::vector<MatShape> &internals) const CV_OVERRIDE
    {
        CV_Assert(inputs.size() == 2 || inputs.size() == 3);
        CV_Assert(total(inputs[0]) == total(inputs[1]));

        MatShape outShape;
        if (inputs.size() == 2)
        {
            outShape = inputs[0];
            outShape[2] = (outShape[2] - 1) * poolStride.height + poolKernel.height - 2 * poolPad.height;
            outShape[3] = (outShape[3] - 1) * poolStride.width + poolKernel.width - 2 * poolPad.width;
        }
        else
            outShape = inputs[2];

        outputs.clear();
        outputs.push_back(outShape);

        return false;
    }

    void forward(InputArrayOfArrays inputs_arr, OutputArrayOfArrays outputs_arr, OutputArrayOfArrays internals_arr) CV_OVERRIDE
    {
        CV_TRACE_FUNCTION();
        CV_TRACE_ARG_VALUE(name, "name", name.c_str());

        if (inputs_arr.depth() == CV_16S)
        {
            forward_fallback(inputs_arr, outputs_arr, internals_arr);
            return;
        }

        std::vector<Mat> inputs, outputs;
        inputs_arr.getMatVector(inputs);
        outputs_arr.getMatVector(outputs);

        CV_Assert(inputs.size() == 2 || inputs.size() == 3);
        Mat& input = inputs[0];
        Mat& indices = inputs[1];

        CV_Assert(input.total() == indices.total());
        CV_Assert(input.size[0] == 1);
        CV_Assert(input.isContinuous());

        for(int i_n = 0; i_n < outputs.size(); i_n++)
        {
            Mat& outBlob = outputs[i_n];
            outBlob.setTo(0);
            CV_Assert(input.size[1] == outBlob.size[1]);
            int outPlaneTotal = outBlob.size[2]*outBlob.size[3];

            for (int i_c = 0; i_c < input.size[1]; i_c++)
            {
                Mat outPlane = getPlane(outBlob, 0, i_c);
                int wh_area = input.size[2]*input.size[3];
                const float* inptr = input.ptr<float>(0, i_c);
                const float* idxptr = indices.ptr<float>(0, i_c);
                float* outptr = outPlane.ptr<float>();

                for(int i_wh = 0; i_wh < wh_area; i_wh++)
                {
                    int index = idxptr[i_wh];
                    if (!(0 <= index && index < outPlaneTotal))
                    {
                        std::cerr
                            << "i_n=" << i_n << std::endl
                            << "i_c=" << i_c << std::endl
                            << "i_wh=" << i_wh << std::endl
                            << "index=" << index << std::endl
                            << "maxval=" << inptr[i_wh] << std::endl
                            << "outPlaneTotal=" << outPlaneTotal << std::endl
                            << "input.size=" << input.size << std::endl
                            << "indices.size=" << indices.size << std::endl
                            << "outBlob=" << outBlob.size << std::endl
                            ;
                        CV_Assert(0 <= index && index < outPlaneTotal);
                    }
                    outptr[index] = inptr[i_wh];
                }
            }
        }
    }

#ifdef HAVE_CUDA
    Ptr<BackendNode> initCUDA(
        void *context_,
        const std::vector<Ptr<BackendWrapper>>& inputs,
        const std::vector<Ptr<BackendWrapper>>& outputs
    ) override
    {
        auto context = reinterpret_cast<csl::CSLContext*>(context_);

        cuda4dnn::MaxUnpoolingConfiguration config;
        auto& window_size = config.window_size;
        window_size.resize(2);
        window_size[0] = poolKernel.height;
        window_size[1] = poolKernel.width;

        auto& strides = config.strides;
        strides.resize(2);
        strides[0] = poolStride.height;
        strides[1] = poolStride.width;

        auto& pads_begin = config.pads_begin;
        pads_begin.resize(2);
        pads_begin[0] = poolPad.height;
        pads_begin[1] = poolPad.width;

        return make_cuda_node<cuda4dnn::MaxUnpoolingOp>(preferableTarget, std::move(context->stream), config);
    }
#endif

    virtual Ptr<BackendNode> initHalide(const std::vector<Ptr<BackendWrapper> > &input) CV_OVERRIDE
    {
#ifdef HAVE_HALIDE
        // Meaningless operation if false because if kernel > stride
        // it is not deterministic and if kernel < stride we just
        // skip a part of input data (you'd better change your model).
        if (poolKernel.width != poolStride.width ||
            poolKernel.height != poolStride.height)
            CV_Error(cv::Error::StsNotImplemented,
                     "Halide backend for maximum unpooling "
                     "is not support cases when kernel != stride");

        Halide::Var x("x"), y("y"), c("c"), n("n");
        Halide::Func top = (name.empty() ? Halide::Func() : Halide::Func(name));
        Halide::Buffer<float> inputBuffer = halideBuffer(input[0]);
        Halide::Buffer<float> indices = halideBuffer(input[1]);

        Halide::Expr pooledX = x / poolKernel.width;
        Halide::Expr pooledY = y / poolKernel.height;

        const int outW = inputBuffer.width() * poolKernel.width;
        top(x, y, c, n) = select(y * outW + x == indices(pooledX, pooledY, c, n),
                                 inputBuffer(pooledX, pooledY, c, n), 0.0f);
        return Ptr<BackendNode>(new HalideBackendNode(top));
#endif  // HAVE_HALIDE
        return Ptr<BackendNode>();
    }
};

Ptr<MaxUnpoolLayer> MaxUnpoolLayer::create(const LayerParams& params)
{
    return Ptr<MaxUnpoolLayer>(new MaxUnpoolLayerImpl(params));
}

}
}