Commit d051f5fa authored by Chris Sullivan's avatar Chris Sullivan Committed by Robert Kimball

[CS:GPU::Part 1] Add GPUShape type, conversion operators, and generalized shape helpers. (#1031)

* Added GPUShape and reworked Shape helpers to be
compatible with different shape types.
Shape is now implicitly convertable to GPUShape.

* Updated shape helpers signature and add conversion operators/constructors for GPUShape.

* Adjust row_major_strides to avoid reversed-copy.

* Moved declaration out of loop for clang.

* Moved gpu_shape to gpu transformer.

* Removed no longer necessary headers.

* Added stdexcept header to gpu_shape.hpp

* Changed check on 64bit shape to check if high bits are set.

* Added spacing between functions in GPUShape and boolean operators in shape.hpp.

* Template parameters are UPPER_SNAKE_CASE.

* Return type of shape_size should be large enough to encapsulate the full stride of the tensor.
This should be 64bits wide regardless of the underlying value_type of the ShapeType.

* [CS:GPU::Part 2] Add GPUMemoryManager, GPUAllocator, and memory primitives. (#1034)

This is a big PR which introduces the GPUMemoryManager, GPUAllocator, and the concept of memory primitives.

A memory primitive is a closure which yields the device memory address for a reserved memory space. When a memory reservation is made, the request is recorded along with the data that should be copied (for kernel arguments, but not for workspace memory). The reservation does not yield an address eagerly but instead does so lazily by returning an index which can be used to look up the memory_primitive at runtime. This allows the GPUMemoryManager to delay resolution of the memory address until all reservations have been made. 

Ideally, the temporary allocations needed by each kernel could be captured by the liveness lists in the GPU_External_Function. This way the pass::MemoryManager would capture these allocations along with the needed tensor allocations.

For now, rather than rearchitect the gpu_emitter and external function, we utilize the GPUMemoryManager, which maintains its own internal pass::MemoryManager, and the GPUAllocator. Liveness is handled by the GPUAllocator: all workspace allocation/reservations created in the same (or sub)scope as the GPUAllocator will persist until the GPUAllocator goes out of scope and deconstructs. At that time, the GPUAllocator will mark the requested temporary buffers as free, and their liveness will be removed (effectively). That way the next kernels that construct a GPUAllocator can reuse the workspace memory that was needed for previous kernels.

Additional notes:
* This PR updates the CUDAEmitter to exclusively utilize GPUShape instead of Shape.

   Commits:
   * Added GPUMemoryManager for aggregating memory allocations and copies into a single operation for kernel arguments, and a reusuable memory space for workspace allocations.

   * Added GPUShape and reworked Shape helpers to be
compatible with different shape types.

  * Removed several unecessary static_casts now that GPUShape is utilized. GPUTensorViewWrapper had a few functions returning std::vector<size_t> instead of Shape/Strides. These were updated as well to take advantage of GPUShape convertion operators.

   * Coordinate->GPUShape

   * Refactored replace_slice into CudaKernelBuilder. Simplified allocations using new GPUAllocator and GPUMemoryManager.

  * Refactor allocations to make use of primitive emitter. Now memory primitives are registered at compile time and the gpu memory address is resolved at runtime by invoking the primitive.

   * Added const qualifier to data being copied in GPUAllocator::reserve_argspace

   * Removed more replace_slice diffs.

   * Added unit tests for GPUMemoryManager and added checks that ensure the
device memory is allocated prior to address resolution by the memory_primitives.
Also exposed the allocation size of the memory manager.

   * Added explicit function for queueing kernel argument data rather than inline in the reservation function per @fengleitian recommendation.

[CS:GPU::Part 3] Refactoring of several ops to use GPUMemoryManager (#1035)

This PR implements the new GPUMemoryManager and allocator for all the ops which were previously implemented but required allocations and copies for kernel arguments at runtime. 

Limitations:
The convolution workspaces could not be added because the relevant descriptors were not available at compile time due to the codegen. If convolution is later added to the CUDNN emitter, the GPUAllocator can be used to avoid workspace allocation at runtime.

   Commits:
   * Replaced runtime host to device memcpys with GPUAllocator reservations in order to move them to compile time.

   * Forgot to remove no longer necessary buffer freeing from op emitters.

[CS:GPU::Part4] Added op::ReplaceSlice and enabled respective tests. (#999)

This PR implements ReplaceSlice using the coordinate transformation strategy. A thread for each tensor element of the input tensor is chosen and it's position in the source tensor coordinate system is calculated. If it is within the source slice, the source is loaded and written out, otherwise the input tensor is loaded. 

* Relevant tests are enabled.

* This op was refactored to utilize the new GPUAllocator and memory manager.

   Commits: 

   * Updated replace_slice op to utilize GPUShape and GPUMemoryManager.

   * Added back missing changes after timeline resolution.

* Fixed clang warnings and bug. The cudnn_handle was not initialized ahead of emission time and so any eager cudnn calls would fail.
To fix this, the cudnn and cublas handle creation was moved to the external function constructor.

* Changed row_major_strides to always return vector<size_t> to avoid overflow for tensors with many dimensions. Handle the conversion to 32 bits for GPU shapes with an explicit conversion constructor from vector<size_t>.

* During merge the allocation line from external_function was left out. Adding it back.
parent 61b2e93a
......@@ -134,7 +134,6 @@ set (SRC
runtime/interpreter/int_backend.cpp
runtime/tensor_view.cpp
serializer.cpp
shape.cpp
type/element_type.cpp
type/type.cpp
util.cpp
......@@ -295,6 +294,7 @@ endif()
runtime/gpu/cudnn_emitter.cpp
runtime/gpu/cuda_emitter.cpp
runtime/gpu/type_info.cpp
runtime/gpu/gpu_memory_manager.cpp
)
set_property(SOURCE codegen/compiler.cpp APPEND_STRING PROPERTY COMPILE_DEFINITIONS
"CUDA_HEADER_PATHS=\"${CUDA_INCLUDE_DIRS}\";")
......
......@@ -22,6 +22,7 @@
#include "ngraph/codegen/code_writer.hpp"
#include "ngraph/runtime/gpu/cuda_emitter.hpp"
#include "ngraph/runtime/gpu/gpu_cuda_kernel_builder.hpp"
#include "ngraph/runtime/gpu/gpu_invoke.hpp"
#include "ngraph/runtime/gpu/gpu_primitive_emitter.hpp"
#include "ngraph/runtime/gpu/gpu_runtime_context.hpp"
#include "ngraph/runtime/gpu/gpu_util.hpp"
......@@ -69,11 +70,11 @@ runtime::gpu::CUDAEmitter::CUDAEmitter(runtime::gpu::GPUPrimitiveEmitter* emitte
size_t runtime::gpu::CUDAEmitter::build_pad(const runtime::gpu::GPURuntimeContext* ctx,
const std::array<std::string, 2>& dtypes,
const Shape& input_shape,
const Shape& output_shape,
const Shape& padding_below,
const Shape& padding_above,
const Shape& padding_interior,
GPUShape input_shape,
GPUShape output_shape,
GPUShape padding_below,
GPUShape padding_above,
GPUShape padding_interior,
const std::string& pad_value)
{
// Need to check: are there models in which some tensors will have different types? if so, this
......@@ -104,9 +105,9 @@ size_t runtime::gpu::CUDAEmitter::build_pad(const runtime::gpu::GPURuntimeContex
if (compiled_kernel == nullptr)
{
// normalize pad dimensions to shape dimensions
Shape pad_below(input_shape.size(), 0);
Shape pad_above(input_shape.size(), 0);
Shape pad_interior(input_shape.size(), 0);
GPUShape pad_below(input_shape.size(), 0);
GPUShape pad_above(input_shape.size(), 0);
GPUShape pad_interior(input_shape.size(), 0);
// if padding_interior is not zero length, it
// is from op::Pad for which padding_below will
......@@ -126,8 +127,8 @@ size_t runtime::gpu::CUDAEmitter::build_pad(const runtime::gpu::GPURuntimeContex
pad_interior = padding_interior;
}
auto input_strides = row_major_strides(input_shape);
auto output_strides = row_major_strides(output_shape);
GPUShape input_strides = row_major_strides(input_shape);
GPUShape output_strides = row_major_strides(output_shape);
int offset = 0;
for (size_t i = 0; i < output_strides.size(); i++)
......@@ -193,7 +194,7 @@ size_t runtime::gpu::CUDAEmitter::build_pad(const runtime::gpu::GPURuntimeContex
pad.reset(new gpu::primitive{[=](void** inputs, void** outputs) {
void* args_list[] = {&inputs[1], &inputs[0], &outputs[0]};
CUDA_SAFE_CALL(cuLaunchKernel(*compiled_kernel.get(),
static_cast<unsigned int>(nthreads),
static_cast<uint32_t>(nthreads),
1,
1, // grid dim
1,
......@@ -211,7 +212,7 @@ size_t runtime::gpu::CUDAEmitter::build_pad(const runtime::gpu::GPURuntimeContex
pad.reset(new gpu::primitive{[=](void** inputs, void** outputs) {
void* args_list[] = {&inputs[0], &outputs[0]};
CUDA_SAFE_CALL(cuLaunchKernel(*compiled_kernel.get(),
static_cast<unsigned int>(nthreads),
static_cast<uint32_t>(nthreads),
1,
1, // grid dim
1,
......@@ -232,8 +233,8 @@ size_t runtime::gpu::CUDAEmitter::build_pad(const runtime::gpu::GPURuntimeContex
size_t runtime::gpu::CUDAEmitter::build_1d_max_pool(const GPURuntimeContext* ctx,
const std::array<std::string, 2>& dtypes,
const Shape& input_shape,
const Shape& output_shape,
GPUShape input_shape,
GPUShape output_shape,
size_t window_width,
size_t window_stride)
{
......@@ -253,7 +254,7 @@ size_t runtime::gpu::CUDAEmitter::build_1d_max_pool(const GPURuntimeContext* ctx
return primitive_index;
}
auto nthreads = shape_size(output_shape);
size_t nthreads = shape_size(output_shape);
// if the kernel has not been compiled, build it
auto compiled_kernel = ctx->compiled_kernel_pool->get(hash);
......@@ -297,7 +298,7 @@ size_t runtime::gpu::CUDAEmitter::build_1d_max_pool(const GPURuntimeContext* ctx
std::unique_ptr<gpu::primitive> pool(new gpu::primitive{[=](void** inputs, void** outputs) {
void* args_list[] = {&inputs[0], &outputs[0]};
CUDA_SAFE_CALL(cuLaunchKernel(*compiled_kernel.get(),
static_cast<unsigned int>(nthreads),
static_cast<uint32_t>(nthreads),
1,
1, // grid dim
1,
......@@ -315,67 +316,67 @@ size_t runtime::gpu::CUDAEmitter::build_1d_max_pool(const GPURuntimeContext* ctx
return primitive_index;
}
pooling_op_shape avgpool_shape(
const Shape& in, const Shape& out, const Shape& window, const Shape& strides, const Shape& pad)
pooling_op_shape
avgpool_shape(GPUShape in, GPUShape out, GPUShape window, GPUShape strides, GPUShape pad)
{
pooling_op_shape shape;
shape.N = static_cast<int>(in[0]);
shape.C = static_cast<int>(in[1]);
shape.N = in[0];
shape.C = in[1];
shape.K = shape.C; // pooling feature maps is
shape.J = shape.C; // not currently supported
if (in.size() == 3)
{
shape.D = 1;
shape.H = 1;
shape.W = static_cast<int>(in[2]);
shape.W = in[2];
shape.M = 1;
shape.P = 1;
shape.Q = static_cast<int>(out[2]);
shape.Q = out[2];
shape.T = 1;
shape.R = 1;
shape.S = static_cast<int>(window[0]);
shape.S = window[0];
shape.STRIDE_D = 0;
shape.STRIDE_H = 0;
shape.STRIDE_W = static_cast<int>(strides[0]);
shape.STRIDE_W = strides[0];
shape.PAD_D = 0;
shape.PAD_H = 0;
shape.PAD_W = static_cast<int>(pad[0]);
shape.PAD_W = pad[0];
}
else if (in.size() == 4)
{
shape.D = 1;
shape.H = static_cast<int>(in[2]);
shape.W = static_cast<int>(in[3]);
shape.H = in[2];
shape.W = in[3];
shape.M = 1;
shape.P = static_cast<int>(out[2]);
shape.Q = static_cast<int>(out[3]);
shape.P = out[2];
shape.Q = out[3];
shape.T = 1;
shape.R = static_cast<int>(window[0]);
shape.S = static_cast<int>(window[1]);
shape.R = window[0];
shape.S = window[1];
shape.STRIDE_D = 0;
shape.STRIDE_H = static_cast<int>(strides[0]);
shape.STRIDE_W = static_cast<int>(strides[1]);
shape.STRIDE_H = strides[0];
shape.STRIDE_W = strides[1];
shape.PAD_D = 0;
shape.PAD_H = static_cast<int>(pad[0]);
shape.PAD_W = static_cast<int>(pad[1]);
shape.PAD_H = pad[0];
shape.PAD_W = pad[1];
}
else if (in.size() == 5)
{
shape.D = static_cast<int>(in[2]);
shape.H = static_cast<int>(in[3]);
shape.W = static_cast<int>(in[4]);
shape.M = static_cast<int>(out[2]);
shape.P = static_cast<int>(out[3]);
shape.Q = static_cast<int>(out[4]);
shape.T = static_cast<int>(window[0]);
shape.R = static_cast<int>(window[1]);
shape.S = static_cast<int>(window[2]);
shape.STRIDE_D = static_cast<int>(strides[0]);
shape.STRIDE_H = static_cast<int>(strides[1]);
shape.STRIDE_W = static_cast<int>(strides[2]);
shape.PAD_D = static_cast<int>(pad[0]);
shape.PAD_H = static_cast<int>(pad[1]);
shape.PAD_W = static_cast<int>(pad[2]);
shape.D = in[2];
shape.H = in[3];
shape.W = in[4];
shape.M = out[2];
shape.P = out[3];
shape.Q = out[4];
shape.T = window[0];
shape.R = window[1];
shape.S = window[2];
shape.STRIDE_D = strides[0];
shape.STRIDE_H = strides[1];
shape.STRIDE_W = strides[2];
shape.PAD_D = pad[0];
shape.PAD_H = pad[1];
shape.PAD_W = pad[2];
}
else
{
......@@ -386,11 +387,11 @@ pooling_op_shape avgpool_shape(
size_t runtime::gpu::CUDAEmitter::build_avg_pool(const GPURuntimeContext* ctx,
const std::array<std::string, 2>& dtypes,
const Shape& input_shape,
const Shape& output_shape,
const Shape& window_shape,
const Shape& window_stride,
const Shape& padding_below,
GPUShape input_shape,
GPUShape output_shape,
GPUShape window_shape,
GPUShape window_stride,
GPUShape padding_below,
bool include_pad)
{
// assumes NCDHW format
......@@ -441,9 +442,11 @@ size_t runtime::gpu::CUDAEmitter::build_avg_pool(const GPURuntimeContext* ctx,
writer << "const int q = blockIdx.x;\n";
writer << "const int mp = blockIdx.y;\n";
writer << "const int nk = blockIdx.z;\n";
writer << "const int k = div64(nk, magic_N, shift_N);\n";
writer << "const int k = division_by_invariant_multiplication(nk, magic_N, "
"shift_N);\n";
writer << "const int n = nk - k * N;\n";
writer << "const int m = div64(mp, magic_P, shift_P);\n";
writer << "const int m = division_by_invariant_multiplication(mp, magic_P, "
"shift_P);\n";
writer << "const int p = mp - m * P;\n";
writer << "out += n*KMPQ + k*MPQ + m*PQ + mad16(p, Q, q);\n";
......@@ -463,9 +466,11 @@ size_t runtime::gpu::CUDAEmitter::build_avg_pool(const GPURuntimeContext* ctx,
writer << "for (int trs = tid; trs < TRS; trs += 32)\n";
writer.block_begin();
{
writer << "int t = div64(trs, magic_RS, shift_RS);\n";
writer << "int t = division_by_invariant_multiplication(trs, magic_RS, "
"shift_RS);\n";
writer << "int rs = mod16(trs, t, RS);\n";
writer << "int r = div64(rs, magic_S, shift_S);\n";
writer
<< "int r = division_by_invariant_multiplication(rs, magic_S, shift_S);\n";
writer << "int s = mod16(rs, r, S);\n";
// coordinate transformation from TRS to DHW
......@@ -607,7 +612,7 @@ size_t runtime::gpu::CUDAEmitter::build_avg_pool(const GPURuntimeContext* ctx,
size_t runtime::gpu::CUDAEmitter::build_elementwise_n_to_1(const GPURuntimeContext* ctx,
const std::vector<std::string>& dtypes,
const Shape& tensor_shape,
GPUShape tensor_shape,
const char* op,
const char* kernel)
{
......@@ -659,7 +664,7 @@ size_t runtime::gpu::CUDAEmitter::build_elementwise_n_to_1(const GPURuntimeConte
args_list.push_back(&outputs[0]);
args_list.push_back(&nthreads);
CUDA_SAFE_CALL(cuLaunchKernel(*compiled_kernel.get(),
static_cast<unsigned int>(nthreads),
static_cast<uint32_t>(nthreads),
1,
1, // grid dim
1,
......@@ -677,9 +682,148 @@ size_t runtime::gpu::CUDAEmitter::build_elementwise_n_to_1(const GPURuntimeConte
return primitive_index;
}
size_t runtime::gpu::CUDAEmitter::build_replace_slice(const GPURuntimeContext* ctx,
const std::array<std::string, 3>& dtypes,
GPUShape tensor_shape,
GPUShape source_shape,
GPUShape lower_bounds,
GPUShape upper_bounds,
GPUShape slice_strides)
{
// assumes NC{d1,...,dn} format
std::string kernel_name = "repslices_" + join(dtypes, "_");
std::replace(kernel_name.begin(), kernel_name.end(), ' ', '_');
std::stringstream ss;
ss << kernel_name << "_s" << join(tensor_shape, "_") << "_ssrc" << join(source_shape, "_")
<< "_sll" << join(lower_bounds, "_") << "_slu" << join(upper_bounds, "_") << "_slst"
<< join(slice_strides, "_");
auto hash = ss.str();
// check if the requested kernel is already an inserted primitive
size_t primitive_index = m_primitive_emitter->lookup(hash);
if (primitive_index != std::numeric_limits<size_t>::max())
{
return primitive_index;
}
constexpr const int nthreads_per_block = 32;
// if the kernel has not been compiled, build it
auto compiled_kernel = ctx->compiled_kernel_pool->get(kernel_name);
if (compiled_kernel == nullptr)
{
codegen::CodeWriter writer;
writer << include_helpers();
runtime::gpu::CudaKernelBuilder::get_replace_slice_op(
writer, kernel_name, dtypes, nthreads_per_block);
compiled_kernel = ctx->compiled_kernel_pool->set(kernel_name, writer.get_code());
}
// calculate strides
GPUShape input_strides = row_major_strides(tensor_shape);
GPUShape source_strides = row_major_strides(source_shape);
// precacluate invariants for integer division via multiplication
std::vector<int> dmagics;
std::vector<int> dshifts;
std::vector<int> smagics;
std::vector<int> sshifts;
for (int i = 0; i < tensor_shape.size(); i++)
{
int magic;
int shift;
std::tie(magic, shift) = idiv_magic_u64(input_strides[i]);
dmagics.push_back(magic);
dshifts.push_back(shift);
std::tie(magic, shift) = idiv_magic_u64(slice_strides[i]);
smagics.push_back(magic);
sshifts.push_back(shift);
}
// get an allocator for transient per kernel gpu memory
GPUAllocator allocator = this->m_primitive_emitter->get_memory_allocator();
// TODO factor into range based for loop of arguments
// (lazy) allocation for kernel arguments
size_t idx_input_strides =
allocator.reserve_argspace(input_strides.data(), (input_strides.size() - 1) * sizeof(int));
size_t idx_dmagics = allocator.reserve_argspace(dmagics.data(), dmagics.size() * sizeof(int));
size_t idx_dshifts = allocator.reserve_argspace(dshifts.data(), dshifts.size() * sizeof(int));
size_t idx_lower_bounds =
allocator.reserve_argspace(lower_bounds.data(), lower_bounds.size() * sizeof(int));
size_t idx_upper_bounds =
allocator.reserve_argspace(upper_bounds.data(), upper_bounds.size() * sizeof(int));
size_t idx_slice_strides =
allocator.reserve_argspace(slice_strides.data(), slice_strides.size() * sizeof(int));
size_t idx_smagics = allocator.reserve_argspace(smagics.data(), smagics.size() * sizeof(int));
size_t idx_sshifts = allocator.reserve_argspace(sshifts.data(), sshifts.size() * sizeof(int));
size_t idx_source_shape =
allocator.reserve_argspace(source_shape.data(), source_shape.size() * sizeof(int));
size_t idx_source_strides =
allocator.reserve_argspace(source_strides.data(), source_strides.size() * sizeof(int));
int rank = static_cast<int>(tensor_shape.size());
size_t nthreads = shape_size(tensor_shape);
int nblocks = 1 + ((static_cast<int>(nthreads) - 1) / nthreads_per_block); // ceil_div(nthreads)
// TODO: blending factors are not currently implemented
float alpha = 1.0f;
float beta = 0.0f;
std::unique_ptr<gpu::primitive> replace_slice(
new gpu::primitive{[=](void** inputs, void** outputs) mutable {
void* param_dstr = runtime::gpu::invoke_memory_primitive(ctx, idx_input_strides);
void* param_dmagic = runtime::gpu::invoke_memory_primitive(ctx, idx_dmagics);
void* param_dshift = runtime::gpu::invoke_memory_primitive(ctx, idx_dshifts);
void* param_lbound = runtime::gpu::invoke_memory_primitive(ctx, idx_lower_bounds);
void* param_ubound = runtime::gpu::invoke_memory_primitive(ctx, idx_upper_bounds);
void* param_slice_str = runtime::gpu::invoke_memory_primitive(ctx, idx_slice_strides);
void* param_slice_magic = runtime::gpu::invoke_memory_primitive(ctx, idx_smagics);
void* param_slice_shift = runtime::gpu::invoke_memory_primitive(ctx, idx_sshifts);
void* param_dsource = runtime::gpu::invoke_memory_primitive(ctx, idx_source_shape);
void* param_sourcestr = runtime::gpu::invoke_memory_primitive(ctx, idx_source_strides);
void* args_list[] = {&inputs[0],
&inputs[1],
&outputs[0],
&alpha,
&beta,
&param_dstr,
&param_dmagic,
&param_dshift,
&param_lbound,
&param_ubound,
&param_slice_str,
&param_slice_magic,
&param_slice_shift,
&param_dsource,
&param_sourcestr,
&rank,
&nthreads};
CUDA_SAFE_CALL(cuLaunchKernel(*compiled_kernel.get(),
nblocks,
1,
1,
nthreads_per_block,
1,
1,
rank * nthreads_per_block * sizeof(int),
NULL,
args_list,
0));
CUDA_SAFE_CALL(cuCtxSynchronize());
}});
primitive_index = this->m_primitive_emitter->insert(std::move(replace_slice));
m_primitive_emitter->cache(hash, primitive_index);
return primitive_index;
}
void runtime::gpu::CUDAEmitter::print_tensor_from_gpu(codegen::CodeWriter& writer,
const std::string& tensor_name,
const Shape& shape)
GPUShape shape)
{
auto strides = row_major_strides(shape);
writer << "__syncthreads();\n";
......@@ -723,12 +867,13 @@ std::string runtime::gpu::CUDAEmitter::include_helpers()
)";
#endif
// div64: fast integer division via magic multiplication and shifting
// division_by_invariant_multiplication:
// fast integer division via invariant multiplication and shifting
// if value is a power of 2, magic will be 1 and only shifting
// is required (predicate p in div64)
// is required (predicate p below)
// load: helper to load from constant memory for fast access
ss << R"(
__device__ __forceinline__ int div64(int value, int magic, int shift)
__device__ __forceinline__ int division_by_invariant_multiplication(int value, int magic, int shift)
{
int result;
asm("{\n\t"
......
......@@ -19,10 +19,11 @@
#include <array>
#include "ngraph/codegen/code_writer.hpp"
#include "ngraph/runtime/gpu/gpu_cuda_kernel_ops.hpp"
#include "ngraph/runtime/gpu/gpu_shape.hpp"
namespace ngraph
{
class Shape;
class GPUShape;
namespace runtime
{
......@@ -38,47 +39,55 @@ namespace ngraph
public:
size_t build_pad(const GPURuntimeContext* ctx,
const std::array<std::string, 2>& dtypes,
const Shape& input_shape,
const Shape& output_shape,
const Shape& pad_below,
const Shape& pad_above,
const Shape& pad_interior,
GPUShape input_shape,
GPUShape output_shape,
GPUShape pad_below,
GPUShape pad_above,
GPUShape pad_interior,
const std::string& pad_value = "");
size_t build_1d_max_pool(const GPURuntimeContext* ctx,
const std::array<std::string, 2>& dtypes,
const Shape& input_shape,
const Shape& output_shape,
GPUShape input_shape,
GPUShape output_shape,
size_t window_width,
size_t window_stride);
size_t build_avg_pool(const GPURuntimeContext* ctx,
const std::array<std::string, 2>& dtypes,
const Shape& input_shape,
const Shape& output_shape,
const Shape& window_shape,
const Shape& window_stride,
const Shape& padding_below,
GPUShape input_shape,
GPUShape output_shape,
GPUShape window_shape,
GPUShape window_stride,
GPUShape padding_below,
bool include_pad = false);
template <typename T>
size_t build_elementwise(const GPURuntimeContext* ctx,
const std::vector<std::string>& dtypes,
const Shape& tensor_shape)
GPUShape tensor_shape)
{
return build_elementwise_n_to_1(
ctx, dtypes, tensor_shape, CudaOpMap<T>::op, CudaOpMap<T>::math_kernel);
}
size_t build_replace_slice(const GPURuntimeContext* ctx,
const std::array<std::string, 3>& dtypes,
GPUShape tensor_shape,
GPUShape source_shape,
GPUShape lower_bounds,
GPUShape upper_bounds,
GPUShape slice_stride);
private:
CUDAEmitter(GPUPrimitiveEmitter* emitter);
void print_tensor_from_gpu(codegen::CodeWriter& writer,
const std::string& tensor_name,
const Shape& shape);
GPUShape shape);
std::string include_helpers();
size_t build_elementwise_n_to_1(const GPURuntimeContext* ctx,
const std::vector<std::string>& dtypes,
const Shape& tensor_shape,
GPUShape tensor_shape,
const char* op,
const char* kernel);
......
......@@ -19,6 +19,7 @@
#include <vector>
#include "ngraph/runtime/gpu/cudnn_emitter.hpp"
#include "ngraph/runtime/gpu/gpu_invoke.hpp"
#include "ngraph/runtime/gpu/gpu_primitive_emitter.hpp"
#include "ngraph/runtime/gpu/gpu_runtime_context.hpp"
#include "ngraph/runtime/gpu/gpu_util.hpp"
......@@ -137,6 +138,13 @@ size_t runtime::gpu::CUDNNEmitter::build_reduce_forward(const runtime::gpu::GPUR
}
auto& output_desc = tensor_descriptor_from_shape(output_shape);
// get an allocator for transient per kernel gpu memory
GPUAllocator allocator = this->m_primitive_emitter->get_memory_allocator();
size_t workspace_size = 0;
CUDNN_SAFE_CALL(cudnnGetReductionWorkspaceSize(
*ctx->cudnn_handle, desc, input_desc, output_desc, &workspace_size));
size_t workspace_idx = allocator.reserve_workspace(workspace_size);
// emit reduce operation
std::unique_ptr<gpu::primitive> reduce(
new gpu::primitive{[=, &desc, &input_desc, &output_desc](void** inputs, void** outputs) {
......@@ -146,10 +154,9 @@ size_t runtime::gpu::CUDNNEmitter::build_reduce_forward(const runtime::gpu::GPUR
CUDNN_NOT_PROPAGATE_NAN,
CUDNN_REDUCE_TENSOR_NO_INDICES,
CUDNN_32BIT_INDICES));
size_t workspace_size = 0;
CUDNN_SAFE_CALL(cudnnGetReductionWorkspaceSize(
*ctx->cudnn_handle, desc, input_desc, output_desc, &workspace_size));
auto workspace_ptr = create_gpu_buffer(workspace_size);
void* workspace_ptr = runtime::gpu::invoke_memory_primitive(ctx, workspace_idx);
float alpha = 1.0, beta = 0.0;
CUDNN_SAFE_CALL(cudnnReduceTensor(*ctx->cudnn_handle,
desc,
......@@ -163,7 +170,6 @@ size_t runtime::gpu::CUDNNEmitter::build_reduce_forward(const runtime::gpu::GPUR
&beta,
output_desc,
outputs[0]));
free_gpu_buffer(workspace_ptr);
}});
primitive_index = this->m_primitive_emitter->insert(std::move(reduce));
......
......@@ -37,8 +37,6 @@ runtime::gpu::GPU_CallFrame::GPU_CallFrame(std::shared_ptr<GPU_ExternalFunction>
runtime::gpu::GPU_CallFrame::~GPU_CallFrame()
{
cublasDestroy(m_cublas_handle);
cudnnDestroy(m_cudnn_handle);
cleanup_runtime_context();
}
......@@ -68,26 +66,11 @@ void runtime::gpu::GPU_CallFrame::call(
void runtime::gpu::GPU_CallFrame::setup_runtime_context()
{
cublasStatus_t cublasStatus = cublasCreate(&m_cublas_handle);
if (cublasStatus != CUBLAS_STATUS_SUCCESS)
{
throw runtime_error("cuBLAS create handle failed");
}
cudnnStatus_t cudnnStatus = cudnnCreate(&m_cudnn_handle);
if (cudnnStatus != CUDNN_STATUS_SUCCESS)
{
throw runtime_error("cuDnn create handle failed");
}
// Pass scalars as reference on the Device
cublasSetPointerMode(m_cublas_handle, CUBLAS_POINTER_MODE_DEVICE);
// add pointers to gpu primitives into the gpu runtime context
const auto& primitive_emitter = m_external_function->get_primitive_emitter();
m_external_function->m_ctx->gpu_primitives = primitive_emitter->get_primitives().data();
// register with c-api runtime context
m_external_function->m_ctx->cublas_handle = &m_cublas_handle;
m_external_function->m_ctx->cudnn_handle = &m_cudnn_handle;
m_external_function->m_ctx->gpu_memory_primitives =
primitive_emitter->get_memory_primitives().data();
}
void runtime::gpu::GPU_CallFrame::cleanup_runtime_context()
......
......@@ -61,8 +61,6 @@ namespace ngraph
protected:
std::shared_ptr<GPU_ExternalFunction> m_external_function;
EntryPoint m_compiled_function;
cublasHandle_t m_cublas_handle;
cudnnHandle_t m_cudnn_handle;
static bool init;
};
}
......
......@@ -242,6 +242,79 @@ void runtime::gpu::CudaKernelBuilder::get_reverse_op(codegen::CodeWriter& writer
writer.block_end();
}
void runtime::gpu::CudaKernelBuilder::get_replace_slice_op(
codegen::CodeWriter& writer,
const std::string& name,
const std::array<std::string, 3>& data_types,
int nthreads_per_block)
{
writer << "extern \"C\" __global__ void cuda_" << name << "(" << data_types[0] << "* in, "
<< data_types[1] << "* source, " << data_types[2] << "* out, "
<< "float alpha, float beta, "
<< "int* dim_strides, "
<< "int* dim_magic, "
<< "int* dim_shift, "
<< "int* lower_bounds, "
<< "int* upper_bounds, "
<< "int* slice_str, "
<< "int* slice_magic, "
<< "int* slice_shift, "
<< "int* dim_source, "
<< "int* src_strides, "
<< "int rank,"
<< "size_t nthreads"
<< ")\n";
writer.block_begin();
{
writer << "extern __shared__ int dimensions[];\n";
writer << "const int tid = blockDim.x*blockIdx.x + threadIdx.x;\n";
writer << "if (tid < nthreads)\n";
writer.block_begin();
{
writer << "int dim_product = tid;\n";
writer << "int data_idx = 0;\n";
writer << "for (int i = threadIdx.x; i < (rank - 1) * " << nthreads_per_block
<< "; i += " << nthreads_per_block << ")\n";
writer.block_begin();
{
writer << "dimensions[i] = division_by_invariant_multiplication(dim_product, "
"dim_magic[data_idx], "
"dim_shift[data_idx]);\n";
writer << "dim_product -= (dimensions[i] * dim_strides[data_idx]);\n";
writer << "data_idx++;\n";
}
writer.block_end();
writer << "dimensions[threadIdx.x + (rank-1) * " << nthreads_per_block
<< "] = dim_product;\n";
writer << "data_idx = 0;\n";
writer << "bool in_bounds = true;\n";
writer << "int source_idx = 0;\n";
writer << "for (int i = threadIdx.x; i < rank * " << nthreads_per_block
<< "; i += " << nthreads_per_block << ")\n";
writer.block_begin();
{
writer << "int source_di = division_by_invariant_multiplication(dimensions[i], "
"slice_magic[data_idx], "
"slice_shift[data_idx]);\n";
writer << "bool on_stride = (mod16(dimensions[i], source_di, "
"slice_str[data_idx]) == 0);\n";
// within slice of input tensor and a multiple of the slice stride
writer << "bool in_slice_di = (dimensions[i] >= lower_bounds[data_idx]) && "
"(dimensions[i] < upper_bounds[data_idx]) && on_stride;\n";
writer << "in_bounds = in_bounds && in_slice_di;\n";
// subtract off lower bound to convert to source index
writer << "source_di -= lower_bounds[data_idx];\n";
writer << "source_idx += source_di * src_strides[data_idx];\n";
writer << "data_idx++;\n";
}
writer.block_end();
writer << "out[tid] = in_bounds ? source[source_idx] : in[tid];\n";
}
writer.block_end();
}
writer.block_end();
}
void runtime::gpu::CudaKernelBuilder::get_device_helper(codegen::CodeWriter& writer,
const std::string& name,
const std::string& math_kernel,
......
......@@ -63,6 +63,11 @@ namespace ngraph
const std::string& name,
const std::array<std::string, 2>& data_types);
static void get_replace_slice_op(codegen::CodeWriter& writer,
const std::string& name,
const std::array<std::string, 3>& data_types,
int nthreads_per_block);
static void get_device_helper(codegen::CodeWriter& writer,
const std::string& name,
const std::string& math_kernel,
......
......@@ -806,12 +806,13 @@ CUDNN_SAFE_CALL(cudnnSetOpTensorDescriptor(opTensorDesc,
writer.block_begin(" // " + node->get_name());
writer << "int count = " << out[0].get_size() << ";\n";
writer << "int num_inputs = " << args.size() << ";\n";
writer << "std::vector<size_t> block_strides_h = {" << join(block_strides)
<< "};\n";
writer << "void* block_strides_d = "
"runtime::gpu::create_gpu_buffer(sizeof(size_t) * num_inputs);\n";
writer << "runtime::gpu::cuda_memcpyHtD(block_strides_d, block_strides_h.data(), "
"sizeof(size_t) * num_inputs);\n";
GPUAllocator allocator =
external_function->get_primitive_emitter()->get_memory_allocator();
size_t idx_block_strides = allocator.reserve_argspace(
block_strides.data(), block_strides.size() * sizeof(size_t));
writer << "void* block_strides_d = runtime::gpu::invoke_memory_primitive(ctx, "
<< idx_block_strides << ");\n";
writer << "ngraph::runtime::gpu::emit_concat_op(\"" << node->description() << "\""
<< ", std::vector<std::string>{";
......@@ -905,31 +906,20 @@ CUDNN_SAFE_CALL(cudnnSetOpTensorDescriptor(opTensorDesc,
{
trans_strides[input_order[i]] = output_strides[i];
}
writer << "size_t rank = " << arg_rank << ";\n";
writer << "std::vector<size_t> input_strides_h = {" << input_strides[0] << "UL";
for (int i = 1; i < arg_rank; i++)
{
writer << ", " << input_strides[i] << "UL";
}
writer << "};\n";
writer << "std::vector<size_t> trans_strides_h = {" << trans_strides[0] << "UL";
for (int i = 1; i < arg_rank; i++)
{
writer << ", " << trans_strides[i] << "UL";
}
writer << "};\n";
GPUAllocator allocator =
external_function->get_primitive_emitter()->get_memory_allocator();
size_t idx_input_strides = allocator.reserve_argspace(
input_strides.data(), input_strides.size() * sizeof(size_t));
size_t idx_trans_strides = allocator.reserve_argspace(
trans_strides.data(), trans_strides.size() * sizeof(size_t));
writer << "void* input_strides_d = "
"runtime::gpu::create_gpu_buffer(sizeof(size_t) * rank);\n";
"runtime::gpu::invoke_memory_primitive(ctx, "
<< idx_input_strides << ");\n";
writer << "void* trans_strides_d = "
"runtime::gpu::create_gpu_buffer(sizeof(size_t) * rank);\n";
writer
<< "runtime::gpu::cuda_memcpyHtD(input_strides_d, input_strides_h.data(), "
"sizeof(size_t) * rank);\n";
writer
<< "runtime::gpu::cuda_memcpyHtD(trans_strides_d, trans_strides_h.data(), "
"sizeof(size_t) * rank);\n";
"runtime::gpu::invoke_memory_primitive(ctx, "
<< idx_trans_strides << ");\n";
writer << "runtime::gpu::emit_reshape(\"" << node->description() << "\", {\""
<< args[0].get_type() << "\", \"" << out[0].get_type() << "\"}"
<< ", ctx"
......@@ -938,8 +928,6 @@ CUDNN_SAFE_CALL(cudnnSetOpTensorDescriptor(opTensorDesc,
<< ", "
<< "CUdeviceptr(input_strides_d), CUdeviceptr(trans_strides_d)"
<< ", " << arg_rank << ", " << args[0].get_size() << ");\n";
writer << "runtime::gpu::free_gpu_buffer(input_strides_d);\n";
writer << "runtime::gpu::free_gpu_buffer(trans_strides_d);\n";
}
writer.block_end();
}
......@@ -968,35 +956,30 @@ CUDNN_SAFE_CALL(cudnnSetOpTensorDescriptor(opTensorDesc,
}
else
{
writer << "size_t rank = " << arg_rank << ";\n";
writer << "std::vector<size_t> input_strides_h = {"
<< join(input_strides, "UL,") << "UL};\n";
writer << "std::vector<size_t> output_strides_h = {"
<< join(output_strides, "UL,") << "UL};\n";
writer << "std::vector<size_t> lower_bounds_h = {" << join(lower_bounds, "UL,")
<< "UL};\n";
writer << "std::vector<size_t> slice_strides_h = {"
<< join(slice_strides, "UL,") << "UL};\n";
GPUAllocator allocator =
external_function->get_primitive_emitter()->get_memory_allocator();
size_t idx_input_strides = allocator.reserve_argspace(
input_strides.data(), input_strides.size() * sizeof(size_t));
size_t idx_output_strides = allocator.reserve_argspace(
output_strides.data(), output_strides.size() * sizeof(size_t));
size_t idx_lower_bounds = allocator.reserve_argspace(
lower_bounds.data(), lower_bounds.size() * sizeof(size_t));
size_t idx_slice_strides = allocator.reserve_argspace(
slice_strides.data(), slice_strides.size() * sizeof(size_t));
writer << "size_t rank = " << arg_rank << ";\n";
writer << "void* input_strides_d = "
"runtime::gpu::create_gpu_buffer(sizeof(size_t) * rank);\n";
<< " runtime::gpu::invoke_memory_primitive(ctx, " << idx_input_strides
<< ");\n";
writer << "void* output_strides_d = "
"runtime::gpu::create_gpu_buffer(sizeof(size_t) * rank);\n";
<< " runtime::gpu::invoke_memory_primitive(ctx, " << idx_output_strides
<< ");\n";
writer << "void* slice_strides_d = "
"runtime::gpu::create_gpu_buffer(sizeof(size_t) * rank);\n";
<< " runtime::gpu::invoke_memory_primitive(ctx, " << idx_slice_strides
<< ");\n";
writer << "void* lower_bounds_d = "
"runtime::gpu::create_gpu_buffer(sizeof(size_t) * rank);\n";
writer
<< "runtime::gpu::cuda_memcpyHtD(input_strides_d, input_strides_h.data(), "
"sizeof(size_t) * rank);\n";
writer << "runtime::gpu::cuda_memcpyHtD(output_strides_d, "
"output_strides_h.data(), "
"sizeof(size_t) * rank);\n";
writer
<< "runtime::gpu::cuda_memcpyHtD(slice_strides_d, slice_strides_h.data(), "
"sizeof(size_t) * rank);\n";
writer << "runtime::gpu::cuda_memcpyHtD(lower_bounds_d, lower_bounds_h.data(), "
"sizeof(size_t) * rank);\n";
<< " runtime::gpu::invoke_memory_primitive(ctx, " << idx_lower_bounds
<< ");\n";
writer << "runtime::gpu::emit_slice(\"" << node->description()
<< "\", CUdeviceptr(" << args[0].get_name() << "), CUdeviceptr("
......@@ -1008,10 +991,6 @@ CUDNN_SAFE_CALL(cudnnSetOpTensorDescriptor(opTensorDesc,
<< "CUdeviceptr(input_strides_d), CUdeviceptr(lower_bounds_d), "
"CUdeviceptr(slice_strides_d), CUdeviceptr(output_strides_d)"
<< ", " << arg_rank << ", " << out[0].get_size() << ");\n";
writer << "runtime::gpu::free_gpu_buffer(input_strides_d);\n";
writer << "runtime::gpu::free_gpu_buffer(output_strides_d);\n";
writer << "runtime::gpu::free_gpu_buffer(slice_strides_d);\n";
writer << "runtime::gpu::free_gpu_buffer(lower_bounds_d);\n";
}
writer.block_end();
}
......@@ -1041,21 +1020,20 @@ CUDNN_SAFE_CALL(cudnnSetOpTensorDescriptor(opTensorDesc,
}
else
{
writer << "size_t rank = " << arg_rank << ";\n";
writer << "std::vector<size_t> input_shapes_h = {" << join(arg_shape, "UL,")
<< "UL};\n";
writer << "std::vector<size_t> reverse_axes_h = {"
<< join(reverse_axes_flag, "UL,") << "UL};\n";
GPUAllocator allocator =
external_function->get_primitive_emitter()->get_memory_allocator();
size_t idx_arg_shape = allocator.reserve_argspace(
arg_shape.data(), arg_shape.size() * sizeof(size_t));
size_t idx_reverse_axes_flag = allocator.reserve_argspace(
reverse_axes_flag.data(), reverse_axes_flag.size() * sizeof(size_t));
writer << "size_t rank = " << arg_rank << ";\n";
writer << "void* input_shapes_d = "
"runtime::gpu::create_gpu_buffer(sizeof(size_t) * rank);\n";
<< " runtime::gpu::invoke_memory_primitive(ctx, " << idx_arg_shape
<< ");\n";
writer << "void* reverse_axes_d = "
"runtime::gpu::create_gpu_buffer(sizeof(size_t) * rank);\n";
writer << "runtime::gpu::cuda_memcpyHtD(input_shapes_d, input_shapes_h.data(), "
"sizeof(size_t) * rank);\n";
writer << "runtime::gpu::cuda_memcpyHtD(reverse_axes_d, "
"reverse_axes_h.data(), "
"sizeof(size_t) * rank);\n";
<< " runtime::gpu::invoke_memory_primitive(ctx, "
<< idx_reverse_axes_flag << ");\n";
writer << "runtime::gpu::emit_reverse(\"" << node->description()
<< "\", CUdeviceptr(" << args[0].get_name() << "), CUdeviceptr("
......@@ -1066,8 +1044,6 @@ CUDNN_SAFE_CALL(cudnnSetOpTensorDescriptor(opTensorDesc,
<< "ctx, "
<< "CUdeviceptr(input_shapes_d), CUdeviceptr(reverse_axes_d), "
<< arg_rank << ", " << out[0].get_size() << ");\n";
writer << "runtime::gpu::free_gpu_buffer(input_shapes_d);\n";
writer << "runtime::gpu::free_gpu_buffer(reverse_axes_d);\n";
}
writer.block_end();
}
......@@ -1204,10 +1180,16 @@ CUDNN_SAFE_CALL(cudnnSetOpTensorDescriptor(opTensorDesc,
// one of args[] axes has zero size, zero output
if (args[0].get_size() == 0)
{
writer << "std::vector<float> temp(" << out[0].get_size()
<< ", -std::numeric_limits<float>::infinity());\n";
writer << "runtime::gpu::cuda_memcpyHtD(" << out[0].get_name()
<< ", (void*)temp.data(), " << out[0].get_size() << " * "
GPUAllocator allocator =
external_function->get_primitive_emitter()->get_memory_allocator();
std::vector<float> negative_inf(
out[0].get_size(), -std::numeric_limits<float>::infinity());
size_t idx_float_inf = allocator.reserve_argspace(
negative_inf.data(), negative_inf.size() * sizeof(float));
writer << "void* temp_d = runtime::gpu::invoke_memory_primitive(ctx, "
<< idx_float_inf << ");\n";
writer << "runtime::gpu::cuda_memcpyDtD(" << out[0].get_name()
<< ", temp_d, " << out[0].get_size() << " * "
<< out[0].get_element_type().size() << ");\n";
}
else if (args[0].get_shape().size() == out[0].get_shape().size())
......@@ -1246,10 +1228,16 @@ CUDNN_SAFE_CALL(cudnnSetOpTensorDescriptor(opTensorDesc,
// one of args[] axes has zero size, zero output
if (args[0].get_size() == 0)
{
writer << "std::vector<float> temp(" << out[0].get_size()
<< ", std::numeric_limits<float>::infinity());\n";
writer << "runtime::gpu::cuda_memcpyHtD(" << out[0].get_name()
<< ", (void*)temp.data(), " << out[0].get_size() << " * "
GPUAllocator allocator =
external_function->get_primitive_emitter()->get_memory_allocator();
std::vector<float> positive_inf(out[0].get_size(),
std::numeric_limits<float>::infinity());
size_t idx_float_inf = allocator.reserve_argspace(
positive_inf.data(), positive_inf.size() * sizeof(float));
writer << "void* temp_d = runtime::gpu::invoke_memory_primitive(ctx, "
<< idx_float_inf << ");\n";
writer << "runtime::gpu::cuda_memcpyDtD(" << out[0].get_name()
<< ", temp_d, " << out[0].get_size() << " * "
<< out[0].get_element_type().size() << ");\n";
}
else if (args[0].get_shape().size() == out[0].get_shape().size())
......@@ -1521,7 +1509,7 @@ CUDNN_SAFE_CALL(cudnnSetOpTensorDescriptor(opTensorDesc,
shape_to_pool,
padding_below,
padding_above,
/*padding_interior*/ {},
Shape{},
ss.str());
writer << "gpu::invoke_primitive(ctx, " << pad_index << ", ";
......@@ -1903,6 +1891,58 @@ CUDNN_SAFE_CALL(cudnnSetOpTensorDescriptor(opTensorDesc,
writer.block_end();
}
template <>
void GPU_Emitter::EMITTER_DECL(ngraph::op::ReplaceSlice)
{
// assumes NC{d1,d2,...} format
auto rep_slice = static_cast<const ngraph::op::ReplaceSlice*>(node);
writer.block_begin(" // " + node->get_name());
{
auto& input_shape = args[0].get_shape();
auto& source_shape = args[1].get_shape();
auto& lower_bounds = rep_slice->get_lower_bounds();
auto& upper_bounds = rep_slice->get_upper_bounds();
auto& strides = rep_slice->get_strides();
Shape slice_shape(upper_bounds.size(), 0);
std::transform(upper_bounds.begin(),
upper_bounds.end(),
lower_bounds.begin(),
slice_shape.begin(),
std::minus<size_t>());
std::transform(slice_shape.begin(),
slice_shape.end(),
strides.begin(),
slice_shape.begin(),
std::divides<size_t>());
// replace the input with the source if the slice shape and input shape are equal
if (input_shape == slice_shape)
{
kernel::emit_memcpyDtD(writer, out[0], args[1]);
}
else
{
auto& cuda_emitter =
external_function->get_primitive_emitter()->get_cuda_emitter();
auto replace_slice_index = cuda_emitter->build_replace_slice(
external_function->ctx().get(),
{{args[0].get_type(), args[1].get_type(), out[0].get_type()}},
input_shape,
source_shape,
lower_bounds,
upper_bounds,
rep_slice->get_strides());
writer << "gpu::invoke_primitive(ctx, " << replace_slice_index << ", ";
writer << "std::vector<void*>{" << args[0].get_name() << ", "
<< args[1].get_name() << "}.data(), ";
writer << "std::vector<void*>{" << out[0].get_name() << "}.data()";
writer << ");\n";
}
}
writer.block_end();
}
template <>
void GPU_Emitter::EMITTER_DECL(ngraph::op::Softmax)
{
......
......@@ -260,11 +260,30 @@ runtime::gpu::GPU_ExternalFunction::GPU_ExternalFunction(
// http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html
// #interoperability-between-runtime-and-driver-apis
ngraph::runtime::gpu::CudaContextManager::Instance().SetContextCurrent();
cublasStatus_t cublasStatus = cublasCreate(&m_cublas_handle);
if (cublasStatus != CUBLAS_STATUS_SUCCESS)
{
throw runtime_error("cuBLAS create handle failed");
}
cudnnStatus_t cudnnStatus = cudnnCreate(&m_cudnn_handle);
if (cudnnStatus != CUDNN_STATUS_SUCCESS)
{
throw runtime_error("cuDNN create handle failed");
}
// Pass scalars as reference on the Device
cublasSetPointerMode(m_cublas_handle, CUBLAS_POINTER_MODE_DEVICE);
// register with c-api runtime context
m_ctx->cublas_handle = &m_cublas_handle;
m_ctx->cudnn_handle = &m_cudnn_handle;
m_ctx->compiled_kernel_pool = new CudaFunctionPool;
}
runtime::gpu::GPU_ExternalFunction::~GPU_ExternalFunction()
{
cublasDestroy(m_cublas_handle);
cudnnDestroy(m_cudnn_handle);
delete m_ctx->compiled_kernel_pool;
}
......@@ -754,6 +773,9 @@ using namespace std;
}
// TODO: Cleanup and make this a utility function
// allocate device buffers for primitive arguments and workspace
m_primitive_emitter->allocate_primitive_memory();
string filename = file_util::path_join(s_output_dir, function_name + "_codegen.cpp");
ofstream out(filename);
string code = writer.get_code();
......
......@@ -91,6 +91,9 @@ namespace ngraph
bool m_release_function;
bool m_is_compiled;
bool m_timing;
cublasHandle_t m_cublas_handle;
cudnnHandle_t m_cudnn_handle;
std::unique_ptr<GPUPrimitiveEmitter> m_primitive_emitter;
std::unique_ptr<GPURuntimeContext> m_ctx;
};
......
......@@ -19,10 +19,16 @@
#include "ngraph/runtime/gpu/gpu_invoke.hpp"
#include "ngraph/runtime/gpu/gpu_runtime_context.hpp"
extern "C" void ngraph::runtime::gpu::invoke_primitive(GPURuntimeContext* ctx,
extern "C" void ngraph::runtime::gpu::invoke_primitive(const GPURuntimeContext* ctx,
size_t primitive_index,
void** args,
void** result)
{
(*ctx->gpu_primitives[primitive_index])(args, result);
}
extern "C" void* ngraph::runtime::gpu::invoke_memory_primitive(const GPURuntimeContext* ctx,
size_t primitive_index)
{
return ctx->gpu_memory_primitives[primitive_index]();
}
......@@ -25,10 +25,12 @@ namespace ngraph
namespace gpu
{
struct GPURuntimeContext;
extern "C" void invoke_primitive(GPURuntimeContext* ctx,
extern "C" void invoke_primitive(const GPURuntimeContext* ctx,
size_t primitive_index,
void** args,
void** result);
extern "C" void* invoke_memory_primitive(const GPURuntimeContext* ctx,
size_t primitive_index);
}
}
}
/*******************************************************************************
* Copyright 2018 Intel Corporation
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*******************************************************************************/
#include <cstring>
#include "ngraph/runtime/gpu/gpu_memory_manager.hpp"
#include "ngraph/runtime/gpu/gpu_primitive_emitter.hpp"
#include "ngraph/runtime/gpu/gpu_util.hpp"
using namespace ngraph;
constexpr const uint32_t initial_buffer_size = 10 * 1024 * 1024;
runtime::gpu::GPUMemoryManager::GPUMemoryManager(GPUPrimitiveEmitter* emitter)
: m_buffer_offset(0)
, m_buffered_mem(initial_buffer_size)
, m_workspace_manager(alignment)
, m_argspace(nullptr)
, m_workspace(nullptr)
, m_allocation_size(0)
, m_primitive_emitter(emitter)
{
}
runtime::gpu::GPUMemoryManager::~GPUMemoryManager()
{
runtime::gpu::free_gpu_buffer(m_argspace);
runtime::gpu::free_gpu_buffer(m_workspace);
}
void runtime::gpu::GPUMemoryManager::allocate()
{
if (m_buffer_offset)
{
m_buffer_offset = pass::MemoryManager::align(m_buffer_offset, alignment);
m_argspace = runtime::gpu::create_gpu_buffer(m_buffer_offset);
runtime::gpu::cuda_memcpyHtD(m_argspace, m_buffered_mem.data(), m_buffer_offset);
m_allocation_size += m_buffer_offset;
}
auto workspace_size = m_workspace_manager.max_allocated();
if (workspace_size)
{
m_workspace = runtime::gpu::create_gpu_buffer(workspace_size);
m_allocation_size += workspace_size;
}
}
size_t runtime::gpu::GPUMemoryManager::queue_for_transfer(const void* data, size_t size)
{
// if the current allocation will overflow the host buffer
if (m_buffer_offset + size > m_buffered_mem.size())
{
// add more space to the managed buffer
size_t new_size = m_buffered_mem.size() / initial_buffer_size + 1;
m_buffered_mem.resize(new_size);
}
size_t offset = m_buffer_offset;
std::memcpy(m_buffered_mem.data() + offset, data, size);
m_buffer_offset += size;
return offset;
}
runtime::gpu::GPUAllocator::GPUAllocator(const GPUAllocator& g)
{
m_manager = g.m_manager;
m_active = g.m_active;
}
size_t runtime::gpu::GPUAllocator::reserve_argspace(const void* data, size_t size)
{
// add parameter data to host buffer that will be transfered to device
size_t offset = m_manager->queue_for_transfer(data, size);
// required to capture m_manager pointer
// directly rather than `this` pointer
auto manager = m_manager;
// return a lambda that will yield the gpu memory address. this
// should only be evaluated by the runtime invoked primitive
gpu::memory_primitive mem_primitive = [=]() {
if (manager->m_argspace == nullptr)
{
throw std::runtime_error("An attempt was made to use unallocated device memory.");
}
auto gpu_mem = static_cast<uint8_t*>(manager->m_argspace);
return static_cast<void*>(gpu_mem + offset);
};
return m_manager->m_primitive_emitter->insert(mem_primitive);
}
size_t runtime::gpu::GPUAllocator::reserve_workspace(size_t size)
{
size_t offset = m_manager->m_workspace_manager.allocate(size);
m_active.push(offset);
// required to capture m_manager pointer
// directly rather than `this` pointer
auto manager = m_manager;
// return a lambda that will yield the gpu memory address. this
// should only be evaluated by the runtime invoked primitive
gpu::memory_primitive mem_primitive = [=]() {
if (manager->m_workspace == nullptr)
{
throw std::runtime_error("An attempt was made to use unallocated device memory.");
}
auto gpu_mem = static_cast<uint8_t*>(manager->m_workspace);
return static_cast<void*>(gpu_mem + offset);
};
return m_manager->m_primitive_emitter->insert(mem_primitive);
}
runtime::gpu::GPUAllocator::~GPUAllocator()
{
while (!m_active.empty())
{
m_manager->m_workspace_manager.free(m_active.top());
m_active.pop();
}
}
/*******************************************************************************
* Copyright 2018 Intel Corporation
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*******************************************************************************/
#pragma once
#include <memory>
#include <stack>
#include <vector>
#include "ngraph/pass/memory_layout.hpp"
namespace ngraph
{
namespace runtime
{
namespace gpu
{
class GPUPrimitiveEmitter;
class GPUMemoryManager;
class GPUAllocator
{
public:
GPUAllocator() = delete;
GPUAllocator(GPUMemoryManager* mgr)
: m_manager(mgr)
{
}
GPUAllocator(const GPUAllocator& g);
~GPUAllocator();
size_t reserve_argspace(const void* data, size_t size);
size_t reserve_workspace(size_t size);
private:
GPUMemoryManager* m_manager;
std::stack<size_t> m_active;
};
class GPUMemoryManager
{
friend class GPUPrimitiveEmitter;
friend class GPUAllocator;
public:
~GPUMemoryManager();
void allocate();
size_t get_allocation_size() { return m_allocation_size; }
GPUAllocator build_allocator() { return GPUAllocator(this); }
private:
GPUMemoryManager(GPUPrimitiveEmitter* emitter);
size_t queue_for_transfer(const void* data, size_t size);
size_t m_buffer_offset;
std::vector<uint8_t> m_buffered_mem;
pass::MemoryManager m_workspace_manager;
static constexpr const uint16_t alignment = 4;
void* m_argspace;
void* m_workspace;
size_t m_allocation_size;
GPUPrimitiveEmitter* m_primitive_emitter;
};
}
}
}
......@@ -25,6 +25,7 @@ using namespace ngraph::runtime::gpu;
GPUPrimitiveEmitter::GPUPrimitiveEmitter()
: m_cuda_emitter(new CUDAEmitter(this))
, m_cudnn_emitter(new CUDNNEmitter(this))
, m_memory_manager(this)
{
}
......@@ -42,6 +43,11 @@ size_t GPUPrimitiveEmitter::insert(std::unique_ptr<gpu::primitive>&& f)
m_gpu_primitives.push_back(m_managed_primitives.back().get());
return m_gpu_primitives.size() - 1;
}
size_t GPUPrimitiveEmitter::insert(gpu::memory_primitive& f)
{
m_gpu_mem_primitives.push_back(f);
return m_gpu_mem_primitives.size() - 1;
}
size_t GPUPrimitiveEmitter::lookup(std::string hash)
{
if (m_primitive_map.count(hash) > 0)
......
......@@ -20,6 +20,7 @@
#include "ngraph/runtime/gpu/cuda_emitter.hpp"
#include "ngraph/runtime/gpu/cudnn_emitter.hpp"
#include "ngraph/runtime/gpu/gpu_memory_manager.hpp"
#include "ngraph/runtime/gpu/gpu_runtime_context.hpp"
namespace ngraph
......@@ -38,16 +39,25 @@ namespace ngraph
std::unique_ptr<CUDAEmitter>& get_cuda_emitter();
std::unique_ptr<CUDNNEmitter>& get_cudnn_emitter();
std::vector<gpu::primitive*>& get_primitives() { return m_gpu_primitives; }
std::vector<gpu::memory_primitive>& get_memory_primitives()
{
return m_gpu_mem_primitives;
}
size_t insert(std::unique_ptr<gpu::primitive>&& f);
size_t insert(gpu::memory_primitive& f);
size_t lookup(std::string hash);
void cache(const std::string& hash, const size_t& index);
GPUAllocator get_memory_allocator() { return m_memory_manager.build_allocator(); }
void allocate_primitive_memory() { m_memory_manager.allocate(); }
size_t sizeof_device_allocation() { return m_memory_manager.get_allocation_size(); }
private:
std::unique_ptr<CUDAEmitter> m_cuda_emitter;
std::unique_ptr<CUDNNEmitter> m_cudnn_emitter;
std::vector<gpu::primitive*> m_gpu_primitives;
std::vector<gpu::memory_primitive> m_gpu_mem_primitives;
std::unordered_map<std::string, size_t> m_primitive_map;
std::vector<std::unique_ptr<gpu::primitive>> m_managed_primitives;
GPUMemoryManager m_memory_manager;
};
}
}
......
......@@ -30,6 +30,7 @@ namespace ngraph
namespace gpu
{
typedef std::function<void(void**, void**)> primitive;
typedef std::function<void*(void)> memory_primitive;
extern "C" {
struct GPURuntimeContext
......@@ -37,6 +38,7 @@ namespace ngraph
cudnnHandle_t* cudnn_handle;
cublasHandle_t* cublas_handle;
gpu::primitive* const* gpu_primitives;
const gpu::memory_primitive* gpu_memory_primitives;
CudaFunctionPool* compiled_kernel_pool;
// Note that in it's current state, calling methods of CudaFunctionPool
// or other native compiled C++ functions in ngraph from the JIT code is
......
/*******************************************************************************
* Copyright 2017-2018 Intel Corporation
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*******************************************************************************/
#pragma once
#include <cstdio>
#include <stdexcept>
#include <vector>
#include "ngraph/axis_set.hpp"
#include "ngraph/coordinate.hpp"
#include "ngraph/shape.hpp"
#include "ngraph/strides.hpp"
namespace ngraph
{
class Shape;
/// \brief Shape for a tensor resident on GPU.
class GPUShape : public std::vector<uint32_t>
{
public:
GPUShape(const std::initializer_list<uint32_t>& axis_lengths)
: std::vector<uint32_t>(axis_lengths)
{
}
GPUShape(const std::vector<uint32_t>& axis_lengths)
: std::vector<uint32_t>(axis_lengths)
{
}
GPUShape(const GPUShape& axis_lengths)
: std::vector<uint32_t>(axis_lengths)
{
}
explicit GPUShape(size_t n, uint32_t initial_value = 0)
: std::vector<uint32_t>(n, initial_value)
{
}
template <class InputIterator>
GPUShape(InputIterator first, InputIterator last)
: std::vector<uint32_t>(first, last)
{
}
GPUShape() {}
GPUShape& operator=(const GPUShape& v)
{
static_cast<std::vector<uint32_t>*>(this)->operator=(v);
return *this;
}
GPUShape& operator=(GPUShape&& v)
{
static_cast<std::vector<uint32_t>*>(this)->operator=(v);
return *this;
}
GPUShape(const std::vector<size_t>& vec)
{
for (size_t const& size : vec)
{
if (size >> 32 != 0)
{
throw std::runtime_error(
"Request exceeds the bitwidth available for GPUShapes (32)");
}
this->push_back(static_cast<uint32_t>(size));
}
}
GPUShape(const Shape& shape)
{
for (size_t const& size : shape)
{
if (size >> 32 != 0)
{
throw std::runtime_error(
"Request for Shape which exceeds the bitwidth available for GPUShapes "
"(32)");
}
this->push_back(static_cast<uint32_t>(size));
}
}
GPUShape(const Strides& strides)
{
for (size_t const& size : strides)
{
if (size >> 32 != 0)
{
throw std::runtime_error(
"Request for Strides which exceed the bitwidth available for GPUShapes "
"(32)");
}
this->push_back(static_cast<uint32_t>(size));
}
}
GPUShape(const Coordinate& coord)
{
for (size_t const& size : coord)
{
if (size >> 32 != 0)
{
throw std::runtime_error(
"Request for Coordinate which exceed the bitwidth available for GPUShapes "
"(32)");
}
this->push_back(static_cast<uint32_t>(size));
}
}
};
}
......@@ -33,12 +33,12 @@ size_t runtime::gpu::GPU_TensorViewWrapper::get_size() const
return m_tensor_view->get_tensor_view_layout()->get_size();
}
const vector<size_t>& runtime::gpu::GPU_TensorViewWrapper::get_shape() const
const Shape& runtime::gpu::GPU_TensorViewWrapper::get_shape() const
{
return m_tensor_view->get_tensor_view_layout()->get_shape();
}
const vector<size_t>& runtime::gpu::GPU_TensorViewWrapper::get_strides() const
const Strides& runtime::gpu::GPU_TensorViewWrapper::get_strides() const
{
return m_tensor_view->get_tensor_view_layout()->get_strides();
}
......
......@@ -39,8 +39,8 @@ public:
const std::string& alias = "");
size_t get_size() const;
const std::vector<size_t>& get_shape() const;
const std::vector<size_t>& get_strides() const;
const Shape& get_shape() const;
const Strides& get_strides() const;
const element::Type& get_element_type() const;
const std::string& get_name() const;
const std::string& get_type() const;
......
abc_int64
backwards_replace_slice
backwards_reverse_sequence_n4d2c3h2w2
backwards_reverse_sequence_n3_c2_h3
backwards_slice
......@@ -52,12 +51,6 @@ reduce_window_emulating_max_pool_1d_1channel_2image
reduce_window_emulating_max_pool_1d_2channel_2image
reduce_window_emulating_max_pool_2d_1channel_1image_strided
reduce_window_emulating_max_pool_2d_2channel_2image
replace_slice_3d
replace_slice_3d_strided
replace_slice_3d_strided_different_strides
replace_slice_matrix
replace_slice_scalar
replace_slice_vector
reverse_sequence_n4d2c3h2w2
reverse_sequence_n4c3h2w2
reverse_sequence_n2c3h4w2
......
/*******************************************************************************
* Copyright 2017-2018 Intel Corporation
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*******************************************************************************/
#include <algorithm>
#include <vector>
#include "ngraph/shape.hpp"
using namespace std;
using namespace ngraph;
size_t ngraph::shape_size(const Shape& shape)
{
size_t size = 1;
for (auto d : shape)
{
size *= d;
}
return size;
}
Strides ngraph::row_major_strides(const Shape& shape)
{
Strides strides;
size_t s = 1;
for (auto d = shape.rbegin(); d != shape.rend(); d++)
{
strides.push_back(s);
s *= *d;
}
reverse(strides.begin(), strides.end());
return strides;
}
......@@ -68,11 +68,41 @@ namespace ngraph
};
/// Number of elements in spanned by a shape
size_t shape_size(const Shape& shape);
template <typename SHAPE_TYPE>
size_t shape_size(const SHAPE_TYPE& shape)
{
size_t size = 1;
for (auto d : shape)
{
size *= d;
}
return size;
}
/// Row-major strides for a shape
Strides row_major_strides(const Shape& shape);
template <typename SHAPE_TYPE>
std::vector<size_t> row_major_strides(const SHAPE_TYPE& shape)
{
std::vector<size_t> strides(shape.size());
size_t s = 1;
auto st = strides.rbegin();
for (auto d = shape.rbegin(); d != shape.rend(); d++, st++)
{
*st = s;
s *= *d;
}
return strides;
}
template <typename SHAPE_TYPE>
inline bool is_scalar(const SHAPE_TYPE& shape)
{
return 0 == shape.size();
}
inline bool is_scalar(const Shape& shape) { return 0 == shape.size(); }
inline bool is_vector(const Shape& shape) { return 1 == shape.size(); }
template <typename SHAPE_TYPE>
inline bool is_vector(const SHAPE_TYPE& shape)
{
return 1 == shape.size();
}
}
......@@ -85,9 +85,7 @@ if(NGRAPH_GPU_ENABLE AND LLVM_INCLUDE_DIR)
link_directories(${CUDA_LIBRARIES})
link_directories(${CUDA_CUBLAS_LIBRARIES})
link_directories(${CUDNN_LIBRARIES})
set(SRC
${SRC}
cudnn.cpp)
set(SRC ${SRC} cudnn.cpp gpu_test.cpp)
# Disabled for testing
set(BACKEND_NAMES ${BACKEND_NAMES} "GPU")
endif()
......
/*******************************************************************************
* Copyright 2017-2018 Intel Corporation
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*******************************************************************************/
#include <iostream>
#include <vector>
#include "gtest/gtest.h"
#include "ngraph/ngraph.hpp"
#include "ngraph/runtime/gpu/gpu_primitive_emitter.hpp"
#include "ngraph/runtime/gpu/gpu_shape.hpp"
#include "ngraph/runtime/gpu/gpu_util.hpp"
using namespace ngraph;
TEST(gpu_test, gpu_shape_from_64bit_shape)
{
Shape shape{1UL << 33};
ASSERT_ANY_THROW([](GPUShape s) {}(shape););
}
TEST(gpu_test, memory_manager_unallocated)
{
runtime::gpu::GPUPrimitiveEmitter emitter;
auto allocator = emitter.get_memory_allocator();
size_t idx = allocator.reserve_workspace(10);
runtime::gpu::memory_primitive& mem_primitive = emitter.get_memory_primitives()[idx];
ASSERT_ANY_THROW(mem_primitive());
}
TEST(gpu_test, memory_manager_allocated)
{
runtime::gpu::GPUPrimitiveEmitter emitter;
auto allocator = emitter.get_memory_allocator();
size_t idx = allocator.reserve_workspace(10);
emitter.allocate_primitive_memory();
runtime::gpu::memory_primitive& mem_primitive = emitter.get_memory_primitives()[idx];
EXPECT_NO_THROW(mem_primitive());
}
TEST(gpu_test, memory_manager_extract_arguments)
{
runtime::gpu::GPUPrimitiveEmitter emitter;
auto allocator = emitter.get_memory_allocator();
std::vector<float> fp32_args = {2112.0f, 2112.0f};
size_t idx = allocator.reserve_argspace(fp32_args.data(), fp32_args.size() * sizeof(float));
emitter.allocate_primitive_memory();
runtime::gpu::memory_primitive& mem_primitive = emitter.get_memory_primitives()[idx];
std::vector<float> host(2, 0);
runtime::gpu::cuda_memcpyDtH(host.data(), mem_primitive(), host.size() * sizeof(float));
EXPECT_EQ(host, fp32_args);
}
TEST(gpu_test, memory_manager_argspace_size)
{
runtime::gpu::GPUPrimitiveEmitter emitter;
auto allocator = emitter.get_memory_allocator();
std::vector<float> fp32_args = {2112.0f, 2112.0f};
allocator.reserve_argspace(fp32_args.data(), fp32_args.size() * sizeof(float));
emitter.allocate_primitive_memory();
EXPECT_EQ(emitter.sizeof_device_allocation(), fp32_args.size() * sizeof(float));
}
TEST(gpu_test, memory_manager_overlapping_workspace_allocsize)
{
runtime::gpu::GPUPrimitiveEmitter emitter;
for (size_t i = 0; i < 8; i++)
{
auto allocator = emitter.get_memory_allocator();
allocator.reserve_workspace(std::pow(2, i));
}
emitter.allocate_primitive_memory();
EXPECT_EQ(emitter.sizeof_device_allocation(), 128);
void* first = nullptr;
for (size_t i = 0; i < 8; i++)
{
if (not first)
{
first = emitter.get_memory_primitives()[i]();
}
else
{
EXPECT_EQ(emitter.get_memory_primitives()[i](), first);
}
}
}
TEST(gpu_test, memory_manager_seperate_workspaces_allocsize)
{
size_t total_size = 0;
runtime::gpu::GPUPrimitiveEmitter emitter;
{
auto allocator = emitter.get_memory_allocator();
for (size_t i = 0; i < 8; i++)
{
size_t size = std::pow(2, i);
allocator.reserve_workspace(size);
total_size += pass::MemoryManager::align(size, 4);
}
}
emitter.allocate_primitive_memory();
EXPECT_EQ(emitter.sizeof_device_allocation(), total_size);
}
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment