- 19 Jun, 2018 3 commits
-
-
Robert Kimball authored
* fix mkldnn rpath * fix compile warning * close backends when exiting * set backend output directory of backends to the ngraph output directory * Aprocter/patch patch (#1119) * Move more rpath stuff inside if(NOT APPLE) * fix repatch problem with mkldnn library * add updated patch command for older versions of cmake
-
Nick Korovaiko authored
* loop kernel + tests * remove commented out code * remove commented code; add comments * copy_with_new_args +test * add comment * fix comp errors
-
Jayaram Bobba authored
* Move to depth-first serialization of graph for better cache behavior * Added comment * Force 64 byte stack alignment to avoid crashes from unaligned AVX loads/stores * Revert "Force 64 byte stack alignment to avoid crashes from unaligned AVX loads/stores" This reverts commit 84346420fbd0fbd5d05a4a1e8f5fae12bdc7348b. * revert to breadth-first serialization
-
- 18 Jun, 2018 6 commits
-
-
Jayaram Bobba authored
DEX Part 2
-
Jayaram Bobba authored
-
Nick Korovaiko authored
-
Jaikrishnan Menon authored
-
Fenglei authored
* enable more gpu ops test
-
Jayaram Bobba authored
-
- 17 Jun, 2018 2 commits
-
-
Nick Korovaiko authored
-
Jayaram Bobba authored
-
- 16 Jun, 2018 3 commits
-
-
Fenglei authored
* add reverse_sequence * fix bugs, compiled * fix index bug * fix bug and clang format * correct function name * clang format * remove extra ; * remove tests from skip list * add backward support, skip tests * add back template<> line * remove unecessary lines in kernel
-
Nick Korovaiko authored
* optimized strided convolutions * clean up debug messages * format fixes * more tests * even more tests * adapt to resnet-50.v1 * fix format errors; remove changes from diff PRs
-
Nick Korovaiko authored
* enable cse for reduction ops * reduction tests
-
- 15 Jun, 2018 8 commits
-
-
Jaikrishnan Menon authored
-
Jaikrishnan Menon authored
-
Jaikrishnan Menon authored
-
Robert Kimball authored
-
-
Jaikrishnan Menon authored
-
Pruthvi authored
* - Added graph pass for fusing RNN op across layer - Added test case for inter v/s cpu for verifying layer fused RNN - more sanity checks in the RNN fusion graph pass - added support to replace the recurrent cell state correctly in the fused RNN op * Fixed multi layer rnn fusion unit test failure * Addressed PR comments
-
Fenglei authored
* enable tests * add funciton call * working version * remove test from ski list
-
- 14 Jun, 2018 3 commits
-
-
Jayaram Bobba authored
-
Robert Kimball authored
* remove comments within body of function * change to emit each op exactly once * update code per review comments
-
Chris Sullivan authored
-
- 13 Jun, 2018 6 commits
-
-
Jaikrishnan Menon authored
Also, formatting fixes
-
Jaikrishnan Menon authored
-
Jaikrishnan Menon authored
-
Robert Kimball authored
* backend libraries now found in tree dynamically read header search paths fix running from install
-
Nick Korovaiko authored
* group conv init * add GroupConvolution op; refine checks in fusion logic * add an emitter, cpu assigment * cpu_layout * add checks to algebraic simplification * updating emitter logic for groupconvolution * working before refactoring * moving primitive creation logic to mkldnn_emitter * group convolution graph test * rename an opt * address jbobba's feedback
-
Fenglei authored
* add pad_dilation function * add dilation to gpu_emitter * add CoordinateDiff constructor to GPUShape * remove unecessary cast * working version for forward * forward working * forward test all pass * deconvolution forward * backward data dilation * forward test passed * initial to 0 * fix bug for get_padded_shape and clang format * code style, change variable names * refactor convolution conditions * fix bug padding_below_diff * change pad_dilation to pad_dynamic, compare to pad * remove passed convolution test from skip list, clang format * change pad to use GPUShape
-
- 12 Jun, 2018 2 commits
-
-
Chris Sullivan authored
* Added op::ReplaceSlice and enabled respective tests. * div64 -> division_by_invariant_multiplication * Added GPUMemoryManager for aggregating memory allocations and copies into a single operation for kernel arguments, and a reusuable memory space for workspace allocations. * Added GPUShape and reworked Shape helpers to be compatible with different shape types. Shape is now implicitly convertable to GPUShape. * Updated shape helpers signature and add conversion operators/constructors for GPUShape. * Removed several unecessary static_casts now that GPUShape is utilized. GPUTensorViewWrapper had a few functions returning std::vector<size_t> instead of Shape/Strides. These were updated as well to take advantage of GPUShape convertion operators. * Forgot to fix lambda for workspace allocations to match that of argspace allocations. * Added GPUShape and reworked Shape helpers to be compatible with different shape types. Shape is now implicitly convertable to GPUShape. * Updated shape helpers signature and add conversion operators/constructors for GPUShape. * Adjust row_major_strides to avoid reversed-copy. * Moved declaration out of loop for clang. * Moved gpu_shape to gpu transformer. * Removed no longer necessary headers. * Added stdexcept header to gpu_shape.hpp * Coordinate->GPUShape * Refactored replace_slice into CudaKernelBuilder. Simplified allocations using new GPUAllocator and GPUMemoryManager. * Refactor allocations to make use of primitive emitter. Now memory primitives are registered at compile time and the gpu memory address is resolved at runtime by ivoking the primitive. * Changed check on 64bit shape to check if high bits are set. * Added const qualifier to data being copied in GPUAllocator::reserve_argspace * Added const qualifier to data being copied in GPUAllocator::reserve_argspace * Replaced runtime host to device memcpys with GPUAllocator reservations in order to move them to compile time. * Forgot to remove no longer necessary buffer freeing from op emitters. * Removed replace slice. * Removed more replace_slice diffs. * Updated replace_slice op to utilize GPUShape and GPUMemoryManager. * Added back missing changes after timeline resolution. * Added spacing between functions in GPUShape and boolean operators in shape.hpp. * Template parameters are UPPER_SNAKE_CASE. * Added unit tests for GPUMemoryManager and added checks that ensure the device memory is allocated prior to address resolution by the memory_primitives. Also exposed the allocation size of the memory manager. * Return type of shape_size should be large enough to encapsulate the full stride of the tensor. This should be 64bits wide regardless of the underlying value_type of the ShapeType. * Upstreaming changes to shape_size (which returns size_t). * cuDNN softmax impl. for all axis activation. * Added catch for per-axis activations. * Removed commended headers. * Added explicit function for queueing kernel argument data rather than inline in the reservation function per @fengleitian recommendation. * Add softmax cuda kernel. It relies on atomic memory addition to global memory, this will add contention and should be optimized in the future. A multilevel reduction can be found in cs/gpu_softmax_cuda_shfl but it requires some further engineering. * Refactored reduce coordinate transform code into a helper and applied it to broadcast. Broadcast added to CUDAEmitter, now supports multiple non-consecutive axes. * Removed change to data_types variable and updated/removed comments. * Refactored softmax into the emission of two fused elementwise collective ops. Added fused elementwise + collective kernels. Softmax is then just the combination of exp_sum_reduce + div_broadcast. * Added default param to GPUAllocator::reserve_workspace to request memory initialization for each invocation of the memory primitive. * GPU workspace memory is zero initialized by default but can be turned off if desired. * Added template parameter to CUDAEmitter::build_elementwise, REDUCE_OP_TYPE, to specify the ngraph op type to use for the reduction in the fusted ew_collective kernel. * Renamed variables and updated a comment. * Removed outdated softmax kernel to avoid confusion. Can be added later when atomic reduce is replaced. * Clang complained about lack of explicit destructor for AxisSet. Since cuda_emitter doesn't need AxisSet specifically, switch to std::set<size_t>. This also has the benefit that in the future, if we wish to emit kernels without ngraph core (for example in a standalone binary via a serialized graph manifest, we don't depend on AxisSet. * softmax -> broadcast in build_broadcast. * Separate elementwise and elementwise_collective.
-
Nick Korovaiko authored
-
- 11 Jun, 2018 2 commits
-
-
Chris Sullivan authored
* Added default param to GPUAllocator::reserve_workspace to request memory initialization for each invocation of the memory primitive. * GPU workspace memory is zero initialized by default but can be turned off if desired.
-
Robert Kimball authored
* finally have something almost acceptable
-
- 08 Jun, 2018 2 commits
-
-
Jayaram Bobba authored
* Optimized eigen kernel for 2D reduction on a 4D tensor used for spatial mean * revert change to serializer
-
Jaikrishnan Menon authored
* CPU: Direct Execution Part 1 with bare minimum infrastructure * Refactor: Move build related functionality to a separate TU and external function method * Add TU back after merge * Remove an assert * Remove commented-out code
-
- 07 Jun, 2018 2 commits
-
-
Robert Kimball authored
-
Louis Feng authored
* batch dot pattern wip. * batch dot pattern wip. * added batch dot op. * batch dot compute testing. * correct gemm parameters. * renaming matrix fusions passes and update tests. * clean up. * clang format. * more clean ups. * clang format. * added CPUBatchDotFusion to default cpu passes. * added missing header. * added element type check.
-
- 06 Jun, 2018 1 commit
-
-
L.S. Cook authored
-