- 05 Jun, 2018 3 commits
-
-
Adam Straw authored
* strip off attributes from backend type prior to shared lib dlopen
-
Fenglei authored
* change convolution to use cudnn emitter * convolution working * add asymmetric pad * forward with asymmetric working * backward asymmetric * padding to padding_below * pad still has bug on backward * change name * fix convolution back prop * fix code block * slice back from padded output: * working code * extra , * Update gpu_emitter.cpp * splict build_convolution to 3 function * format and fix bugs * Update cudnn_emitter.hpp
-
Chris Sullivan authored
* Added per argument alignment to GPUAllocator::reserve_argspace. * Changed alignment in tests to match update to alignment in backend.
-
- 04 Jun, 2018 3 commits
-
-
Chris Sullivan authored
These tests should fail on all systems but they are somehow passing on the CI system.
-
Nick Korovaiko authored
-
Robert Kimball authored
* Update cmake files to more modern approach * disable building libraries that are not required * handle more build cases * add versions to backend libs. add start of package target. * add create_backend to backends * temporary workaround to tbb not linking correctly with gcc * install codegen lib * force tbb to link to the cpu backend so that it is available for codegen * fix clang build error * fix warning for codegen build * update cuda header paths * change error message for opening backend shared library * set lib path
-
- 02 Jun, 2018 1 commit
-
-
Yixing Lao authored
-
- 01 Jun, 2018 1 commit
-
-
Fenglei authored
-
- 31 May, 2018 4 commits
-
-
DawnStone authored
-
Louis Feng authored
-
Nishant Patel authored
-
Robert Kimball authored
* update serializer for all new ops
-
- 30 May, 2018 4 commits
-
-
Fenglei authored
* add reduce op * hack solution to get reduction function in reduct op * hack version working on all tests * add recude_window op * fixed the reduction checking process * add reduce window op, save progress, not compilable yet * change puchback to pre allocate for vector * fixed datatype vector * dataype and comments * pass op intead of using template * using new GPUshape and allocator * using GPUShape * add comment, change map inside function. * change to more menaful name
-
Nick Korovaiko authored
-
Nick Korovaiko authored
* refactor cpworkspaceinsertion for mxnet * rename mxnet functions to adhere to ngraph naming convention * define a member static const int in a cpp file to resolve a linking issue
-
Nishant Patel authored
-
- 29 May, 2018 2 commits
-
-
Chris Sullivan authored
* Added GPUShape and reworked Shape helpers to be compatible with different shape types. Shape is now implicitly convertable to GPUShape. * Updated shape helpers signature and add conversion operators/constructors for GPUShape. * Adjust row_major_strides to avoid reversed-copy. * Moved declaration out of loop for clang. * Moved gpu_shape to gpu transformer. * Removed no longer necessary headers. * Added stdexcept header to gpu_shape.hpp * Changed check on 64bit shape to check if high bits are set. * Added spacing between functions in GPUShape and boolean operators in shape.hpp. * Template parameters are UPPER_SNAKE_CASE. * Return type of shape_size should be large enough to encapsulate the full stride of the tensor. This should be 64bits wide regardless of the underlying value_type of the ShapeType. * [CS:GPU::Part 2] Add GPUMemoryManager, GPUAllocator, and memory primitives. (#1034) This is a big PR which introduces the GPUMemoryManager, GPUAllocator, and the concept of memory primitives. A memory primitive is a closure which yields the device memory address for a reserved memory space. When a memory reservation is made, the request is recorded along with the data that should be copied (for kernel arguments, but not for workspace memory). The reservation does not yield an address eagerly but instead does so lazily by returning an index which can be used to look up the memory_primitive at runtime. This allows the GPUMemoryManager to delay resolution of the memory address until all reservations have been made. Ideally, the temporary allocations needed by each kernel could be captured by the liveness lists in the GPU_External_Function. This way the pass::MemoryManager would capture these allocations along with the needed tensor allocations. For now, rather than rearchitect the gpu_emitter and external function, we utilize the GPUMemoryManager, which maintains its own internal pass::MemoryManager, and the GPUAllocator. Liveness is handled by the GPUAllocator: all workspace allocation/reservations created in the same (or sub)scope as the GPUAllocator will persist until the GPUAllocator goes out of scope and deconstructs. At that time, the GPUAllocator will mark the requested temporary buffers as free, and their liveness will be removed (effectively). That way the next kernels that construct a GPUAllocator can reuse the workspace memory that was needed for previous kernels. Additional notes: * This PR updates the CUDAEmitter to exclusively utilize GPUShape instead of Shape. Commits: * Added GPUMemoryManager for aggregating memory allocations and copies into a single operation for kernel arguments, and a reusuable memory space for workspace allocations. * Added GPUShape and reworked Shape helpers to be compatible with different shape types. * Removed several unecessary static_casts now that GPUShape is utilized. GPUTensorViewWrapper had a few functions returning std::vector<size_t> instead of Shape/Strides. These were updated as well to take advantage of GPUShape convertion operators. * Coordinate->GPUShape * Refactored replace_slice into CudaKernelBuilder. Simplified allocations using new GPUAllocator and GPUMemoryManager. * Refactor allocations to make use of primitive emitter. Now memory primitives are registered at compile time and the gpu memory address is resolved at runtime by invoking the primitive. * Added const qualifier to data being copied in GPUAllocator::reserve_argspace * Removed more replace_slice diffs. * Added unit tests for GPUMemoryManager and added checks that ensure the device memory is allocated prior to address resolution by the memory_primitives. Also exposed the allocation size of the memory manager. * Added explicit function for queueing kernel argument data rather than inline in the reservation function per @fengleitian recommendation. [CS:GPU::Part 3] Refactoring of several ops to use GPUMemoryManager (#1035) This PR implements the new GPUMemoryManager and allocator for all the ops which were previously implemented but required allocations and copies for kernel arguments at runtime. Limitations: The convolution workspaces could not be added because the relevant descriptors were not available at compile time due to the codegen. If convolution is later added to the CUDNN emitter, the GPUAllocator can be used to avoid workspace allocation at runtime. Commits: * Replaced runtime host to device memcpys with GPUAllocator reservations in order to move them to compile time. * Forgot to remove no longer necessary buffer freeing from op emitters. [CS:GPU::Part4] Added op::ReplaceSlice and enabled respective tests. (#999) This PR implements ReplaceSlice using the coordinate transformation strategy. A thread for each tensor element of the input tensor is chosen and it's position in the source tensor coordinate system is calculated. If it is within the source slice, the source is loaded and written out, otherwise the input tensor is loaded. * Relevant tests are enabled. * This op was refactored to utilize the new GPUAllocator and memory manager. Commits: * Updated replace_slice op to utilize GPUShape and GPUMemoryManager. * Added back missing changes after timeline resolution. * Fixed clang warnings and bug. The cudnn_handle was not initialized ahead of emission time and so any eager cudnn calls would fail. To fix this, the cudnn and cublas handle creation was moved to the external function constructor. * Changed row_major_strides to always return vector<size_t> to avoid overflow for tensors with many dimensions. Handle the conversion to 32 bits for GPU shapes with an explicit conversion constructor from vector<size_t>. * During merge the allocation line from external_function was left out. Adding it back.
-
L.S. Cook authored
* Add markdown version of CONTRIB buidelines to ngraph root * Fix weird markdown issue with cpp code block rendering
-
- 26 May, 2018 2 commits
-
-
Nick Korovaiko authored
* serializer pass
-
Jayaram Bobba authored
* Bug fix to graph control logic to always compute output tensors * Remove stale comments
-
- 25 May, 2018 7 commits
-
-
Fenglei authored
-
Chris Sullivan authored
* cuDNN softmax impl. for all axis activation. * Added catch for per-axis activations.
-
Nick Korovaiko authored
* add any op
-
Robert Kimball authored
* fix the op list generator script
-
Fenglei authored
* enable more gpu test * enable more * more test * more tests
-
Fenglei authored
* add gpu product * enable test, change initial value for product
-
Fenglei authored
-
- 23 May, 2018 1 commit
-
-
Pruthvi authored
* - Added pattren matcher for LSTM cell * WIP added support to replace lstm cell instead of subgraph * WIP LSTM pattern matcher, fuses recurrent cells * WIP added RNN CPU op * WIP mkldnn emmiter code for fprop RNN * WIP RNN mkldnn integration - Added mkldnn kernel for uni directional LSTM in the CPU emitter * add a getter for root node * recurrent graph rewrite * fix perms, rename match_root -> get_match_root * fix comp errors * make match_root return the topmost match; fix tests * - WIP GetOutputElement for handling multiple LSTM o/ps - use RecurrentGraphRewrite for replacing node after matching LSTM cells * WIP LSTM multi Output + debug prints * moved LSTM fusion to cpu_fusion * WIP added RNN superfused OP * WIP towards RNN layer fusion * WIP multiple output slicing RNN * WIP RNN mulitple o/ps fusion across layer * WIP corrected input params for fused RNN OP * concat corrosponding param's across differnt LSTM to form inputs to RNN fused op * i) Added test case for RNN kernel ii) runs without error's * refactored and moved LSTM class to standalone file * Rename RNN -> Rnn , LSTM -> Lstm * WIP replace lstm slices to the consumer op * Slicing works on multiple RNN layers * fixed all bugs * - Added CPU RNN Recurrent Fusion - Added CPU LSTM fusion - removed debug code - style fix * - Added support to compute src_iter and dst_iter instead of taking zero_memory_desc - Added unit test to compute one LSTM cell * changed RNN op signature to accept number of states in basic unit of RNN(GRU/LSTM/ vanilla RNN) cell * added sanity checks for RNN op * Fixed issue related to patching the graph while replacing the RNN sliced outputs * Fixed issue to feed the input symbols in the order X0, X1, ...Xt to the RNN op * Added unit test for multi layer RNN fusion * Removed debug statements * Added mulitlayered serialized graph ii) fixed compilation issue * Addressed PR comments * i) WIP MKLDNN layout for RNN Op ii) added test case for INTERPRETER v/s CPU Rnn results * - Fixed bug w.r.to src_layer feature size in rnn mkldnn emitter code - Refactored cpu_fusion rnn test case * merge origin/master with branch pruthvi/lstm_fusion * style fix * Added test case for multiple RNN layers * i) make rnn as mkldnn op if it meets the constraints ii) assert if rnn is not mkldnn op * fix unit test failure * - Added support to reliabily identify the hiddent state and input symbols from the nodes collected by Pattern matcher - Fixed failing unit tests * style fix * - removed "node type" dependency to replace the intermediate LSTM outputs * Addressed PR comments * Fix unit test * - added MKLDNN emitter for LSTM op - graph pass to concat LSTM input recurrent state tensors - CPU layout assignment for LSTM Op - Fixed bug in rnn/lstm unit test's - made changes to use replace_output instead of replace_node for replacing matched graph nodes in LSTM/RNN fusion pass (cherry picked from commit d16fc709265cc0a73e60c6d5f6d2878e7b908aca) * style fix * Renamed passes and style fixes
-
- 22 May, 2018 1 commit
-
-
Robert Kimball authored
-
- 21 May, 2018 4 commits
-
-
L.S. Cook authored
* editing how to execute computation file for clarity and linenos * Add placeholder for runtime docs * Update section on backends, interpreter, and FPGA options * add updated master to fix python_ci * Weird autosummary issue reverted * Clarify new section * remove renamed file * sentence structure
-
Jayaram Bobba authored
* Batch norm folding * Addressed PR feedback * Style fixes * Style fix
-
Yixing Lao authored
-
tsocha authored
-
- 18 May, 2018 3 commits
-
-
Nick Korovaiko authored
* use reference kernel for reverse_sequence for int * move tests * resolve CI errors * TEST to NGRAPH_TEST
-
tsocha authored
-
Michał Karzyński authored
-
- 17 May, 2018 2 commits
-
-
Adam Rogowiec authored
-
Sang Ik Lee authored
If user manully provide MKLDNN_INCLUDE_DIR and MKLDNN_LIB_DIR, don't build mkl-dnn and just add a dummy external project "ext-_mkldnn" to satisfy target dependency for the rest of the build.
-
- 16 May, 2018 1 commit
-
-
Nick Korovaiko authored
* give frontends some flexibility over fusions they would like to run * address jbobbas feedback
-
- 15 May, 2018 1 commit
-
-
L.S. Cook authored
* Make sure that generating pyapi does not throw errors due to directory structureor linenos * Update basic.py
-