Commits · b9a77a9d5018258f97bdaba9324930de6c41fb9a · submodule / ngraph

21 Jun, 2018 1 commit

Constant folding for Reshapes (#1130) · b9a77a9d

Adam Straw authored 6 years ago

* adding constant propagation pass

* adding test/constant_propagation.cpp

* template make_constant_reshape function

* code review feedback

* add missing files

b9a77a9d

20 Jun, 2018 1 commit
- Fix two bugs with concat for 0-size tensors (#1120) · 22e783ff
  Adam Procter authored 6 years ago
```
* Fix bug with concat for 0-size tensors

* Simplify test for zero-length axes, per PR comments
```
  22e783ff
19 Jun, 2018 2 commits

Bob/cmake (#1118) · 4847b2de

Robert Kimball authored 6 years ago

* fix mkldnn rpath

* fix compile warning

* close backends when exiting

* set backend output directory of backends to the ngraph output directory

* Aprocter/patch patch (#1119)

* Move more rpath stuff inside if(NOT APPLE)

* fix repatch problem with mkldnn library

* add updated patch command for older versions of cmake

4847b2de

Loop Kernel Op + Tests (#1028) · 96295aaa

Nick Korovaiko authored 6 years ago

* loop kernel + tests

* remove commented out code

* remove commented code; add comments

* copy_with_new_args +test

* add comment

* fix comp errors

96295aaa

16 Jun, 2018 2 commits

Strided Convolution (#1058) · 94844d13

Nick Korovaiko authored 6 years ago

* optimized strided convolutions

* clean up debug messages

* format fixes

* more tests

* even more tests

* adapt to resnet-50.v1

* fix format errors; remove changes from diff PRs

94844d13

enable cse for reduction ops (#1030) · 656dfa55
Nick Korovaiko authored 6 years ago
```
* enable cse for reduction ops

* reduction tests
```
656dfa55

15 Jun, 2018 2 commits

move tbb test from backend_test to cpu_test because it is CPU only (#1102) · 7d6a0d1c
Robert Kimball authored 6 years ago

7d6a0d1c

RNN fusion across layers (#1085) · f75b8006

Pruthvi authored 6 years ago

* - Added graph pass for fusing RNN op across layer
- Added test case for inter v/s cpu for verifying layer fused RNN
- more sanity checks in the RNN fusion graph pass
- added support to replace the recurrent cell state correctly in the fused RNN op

* Fixed multi layer rnn fusion unit test failure

* Addressed PR comments

f75b8006

13 Jun, 2018 3 commits

Ubuntu 18 build support (#1101) · 838ba3f1

Robert Kimball authored 6 years ago

* backend libraries now found in tree

dynamically read header search paths

fix running from install

838ba3f1

Group Convolution (#1041) · 4a2c3c9c

Nick Korovaiko authored 6 years ago

*  group conv init

* add GroupConvolution op; refine checks in fusion logic

* add an emitter, cpu assigment

* cpu_layout

* add checks to algebraic simplification

* updating emitter logic for groupconvolution

* working before refactoring

* moving primitive creation logic to mkldnn_emitter

* group convolution graph test

* rename an opt

* address jbobba's feedback

4a2c3c9c

gpu deconvolution (#1099) · 40069d27

Fenglei authored 6 years ago

* add pad_dilation function

* add dilation to gpu_emitter

* add CoordinateDiff constructor to GPUShape

* remove unecessary cast

* working version for forward

* forward working

* forward test all pass

* deconvolution forward

* backward data dilation

* forward test passed

* initial to 0

* fix bug for get_padded_shape and clang format

* code style, change variable names

* refactor convolution conditions

* fix bug padding_below_diff

* change pad_dilation to pad_dynamic, compare to pad

* remove passed convolution test from skip list, clang format

* change pad to use GPUShape

40069d27

12 Jun, 2018 1 commit

CUDA softmax kernel and broadcast kernel support for multiple non-consecutive axes (#1070) · 83e6aa5f

Chris Sullivan authored 6 years ago

* Added op::ReplaceSlice and enabled respective tests.

* div64 -> division_by_invariant_multiplication

* Added GPUMemoryManager for aggregating memory allocations and copies into a single operation for kernel arguments, and a reusuable memory space for workspace allocations.

* Added GPUShape and reworked Shape helpers to be
compatible with different shape types.
Shape is now implicitly convertable to GPUShape.

* Updated shape helpers signature and add conversion operators/constructors for GPUShape.

* Removed several unecessary static_casts now that GPUShape is utilized. GPUTensorViewWrapper had a few functions returning std::vector<size_t> instead of Shape/Strides. These were updated as well to take advantage of GPUShape convertion operators.

* Forgot to fix lambda for workspace allocations to match that of argspace allocations.

* Added GPUShape and reworked Shape helpers to be
compatible with different shape types.
Shape is now implicitly convertable to GPUShape.

* Updated shape helpers signature and add conversion operators/constructors for GPUShape.

* Adjust row_major_strides to avoid reversed-copy.

* Moved declaration out of loop for clang.

* Moved gpu_shape to gpu transformer.

* Removed no longer necessary headers.

* Added stdexcept header to gpu_shape.hpp

* Coordinate->GPUShape

* Refactored replace_slice into CudaKernelBuilder. Simplified allocations using new GPUAllocator and GPUMemoryManager.

* Refactor allocations to make use of primitive emitter.
Now memory primitives are registered at compile time and
the gpu memory address is resolved at runtime by ivoking
the primitive.

* Changed check on 64bit shape to check if high bits are set.

* Added const qualifier to data being copied in GPUAllocator::reserve_argspace

* Replaced runtime host to device memcpys with GPUAllocator reservations in order to move them to compile time.

* Forgot to remove no longer necessary buffer freeing from op emitters.

* Removed replace slice.

* Removed more replace_slice diffs.

* Updated replace_slice op to utilize GPUShape and GPUMemoryManager.

* Added back missing changes after timeline resolution.

* Added spacing between functions in GPUShape and boolean operators in shape.hpp.

* Template parameters are UPPER_SNAKE_CASE.

* Added unit tests for GPUMemoryManager and added checks that ensure the
device memory is allocated prior to address resolution by the memory_primitives.
Also exposed the allocation size of the memory manager.

* Return type of shape_size should be large enough to encapsulate the full stride of the tensor.
This should be 64bits wide regardless of the underlying value_type of the ShapeType.

* Upstreaming changes to shape_size (which returns size_t).

* cuDNN softmax impl. for all axis activation.

* Added catch for per-axis activations.

* Removed commended headers.

* Added explicit function for queueing kernel argument data rather than inline in the reservation function per @fengleitian recommendation.

* Add softmax cuda kernel. It relies on atomic memory addition to global
memory, this will add contention and should be optimized in the
future. A multilevel reduction can be found in
cs/gpu_softmax_cuda_shfl but it requires some further engineering.

* Refactored reduce coordinate transform code into a helper and applied it to broadcast.
Broadcast added to CUDAEmitter, now supports multiple non-consecutive axes.

* Removed change to data_types variable and updated/removed comments.

* Refactored softmax into the emission of two fused elementwise collective ops.
Added fused elementwise + collective kernels. Softmax is then just the combination of exp_sum_reduce + div_broadcast.

* Added default param to GPUAllocator::reserve_workspace to request memory initialization for each invocation of the memory primitive.

* GPU workspace memory is zero initialized by default but can be turned off if desired.

* Added template parameter to CUDAEmitter::build_elementwise, REDUCE_OP_TYPE,
to specify the ngraph op type to use for the reduction in the fusted ew_collective kernel.

* Renamed variables and updated a comment.

* Removed outdated softmax kernel to avoid confusion. Can be added later when atomic reduce is replaced.

* Clang complained about lack of explicit destructor for AxisSet. Since cuda_emitter doesn't need AxisSet specifically, switch to std::set<size_t>.
This also has the benefit that in the future, if we wish to emit kernels without ngraph core (for example in a standalone binary via a
serialized graph manifest, we don't depend on AxisSet.

* softmax -> broadcast in build_broadcast.

* Separate elementwise and elementwise_collective.

83e6aa5f

07 Jun, 2018 1 commit

ngraph-1676 batch dot fusion (#1071) · 6f5e3ac7

Louis Feng authored 6 years ago

* batch dot pattern wip.

* batch dot pattern wip.

* added batch dot op.

* batch dot compute testing.

* correct gemm parameters.

* renaming matrix fusions passes and update tests.

* clean up.

* clang format.

* more clean ups.

* clang format.

* added CPUBatchDotFusion to default cpu passes.

* added missing header.

* added element type check.

6f5e3ac7

06 Jun, 2018 1 commit
- Support 3-D pool with mkldnn (#1079) · bb5c7f07
  Nishant Patel authored 6 years ago
```
* Support 3-D pool with mkldnn

* Move execute() to test_tools.hpp
```
  bb5c7f07
05 Jun, 2018 3 commits

Add StopGradient op to ngraph (#1067) · 52313f9e

Ashok Emani authored 6 years ago

* add StopGradient op

* add StopGradient op src

* remove adjoints and add interpreter

* fix compile issue

* use nop_elimination and add unit-test

* update cmake

* update unit-tests

52313f9e

Slice Concat Elimination (#948) · 91ecac9d

Nick Korovaiko authored 6 years ago

* slice elimination

* add comment for simplify_concat

* fix concat_slice

* another reshape-related fix

* added a missing header

* disable reshape-concat optimization

* test fix

91ecac9d

Added per argument alignment to GPUAllocator::reserve_argspace. (#1069) · 6638e02b

Chris Sullivan authored 6 years ago

* Added per argument alignment to GPUAllocator::reserve_argspace.

* Changed alignment in tests to match update to alignment in backend.

6638e02b

04 Jun, 2018 1 commit

Modernize cmake usage (#1032) · eef750df

Robert Kimball authored 6 years ago

* Update cmake files to more modern approach

* disable building libraries that are not required

* handle more build cases

* add versions to backend libs. add start of package target.

* add create_backend to backends

* temporary workaround to tbb not linking correctly with gcc

* install codegen lib

* force tbb to link to the cpu backend so that it is available for codegen

* fix clang build error

* fix warning for codegen build

* update cuda header paths

* change error message for opening backend shared library

* set lib path

eef750df

02 Jun, 2018 1 commit
- Floating point comparison with ULP, adding close_f and all_close_f (#1068) · b8e28555
  Yixing Lao authored 6 years ago
  
  b8e28555
31 May, 2018 2 commits
- NGRAPH-1605 Sigmoid multiply fusion (#964) · 5a7d60a1
  Louis Feng authored 6 years ago
  
  5a7d60a1
- Add missing ops to serializer (#1060) · ba2cbdd6
  Robert Kimball authored 6 years ago
```
* update serializer for all new ops
```
  ba2cbdd6
30 May, 2018 2 commits
- Refactor CPUWorkspaceInsertion to simplify its use in MxNet (#988) · fa221c5f
  Nick Korovaiko authored 6 years ago
```
* refactor cpworkspaceinsertion for mxnet

* rename mxnet functions to adhere to ngraph naming convention

* define a member static const int in a cpp file to resolve a linking issue
```
  fa221c5f
- Fuse conv+bias bprop debug (#1038) · a1d78033
  Nishant Patel authored 6 years ago
  
  a1d78033
29 May, 2018 1 commit

[CS:GPU::Part 1] Add GPUShape type, conversion operators, and generalized shape helpers. (#1031) · d051f5fa

Chris Sullivan authored 6 years ago

* Added GPUShape and reworked Shape helpers to be
compatible with different shape types.
Shape is now implicitly convertable to GPUShape.

* Updated shape helpers signature and add conversion operators/constructors for GPUShape.

* Adjust row_major_strides to avoid reversed-copy.

* Moved declaration out of loop for clang.

* Moved gpu_shape to gpu transformer.

* Removed no longer necessary headers.

* Added stdexcept header to gpu_shape.hpp

* Changed check on 64bit shape to check if high bits are set.

* Added spacing between functions in GPUShape and boolean operators in shape.hpp.

* Template parameters are UPPER_SNAKE_CASE.

* Return type of shape_size should be large enough to encapsulate the full stride of the tensor.
This should be 64bits wide regardless of the underlying value_type of the ShapeType.

* [CS:GPU::Part 2] Add GPUMemoryManager, GPUAllocator, and memory primitives. (#1034)

This is a big PR which introduces the GPUMemoryManager, GPUAllocator, and the concept of memory primitives.

A memory primitive is a closure which yields the device memory address for a reserved memory space. When a memory reservation is made, the request is recorded along with the data that should be copied (for kernel arguments, but not for workspace memory). The reservation does not yield an address eagerly but instead does so lazily by returning an index which can be used to look up the memory_primitive at runtime. This allows the GPUMemoryManager to delay resolution of the memory address until all reservations have been made.

Ideally, the temporary allocations needed by each kernel could be captured by the liveness lists in the GPU_External_Function. This way the pass::MemoryManager would capture these allocations along with the needed tensor allocations.

For now, rather than rearchitect the gpu_emitter and external function, we utilize the GPUMemoryManager, which maintains its own internal pass::MemoryManager, and the GPUAllocator. Liveness is handled by the GPUAllocator: all workspace allocation/reservations created in the same (or sub)scope as the GPUAllocator will persist until the GPUAllocator goes out of scope and deconstructs. At that time, the GPUAllocator will mark the requested temporary buffers as free, and their liveness will be removed (effectively). That way the next kernels that construct a GPUAllocator can reuse the workspace memory that was needed for previous kernels.

Additional notes:
* This PR updates the CUDAEmitter to exclusively utilize GPUShape instead of Shape.

Commits:
* Added GPUMemoryManager for aggregating memory allocations and copies into a single operation for kernel arguments, and a reusuable memory space for workspace allocations.

* Added GPUShape and reworked Shape helpers to be
compatible with different shape types.

* Coordinate->GPUShape

* Refactored replace_slice into CudaKernelBuilder. Simplified allocations using new GPUAllocator and GPUMemoryManager.

* Refactor allocations to make use of primitive emitter. Now memory primitives are registered at compile time and the gpu memory address is resolved at runtime by invoking the primitive.

* Added const qualifier to data being copied in GPUAllocator::reserve_argspace

* Removed more replace_slice diffs.

* Added explicit function for queueing kernel argument data rather than inline in the reservation function per @fengleitian recommendation.

[CS:GPU::Part 3] Refactoring of several ops to use GPUMemoryManager (#1035)

This PR implements the new GPUMemoryManager and allocator for all the ops which were previously implemented but required allocations and copies for kernel arguments at runtime.

Limitations:
The convolution workspaces could not be added because the relevant descriptors were not available at compile time due to the codegen. If convolution is later added to the CUDNN emitter, the GPUAllocator can be used to avoid workspace allocation at runtime.

Commits:
* Replaced runtime host to device memcpys with GPUAllocator reservations in order to move them to compile time.

* Forgot to remove no longer necessary buffer freeing from op emitters.

[CS:GPU::Part4] Added op::ReplaceSlice and enabled respective tests. (#999)

This PR implements ReplaceSlice using the coordinate transformation strategy. A thread for each tensor element of the input tensor is chosen and it's position in the source tensor coordinate system is calculated. If it is within the source slice, the source is loaded and written out, otherwise the input tensor is loaded.

* Relevant tests are enabled.

* This op was refactored to utilize the new GPUAllocator and memory manager.

Commits:

* Updated replace_slice op to utilize GPUShape and GPUMemoryManager.

* Added back missing changes after timeline resolution.

* Fixed clang warnings and bug. The cudnn_handle was not initialized ahead of emission time and so any eager cudnn calls would fail.
To fix this, the cudnn and cublas handle creation was moved to the external function constructor.

* Changed row_major_strides to always return vector<size_t> to avoid overflow for tensors with many dimensions. Handle the conversion to 32 bits for GPU shapes with an explicit conversion constructor from vector<size_t>.

* During merge the allocation line from external_function was left out. Adding it back.

d051f5fa

26 May, 2018 1 commit
- Bug fix to graph control logic to always compute output tensors (#1053) · 2f776ef0
  Jayaram Bobba authored 6 years ago
```
* Bug fix to graph control logic to always compute output tensors

* Remove stale comments
```
  2f776ef0
25 May, 2018 3 commits
- Any op (#1036) · bff65fe3
  Nick Korovaiko authored 6 years ago
```
* add any op
```
  bff65fe3
- add gpu product (#1040) · 60523801
  Fenglei authored 6 years ago
```
* add gpu product

* enable test, change initial value for product
```
  60523801
- fix bug, add another test to catch the error, enable more tests (#1048) · 2177cf5b
  Fenglei authored 6 years ago
  
  2177cf5b
23 May, 2018 1 commit

LSTM fusion + RNN fusion across time slice's for single layer (#826) · 1d08f073

Pruthvi authored 6 years ago

* - Added pattren matcher for LSTM cell

* WIP added support to replace lstm cell instead of subgraph

* WIP LSTM pattern matcher, fuses recurrent cells

* WIP added RNN CPU op

* WIP mkldnn emmiter code for fprop RNN

* WIP RNN mkldnn integration
- Added mkldnn kernel for uni directional LSTM in the CPU emitter

* add a getter for root node

* recurrent graph rewrite

* fix perms, rename match_root -> get_match_root

* fix comp errors

* make match_root return the topmost match; fix tests

* - WIP GetOutputElement for handling multiple LSTM o/ps
- use RecurrentGraphRewrite for replacing node after matching LSTM cells

* WIP LSTM multi Output + debug prints

* moved LSTM fusion to cpu_fusion

* WIP added RNN superfused OP

* WIP towards RNN layer fusion

* WIP multiple output slicing RNN

* WIP RNN mulitple o/ps fusion across layer

* WIP corrected input params for fused RNN OP

* concat corrosponding param's across differnt LSTM to form inputs to RNN fused op

* i) Added  test case for RNN kernel ii) runs without error's

* refactored and moved LSTM class to standalone file

* Rename RNN -> Rnn , LSTM -> Lstm

* WIP replace lstm slices to the consumer op

* Slicing works on multiple RNN layers

* fixed all bugs

* - Added CPU RNN Recurrent Fusion
- Added CPU LSTM fusion
- removed debug code
- style fix

* - Added support to compute src_iter and dst_iter instead of taking zero_memory_desc
- Added unit test to compute one LSTM cell

*  changed RNN op signature to accept number of states in basic unit of RNN(GRU/LSTM/ vanilla RNN) cell

* added sanity checks for RNN op

* Fixed issue related to patching the graph while replacing the RNN sliced outputs

* Fixed issue to feed the input symbols in the order X0, X1, ...Xt to the RNN op

* Added unit test for multi layer RNN fusion

* Removed debug statements

* Added mulitlayered serialized graph ii) fixed compilation issue

* Addressed PR comments

* i) WIP MKLDNN layout for RNN Op ii) added test case for INTERPRETER v/s CPU Rnn results

* - Fixed bug w.r.to src_layer feature size in rnn mkldnn emitter code
- Refactored cpu_fusion rnn test case

* merge origin/master with branch pruthvi/lstm_fusion

* style fix

* Added test case for multiple RNN layers

* i) make rnn as mkldnn op if it meets the constraints ii) assert if rnn is not mkldnn op

* fix unit test failure

* - Added support to reliabily identify the hiddent state and input symbols from the nodes collected by Pattern matcher
- Fixed failing unit tests

* style fix

* - removed "node type" dependency to replace the intermediate LSTM outputs

* Addressed PR comments

* Fix unit test

* - added MKLDNN emitter for LSTM op
- graph pass to concat LSTM input recurrent state tensors
- CPU layout assignment for LSTM Op
- Fixed bug in rnn/lstm unit test's
- made changes to use replace_output instead of replace_node for replacing matched graph nodes in LSTM/RNN fusion pass

(cherry picked from commit d16fc709265cc0a73e60c6d5f6d2878e7b908aca)

* style fix

* Renamed passes and style fixes

1d08f073

21 May, 2018 1 commit
- Batch norm folding (#992) · 0d125c51
  Jayaram Bobba authored 6 years ago
```
* Batch norm folding

* Addressed PR feedback

* Style fixes

* Style fix
```
  0d125c51
18 May, 2018 1 commit

Enable reverse_sequence for Interpreter (#977) · cd59bfe4

Nick Korovaiko authored 6 years ago

* use reference kernel for reverse_sequence for int

* move tests

* resolve CI errors

* TEST to NGRAPH_TEST

cd59bfe4

16 May, 2018 1 commit
- give frontends some flexibility over fusions they would like to run (#1010) · 57fd873d
  Nick Korovaiko authored 6 years ago
```
* give frontends some flexibility over fusions they would like to run

* address jbobbas feedback
```
  57fd873d
14 May, 2018 3 commits
- parse inline comments for manifest (#1006) · f84b795e
  Yixing Lao authored 6 years ago
  
  f84b795e
- Build NNP with ngraph as a library (#1005) · cc5ddd2f
  Yixing Lao authored 6 years ago
```
* Enable NNP reverse build, clean up ngraph repo

* clean mkldnn cmake
```
  cc5ddd2f
- Bob/numeric representation (#1004) · d4d56986
  Robert Kimball authored 6 years ago
```
* Generate binary representable floats for input data
```
  d4d56986
11 May, 2018 4 commits
- remove visualiazeTree passes in AlgebraicSimplification tests (#998) · 00c4830e
  Nick Korovaiko authored 6 years ago
  
  00c4830e
- move nop elimination pass to nGraph and add broadcast elimination (#995) · ddf3f4f0
  Adam Straw authored 6 years ago
```
* move nop elimination pass to nGraph and add broadcast elimination

* fix pad test bug

* remove graph visualizer and clean up includes in nop eliminate test

* code format
```
  ddf3f4f0
- Add type_prop tests for ReverseSequence (#978) · 516167f7
  Nick Korovaiko authored 6 years ago
```
* type tests for reverse_sequence

* remove commented out code
```
  516167f7
- Fix broken build dependency. (#997) · 175d51fa
  Sang Ik Lee authored 6 years ago
  
  175d51fa
10 May, 2018 1 commit
- Remove prints in test_control (#991) · c842d353
  Yixing Lao authored 6 years ago
  
  c842d353