1. 19 Jun, 2018 1 commit
    • Jayaram Bobba's avatar
      Minor bug fix in function outlining (#1056) · 5203a301
      Jayaram Bobba authored
      * Move to depth-first serialization of graph for better cache behavior
      
      * Added comment
      
      * Force 64 byte stack alignment to avoid crashes from unaligned AVX loads/stores
      
      * Revert "Force 64 byte stack alignment to avoid crashes from unaligned AVX loads/stores"
      
      This reverts commit 84346420fbd0fbd5d05a4a1e8f5fae12bdc7348b.
      
      * revert to breadth-first serialization
      5203a301
  2. 18 Jun, 2018 6 commits
  3. 17 Jun, 2018 2 commits
  4. 16 Jun, 2018 3 commits
    • Fenglei's avatar
      gpu reverse sequence (#1109) · bdfcf5b4
      Fenglei authored
      * add reverse_sequence
      
      * fix bugs, compiled
      
      * fix index bug
      
      * fix bug and clang format
      
      * correct function name
      
      * clang format
      
      * remove extra ;
      
      * remove tests from skip list
      
      * add backward support, skip tests
      
      * add back template<> line
      
      * remove unecessary lines in kernel
      bdfcf5b4
    • Nick Korovaiko's avatar
      Strided Convolution (#1058) · 94844d13
      Nick Korovaiko authored
      * optimized strided convolutions
      
      * clean up debug messages
      
      * format fixes
      
      * more tests
      
      * even more tests
      
      * adapt to resnet-50.v1
      
      * fix format errors; remove changes from diff PRs
      94844d13
    • Nick Korovaiko's avatar
      enable cse for reduction ops (#1030) · 656dfa55
      Nick Korovaiko authored
      * enable cse for reduction ops
      
      * reduction tests
      656dfa55
  5. 15 Jun, 2018 8 commits
  6. 14 Jun, 2018 3 commits
  7. 13 Jun, 2018 6 commits
    • Jaikrishnan Menon's avatar
      CPU Direct Execution: Implement Ceiling · c829a9c7
      Jaikrishnan Menon authored
      Also, formatting fixes
      c829a9c7
    • Jaikrishnan Menon's avatar
      CPU Direct Execution: Implement Relu · b33fc6a2
      Jaikrishnan Menon authored
      b33fc6a2
    • Jaikrishnan Menon's avatar
      9d0a6998
    • Robert Kimball's avatar
      Ubuntu 18 build support (#1101) · 838ba3f1
      Robert Kimball authored
      * backend libraries now found in tree
      
      dynamically read header search paths
      
      fix running from install
      838ba3f1
    • Nick Korovaiko's avatar
      Group Convolution (#1041) · 4a2c3c9c
      Nick Korovaiko authored
      *  group conv init
      
      * add GroupConvolution op; refine checks in fusion logic
      
      * add an emitter, cpu assigment
      
      * cpu_layout
      
      * add checks to algebraic simplification
      
      * updating emitter logic for groupconvolution
      
      * working before refactoring
      
      * moving primitive creation logic to mkldnn_emitter
      
      * group convolution graph test
      
      * rename an opt
      
      * address jbobba's feedback
      4a2c3c9c
    • Fenglei's avatar
      gpu deconvolution (#1099) · 40069d27
      Fenglei authored
      * add pad_dilation function
      
      * add dilation to gpu_emitter
      
      * add CoordinateDiff constructor to GPUShape
      
      * remove unecessary cast
      
      * working version for forward
      
      * forward working
      
      * forward test all pass
      
      * deconvolution forward
      
      * backward data dilation
      
      * forward test passed
      
      * initial to 0
      
      * fix bug for get_padded_shape and clang format
      
      * code style, change variable names
      
      * refactor convolution conditions
      
      * fix bug padding_below_diff
      
      * change pad_dilation to pad_dynamic, compare to pad
      
      * remove passed convolution test from skip list, clang format
      
      * change pad to use GPUShape
      40069d27
  8. 12 Jun, 2018 2 commits
    • Chris Sullivan's avatar
      CUDA softmax kernel and broadcast kernel support for multiple non-consecutive axes (#1070) · 83e6aa5f
      Chris Sullivan authored
      * Added op::ReplaceSlice and enabled respective tests.
      
      * div64 -> division_by_invariant_multiplication
      
      * Added GPUMemoryManager for aggregating memory allocations and copies into a single operation for kernel arguments, and a reusuable memory space for workspace allocations.
      
      * Added GPUShape and reworked Shape helpers to be
      compatible with different shape types.
      Shape is now implicitly convertable to GPUShape.
      
      * Updated shape helpers signature and add conversion operators/constructors for GPUShape.
      
      * Removed several unecessary static_casts now that GPUShape is utilized. GPUTensorViewWrapper had a few functions returning std::vector<size_t> instead of Shape/Strides. These were updated as well to take advantage of GPUShape convertion operators.
      
      * Forgot to fix lambda for workspace allocations to match that of argspace allocations.
      
      * Added GPUShape and reworked Shape helpers to be
      compatible with different shape types.
      Shape is now implicitly convertable to GPUShape.
      
      * Updated shape helpers signature and add conversion operators/constructors for GPUShape.
      
      * Adjust row_major_strides to avoid reversed-copy.
      
      * Moved declaration out of loop for clang.
      
      * Moved gpu_shape to gpu transformer.
      
      * Removed no longer necessary headers.
      
      * Added stdexcept header to gpu_shape.hpp
      
      * Coordinate->GPUShape
      
      * Refactored replace_slice into CudaKernelBuilder. Simplified allocations using new GPUAllocator and GPUMemoryManager.
      
      * Refactor allocations to make use of primitive emitter.
      Now memory primitives are registered at compile time and
      the gpu memory address is resolved at runtime by ivoking
      the primitive.
      
      * Changed check on 64bit shape to check if high bits are set.
      
      * Added const qualifier to data being copied in GPUAllocator::reserve_argspace
      
      * Added const qualifier to data being copied in GPUAllocator::reserve_argspace
      
      * Replaced runtime host to device memcpys with GPUAllocator reservations in order to move them to compile time.
      
      * Forgot to remove no longer necessary buffer freeing from op emitters.
      
      * Removed replace slice.
      
      * Removed more replace_slice diffs.
      
      * Updated replace_slice op to utilize GPUShape and GPUMemoryManager.
      
      * Added back missing changes after timeline resolution.
      
      * Added spacing between functions in GPUShape and boolean operators in shape.hpp.
      
      * Template parameters are UPPER_SNAKE_CASE.
      
      * Added unit tests for GPUMemoryManager and added checks that ensure the
      device memory is allocated prior to address resolution by the memory_primitives.
      Also exposed the allocation size of the memory manager.
      
      * Return type of shape_size should be large enough to encapsulate the full stride of the tensor.
      This should be 64bits wide regardless of the underlying value_type of the ShapeType.
      
      * Upstreaming changes to shape_size (which returns size_t).
      
      * cuDNN softmax impl. for all axis activation.
      
      * Added catch for per-axis activations.
      
      * Removed commended headers.
      
      * Added explicit function for queueing kernel argument data rather than inline in the reservation function per @fengleitian recommendation.
      
      * Add softmax cuda kernel. It relies on atomic memory addition to global
      memory, this will add contention and should be optimized in the
      future. A multilevel reduction can be found in
      cs/gpu_softmax_cuda_shfl but it requires some further engineering.
      
      * Refactored reduce coordinate transform code into a helper and applied it to broadcast.
      Broadcast added to CUDAEmitter, now supports multiple non-consecutive axes.
      
      * Removed change to data_types variable and updated/removed comments.
      
      * Refactored softmax into the emission of two fused elementwise collective ops.
      Added fused elementwise + collective kernels. Softmax is then just the combination of exp_sum_reduce + div_broadcast.
      
      * Added default param to GPUAllocator::reserve_workspace to request memory initialization for each invocation of the memory primitive.
      
      * GPU workspace memory is zero initialized by default but can be turned off if desired.
      
      * Added template parameter to CUDAEmitter::build_elementwise, REDUCE_OP_TYPE,
      to specify the ngraph op type to use for the reduction in the fusted ew_collective kernel.
      
      * Renamed variables and updated a comment.
      
      * Removed outdated softmax kernel to avoid confusion. Can be added later when atomic reduce is replaced.
      
      * Clang complained about lack of explicit destructor for AxisSet. Since cuda_emitter doesn't need AxisSet specifically, switch to std::set<size_t>.
      This also has the benefit that in the future, if we wish to emit kernels without ngraph core (for example in a standalone binary via a
      serialized graph manifest, we don't depend on AxisSet.
      
      * softmax -> broadcast in build_broadcast.
      
      * Separate elementwise and elementwise_collective.
      83e6aa5f
    • Nick Korovaiko's avatar
      Replace Check (#1097) · 692101a7
      Nick Korovaiko authored
      692101a7
  9. 11 Jun, 2018 2 commits
  10. 08 Jun, 2018 2 commits
    • Jayaram Bobba's avatar
      Optimized eigen kernel for spatial mean (#1094) · 0b95efa6
      Jayaram Bobba authored
      * Optimized eigen kernel for 2D reduction on a 4D tensor used for spatial mean
      
      * revert change to serializer
      0b95efa6
    • Jaikrishnan Menon's avatar
      Jmenon/dexec (#1092) · abb68627
      Jaikrishnan Menon authored
      * CPU: Direct Execution
      Part 1 with bare minimum infrastructure
      
      * Refactor: Move build related functionality to a separate TU
      and external function method
      
      * Add TU back after merge
      
      * Remove an assert
      
      * Remove commented-out code
      abb68627
  11. 07 Jun, 2018 2 commits
  12. 06 Jun, 2018 3 commits