Fix broken link and update details about mxnet (#2684)

* Update MXNet bridge info page * Fix warning from docbuild due to heading * Update page on distributed training

Fix broken link and update details about mxnet (#2684)
* Update MXNet bridge info page * Fix warning from docbuild due to heading * Update page on distributed training
db610006 · Leona C · Scott Cyphers · 199ec73e · db610006 · db610006
Commit db610006 authored Mar 29, 2019 by Leona C Committed by Scott Cyphers Mar 29, 2019
6 changed files
--- a/doc/sphinx/source/distr/index.rst
+++ b/doc/sphinx/source/distr/index.rst
-.. distr/index.rst: 
+.. distr/index.rst:
 ################################
 Distributed training with nGraph
 ################################
-.. important:: Distributed training is not officially supported in version 
+.. important:: Distributed training is not officially supported as of version
-   |version|; however, some configuration options have worked for nGraph devices 
+   |version|; however, some configuration options have worked for nGraph 
-   with mixed or limited success in testing environments.
+   devices in testing environments.
-Why distributed training?
-=========================
-A tremendous amount of data is required to train DNNs in diverse areas -- from 
-computer vision to natural language processing. Meanwhile, computation used in 
-AI training has been increasing exponentially. And even though significant 
-improvements have been made in algorithms and hardware, using one machine to 
-train a very large :term:`NN` is usually not optimal. The use of multiple nodes, 
-then, becomes important for making deep learning training feasible with large 
-datasets.   
-Data parallelism is the most popular parallel architecture to accelerate deep 
-learning with large datasets. The first algorithm we support is `based on the 
-synchronous`_ :term:`SGD` method, and partitions the dataset among workers 
-where each worker executes the same neural network model. For every iteration, 
-nGraph backend computes the gradients in back-propagation, aggregates the gradients 
-across all workers, and then update the weights. 
 How? (Generic frameworks)
 =========================
 * :doc:`../core/constructing-graphs/distribute-train`
-To synchronize gradients across all workers, the essential operation for data 
+To synchronize gradients across all workers, the essential operation for data
-parallel training, due to its simplicity and scalability over parameter servers, 
+parallel training, due to its simplicity and scalability over parameter servers,
-is ``allreduce``. The AllReduce op is one of the nGraph Library’s core ops. To 
+is ``allreduce``. The AllReduce op is one of the nGraph Library’s core ops. To
-enable gradient synchronization for a network, we simply inject the AllReduce op 
+enable gradient synchronization for a network, we simply inject the AllReduce op
-into the computation graph, connecting the graph for the autodiff computation 
+into the computation graph, connecting the graph for the autodiff computation
-and optimizer update (which then becomes part of the nGraph graph). The 
+and optimizer update (which then becomes part of the nGraph graph). The
-nGraph Backend will handle the rest. 
+nGraph Backend will handle the rest.
-Data scientists with locally-scalable rack or cloud-based resources will likely 
+Data scientists with locally-scalable rack or cloud-based resources will likely
-find it worthwhile to experiment with different modes or variations of  
+find it worthwhile to experiment with different modes or variations of
-distributed training. Deployments using nGraph Library with supported backends 
+distributed training. Deployments using nGraph Library with supported backends
-can be configured to train with data parallelism and will soon work with model 
+can be configured to train with data parallelism and will soon work with model
-parallelism. Distributing workloads is increasingly important, as more data and 
+parallelism. Distributing workloads is increasingly important, as more data and
-bigger models mean the ability to :doc:`../core/constructing-graphs/distribute-train` 
+bigger models mean the ability to :doc:`../core/constructing-graphs/distribute-train`
-work with larger and larger datasets, or to work with models having many layers 
+work with larger and larger datasets, or to work with models having many layers
-that aren't designed to fit to a single device.  
+that aren't designed to fit to a single device.
-Distributed training with data parallelism splits the data and each worker 
+Distributed training with data parallelism splits the data and each worker
-node has the same model; during each iteration, the gradients are aggregated 
+node has the same model; during each iteration, the gradients are aggregated
-across all workers with an op that performs "allreduce", and applied to update 
+across all workers with an op that performs "allreduce", and applied to update
 the weights.
 Using multiple machines helps to scale and speed up deep learning. With large 
-mini-batch training, one could train ResNet-50 with Imagenet-1k data to the 
+mini-batch training, one could train ResNet-50 with Imagenet-1k data to the
-*Top 5* classifier in minutes using thousands of CPU nodes. See 
+*Top 5* classifier in minutes using thousands of CPU nodes. See
-`arxiv.org/abs/1709.05011`_. 
+`arxiv.org/abs/1709.05011`_.
-MXNet
-=====
-We implemented a KVStore in MXNet\* (KVStore is unique to MXNet) to modify 
-the SGD update op so the nGraph graph will contain the allreduce op and generate
-corresponding collective communication kernels for different backends. We are 
-using `Intel MLSL`_ for CPU backends.
-The figure below shows a bar chart with preliminary results from a Resnet-50 
-I1K training in MXNet 1, 2, 4, (and 8 if available) nodes, x-axis is the number 
-of nodes while y-axis is the throughput (images/sec).
-.. TODO add figure graphics/distributed-training-ngraph-backends.png
-TensorFlow
-==========
-We plan to support the same in nGraph-TensorFlow. It is still work in progress.
-Meanwhile, users could still use Horovod and the current nGraph TensorFlow, 
-where allreduce op is placed on CPU instead of on nGraph device.
-Figure: a bar chart shows preliminary results Resnet-50 I1K training in TF 1, 
-2, 4, (and 8 if available) nodes, x-axis is the number of nodes while y-axis 
-is the throughput (images/sec).
 Future work
 ===========
-Model parallelism with more communication ops support is in the works. For 
+More communication ops support is in the works. See also:  
-more general parallelism, such as model parallel, we plan to add more 
+:doc:`../../core/passes/list-of-passes`. 
-communication collective ops such as allgather, scatter, gather, etc. in 
-the future. 
 .. _arxiv.org/abs/1709.05011: https://arxiv.org/format/1709.05011
 .. _based on the synchronous: https://arxiv.org/format/1602.06709 
-.. _Intel MLSL: https://github.com/intel/MLSL/releases
\ No newline at end of file
--- a/doc/sphinx/source/frameworks/mxnet_integ.rst
+++ b/doc/sphinx/source/frameworks/mxnet_integ.rst
-.. mxnet_integ.rst:
+.. frameworks/mxnet_integ.rst:
 MXNet\* bridge
 ===============
-* See the `README`_ on nGraph-MXNet repo.
+* See the nGraph-MXNet `Integration Guide`_ on the nGraph-MXNet repo.
 * **Testing inference latency**:  See the :doc:`validated/testing-latency` 
  doc for a fully-documented example how to compile and test latency with an 
-  MXNet-supported model.     
+  MXNet-supported model.  
-* **Training**: For experimental or alternative approaches to distributed 
+.. note:: The nGraph-MXNet bridge is designed to be used with trained models 
-  training methodologies, including data parallel training, see the 
+   only; it does not support distributed training.  
-  MXNet-relevant sections of the docs on :doc:`../distr/index` and 
-  :doc:`How to <../core/constructing-graphs/index>` topics like :doc:`../core/constructing-graphs/distribute-train`. 
-.. _README: https://github.com/NervanaSystems/ngraph-mxnet/blob/master/README.md
\ No newline at end of file
+.. _Integration Guide: https://github.com/NervanaSystems/ngraph-mxnet/blob/master/NGRAPH_README.md
--- a/doc/sphinx/source/ops/broadcast_distributed.rst
+++ b/doc/sphinx/source/ops/broadcast_distributed.rst
@@ -28,7 +28,7 @@ Inputs
 Outputs (in place)
-------
+------------------
 +-----------------+-------------------------+--------------------------------+
 | Name            | Element Type            | Shape                          |

--- a/doc/sphinx/source/ops/index.rst
+++ b/doc/sphinx/source/ops/index.rst
@@ -26,7 +26,7 @@ Not currently a comprehensive list.
   * :doc:`batch_norm_training`
   * :doc:`batch_norm_training_backprop`
   * :doc:`broadcast`
-   * :doc:`broadcastdistributed`
+   * :doc:`broadcast_distributed`
   * :doc:`ceiling`
   * :doc:`concat`
   * :doc:`constant`

--- a/doc/sphinx/source/python_api/_autosummary/ngraph.exceptions.rst
+++ b/doc/sphinx/source/python_api/_autosummary/ngraph.exceptions.rst
@@ -3,6 +3,19 @@ ngraph.exceptions
 .. automodule:: ngraph.exceptions
   .. rubric:: Exceptions
   .. autosummary::

--- a/doc/sphinx/source/python_api/_autosummary/ngraph.ops.rst
+++ b/doc/sphinx/source/python_api/_autosummary/ngraph.ops.rst
@@ -14,16 +14,20 @@ ngraph.ops
      absolute
      acos
      add
+      argmax
+      argmin
      asin
      atan
      avg_pool
      batch_norm
      broadcast
+      broadcast_to
      ceiling
      concat
      constant
      convert
      convolution
+      convolution_backprop_data
      cos
      cosh
      divide
@@ -31,14 +35,16 @@ ngraph.ops
      equal
      exp
      floor
-      function_call
      get_output_element
      greater
      greater_eq
      less
      less_eq
      log
+      logical_and
      logical_not
+      logical_or
+      lrn
      max
      max_pool
      maximum
@@ -52,7 +58,6 @@ ngraph.ops
      parameter
      power
      prod
-      reduce
      relu
      replace_slice
      reshape
@@ -68,6 +73,7 @@ ngraph.ops
      sum
      tan
      tanh
+      topk