Doc distributed training (#1104)

* editing how to execute computation file for clarity and linenos * Add placeholder for runtime docs * Update section on backends, interpreter, and FPGA options * add updated master to fix python_ci * Weird autosummary issue reverted * Clarify new section * fix up docs * Update pattern matcher doc based on Nik's presentation slides WIP * Update doc structure and examples * remove old folder * Fix broken Tensorview refs * new section on distr training * updated index w/drafted outline * . helping people document code more efficiently * edit WIP branch * WIP editing * WIP editing * init distributed doc * PR review edits * modify dist doc and dist mnist_mlp * Finish PR review comment fixes so far * Improving distributed training docs * Fix build error now that we have documented inteface backends use * update example build and run * update how-to distributed training doc * Editing distr train docs * Reword section to avoid strange doc build error * rebuild for zero errors for CI * split patternmatcher PR * PR feedback added * Add more help and detail for MXNet and neon distr * Resolve merge conflicts due to patternmatcher doc split * Resolve merge conflicts due to patternmatcher doc split * Resolve build errors manually * These files are already added to the branch * fix style * update with glossary def and link to Intel paper on synchronous SGD * fix link to sgd * remove comm_rank in dist example

Doc distributed training (#1104)
* editing how to execute computation file for clarity and linenos * Add placeholder for runtime docs * Update section on backends, interpreter, and FPGA options * add updated master to fix python_ci * Weird autosummary issue reverted * Clarify new section * fix up docs * Update pattern matcher doc based on Nik's presentation slides WIP * Update doc structure and examples * remove old folder * Fix broken Tensorview refs * new section on distr training * updated index w/drafted outline * . helping people document code more efficiently * edit WIP branch * WIP editing * WIP editing * init distributed doc * PR review edits * modify dist doc and dist mnist_mlp * Finish PR review comment fixes so far * Improving distributed training docs * Fix build error now that we have documented inteface backends use * update example build and run * update how-to distributed training doc * Editing distr train docs * Reword section to avoid strange doc build error * rebuild for zero errors for CI * split patternmatcher PR * PR feedback added * Add more help and detail for MXNet and neon distr * Resolve merge conflicts due to patternmatcher doc split * Resolve merge conflicts due to patternmatcher doc split * Resolve build errors manually * These files are already added to the branch * fix style * update with glossary def and link to Intel paper on synchronous SGD * fix link to sgd * remove comm_rank in dist example
f78133d2 · L.S. Cook · Scott Cyphers · dcdaf26e · f78133d2 · f78133d2
Commit f78133d2 authored Jul 26, 2018 by L.S. Cook Committed by Scott Cyphers Jul 26, 2018
23 changed files
--- a/doc/examples/mnist_mlp/CMakeLists.txt
+++ b/doc/examples/mnist_mlp/CMakeLists.txt
@@ -17,3 +17,13 @@
 add_executable(mnist_mlp mnist_loader.cpp mnist_mlp.cpp)
 add_dependencies(mnist_mlp ngraph cpu_backend)
 target_link_libraries(mnist_mlp ngraph cpu_backend)
+if (NGRAPH_DISTRIBUTED_ENABLE)
+    find_package(MPI REQUIRED)
+    add_definitions(-DNGRAPH_DISTRIBUTED)
+    include_directories(SYSTEM ${MPI_C_INCLUDE_PATH} ${MPI_CXX_INCLUDE_PATH})
+    link_directories(${MPI_C_LIBRARIES} ${MPI_CXX_LIBRARIES})
+    link_libraries(${MPI_CXX_LIBRARIES})
+    add_executable(dist_mnist_mlp mnist_loader.cpp dist_mnist_mlp.cpp)
+    add_dependencies(dist_mnist_mlp ngraph cpu_backend)
+    target_link_libraries(dist_mnist_mlp ngraph cpu_backend)
+endif()
--- a/doc/examples/mnist_mlp/dist_mnist_mlp.cpp
+++ b/doc/examples/mnist_mlp/dist_mnist_mlp.cpp
--- a/doc/sphinx/source/distr/index.rst
+++ b/doc/sphinx/source/distr/index.rst
+.. distr/index: 
+Distributed Training in nGraph
+==============================
+Why distributed training?
+-------------------------
+A tremendous amount of data is required to train deep neural networks in diverse 
+areas -- from computer vision to natural language processing. Meanwhile, 
+computation used in AI training has been increasing exponentially. And even 
+though significant improvements have been made in algorithms and hardware, 
+using one machine to train a very large neural network / model is usually not 
+optimal. The use of multiple nodes, then, becomes important for making deep 
+learning training feasible with a large datasets.   
+Data parallelism is the most popular parallel architecture to accelerate deep 
+learning with large datasets. The first algorithm we support is based on the 
+`synchronous`_ :term:`SGD` method, and partitions the dataset among workers 
+where each worker executes the same neural network model. For every iteration, 
+nGraph backend computes the gradients in back-propagation, aggregates the gradients 
+across all workers, and then update the weights. 
+How? (Generic frameworks)
+-------------------------
+To synchronize gradients across all workers, the essential operation for data 
+parallel training, due to its simplicity and scalability over parameter servers, 
+is “allreduce”. The AllReduce op is one of the nGraph Library’s core ops. To 
+enable gradient synchronization for a network, we simply inject the AllReduce op 
+into the computation graph, connecting the graph for the autodiff computation 
+and optimizer update (which then becomes part of the nGraph graph). The 
+nGraph Backend will handle the rest. 
+Data scientists with locally-scalable rack or cloud-based resources will likely 
+find it worthwhile to experiment with different modes or variations of  
+distributed training. Deployments using nGraph Library with supported backends 
+can be configured to train with data parallelism and will soon work with model 
+parallelism. Distributing workloads is increasingly important, as more data and 
+bigger models mean the ability to :doc:`../howto/distribute-train` work with 
+larger and larger datasets, or to work with models having many layers that 
+aren't designed to fit to a single device.  
+Distributed training with data parallelism splits the data and each worker 
+node has the same model; during each iteration, the gradients are aggregated 
+across all workers with an op that performs "allreduce", and applied to update 
+the weights.
+Using multiple machines helps to scale and speed up deep learning. With large 
+mini-batch training, `one could train ResNet-50 with Imagenet-1k data`_ to the 
+*Top 5* classifier in minutes using thousands of CPU nodes. See also: 
+`arxiv.org/pdf/1709.05011.pdf`_. 
+MXNet
+-----
+We implemented a KVStore in MXNet\* (KVStore is unique to MXNet) to modify 
+the SGD update op so the nGraph graph will contain the allreduce op and generate
+corresponding collective communication kernels for different backends. We are 
+using OpenMPI for CPU backends and plan to integrate `Intel MLSL`_ in future. 
+The figure below shows a bar chart with preliminary results from a Resnet-50 
+I1K training in MXNet 1, 2, 4, (and 8 if available) nodes, x-axis is the number 
+of nodes while y-axis is the throughput (images/sec).
+.. TODO add figure graphics/distributed-training-ngraph-backends.png
+TensorFlow
+----------
+We plan to support the same in nGraph-TensorFlow. It is still work in progress.
+Meanwhile, users could still use Horovod and the current nGraph TensorFlow, 
+where allreduce op is placed on CPU instead of on nGraph device.
+Figure: a bar chart shows preliminary results Resnet-50 I1K training in TF 1, 
+2, 4, (and 8 if available) nodes, x-axis is the number of nodes while y-axis 
+is the throughput (images/sec).
+Future work
+-----------
+Model parallelism with more communication ops support is in the works. For 
+more general parallelism, such as model parallel, we plan to add more 
+communication collective ops such as allgather, scatter, gather, etc. in 
+the future. 
+.. _synchronous: https://arxiv.org/pdf/1602.06709.pdf 
+.. _one could train ResNet-50 with Imagenet-1k data: https://blog.surf.nl/en/imagenet-1k-training-on-intel-xeon-phi-in-less-than-40-minutes/
+.. _arxiv.org/pdf/1709.05011.pdf: https://arxiv.org/pdf/1709.05011.pdf
+.. _Intel MLSL: https://github.com/intel/MLSL/releases
\ No newline at end of file
--- a/doc/sphinx/source/framework-integration-guides.rst
+++ b/doc/sphinx/source/framework-integration-guides.rst
@@ -94,6 +94,10 @@ Compile MXNet with nGraph
      $ python example/image-classification/train_mnist.py
+#. (Optional) For experimental or alternative approaches to distributed training
+   methodologies, including data parallel training, see the :doc:`distr/index` 
+   and :doc:`How to <howto/index>` articles on :doc:`howto/distribute-train`. 
 .. _tensorflow_intg:
@@ -104,7 +108,6 @@ See the `ngraph tensorflow bridge README`_ for how to install the `DSO`_ for the
 nGraph-TensorFlow bridge.
 .. _neon_intg:
 neon |trade|
@@ -168,6 +171,9 @@ system that already has an ``ngraph_dist`` installed.
      (neon_venv)$ python cifar10_conv.py
+#. (Optional) For experimental or alternative approaches to distributed training
+   methodologies, including data parallel training, see the :doc:`distr/index` 
+   and :doc:`How to <howto/index>` articles on :doc:`howto/distribute-train`. 

--- a/doc/sphinx/source/fusion/mg/fusion_pattern.dot
+++ b/doc/sphinx/source/fusion/mg/fusion_pattern.dot
+digraph G {
+    Label_0 -> Max_2
+    Constant_1 -> Max_2
+    Label_0 [shape=ellipse color=black]
+    Constant_1 [shape=ellipse color=black]
+    Max_2 [shape=ellipse color=black]
+}
--- a/doc/sphinx/source/fusion/mg/pr1_graph1.dot
+++ b/doc/sphinx/source/fusion/mg/pr1_graph1.dot
+digraph G {
+    Parameter_0 -> Negative_1 -> Negative_2;
+    Parameter_0 [shape=box color=blue]
+    Negative_1 [shape=ellipse color=black]
+    Negative_2 [shape=ellipse color=black]
+}
--- a/doc/sphinx/source/fusion/mg/pr1_graph2.dot
+++ b/doc/sphinx/source/fusion/mg/pr1_graph2.dot
+digraph G {
+    Parameter_0 -> Abs_1 -> Negative_2 -> Negative_3;
+    Parameter_0 [shape=box color=blue]
+    Abs_1 [shape=ellipse color=black]
+    Negative_2 [shape=ellipse color=black]
+    Negative_3 [shape=ellipse color=black]
+}
--- a/doc/sphinx/source/fusion/mg/pr1_graph3.dot
+++ b/doc/sphinx/source/fusion/mg/pr1_graph3.dot
+digraph G {
+    Parameter_0 -> Add_2
+    Parameter_1 -> Add_2
+    Add_2 -> Abs_3 -> Negative_4 -> Negative_5
+    Parameter_0 [shape=box color=blue]
+    Parameter_1 [shape=box color=blue]
+    Add_2 [shape=ellipse color=black]
+    Abs_3 [shape=ellipse color=black]
+    Negative_4 [shape=ellipse color=black]
+    Negative_5 [shape=ellipse color=black]
+}
--- a/doc/sphinx/source/fusion/mg/pr1_graph4.dot
+++ b/doc/sphinx/source/fusion/mg/pr1_graph4.dot
+digraph G {
+    Parameter_0 -> Add_2
+    Parameter_1 -> Add_2
+    Add_2 -> Negative_3 -> Negative_4
+    Parameter_0 [shape=box color=blue]
+    Parameter_1 [shape=box color=blue]
+    Add_2 [shape=ellipse color=black]
+    Negative_3 [shape=ellipse color=black]
+    Negative_4 [shape=ellipse color=black]
+}
--- a/doc/sphinx/source/fusion/mg/pr1_graph5.dot
+++ b/doc/sphinx/source/fusion/mg/pr1_graph5.dot
+digraph G {
+    Parameter_0 -> Sub_2
+    Parameter_1 -> Sub_2
+    Sub_2 -> Negative_3 -> Negative_4
+    Parameter_0 [shape=box color=blue]
+    Parameter_1 [shape=box color=blue]
+    Sub_2 [shape=ellipse color=black]
+    Negative_3 [shape=ellipse color=black]
+    Negative_4 [shape=ellipse color=black]
+}
--- a/doc/sphinx/source/fusion/mg/pr1_pattern.dot
+++ b/doc/sphinx/source/fusion/mg/pr1_pattern.dot
+digraph G {
+    Parameter_1 -> Negative_2 -> Negative_3;
+    Parameter_1 [shape=box color=blue]
+    Negative_2 [shape=ellipse color=black]
+    Negative_3 [shape=ellipse color=black]
+}
--- a/doc/sphinx/source/fusion/mg/pr1_pattern2.dot
+++ b/doc/sphinx/source/fusion/mg/pr1_pattern2.dot
+digraph G {
+    Label_0 -> Negative_1 -> Negative_2;
+    Label_0 [shape=ellipse color=black]
+    Negative_1 [shape=ellipse color=black]
+    Negative_2 [shape=ellipse color=black]
+}
--- a/doc/sphinx/source/fusion/mg/pr2_graph1.dot
+++ b/doc/sphinx/source/fusion/mg/pr2_graph1.dot
+digraph G {
+    Parameter_0 -> Add_2
+    Constant_1 -> Add_2
+    Parameter_0 [shape=box color=blue]
+    Constant_1 [shape=ellipse color=black]
+    Add_2 [shape=ellipse color=black]
+}
--- a/doc/sphinx/source/fusion/mg/pr2_graph2.dot
+++ b/doc/sphinx/source/fusion/mg/pr2_graph2.dot
+digraph G {
+    Parameter_0 -> Add_3
+    Constant_1 -> Broadcast_2
+    Broadcast_2 -> Add_3
+    Parameter_0 [shape=box color=blue]
+    Constant_1 [shape=ellipse color=black]
+    Broadcast_2 [shape=ellipse color=black]
+    Add_3 [shape=ellipse color=black]
+}
--- a/doc/sphinx/source/fusion/mg/pr2_graph3.dot
+++ b/doc/sphinx/source/fusion/mg/pr2_graph3.dot
+digraph G {
+    Parameter_0 -> Add_2
+    Constant_1 -> Broadcast_2
+    Constant_1 -> Add_3
+    Parameter_0 [shape=box color=blue]
+    Constant_1 [shape=ellipse color=black]
+    Broadcast_2 [shape=ellipse color=black]
+    Add_3 [shape=ellipse color=black]
+}
--- a/doc/sphinx/source/fusion/mg/pr2_pattern2.dot
+++ b/doc/sphinx/source/fusion/mg/pr2_pattern2.dot
+digraph G {
+    Constant_1 -> Skip_2
+    Label_3 -> Add_4
+    Skip_2 -> Add_4
+    Constant_1 [shape=ellipse color=black]
+    Skip_2 [shape=ellipse color=black]
+    Label_3 [shape=ellipse color=black]
+    Add_4 [shape=ellipse color=black]
+}
--- a/doc/sphinx/source/fusion/mg/rp_graph1.dot
+++ b/doc/sphinx/source/fusion/mg/rp_graph1.dot
+digraph G {
+    Parameter_0 -> Add_2
+    Constant_1 -> Add_2
+    Add_2 -> Add_3
+    Constant_2 -> Add_3
+    Add_3 -> Add_4
+    Constant_3 -> Add_4
+    Parameter_0 [shape=box color=blue]
+    Constant_1 [shape=ellipse color=black]
+    Constant_2 [shape=ellipse color=black]
+    Constant_3 [shape=ellipse color=black]
+    Add_2 [shape=ellipse color=black]
+    Add_3 [shape=ellipse color=black]
+    Add_4 [shape=ellipse color=black]
+}
--- a/doc/sphinx/source/fusion/mg/rp_pattern.dot
+++ b/doc/sphinx/source/fusion/mg/rp_pattern.dot
+digraph G {
+    Label_0 -> Add_2
+    Constant_1 -> Add_2
+    Label_0 [shape=ellipse color=black]
+    Constant_1 [shape=ellipse color=black]
+    Add_2 [shape=ellipse color=black]
+}
--- a/doc/sphinx/source/glossary.rst
+++ b/doc/sphinx/source/glossary.rst
@@ -165,3 +165,8 @@ Glossary
      gates. These gates help avoid the problem of exploding or vanishing 
      gradients that occur in the traditional RNN.
+   SGD
+      :abbr:`Stochastic Gradient Descent (SGD)`, also known as incremental 
+      gradient descent, is an iterative method for optimizing a differentiable 
+      objective function.
\ No newline at end of file
--- a/doc/sphinx/source/howto/distribute-train.rst
+++ b/doc/sphinx/source/howto/distribute-train.rst
+.. howto/distribute-train.rst 
+Train using multiple nGraph CPU backends with data parallel 
+===========================================================
+In the :doc:`previous section <../howto/derive-for-training>`, we described the 
+steps needed to create a "trainable" nGraph model. Here we demonstrate how to 
+train a data parallel model by distributing the graph across devices.
+To use this mode of training, first install a supported version of `OpenMPI`_ 
+(1.10 or newer). 
+Next, create an nGraph build with the cmake flag ``-DNGRAPH_DISTRIBUTED_ENABLE=TRUE``.  
+To deploy data-parallel training on multi-node/device, the ``AllReduce`` op 
+should be added after the steps needed to complete the 
+:doc:`backpropagation <../howto/derive-for-training>`.
+.. literalinclude:: ../../../examples/mnist_mlp/dist_mnist_mlp.cpp
+   :language: cpp
+   :lines: 188-191
+Also since we are using OpenMPI in this example, we need to initialize and 
+finalize MPI.
+.. literalinclude:: ../../../examples/mnist_mlp/dist_mnist_mlp.cpp
+   :language: cpp
+   :lines: 112
+.. literalinclude:: ../../../examples/mnist_mlp/dist_mnist_mlp.cpp
+   :language: cpp
+   :lines: 295
+Finally, to run the training on two nGraph devices, invoke :command:`mpirun`. 
+This will run on a single machine and launch two processes. 
+.. code-block:: console 
+   $ mpirun -np 2 dist_mnist_mlp
+.. _OpenMPI: https://www.open-mpi.org/software/ompi/v3.1
--- a/doc/sphinx/source/howto/import.rst
+++ b/doc/sphinx/source/howto/import.rst
@@ -26,9 +26,11 @@ usually named ``<some_model>.onnx`` or ``<some_model>.onnx.pb``. These
   or ``.onnx.pb`` formatted file, you should be able to run the inference 
   without needing to dig into anything from the "Frameworks" sections. You 
   will, however, need to have completed the steps outlined in 
-   our :doc:`../install` guide.  
+   our :doc:`../install` guide.  If you intend to build nGraph for :   doc:`distributed-training`, 
+   you will need to build that has already been compiled with the additional 
+   cmake flag: ``-DNGRAPH_DISTRIBUTED_ENABLE=TRUE``.
-To demonstrate functionality, we'll use an already serialized CIFAR10 model 
+To demonstrate functionality, we'll use an already-serialized CIFAR10 model 
 trained via ResNet20. Remember that this model has already been trained and 
 exported from a framework such as Caffe2, PyTorch or CNTK; we are simply going 
 to build an nGraph representation of the model, execute it, and produce some 

--- a/doc/sphinx/source/howto/index.rst
+++ b/doc/sphinx/source/howto/index.rst
@@ -11,6 +11,7 @@ How to
   operator.rst
   update.rst
   derive-for-training.rst
+   distribute-train.rst
   import.rst    
 The "How to" articles in this section explain how to do specific tasks with 

--- a/doc/sphinx/source/index.rst
+++ b/doc/sphinx/source/index.rst
@@ -147,6 +147,7 @@ Contents
   framework-integration-guides.rst
   frameworks/index.rst
   programmable/index.rst
+   distr/index.rst
   python_api/index.rst
   project/index.rst
@@ -163,3 +164,4 @@ Indices and tables
 .. _ngraph onnx companion tool: https://github.com/NervanaSystems/ngraph-onnx
 .. _Movidius: https://www.movidius.com/
 .. _contributions: https://github.com/NervanaSystems/ngraph#how-to-contribute