Commit f78133d2 authored by L.S. Cook's avatar L.S. Cook Committed by Scott Cyphers

Doc distributed training (#1104)

* editing how to execute computation file for clarity and linenos

* Add placeholder for runtime docs

* Update section on backends, interpreter, and FPGA options

* add updated master to fix python_ci

* Weird autosummary issue reverted

* Clarify new section

* fix up docs

* Update pattern matcher doc based on Nik's presentation slides WIP

* Update doc structure and examples

* remove old folder

* Fix broken Tensorview refs

* new section on distr training

* updated index w/drafted outline

* . helping people document code more efficiently

* edit WIP branch

* WIP editing

* WIP editing

* init distributed doc

* PR review edits

* modify dist doc and dist mnist_mlp

* Finish PR review comment fixes so far

* Improving distributed training docs

* Fix build error now that we have documented inteface backends use

* update example build and run

* update how-to distributed training doc

* Editing distr train docs

* Reword section to avoid strange doc build error

* rebuild for zero errors for CI

* split patternmatcher PR

* PR feedback added

* Add more help and detail for MXNet and neon distr

* Resolve merge conflicts due to patternmatcher doc split

* Resolve merge conflicts due to patternmatcher doc split

* Resolve build errors manually

* These files are already added to the branch

* fix style

* update with glossary def and link to Intel paper on synchronous SGD

* fix link to sgd

* remove comm_rank in dist example
parent dcdaf26e
...@@ -17,3 +17,13 @@ ...@@ -17,3 +17,13 @@
add_executable(mnist_mlp mnist_loader.cpp mnist_mlp.cpp) add_executable(mnist_mlp mnist_loader.cpp mnist_mlp.cpp)
add_dependencies(mnist_mlp ngraph cpu_backend) add_dependencies(mnist_mlp ngraph cpu_backend)
target_link_libraries(mnist_mlp ngraph cpu_backend) target_link_libraries(mnist_mlp ngraph cpu_backend)
if (NGRAPH_DISTRIBUTED_ENABLE)
find_package(MPI REQUIRED)
add_definitions(-DNGRAPH_DISTRIBUTED)
include_directories(SYSTEM ${MPI_C_INCLUDE_PATH} ${MPI_CXX_INCLUDE_PATH})
link_directories(${MPI_C_LIBRARIES} ${MPI_CXX_LIBRARIES})
link_libraries(${MPI_CXX_LIBRARIES})
add_executable(dist_mnist_mlp mnist_loader.cpp dist_mnist_mlp.cpp)
add_dependencies(dist_mnist_mlp ngraph cpu_backend)
target_link_libraries(dist_mnist_mlp ngraph cpu_backend)
endif()
This diff is collapsed.
.. distr/index:
Distributed Training in nGraph
==============================
Why distributed training?
-------------------------
A tremendous amount of data is required to train deep neural networks in diverse
areas -- from computer vision to natural language processing. Meanwhile,
computation used in AI training has been increasing exponentially. And even
though significant improvements have been made in algorithms and hardware,
using one machine to train a very large neural network / model is usually not
optimal. The use of multiple nodes, then, becomes important for making deep
learning training feasible with a large datasets.
Data parallelism is the most popular parallel architecture to accelerate deep
learning with large datasets. The first algorithm we support is based on the
`synchronous`_ :term:`SGD` method, and partitions the dataset among workers
where each worker executes the same neural network model. For every iteration,
nGraph backend computes the gradients in back-propagation, aggregates the gradients
across all workers, and then update the weights.
How? (Generic frameworks)
-------------------------
To synchronize gradients across all workers, the essential operation for data
parallel training, due to its simplicity and scalability over parameter servers,
is “allreduce”. The AllReduce op is one of the nGraph Library’s core ops. To
enable gradient synchronization for a network, we simply inject the AllReduce op
into the computation graph, connecting the graph for the autodiff computation
and optimizer update (which then becomes part of the nGraph graph). The
nGraph Backend will handle the rest.
Data scientists with locally-scalable rack or cloud-based resources will likely
find it worthwhile to experiment with different modes or variations of
distributed training. Deployments using nGraph Library with supported backends
can be configured to train with data parallelism and will soon work with model
parallelism. Distributing workloads is increasingly important, as more data and
bigger models mean the ability to :doc:`../howto/distribute-train` work with
larger and larger datasets, or to work with models having many layers that
aren't designed to fit to a single device.
Distributed training with data parallelism splits the data and each worker
node has the same model; during each iteration, the gradients are aggregated
across all workers with an op that performs "allreduce", and applied to update
the weights.
Using multiple machines helps to scale and speed up deep learning. With large
mini-batch training, `one could train ResNet-50 with Imagenet-1k data`_ to the
*Top 5* classifier in minutes using thousands of CPU nodes. See also:
`arxiv.org/pdf/1709.05011.pdf`_.
MXNet
-----
We implemented a KVStore in MXNet\* (KVStore is unique to MXNet) to modify
the SGD update op so the nGraph graph will contain the allreduce op and generate
corresponding collective communication kernels for different backends. We are
using OpenMPI for CPU backends and plan to integrate `Intel MLSL`_ in future.
The figure below shows a bar chart with preliminary results from a Resnet-50
I1K training in MXNet 1, 2, 4, (and 8 if available) nodes, x-axis is the number
of nodes while y-axis is the throughput (images/sec).
.. TODO add figure graphics/distributed-training-ngraph-backends.png
TensorFlow
----------
We plan to support the same in nGraph-TensorFlow. It is still work in progress.
Meanwhile, users could still use Horovod and the current nGraph TensorFlow,
where allreduce op is placed on CPU instead of on nGraph device.
Figure: a bar chart shows preliminary results Resnet-50 I1K training in TF 1,
2, 4, (and 8 if available) nodes, x-axis is the number of nodes while y-axis
is the throughput (images/sec).
Future work
-----------
Model parallelism with more communication ops support is in the works. For
more general parallelism, such as model parallel, we plan to add more
communication collective ops such as allgather, scatter, gather, etc. in
the future.
.. _synchronous: https://arxiv.org/pdf/1602.06709.pdf
.. _one could train ResNet-50 with Imagenet-1k data: https://blog.surf.nl/en/imagenet-1k-training-on-intel-xeon-phi-in-less-than-40-minutes/
.. _arxiv.org/pdf/1709.05011.pdf: https://arxiv.org/pdf/1709.05011.pdf
.. _Intel MLSL: https://github.com/intel/MLSL/releases
\ No newline at end of file
...@@ -94,6 +94,10 @@ Compile MXNet with nGraph ...@@ -94,6 +94,10 @@ Compile MXNet with nGraph
$ python example/image-classification/train_mnist.py $ python example/image-classification/train_mnist.py
#. (Optional) For experimental or alternative approaches to distributed training
methodologies, including data parallel training, see the :doc:`distr/index`
and :doc:`How to <howto/index>` articles on :doc:`howto/distribute-train`.
.. _tensorflow_intg: .. _tensorflow_intg:
...@@ -104,7 +108,6 @@ See the `ngraph tensorflow bridge README`_ for how to install the `DSO`_ for the ...@@ -104,7 +108,6 @@ See the `ngraph tensorflow bridge README`_ for how to install the `DSO`_ for the
nGraph-TensorFlow bridge. nGraph-TensorFlow bridge.
.. _neon_intg: .. _neon_intg:
neon |trade| neon |trade|
...@@ -168,6 +171,9 @@ system that already has an ``ngraph_dist`` installed. ...@@ -168,6 +171,9 @@ system that already has an ``ngraph_dist`` installed.
(neon_venv)$ python cifar10_conv.py (neon_venv)$ python cifar10_conv.py
#. (Optional) For experimental or alternative approaches to distributed training
methodologies, including data parallel training, see the :doc:`distr/index`
and :doc:`How to <howto/index>` articles on :doc:`howto/distribute-train`.
......
digraph G {
Label_0 -> Max_2
Constant_1 -> Max_2
Label_0 [shape=ellipse color=black]
Constant_1 [shape=ellipse color=black]
Max_2 [shape=ellipse color=black]
}
digraph G {
Parameter_0 -> Negative_1 -> Negative_2;
Parameter_0 [shape=box color=blue]
Negative_1 [shape=ellipse color=black]
Negative_2 [shape=ellipse color=black]
}
digraph G {
Parameter_0 -> Abs_1 -> Negative_2 -> Negative_3;
Parameter_0 [shape=box color=blue]
Abs_1 [shape=ellipse color=black]
Negative_2 [shape=ellipse color=black]
Negative_3 [shape=ellipse color=black]
}
digraph G {
Parameter_0 -> Add_2
Parameter_1 -> Add_2
Add_2 -> Abs_3 -> Negative_4 -> Negative_5
Parameter_0 [shape=box color=blue]
Parameter_1 [shape=box color=blue]
Add_2 [shape=ellipse color=black]
Abs_3 [shape=ellipse color=black]
Negative_4 [shape=ellipse color=black]
Negative_5 [shape=ellipse color=black]
}
digraph G {
Parameter_0 -> Add_2
Parameter_1 -> Add_2
Add_2 -> Negative_3 -> Negative_4
Parameter_0 [shape=box color=blue]
Parameter_1 [shape=box color=blue]
Add_2 [shape=ellipse color=black]
Negative_3 [shape=ellipse color=black]
Negative_4 [shape=ellipse color=black]
}
digraph G {
Parameter_0 -> Sub_2
Parameter_1 -> Sub_2
Sub_2 -> Negative_3 -> Negative_4
Parameter_0 [shape=box color=blue]
Parameter_1 [shape=box color=blue]
Sub_2 [shape=ellipse color=black]
Negative_3 [shape=ellipse color=black]
Negative_4 [shape=ellipse color=black]
}
digraph G {
Parameter_1 -> Negative_2 -> Negative_3;
Parameter_1 [shape=box color=blue]
Negative_2 [shape=ellipse color=black]
Negative_3 [shape=ellipse color=black]
}
digraph G {
Label_0 -> Negative_1 -> Negative_2;
Label_0 [shape=ellipse color=black]
Negative_1 [shape=ellipse color=black]
Negative_2 [shape=ellipse color=black]
}
digraph G {
Parameter_0 -> Add_2
Constant_1 -> Add_2
Parameter_0 [shape=box color=blue]
Constant_1 [shape=ellipse color=black]
Add_2 [shape=ellipse color=black]
}
digraph G {
Parameter_0 -> Add_3
Constant_1 -> Broadcast_2
Broadcast_2 -> Add_3
Parameter_0 [shape=box color=blue]
Constant_1 [shape=ellipse color=black]
Broadcast_2 [shape=ellipse color=black]
Add_3 [shape=ellipse color=black]
}
digraph G {
Parameter_0 -> Add_2
Constant_1 -> Broadcast_2
Constant_1 -> Add_3
Parameter_0 [shape=box color=blue]
Constant_1 [shape=ellipse color=black]
Broadcast_2 [shape=ellipse color=black]
Add_3 [shape=ellipse color=black]
}
digraph G {
Constant_1 -> Skip_2
Label_3 -> Add_4
Skip_2 -> Add_4
Constant_1 [shape=ellipse color=black]
Skip_2 [shape=ellipse color=black]
Label_3 [shape=ellipse color=black]
Add_4 [shape=ellipse color=black]
}
digraph G {
Parameter_0 -> Add_2
Constant_1 -> Add_2
Add_2 -> Add_3
Constant_2 -> Add_3
Add_3 -> Add_4
Constant_3 -> Add_4
Parameter_0 [shape=box color=blue]
Constant_1 [shape=ellipse color=black]
Constant_2 [shape=ellipse color=black]
Constant_3 [shape=ellipse color=black]
Add_2 [shape=ellipse color=black]
Add_3 [shape=ellipse color=black]
Add_4 [shape=ellipse color=black]
}
digraph G {
Label_0 -> Add_2
Constant_1 -> Add_2
Label_0 [shape=ellipse color=black]
Constant_1 [shape=ellipse color=black]
Add_2 [shape=ellipse color=black]
}
...@@ -165,3 +165,8 @@ Glossary ...@@ -165,3 +165,8 @@ Glossary
gates. These gates help avoid the problem of exploding or vanishing gates. These gates help avoid the problem of exploding or vanishing
gradients that occur in the traditional RNN. gradients that occur in the traditional RNN.
SGD
:abbr:`Stochastic Gradient Descent (SGD)`, also known as incremental
gradient descent, is an iterative method for optimizing a differentiable
objective function.
\ No newline at end of file
.. howto/distribute-train.rst
Train using multiple nGraph CPU backends with data parallel
===========================================================
In the :doc:`previous section <../howto/derive-for-training>`, we described the
steps needed to create a "trainable" nGraph model. Here we demonstrate how to
train a data parallel model by distributing the graph across devices.
To use this mode of training, first install a supported version of `OpenMPI`_
(1.10 or newer).
Next, create an nGraph build with the cmake flag ``-DNGRAPH_DISTRIBUTED_ENABLE=TRUE``.
To deploy data-parallel training on multi-node/device, the ``AllReduce`` op
should be added after the steps needed to complete the
:doc:`backpropagation <../howto/derive-for-training>`.
.. literalinclude:: ../../../examples/mnist_mlp/dist_mnist_mlp.cpp
:language: cpp
:lines: 188-191
Also since we are using OpenMPI in this example, we need to initialize and
finalize MPI.
.. literalinclude:: ../../../examples/mnist_mlp/dist_mnist_mlp.cpp
:language: cpp
:lines: 112
.. literalinclude:: ../../../examples/mnist_mlp/dist_mnist_mlp.cpp
:language: cpp
:lines: 295
Finally, to run the training on two nGraph devices, invoke :command:`mpirun`.
This will run on a single machine and launch two processes.
.. code-block:: console
$ mpirun -np 2 dist_mnist_mlp
.. _OpenMPI: https://www.open-mpi.org/software/ompi/v3.1
...@@ -26,9 +26,11 @@ usually named ``<some_model>.onnx`` or ``<some_model>.onnx.pb``. These ...@@ -26,9 +26,11 @@ usually named ``<some_model>.onnx`` or ``<some_model>.onnx.pb``. These
or ``.onnx.pb`` formatted file, you should be able to run the inference or ``.onnx.pb`` formatted file, you should be able to run the inference
without needing to dig into anything from the "Frameworks" sections. You without needing to dig into anything from the "Frameworks" sections. You
will, however, need to have completed the steps outlined in will, however, need to have completed the steps outlined in
our :doc:`../install` guide. our :doc:`../install` guide. If you intend to build nGraph for : doc:`distributed-training`,
you will need to build that has already been compiled with the additional
cmake flag: ``-DNGRAPH_DISTRIBUTED_ENABLE=TRUE``.
To demonstrate functionality, we'll use an already serialized CIFAR10 model To demonstrate functionality, we'll use an already-serialized CIFAR10 model
trained via ResNet20. Remember that this model has already been trained and trained via ResNet20. Remember that this model has already been trained and
exported from a framework such as Caffe2, PyTorch or CNTK; we are simply going exported from a framework such as Caffe2, PyTorch or CNTK; we are simply going
to build an nGraph representation of the model, execute it, and produce some to build an nGraph representation of the model, execute it, and produce some
......
...@@ -11,6 +11,7 @@ How to ...@@ -11,6 +11,7 @@ How to
operator.rst operator.rst
update.rst update.rst
derive-for-training.rst derive-for-training.rst
distribute-train.rst
import.rst import.rst
The "How to" articles in this section explain how to do specific tasks with The "How to" articles in this section explain how to do specific tasks with
......
...@@ -147,6 +147,7 @@ Contents ...@@ -147,6 +147,7 @@ Contents
framework-integration-guides.rst framework-integration-guides.rst
frameworks/index.rst frameworks/index.rst
programmable/index.rst programmable/index.rst
distr/index.rst
python_api/index.rst python_api/index.rst
project/index.rst project/index.rst
...@@ -163,3 +164,4 @@ Indices and tables ...@@ -163,3 +164,4 @@ Indices and tables
.. _ngraph onnx companion tool: https://github.com/NervanaSystems/ngraph-onnx .. _ngraph onnx companion tool: https://github.com/NervanaSystems/ngraph-onnx
.. _Movidius: https://www.movidius.com/ .. _Movidius: https://www.movidius.com/
.. _contributions: https://github.com/NervanaSystems/ngraph#how-to-contribute .. _contributions: https://github.com/NervanaSystems/ngraph#how-to-contribute
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment