Commit db610006 authored by Leona C's avatar Leona C Committed by Scott Cyphers

Fix broken link and update details about mxnet (#2684)

* Update MXNet bridge info page

* Fix warning from docbuild due to heading

* Update page on distributed training
parent 199ec73e
.. distr/index.rst:
.. distr/index.rst:
################################ ################################
Distributed training with nGraph Distributed training with nGraph
################################ ################################
.. important:: Distributed training is not officially supported in version .. important:: Distributed training is not officially supported as of version
|version|; however, some configuration options have worked for nGraph devices |version|; however, some configuration options have worked for nGraph
with mixed or limited success in testing environments. devices in testing environments.
Why distributed training?
=========================
A tremendous amount of data is required to train DNNs in diverse areas -- from
computer vision to natural language processing. Meanwhile, computation used in
AI training has been increasing exponentially. And even though significant
improvements have been made in algorithms and hardware, using one machine to
train a very large :term:`NN` is usually not optimal. The use of multiple nodes,
then, becomes important for making deep learning training feasible with large
datasets.
Data parallelism is the most popular parallel architecture to accelerate deep
learning with large datasets. The first algorithm we support is `based on the
synchronous`_ :term:`SGD` method, and partitions the dataset among workers
where each worker executes the same neural network model. For every iteration,
nGraph backend computes the gradients in back-propagation, aggregates the gradients
across all workers, and then update the weights.
How? (Generic frameworks) How? (Generic frameworks)
========================= =========================
* :doc:`../core/constructing-graphs/distribute-train` * :doc:`../core/constructing-graphs/distribute-train`
To synchronize gradients across all workers, the essential operation for data To synchronize gradients across all workers, the essential operation for data
parallel training, due to its simplicity and scalability over parameter servers, parallel training, due to its simplicity and scalability over parameter servers,
is ``allreduce``. The AllReduce op is one of the nGraph Library’s core ops. To is ``allreduce``. The AllReduce op is one of the nGraph Library’s core ops. To
enable gradient synchronization for a network, we simply inject the AllReduce op enable gradient synchronization for a network, we simply inject the AllReduce op
into the computation graph, connecting the graph for the autodiff computation into the computation graph, connecting the graph for the autodiff computation
and optimizer update (which then becomes part of the nGraph graph). The and optimizer update (which then becomes part of the nGraph graph). The
nGraph Backend will handle the rest. nGraph Backend will handle the rest.
Data scientists with locally-scalable rack or cloud-based resources will likely Data scientists with locally-scalable rack or cloud-based resources will likely
find it worthwhile to experiment with different modes or variations of find it worthwhile to experiment with different modes or variations of
distributed training. Deployments using nGraph Library with supported backends distributed training. Deployments using nGraph Library with supported backends
can be configured to train with data parallelism and will soon work with model can be configured to train with data parallelism and will soon work with model
parallelism. Distributing workloads is increasingly important, as more data and parallelism. Distributing workloads is increasingly important, as more data and
bigger models mean the ability to :doc:`../core/constructing-graphs/distribute-train` bigger models mean the ability to :doc:`../core/constructing-graphs/distribute-train`
work with larger and larger datasets, or to work with models having many layers work with larger and larger datasets, or to work with models having many layers
that aren't designed to fit to a single device. that aren't designed to fit to a single device.
Distributed training with data parallelism splits the data and each worker Distributed training with data parallelism splits the data and each worker
node has the same model; during each iteration, the gradients are aggregated node has the same model; during each iteration, the gradients are aggregated
across all workers with an op that performs "allreduce", and applied to update across all workers with an op that performs "allreduce", and applied to update
the weights. the weights.
Using multiple machines helps to scale and speed up deep learning. With large Using multiple machines helps to scale and speed up deep learning. With large
mini-batch training, one could train ResNet-50 with Imagenet-1k data to the mini-batch training, one could train ResNet-50 with Imagenet-1k data to the
*Top 5* classifier in minutes using thousands of CPU nodes. See *Top 5* classifier in minutes using thousands of CPU nodes. See
`arxiv.org/abs/1709.05011`_. `arxiv.org/abs/1709.05011`_.
MXNet
=====
We implemented a KVStore in MXNet\* (KVStore is unique to MXNet) to modify
the SGD update op so the nGraph graph will contain the allreduce op and generate
corresponding collective communication kernels for different backends. We are
using `Intel MLSL`_ for CPU backends.
The figure below shows a bar chart with preliminary results from a Resnet-50
I1K training in MXNet 1, 2, 4, (and 8 if available) nodes, x-axis is the number
of nodes while y-axis is the throughput (images/sec).
.. TODO add figure graphics/distributed-training-ngraph-backends.png
TensorFlow
==========
We plan to support the same in nGraph-TensorFlow. It is still work in progress.
Meanwhile, users could still use Horovod and the current nGraph TensorFlow,
where allreduce op is placed on CPU instead of on nGraph device.
Figure: a bar chart shows preliminary results Resnet-50 I1K training in TF 1,
2, 4, (and 8 if available) nodes, x-axis is the number of nodes while y-axis
is the throughput (images/sec).
Future work Future work
=========== ===========
Model parallelism with more communication ops support is in the works. For More communication ops support is in the works. See also:
more general parallelism, such as model parallel, we plan to add more :doc:`../../core/passes/list-of-passes`.
communication collective ops such as allgather, scatter, gather, etc. in
the future.
.. _arxiv.org/abs/1709.05011: https://arxiv.org/format/1709.05011 .. _arxiv.org/abs/1709.05011: https://arxiv.org/format/1709.05011
.. _based on the synchronous: https://arxiv.org/format/1602.06709 .. _based on the synchronous: https://arxiv.org/format/1602.06709
.. _Intel MLSL: https://github.com/intel/MLSL/releases
\ No newline at end of file
.. mxnet_integ.rst: .. frameworks/mxnet_integ.rst:
MXNet\* bridge MXNet\* bridge
=============== ===============
* See the `README`_ on nGraph-MXNet repo. * See the nGraph-MXNet `Integration Guide`_ on the nGraph-MXNet repo.
* **Testing inference latency**: See the :doc:`validated/testing-latency` * **Testing inference latency**: See the :doc:`validated/testing-latency`
doc for a fully-documented example how to compile and test latency with an doc for a fully-documented example how to compile and test latency with an
MXNet-supported model. MXNet-supported model.
* **Training**: For experimental or alternative approaches to distributed .. note:: The nGraph-MXNet bridge is designed to be used with trained models
training methodologies, including data parallel training, see the only; it does not support distributed training.
MXNet-relevant sections of the docs on :doc:`../distr/index` and
:doc:`How to <../core/constructing-graphs/index>` topics like :doc:`../core/constructing-graphs/distribute-train`.
.. _README: https://github.com/NervanaSystems/ngraph-mxnet/blob/master/README.md
\ No newline at end of file .. _Integration Guide: https://github.com/NervanaSystems/ngraph-mxnet/blob/master/NGRAPH_README.md
...@@ -28,7 +28,7 @@ Inputs ...@@ -28,7 +28,7 @@ Inputs
Outputs (in place) Outputs (in place)
------- ------------------
+-----------------+-------------------------+--------------------------------+ +-----------------+-------------------------+--------------------------------+
| Name | Element Type | Shape | | Name | Element Type | Shape |
......
...@@ -26,7 +26,7 @@ Not currently a comprehensive list. ...@@ -26,7 +26,7 @@ Not currently a comprehensive list.
* :doc:`batch_norm_training` * :doc:`batch_norm_training`
* :doc:`batch_norm_training_backprop` * :doc:`batch_norm_training_backprop`
* :doc:`broadcast` * :doc:`broadcast`
* :doc:`broadcastdistributed` * :doc:`broadcast_distributed`
* :doc:`ceiling` * :doc:`ceiling`
* :doc:`concat` * :doc:`concat`
* :doc:`constant` * :doc:`constant`
......
...@@ -3,6 +3,19 @@ ngraph.exceptions ...@@ -3,6 +3,19 @@ ngraph.exceptions
.. automodule:: ngraph.exceptions .. automodule:: ngraph.exceptions
.. rubric:: Exceptions .. rubric:: Exceptions
.. autosummary:: .. autosummary::
......
...@@ -14,16 +14,20 @@ ngraph.ops ...@@ -14,16 +14,20 @@ ngraph.ops
absolute absolute
acos acos
add add
argmax
argmin
asin asin
atan atan
avg_pool avg_pool
batch_norm batch_norm
broadcast broadcast
broadcast_to
ceiling ceiling
concat concat
constant constant
convert convert
convolution convolution
convolution_backprop_data
cos cos
cosh cosh
divide divide
...@@ -31,14 +35,16 @@ ngraph.ops ...@@ -31,14 +35,16 @@ ngraph.ops
equal equal
exp exp
floor floor
function_call
get_output_element get_output_element
greater greater
greater_eq greater_eq
less less
less_eq less_eq
log log
logical_and
logical_not logical_not
logical_or
lrn
max max
max_pool max_pool
maximum maximum
...@@ -52,7 +58,6 @@ ngraph.ops ...@@ -52,7 +58,6 @@ ngraph.ops
parameter parameter
power power
prod prod
reduce
relu relu
replace_slice replace_slice
reshape reshape
...@@ -68,6 +73,7 @@ ngraph.ops ...@@ -68,6 +73,7 @@ ngraph.ops
sum sum
tan tan
tanh tanh
topk
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment