Commit db610006 authored by Leona C's avatar Leona C Committed by Scott Cyphers

Fix broken link and update details about mxnet (#2684)

* Update MXNet bridge info page

* Fix warning from docbuild due to heading

* Update page on distributed training
parent 199ec73e
.. distr/index.rst:
.. distr/index.rst:
################################
Distributed training with nGraph
################################
.. important:: Distributed training is not officially supported in version
|version|; however, some configuration options have worked for nGraph devices
with mixed or limited success in testing environments.
Why distributed training?
=========================
.. important:: Distributed training is not officially supported as of version
|version|; however, some configuration options have worked for nGraph
devices in testing environments.
A tremendous amount of data is required to train DNNs in diverse areas -- from
computer vision to natural language processing. Meanwhile, computation used in
AI training has been increasing exponentially. And even though significant
improvements have been made in algorithms and hardware, using one machine to
train a very large :term:`NN` is usually not optimal. The use of multiple nodes,
then, becomes important for making deep learning training feasible with large
datasets.
Data parallelism is the most popular parallel architecture to accelerate deep
learning with large datasets. The first algorithm we support is `based on the
synchronous`_ :term:`SGD` method, and partitions the dataset among workers
where each worker executes the same neural network model. For every iteration,
nGraph backend computes the gradients in back-propagation, aggregates the gradients
across all workers, and then update the weights.
How? (Generic frameworks)
=========================
* :doc:`../core/constructing-graphs/distribute-train`
To synchronize gradients across all workers, the essential operation for data
parallel training, due to its simplicity and scalability over parameter servers,
is ``allreduce``. The AllReduce op is one of the nGraph Library’s core ops. To
enable gradient synchronization for a network, we simply inject the AllReduce op
into the computation graph, connecting the graph for the autodiff computation
and optimizer update (which then becomes part of the nGraph graph). The
nGraph Backend will handle the rest.
Data scientists with locally-scalable rack or cloud-based resources will likely
find it worthwhile to experiment with different modes or variations of
distributed training. Deployments using nGraph Library with supported backends
can be configured to train with data parallelism and will soon work with model
parallelism. Distributing workloads is increasingly important, as more data and
bigger models mean the ability to :doc:`../core/constructing-graphs/distribute-train`
work with larger and larger datasets, or to work with models having many layers
that aren't designed to fit to a single device.
Distributed training with data parallelism splits the data and each worker
node has the same model; during each iteration, the gradients are aggregated
across all workers with an op that performs "allreduce", and applied to update
To synchronize gradients across all workers, the essential operation for data
parallel training, due to its simplicity and scalability over parameter servers,
is ``allreduce``. The AllReduce op is one of the nGraph Library’s core ops. To
enable gradient synchronization for a network, we simply inject the AllReduce op
into the computation graph, connecting the graph for the autodiff computation
and optimizer update (which then becomes part of the nGraph graph). The
nGraph Backend will handle the rest.
Data scientists with locally-scalable rack or cloud-based resources will likely
find it worthwhile to experiment with different modes or variations of
distributed training. Deployments using nGraph Library with supported backends
can be configured to train with data parallelism and will soon work with model
parallelism. Distributing workloads is increasingly important, as more data and
bigger models mean the ability to :doc:`../core/constructing-graphs/distribute-train`
work with larger and larger datasets, or to work with models having many layers
that aren't designed to fit to a single device.
Distributed training with data parallelism splits the data and each worker
node has the same model; during each iteration, the gradients are aggregated
across all workers with an op that performs "allreduce", and applied to update
the weights.
Using multiple machines helps to scale and speed up deep learning. With large
mini-batch training, one could train ResNet-50 with Imagenet-1k data to the
*Top 5* classifier in minutes using thousands of CPU nodes. See
`arxiv.org/abs/1709.05011`_.
MXNet
=====
We implemented a KVStore in MXNet\* (KVStore is unique to MXNet) to modify
the SGD update op so the nGraph graph will contain the allreduce op and generate
corresponding collective communication kernels for different backends. We are
using `Intel MLSL`_ for CPU backends.
The figure below shows a bar chart with preliminary results from a Resnet-50
I1K training in MXNet 1, 2, 4, (and 8 if available) nodes, x-axis is the number
of nodes while y-axis is the throughput (images/sec).
.. TODO add figure graphics/distributed-training-ngraph-backends.png
TensorFlow
==========
We plan to support the same in nGraph-TensorFlow. It is still work in progress.
Meanwhile, users could still use Horovod and the current nGraph TensorFlow,
where allreduce op is placed on CPU instead of on nGraph device.
Figure: a bar chart shows preliminary results Resnet-50 I1K training in TF 1,
2, 4, (and 8 if available) nodes, x-axis is the number of nodes while y-axis
is the throughput (images/sec).
mini-batch training, one could train ResNet-50 with Imagenet-1k data to the
*Top 5* classifier in minutes using thousands of CPU nodes. See
`arxiv.org/abs/1709.05011`_.
Future work
===========
Model parallelism with more communication ops support is in the works. For
more general parallelism, such as model parallel, we plan to add more
communication collective ops such as allgather, scatter, gather, etc. in
the future.
More communication ops support is in the works. See also:
:doc:`../../core/passes/list-of-passes`.
.. _arxiv.org/abs/1709.05011: https://arxiv.org/format/1709.05011
.. _based on the synchronous: https://arxiv.org/format/1602.06709
.. _Intel MLSL: https://github.com/intel/MLSL/releases
\ No newline at end of file
.. mxnet_integ.rst:
.. frameworks/mxnet_integ.rst:
MXNet\* bridge
===============
* See the `README`_ on nGraph-MXNet repo.
* See the nGraph-MXNet `Integration Guide`_ on the nGraph-MXNet repo.
* **Testing inference latency**: See the :doc:`validated/testing-latency`
doc for a fully-documented example how to compile and test latency with an
MXNet-supported model.
MXNet-supported model.
* **Training**: For experimental or alternative approaches to distributed
training methodologies, including data parallel training, see the
MXNet-relevant sections of the docs on :doc:`../distr/index` and
:doc:`How to <../core/constructing-graphs/index>` topics like :doc:`../core/constructing-graphs/distribute-train`.
.. note:: The nGraph-MXNet bridge is designed to be used with trained models
only; it does not support distributed training.
.. _README: https://github.com/NervanaSystems/ngraph-mxnet/blob/master/README.md
\ No newline at end of file
.. _Integration Guide: https://github.com/NervanaSystems/ngraph-mxnet/blob/master/NGRAPH_README.md
......@@ -28,7 +28,7 @@ Inputs
Outputs (in place)
-------
------------------
+-----------------+-------------------------+--------------------------------+
| Name | Element Type | Shape |
......
......@@ -26,7 +26,7 @@ Not currently a comprehensive list.
* :doc:`batch_norm_training`
* :doc:`batch_norm_training_backprop`
* :doc:`broadcast`
* :doc:`broadcastdistributed`
* :doc:`broadcast_distributed`
* :doc:`ceiling`
* :doc:`concat`
* :doc:`constant`
......
......@@ -3,6 +3,19 @@ ngraph.exceptions
.. automodule:: ngraph.exceptions
.. rubric:: Exceptions
.. autosummary::
......
......@@ -14,16 +14,20 @@ ngraph.ops
absolute
acos
add
argmax
argmin
asin
atan
avg_pool
batch_norm
broadcast
broadcast_to
ceiling
concat
constant
convert
convolution
convolution_backprop_data
cos
cosh
divide
......@@ -31,14 +35,16 @@ ngraph.ops
equal
exp
floor
function_call
get_output_element
greater
greater_eq
less
less_eq
log
logical_and
logical_not
logical_or
lrn
max
max_pool
maximum
......@@ -52,7 +58,6 @@ ngraph.ops
parameter
power
prod
reduce
relu
replace_slice
reshape
......@@ -68,6 +73,7 @@ ngraph.ops
sum
tan
tanh
topk
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment