index.rst 4.38 KB
Newer Older
L.S. Cook's avatar
L.S. Cook committed
1
.. distr/index.rst: 
2

L.S. Cook's avatar
L.S. Cook committed
3
##############################
4
Distributed Training in nGraph
L.S. Cook's avatar
L.S. Cook committed
5
##############################
6

Leona C's avatar
Leona C committed
7 8 9 10 11 12

.. important:: Distributed training is not officially supported in version |version|;
   however, some configuration options have worked for nGraph devices with mixed or 
   limited success in testing environments.


13
Why distributed training?
L.S. Cook's avatar
L.S. Cook committed
14
=========================
15

L.S. Cook's avatar
L.S. Cook committed
16 17 18 19 20 21 22
A tremendous amount of data is required to train DNNs in diverse areas -- from 
computer vision to natural language processing. Meanwhile, computation used in 
AI training has been increasing exponentially. And even though significant 
improvements have been made in algorithms and hardware, using one machine to 
train a very large :term:`NN` is usually not optimal. The use of multiple nodes, 
then, becomes important for making deep learning training feasible with large 
datasets.   
23 24

Data parallelism is the most popular parallel architecture to accelerate deep 
L.S. Cook's avatar
L.S. Cook committed
25 26
learning with large datasets. The first algorithm we support is `based on the 
synchronous`_ :term:`SGD` method, and partitions the dataset among workers 
27 28 29 30 31
where each worker executes the same neural network model. For every iteration, 
nGraph backend computes the gradients in back-propagation, aggregates the gradients 
across all workers, and then update the weights. 

How? (Generic frameworks)
L.S. Cook's avatar
L.S. Cook committed
32
=========================
33

L.S. Cook's avatar
L.S. Cook committed
34 35
* :doc:`../howto/distribute-train`

36 37
To synchronize gradients across all workers, the essential operation for data 
parallel training, due to its simplicity and scalability over parameter servers, 
38
is ``allreduce``. The AllReduce op is one of the nGraph Library’s core ops. To 
39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58
enable gradient synchronization for a network, we simply inject the AllReduce op 
into the computation graph, connecting the graph for the autodiff computation 
and optimizer update (which then becomes part of the nGraph graph). The 
nGraph Backend will handle the rest. 

Data scientists with locally-scalable rack or cloud-based resources will likely 
find it worthwhile to experiment with different modes or variations of  
distributed training. Deployments using nGraph Library with supported backends 
can be configured to train with data parallelism and will soon work with model 
parallelism. Distributing workloads is increasingly important, as more data and 
bigger models mean the ability to :doc:`../howto/distribute-train` work with 
larger and larger datasets, or to work with models having many layers that 
aren't designed to fit to a single device.  

Distributed training with data parallelism splits the data and each worker 
node has the same model; during each iteration, the gradients are aggregated 
across all workers with an op that performs "allreduce", and applied to update 
the weights.

Using multiple machines helps to scale and speed up deep learning. With large 
59 60 61
mini-batch training, one could train ResNet-50 with Imagenet-1k data to the 
*Top 5* classifier in minutes using thousands of CPU nodes. See 
`arxiv.org/abs/1709.05011`_. 
62 63 64


MXNet
L.S. Cook's avatar
L.S. Cook committed
65
=====
66 67 68 69

We implemented a KVStore in MXNet\* (KVStore is unique to MXNet) to modify 
the SGD update op so the nGraph graph will contain the allreduce op and generate
corresponding collective communication kernels for different backends. We are 
70
using `Intel MLSL`_ for CPU backends.
71 72 73 74 75 76 77 78 79 80

The figure below shows a bar chart with preliminary results from a Resnet-50 
I1K training in MXNet 1, 2, 4, (and 8 if available) nodes, x-axis is the number 
of nodes while y-axis is the throughput (images/sec).


.. TODO add figure graphics/distributed-training-ngraph-backends.png
   

TensorFlow
L.S. Cook's avatar
L.S. Cook committed
81
==========
82 83 84 85 86 87 88 89

We plan to support the same in nGraph-TensorFlow. It is still work in progress.
Meanwhile, users could still use Horovod and the current nGraph TensorFlow, 
where allreduce op is placed on CPU instead of on nGraph device.
Figure: a bar chart shows preliminary results Resnet-50 I1K training in TF 1, 
2, 4, (and 8 if available) nodes, x-axis is the number of nodes while y-axis 
is the throughput (images/sec).

L.S. Cook's avatar
L.S. Cook committed
90

91
Future work
L.S. Cook's avatar
L.S. Cook committed
92
===========
93 94 95 96 97 98 99

Model parallelism with more communication ops support is in the works. For 
more general parallelism, such as model parallel, we plan to add more 
communication collective ops such as allgather, scatter, gather, etc. in 
the future. 


100 101
.. _arxiv.org/abs/1709.05011: https://arxiv.org/format/1709.05011
.. _based on the synchronous: https://arxiv.org/format/1602.06709 
102
.. _Intel MLSL: https://github.com/intel/MLSL/releases