distribute-train.rst 2.21 KB
Newer Older
1 2 3
.. howto/distribute-train.rst 


Leona C's avatar
Leona C committed
4 5
Distribute training across multiple nGraph backends 
===================================================
6

Leona C's avatar
Leona C committed
7 8 9 10
.. important:: Distributed training is not officially supported in version |version|;
    however, the following configuration options have worked for nGraph devices 
    with mixed or limited success in testing.

11 12 13 14
In the :doc:`previous section <../constructing-graphs/derive-for-training>`, 
we described the steps needed to create a "trainable" nGraph model. Here we 
demonstrate how to train a data parallel model by distributing the graph to 
more than one device.
15

Leona C's avatar
Leona C committed
16 17
Frameworks can implement distributed training with nGraph versions prior to 
`0.13`:
18

19
* Use ``-DNGRAPH_DISTRIBUTED_ENABLE=OMPI`` to enable distributed training 
Leona C's avatar
Leona C committed
20 21 22
  with OpenMPI. Use of this flag requires that OpenMPI be a pre-existing library 
  in the system. If it's not present on the system, install `OpenMPI`_ version 
  ``2.1.1`` or later before running the compile. 
23

24
* Use ``-DNGRAPH_DISTRIBUTED_ENABLE=MLSL`` to enable the option for 
Leona C's avatar
Leona C committed
25 26
  :abbr:`Intel® Machine Learning Scaling Library (MLSL)` for Linux* OS:

Leona C's avatar
Leona C committed
27
  .. note:: The Intel® MLSL option applies to Intel® Architecture CPUs 
Leona C's avatar
Leona C committed
28 29 30 31 32 33 34 35 36 37 38
     (``CPU``) and ``Interpreter`` backends only. For all other backends, 
     ``OpenMPI`` is presently the only supported option. We recommend the 
     use of `Intel MLSL` for CPU backends to avoid an extra download step.

Finally, to run the training using two nGraph devices, invoke 

.. code-block:: console 

   $ mpirun 

To deploy data-parallel training, the ``AllReduce`` op should be added after the 
39
steps needed to complete the :doc:`backpropagation <../constructing-graphs/derive-for-training>`; 
Leona C's avatar
Leona C committed
40
the new code is highlighted below: 
41

42
.. literalinclude:: ../../../../examples/mnist_mlp/dist_mnist_mlp.cpp
43
   :language: cpp
44
   :lines: 178-194
45
   :emphasize-lines: 8-11
46

Leona C's avatar
Leona C committed
47
See the `full code`_ in the ``examples`` folder ``/doc/examples/mnist_mlp/dist_mnist_mlp.cpp``. 
48 49 50 51 52 53

.. code-block:: console 

   $ mpirun -np 2 dist_mnist_mlp


54
.. _Intel MLSL: https://github.com/intel/MLSL/releases
Leona C's avatar
Leona C committed
55 56
.. _OpenMPI: https://www.open-mpi.org/software/ompi/v2.1/  
.. _full code: https://github.com/NervanaSystems/ngraph/blob/master/doc/examples/mnist_mlp/dist_mnist_mlp.cpp