generic.rst 6.04 KB
Newer Older
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161
.. generic-frameworks.rst


Activating nGraph on generic frameworks
========================================

This section details some of the *configuration options* and some of the 
*environment variables* that can be used to tune for optimal performance when 
your system already has a version of nGraph installed with one of our supported
backends. 

.. csv-table::
   :header: "Backend", "Current nGraph support", "Future nGraph support"
   :widths: 35, 10, 10

   Intel® Architecture Processors (CPUs), Yes, Yes
   Intel® Nervana™ Neural Network Processor™ (NNPs), Yes, Yes
   NVIDIA\* CUDA (GPUs), Yes, Some 
   :abbr:`Field Programmable Gate Arrays (FPGA)` (FPGAs), Coming soon, Yes
   `Movidius`_, Not yet, Yes
   Other, Not yet, Ask


Regardless of the framework, after the :doc:`../install`, a good place to start 
usually involves making the libraries available to the framework. On Linux\* 
systems, that command tends to looks something like: 

.. code-block:: console

   export NGRAPH_CPP_BUILD_PATH=$HOME/ngraph_dist/
   export LD_LIBRARY_PATH=$HOME/ngraph_dist/lib/


Training Deep Neural Networks
==============================

Before tweaking various environment variables, be aware that how the computation 
gets executed depends upon the ordering of the data format that the model is 
using. ``NHWC`` and ``NCHW`` are the two more common layouts in Deep Learning 
models. Your ultimate runtime can vary greatly -- even when all other factors 
are exactly the same -- when this detail is overlooked.

For CPU (and most cuDNN) backends, the preferred layout is currently ``NCHW``.

* **N** -- Number of images per batch
* **C** -- Channel of the image (expressed as a number like 3 for RGB and 1 
  for grayscale)
* **H** -- Height of the image
* **W** -- Width of the image

MKL-DNN
-------

The following `KMP options`_ were originally optimized for `MKLDNN`_ projects 
running models with the ``NCHW`` data layout; however, other configurations can 
be explored. MKL-DNN is automatically enabled as part of an nGraph build; you do 
*not* need to add MKL-DNN separately or as an additional component to be able to 
use these configuration settings.   

* ``KMP_BLOCKTIME`` Sets the time, in milliseconds, that a thread should wait 
  after completing the execution of a parallel region, before sleeping.
* ``KMP_AFFINITY`` Enables the runtime library to bind threads to physical 
  processing units. 
* ``KMP_SETTINGS`` Enables (``true``) or disables (``false``) the printing of 
  OpenMP* runtime library environment variables during program execution.
* ``OMP_NUM_THREADS`` Specifies the number of threads to use.


nGraph-enabled Intel® Xeon®
===========================

The list below includes recommendations on data layout, parameters, and 
application configuration to achieve best performance running DNN workloads on 
Intel® Xeon® (CPU processor) systems.

Threading 
---------

The number of threads set by ``OMP_NUM_THREADS`` ought not exceed the number of 
physical cores. The threads should be pinned to their respective physical cores 
and activated as follows:

* When ``HT=off``, ``KMP_AFFINITY=compact,granularity=fine``

* When ``HT=on``, ``KMP_AFFINITY=compact,1,0,granularity=fine``


Memory allocation 
-----------------

Buffer pointers should be aligned at the 64-byte boundary. NUMA policy should be 
configured for local memory allocation (``numactl --localloc``)

Convolution shapes
^^^^^^^^^^^^^^^^^^

* When **running inference, or training for forward-propagation and weight 
  updates**, for best performance:
  
  - the number of input channels should be 1, 3, or a multiple of SIMD-width (8 
    for AVX2 systems, 16 for AVX512 systems). 
  - the number of output channels should be a multiple of SIMD-width (8 for AVX2 
    systems, 16 for AVX512 systems).

* When **training backward propagation**, the number of input and output 
  channels should be a multiple of SIMD-width (8 for AVX2 systems, 16 for AVX512 
  systems),
  
  - padding should not exceed :math:`0.5x` where :math:`x` is the kernel size.
  - kernel width should be less than 14.


``OMP_NUM_THREADS``
^^^^^^^^^^^^^^^^^^^

The best resource for this configuration option is the `gnu.org site`_ 
``OMP_NUM_THREADS`` defaults to the number of logical cores. To check the 
number of cores on your system, you can run the following on the command-line to 
see the details of your CPU: 

.. code-block:: console

   $ lscpu


Intra-op and inter-op parallelism 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

* ``intra_op_parallelism_threads``
* ``inter_op_parallelism_threads``

Some frameworks, like Tensorflow, use these settings to improve performance; 
however, they are often not sufficient to achieve optimal performance. 
Framework-based adjustments cannot access the underlying  NUMA configuration in 
multi-socket Intel Xeon processor-based platforms, which is a key requirement for
many kinds of inference-engine computations.  See the next section on 
NUMA performance to learn more about this performance feature available to systems
utilizing nGraph. 


NUMA performance 
~~~~~~~~~~~~~~~~~

NUMA stands for :abbr:`Non-Uniform Memory Access (NUMA)`. It indicates how each 
CPU can access memory attached to each socket. 

Without the "knowledge" of CPU socket and NUMA configuration, a simple thread 
affinity (as in the case of thread pool) does not lead to optimal performance. 
In fact, it can sometimes prohibitively decrease throughput; a core from socket 
0 might have to continually access cache lines from the memory bank of socket 1, 
increasing bandwidth demands on the Intel® Ultra-Path Interconnect (Intel® UPI). 
This situation is exacerbated with larger number of sockets found in 4, 8, and 
16-socket systems. We believe that users need to be aware of system level 
optimizations in addition to framework specific configuration parameters to 
achieve the best performance for NN workloads on CPU platforms. 


.. _KMP options: https://software.intel.com/en-us/node/522691
.. _MKLDNN: https://github.com/intel/mkl-dnn
.. _gnu.org site: https://gcc.gnu.org/onlinedocs/libgomp/Environment-Variables.html
.. _Movidius: https://www.movidius.com/