other.rst 8.79 KB
Newer Older
1
.. frameworks/other.rst:
2

3
.. _fw_other: 
L.S. Cook's avatar
L.S. Cook committed
4

5 6
Integrating other frameworks
============================
7 8 9

This section details some of the *configuration options* and some of the 
*environment variables* that can be used to tune for optimal performance when 
Leona C's avatar
Leona C committed
10 11
your system already has a version of nGraph installed with one or more of our 
supported :doc:`../backends/index`.
12

13 14 15 16
Regardless of the framework, after the :doc:`../buildlb` step, a good place 
to start usually involves making the libraries available to the framework. On 
Linux\* systems built on Intel® Architecture, that command tends to looks 
something like: 
17 18 19

.. code-block:: console

20 21
   export NGRAPH_CPP_BUILD_PATH=path/to/ngraph_dist/
   export LD_LIBRARY_PATH=path/to/ngraph_dist/lib/
22 23


Leona C's avatar
Leona C committed
24
Find or display version
25
-----------------------
26

Leona C's avatar
Leona C committed
27 28
If you're working with the :doc:`../python_api/index`, the following command 
may be useful:
29 30 31 32 33

.. code-block:: console

   python3 -c "import ngraph as ng; print('nGraph version: ',ng.__version__)";

Leona C's avatar
Leona C committed
34 35 36
To manually build a newer version than is available from the latest `PyPI`_
(:abbr:`Python Package Index (PyPI)`), see our nGraph Python API `BUILDING.md`_ 
documentation.
L.S. Cook's avatar
L.S. Cook committed
37

Leona C's avatar
Leona C committed
38

Leona C's avatar
Leona C committed
39
Activate logtrace-related environment variables
40
-----------------------------------------------
Leona C's avatar
Leona C committed
41

Leona C's avatar
Leona C committed
42
Another configuration option is to activate ``NGRAPH_CPU_DEBUG_TRACER``,
Leona C's avatar
Leona C committed
43 44 45 46 47 48 49
a runtime environment variable that supports extra logging and debug detail. 

This is a useful tool for data scientists interested in outputs from logtrace 
files that can, for example, help in tracking down model convergences. It can 
also help engineers who might want to add their new ``Backend`` to an existing 
framework to compare intermediate tensors/values to references from a CPU 
backend.
Leona C's avatar
Leona C committed
50 51

To activate this tool, set the ``env`` var ``NGRAPH_CPU_DEBUG_TRACER=1``.
Leona C's avatar
Leona C committed
52 53
It will dump ``trace_meta.log`` and ``trace_bin_data.log``. The names of the 
logfiles can be customized.
Leona C's avatar
Leona C committed
54 55 56 57 58 59 60 61

To specify the names of logs with those flags:

:: 

  NGRAPH_TRACER_LOG = "meta.log"
  NGRAPH_BIN_TRACER_LOG = "bin.log"

Leona C's avatar
Leona C committed
62
The meta_log contains::
Leona C's avatar
Leona C committed
63 64 65
 
  kernel_name, serial_number_of_op, tensor_id, symbol_of_in_out, num_elements, shape, binary_data_offset, mean_of_tensor, variance_of_tensor

Leona C's avatar
Leona C committed
66
A line example from a unit-test might look like::
Leona C's avatar
Leona C committed
67 68 69 70 71 72 73

  K=Add S=0 TID=0_0 >> size=4 Shape{2, 2} bin_data_offset=8 mean=1.5 var=1.25

The binary_log line contains::

  tensor_id, binary data (tensor data)

Leona C's avatar
Leona C committed
74 75 76
A reference for the implementation of parsing these logfiles can also be found 
in the unit test for this feature.

Leona C's avatar
Leona C committed
77

L.S. Cook's avatar
L.S. Cook committed
78
FMV
79
---
L.S. Cook's avatar
L.S. Cook committed
80 81 82 83 84 85

FMV stands for :abbr:`Function Multi-Versioning`, and it can also provide a 
number of generic ways to patch or bring architecture-based optimizations to 
the :abbr:`Operating System (OS)` that is handling your ML environment. See 
the `GCC wiki for details`_.

Leona C's avatar
Leona C committed
86
If your nGraph build is a Neural Network configured on Clear Linux\* OS 
L.S. Cook's avatar
L.S. Cook committed
87 88 89 90
for Intel® Architecture, and it includes at least one older CPU, the 
`following article may be helpful`_.


91
Training Deep Neural Networks
92
-----------------------------
93 94 95 96 97 98 99 100 101 102 103 104 105 106 107

Before tweaking various environment variables, be aware that how the computation 
gets executed depends upon the ordering of the data format that the model is 
using. ``NHWC`` and ``NCHW`` are the two more common layouts in Deep Learning 
models. Your ultimate runtime can vary greatly -- even when all other factors 
are exactly the same -- when this detail is overlooked.

For CPU (and most cuDNN) backends, the preferred layout is currently ``NCHW``.

* **N** -- Number of images per batch
* **C** -- Channel of the image (expressed as a number like 3 for RGB and 1 
  for grayscale)
* **H** -- Height of the image
* **W** -- Width of the image

108

L.S. Cook's avatar
L.S. Cook committed
109 110
Intel® Math Kernel Library for Deep Neural Networks 
---------------------------------------------------
111

Leona C's avatar
Leona C committed
112 113 114 115 116 117
.. important:: Intel® MKL-DNN is automatically enabled as part of an
   nGraph default :doc:`build <../buildlb>`; you do *not* need to add it 
   separately or as an additional component to be able to use these 
   configuration settings.

The following `KMP`_ options were originally optimized for models using the 
L.S. Cook's avatar
L.S. Cook committed
118
Intel® `MKL-DNN`_ to train models with the ``NCHW`` data layout; however, other 
Leona C's avatar
Leona C committed
119
configurations can be explored.    
120 121 122 123

* ``KMP_BLOCKTIME`` Sets the time, in milliseconds, that a thread should wait 
  after completing the execution of a parallel region, before sleeping.
* ``KMP_AFFINITY`` Enables the runtime library to bind threads to physical 
124 125
  processing units. A useful article that explains more about how to use this 
  option for various CPU backends is here: https://web.archive.org/web/20190401182248/https://www.nas.nasa.gov/hecc/support/kb/Using-Intel-OpenMP-Thread-Affinity-for-Pinning_285.html
126
* ``KMP_SETTINGS`` Enables (``true``) or disables (``false``) the printing of 
127
  OpenMP\* runtime library environment variables during program execution.
128 129 130
* ``OMP_NUM_THREADS`` Specifies the number of threads to use.


131
nGraph-enabled Intel® Xeon® 
132
---------------------------
133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152

The list below includes recommendations on data layout, parameters, and 
application configuration to achieve best performance running DNN workloads on 
Intel® Xeon® (CPU processor) systems.

Threading 
---------

The number of threads set by ``OMP_NUM_THREADS`` ought not exceed the number of 
physical cores. The threads should be pinned to their respective physical cores 
and activated as follows:

* When ``HT=off``, ``KMP_AFFINITY=compact,granularity=fine``

* When ``HT=on``, ``KMP_AFFINITY=compact,1,0,granularity=fine``


Memory allocation 
-----------------

153 154 155 156
Buffer pointers should be aligned on 64-byte boundaries. NUMA policy should be 
configured for local memory allocation (``numactl --localloc``). 


157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179

Convolution shapes
^^^^^^^^^^^^^^^^^^

* When **running inference, or training for forward-propagation and weight 
  updates**, for best performance:
  
  - the number of input channels should be 1, 3, or a multiple of SIMD-width (8 
    for AVX2 systems, 16 for AVX512 systems). 
  - the number of output channels should be a multiple of SIMD-width (8 for AVX2 
    systems, 16 for AVX512 systems).

* When **training backward propagation**, the number of input and output 
  channels should be a multiple of SIMD-width (8 for AVX2 systems, 16 for AVX512 
  systems),
  
  - padding should not exceed :math:`0.5x` where :math:`x` is the kernel size.
  - kernel width should be less than 14.


``OMP_NUM_THREADS``
^^^^^^^^^^^^^^^^^^^

Leona C's avatar
Leona C committed
180 181 182 183 184
The best resource for this configuration option is the Intel® OpenMP\* docs 
at the following link: `Intel OpenMP documentation`_. ``OMP_NUM_THREADS`` 
defaults to the number of logical cores. To check the number of cores on your 
system, you can run the following on the command-line to see the details 
of your CPU:
185 186 187 188 189 190 191 192 193 194 195 196

.. code-block:: console

   $ lscpu


Intra-op and inter-op parallelism 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

* ``intra_op_parallelism_threads``
* ``inter_op_parallelism_threads``

197
Some frameworks, like TensorFlow\*, use these settings to improve performance; 
198 199 200 201 202 203
however, they are often not sufficient for optimal performance. Framework-based 
adjustments cannot access the underlying NUMA configuration in multi-socket 
Intel® Xeon® processor-based platforms, which is a key requirement for 
many kinds of inference-engine computations. See the next section on NUMA 
performance to learn more about this performance feature available to systems 
utilizing nGraph. 
L.S. Cook's avatar
L.S. Cook committed
204
   
205 206 207 208 209 210 211 212 213 214 215 216 217 218 219

NUMA performance 
~~~~~~~~~~~~~~~~~

NUMA stands for :abbr:`Non-Uniform Memory Access (NUMA)`. It indicates how each 
CPU can access memory attached to each socket. 

Without the "knowledge" of CPU socket and NUMA configuration, a simple thread 
affinity (as in the case of thread pool) does not lead to optimal performance. 
In fact, it can sometimes prohibitively decrease throughput; a core from socket 
0 might have to continually access cache lines from the memory bank of socket 1, 
increasing bandwidth demands on the Intel® Ultra-Path Interconnect (Intel® UPI). 
This situation is exacerbated with larger number of sockets found in 4, 8, and 
16-socket systems. We believe that users need to be aware of system level 
optimizations in addition to framework specific configuration parameters to 
L.S. Cook's avatar
L.S. Cook committed
220 221 222 223
achieve the best performance for NN workloads on CPU platforms. The nGraph 
Compiler stack runs on transformers handled by Intel® Architecture (IA), and 
thus can make more efficient use of the underlying hardware.

Leona C's avatar
Leona C committed
224 225
.. _PyPI: https://pypi.org/project/ngraph-core
.. _KMP: https://software.intel.com/en-us/node/522691
L.S. Cook's avatar
L.S. Cook committed
226
.. _MKL-DNN: https://github.com/intel/mkl-dnn
Leona C's avatar
Leona C committed
227
.. _Intel OpenMP documentation: https://www.openmprtl.org/documentation
228
.. _Movidius: https://www.movidius.com/
Leona C's avatar
Leona C committed
229
.. _BUILDING.md: https://github.com/NervanaSystems/ngraph/blob/master/python/BUILDING.md
L.S. Cook's avatar
L.S. Cook committed
230 231
.. _GCC wiki for details: https://gcc.gnu.org/wiki/FunctionMultiVersioning
.. _following article may be helpful: https://clearlinux.org/documentation/clear-linux/tutorials/fmv
Leona C's avatar
Leona C committed
232