generic-configs.rst 9.18 KB
Newer Older
L.S. Cook's avatar
L.S. Cook committed
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
.. frameworks/generic-configs.rst:


Configurations available to any framework
#########################################


Enabling Deep Learning paradigms  
================================

Framework architects or engineers who can't quite find what they need among 
the existing DL tools may need to build something new off a "stock" framework, 
or someting entirely from scratch. For this category of developer, we have 
:doc:`documented several ways <../howto/index>` you can incorporate built-in 
compiler support for users of your framework; this includes out-of-box support 
for things like Intel® MKL-DNN and PlaidML when your framework supports nGraph 
as a "backend" or engine. 

   .. important:: nGraph does not provide an interface for "users" of frameworks 
      (for example, we cannot dictate or control how Tensorflow* or MXNet* presents 
      interfaces to users). Please keep in mind that designing and documenting 
      the :abbr:`User Interface (UI)` of step 3 above is entirely in the realm 
      of the framework owner or developer and beyond the scope of the nGraph 
      Compiler stack. However, any framework can be designed to make direct use 
      of nGraph Compiler stack-based features and then expose an accompanying UI, 
      output message, or other detail to a user.
 
The nGraph :abbr:`IR Intermediate Representation` is format that can understand 
inputs from a framework. Today, there are two primary tasks that can be accomplished 
in the “bridge code” space of the nGraph IR: 

#. Compiling a dataflow graph 
#. Executing a pre-compiled graph. 

See the :doc:`../framework-integration-guides` for how we built bridges with our 
initially-supported frameworks. For more in-depth help in writing things like 
37 38 39 40 41
graph optimizations and bridge code, we provide articles on how to 
:doc:`../fusion/index`, and programmatically :doc:`../howto/execute` that can 
target various compute resources using nGraph when a framework provides some 
inputs to be computed.

L.S. Cook's avatar
L.S. Cook committed
42 43 44 45 46
.. note:: Configuration options can be added manually on the command line or via 
   scripting. Please keep in mind that fine-tuning of parameters is as much of 
   an art as it is a science; there are virtually limitless ways to do so and 
   our documentation provides only a sampling.  

47

48 49
Integrating nGraph with new frameworks
======================================
50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67

This section details some of the *configuration options* and some of the 
*environment variables* that can be used to tune for optimal performance when 
your system already has a version of nGraph installed with one of our supported
backends. 

.. csv-table::
   :header: "Backend", "Current nGraph support", "Future nGraph support"
   :widths: 35, 10, 10

   Intel® Architecture Processors (CPUs), Yes, Yes
   Intel® Nervana™ Neural Network Processor™ (NNPs), Yes, Yes
   NVIDIA\* CUDA (GPUs), Yes, Some 
   :abbr:`Field Programmable Gate Arrays (FPGA)` (FPGAs), Coming soon, Yes
   `Movidius`_, Not yet, Yes
   Other, Not yet, Ask


68 69 70 71
Regardless of the framework, after the :doc:`../buildlb` step, a good place 
to start usually involves making the libraries available to the framework. On 
Linux\* systems built on Intel® Architecture, that command tends to looks 
something like: 
72 73 74

.. code-block:: console

75 76
   export NGRAPH_CPP_BUILD_PATH=path/to/ngraph_dist/
   export LD_LIBRARY_PATH=path/to/ngraph_dist/lib/
77 78


L.S. Cook's avatar
L.S. Cook committed
79 80 81 82 83 84 85 86 87 88 89 90 91 92 93

FMV
===

FMV stands for :abbr:`Function Multi-Versioning`, and it can also provide a 
number of generic ways to patch or bring architecture-based optimizations to 
the :abbr:`Operating System (OS)` that is handling your ML environment. See 
the `GCC wiki for details`_.

If your nGraph build is a Neural Network configured on Clear Linux* OS 
for Intel® Architecture, and it includes at least one older CPU, the 
`following article may be helpful`_.



94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110
Training Deep Neural Networks
==============================

Before tweaking various environment variables, be aware that how the computation 
gets executed depends upon the ordering of the data format that the model is 
using. ``NHWC`` and ``NCHW`` are the two more common layouts in Deep Learning 
models. Your ultimate runtime can vary greatly -- even when all other factors 
are exactly the same -- when this detail is overlooked.

For CPU (and most cuDNN) backends, the preferred layout is currently ``NCHW``.

* **N** -- Number of images per batch
* **C** -- Channel of the image (expressed as a number like 3 for RGB and 1 
  for grayscale)
* **H** -- Height of the image
* **W** -- Width of the image

L.S. Cook's avatar
L.S. Cook committed
111 112
Intel® Math Kernel Library for Deep Neural Networks 
---------------------------------------------------
113

L.S. Cook's avatar
L.S. Cook committed
114 115 116 117 118
-The following `KMP options`_ were originally optimized for models using the 
Intel® `MKL-DNN`_ to train models with the ``NCHW`` data layout; however, other 
configurations can be explored. MKL-DNN is automatically enabled as part of an 
nGraph compilation; you do *not* need to add MKL-DNN separately or as an 
additional component to be able to use these configuration settings.   
119 120 121 122 123 124

* ``KMP_BLOCKTIME`` Sets the time, in milliseconds, that a thread should wait 
  after completing the execution of a parallel region, before sleeping.
* ``KMP_AFFINITY`` Enables the runtime library to bind threads to physical 
  processing units. 
* ``KMP_SETTINGS`` Enables (``true``) or disables (``false``) the printing of 
125
  OpenMP\* runtime library environment variables during program execution.
126 127 128
* ``OMP_NUM_THREADS`` Specifies the number of threads to use.


129 130
nGraph-enabled Intel® Xeon® 
============================
131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150

The list below includes recommendations on data layout, parameters, and 
application configuration to achieve best performance running DNN workloads on 
Intel® Xeon® (CPU processor) systems.

Threading 
---------

The number of threads set by ``OMP_NUM_THREADS`` ought not exceed the number of 
physical cores. The threads should be pinned to their respective physical cores 
and activated as follows:

* When ``HT=off``, ``KMP_AFFINITY=compact,granularity=fine``

* When ``HT=on``, ``KMP_AFFINITY=compact,1,0,granularity=fine``


Memory allocation 
-----------------

151 152 153 154
Buffer pointers should be aligned on 64-byte boundaries. NUMA policy should be 
configured for local memory allocation (``numactl --localloc``). 


155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193

Convolution shapes
^^^^^^^^^^^^^^^^^^

* When **running inference, or training for forward-propagation and weight 
  updates**, for best performance:
  
  - the number of input channels should be 1, 3, or a multiple of SIMD-width (8 
    for AVX2 systems, 16 for AVX512 systems). 
  - the number of output channels should be a multiple of SIMD-width (8 for AVX2 
    systems, 16 for AVX512 systems).

* When **training backward propagation**, the number of input and output 
  channels should be a multiple of SIMD-width (8 for AVX2 systems, 16 for AVX512 
  systems),
  
  - padding should not exceed :math:`0.5x` where :math:`x` is the kernel size.
  - kernel width should be less than 14.


``OMP_NUM_THREADS``
^^^^^^^^^^^^^^^^^^^

The best resource for this configuration option is the `gnu.org site`_ 
``OMP_NUM_THREADS`` defaults to the number of logical cores. To check the 
number of cores on your system, you can run the following on the command-line to 
see the details of your CPU: 

.. code-block:: console

   $ lscpu


Intra-op and inter-op parallelism 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

* ``intra_op_parallelism_threads``
* ``inter_op_parallelism_threads``

194
Some frameworks, like TensorFlow\*, use these settings to improve performance; 
195 196 197 198 199 200
however, they are often not sufficient for optimal performance. Framework-based 
adjustments cannot access the underlying NUMA configuration in multi-socket 
Intel® Xeon® processor-based platforms, which is a key requirement for 
many kinds of inference-engine computations. See the next section on NUMA 
performance to learn more about this performance feature available to systems 
utilizing nGraph. 
L.S. Cook's avatar
L.S. Cook committed
201
   
202 203 204 205 206 207 208 209 210 211 212 213 214 215 216

NUMA performance 
~~~~~~~~~~~~~~~~~

NUMA stands for :abbr:`Non-Uniform Memory Access (NUMA)`. It indicates how each 
CPU can access memory attached to each socket. 

Without the "knowledge" of CPU socket and NUMA configuration, a simple thread 
affinity (as in the case of thread pool) does not lead to optimal performance. 
In fact, it can sometimes prohibitively decrease throughput; a core from socket 
0 might have to continually access cache lines from the memory bank of socket 1, 
increasing bandwidth demands on the Intel® Ultra-Path Interconnect (Intel® UPI). 
This situation is exacerbated with larger number of sockets found in 4, 8, and 
16-socket systems. We believe that users need to be aware of system level 
optimizations in addition to framework specific configuration parameters to 
L.S. Cook's avatar
L.S. Cook committed
217 218 219 220 221
achieve the best performance for NN workloads on CPU platforms. The nGraph 
Compiler stack runs on transformers handled by Intel® Architecture (IA), and 
thus can make more efficient use of the underlying hardware.


222 223 224


.. _KMP options: https://software.intel.com/en-us/node/522691
L.S. Cook's avatar
L.S. Cook committed
225
.. _MKL-DNN: https://github.com/intel/mkl-dnn
226 227
.. _gnu.org site: https://gcc.gnu.org/onlinedocs/libgomp/Environment-Variables.html
.. _Movidius: https://www.movidius.com/
L.S. Cook's avatar
L.S. Cook committed
228 229
.. _GCC wiki for details: https://gcc.gnu.org/wiki/FunctionMultiVersioning
.. _following article may be helpful: https://clearlinux.org/documentation/clear-linux/tutorials/fmv