Commit ec26acf2 authored by L.S. Cook's avatar L.S. Cook Committed by Robert Kimball

New PR with framework DO docs only (#896)

parent 37fca35c
.. framework-integration-guides:
#############################
Framework Integration Guides
#############################
###############################
Integrate Supported Frameworks
###############################
* :ref:`neon_intg`
* :ref:`mxnet_intg`
......
......@@ -110,7 +110,7 @@ Backprop
--------
We want to reduce the loss by adjusting the weights. We compute the
asjustments using the reverse mode autodiff algorithm, commonly
adjustments using the reverse mode autodiff algorithm, commonly
referred to as "backprop" because of the way it is implemented in
interpreted frameworks. In nGraph, we augment the loss computation
with computations for the weight adjustments. This allows the
......
......@@ -18,7 +18,7 @@ nGraph components. The recipes are all framework agnostic; in other words,
if an entity (framework or user) wishes to make use of target-based computational
resources, it can either:
* Do the tasks programatically through the framework, or
* Do the tasks programatically through a framework, or
* Provide a serialized model that can be imported to run on one of the nGraph
backends.
......@@ -33,14 +33,14 @@ that use custom backends. For example, we know that GPU resources can be useful
backends for *some* kinds of algorithmic operations while they impose inherent
limitations or slow down others.
One of our goals with the nGraph++ library is to enable developers with tools to
One of our goals with the nGraph library is to enable developers with tools to
quickly build programs that access and process data from a breadth of edge and
networked devices. This might mean bringing compute resources closer to edge
devices, or it might mean programatically adjusting a model or the compute
resources it requires, at an unknown or arbitray time after it has been deemed
resources it requires, at an unknown or arbitrary time after it has been deemed
to be trained well enough.
To get started, we've provided a basic example for how to execute a
To get started, we've provided a basic example for how to :doc:`execute` a
computation that can run on an nGraph backend; this is analogous to a
framework bridge. We also provide a larger example for training and
evaluating a simple MNIST MLP model.
......
......@@ -142,8 +142,9 @@ Contents
graph-basics.rst
howto/index.rst
ops/index.rst
framework-integration-guides.rst
project/index.rst
framework-integration-guides.rst
optimize/index.rst
Indices and tables
......
.. generic-frameworks.rst
Activating nGraph on generic frameworks
========================================
This section details some of the *configuration options* and some of the
*environment variables* that can be used to tune for optimal performance when
your system already has a version of nGraph installed with one of our supported
backends.
.. csv-table::
:header: "Backend", "Current nGraph support", "Future nGraph support"
:widths: 35, 10, 10
Intel® Architecture Processors (CPUs), Yes, Yes
Intel® Nervana™ Neural Network Processor™ (NNPs), Yes, Yes
NVIDIA\* CUDA (GPUs), Yes, Some
:abbr:`Field Programmable Gate Arrays (FPGA)` (FPGAs), Coming soon, Yes
`Movidius`_, Not yet, Yes
Other, Not yet, Ask
Regardless of the framework, after the :doc:`../install`, a good place to start
usually involves making the libraries available to the framework. On Linux\*
systems, that command tends to looks something like:
.. code-block:: console
export NGRAPH_CPP_BUILD_PATH=$HOME/ngraph_dist/
export LD_LIBRARY_PATH=$HOME/ngraph_dist/lib/
Training Deep Neural Networks
==============================
Before tweaking various environment variables, be aware that how the computation
gets executed depends upon the ordering of the data format that the model is
using. ``NHWC`` and ``NCHW`` are the two more common layouts in Deep Learning
models. Your ultimate runtime can vary greatly -- even when all other factors
are exactly the same -- when this detail is overlooked.
For CPU (and most cuDNN) backends, the preferred layout is currently ``NCHW``.
* **N** -- Number of images per batch
* **C** -- Channel of the image (expressed as a number like 3 for RGB and 1
for grayscale)
* **H** -- Height of the image
* **W** -- Width of the image
MKL-DNN
-------
The following `KMP options`_ were originally optimized for `MKLDNN`_ projects
running models with the ``NCHW`` data layout; however, other configurations can
be explored. MKL-DNN is automatically enabled as part of an nGraph build; you do
*not* need to add MKL-DNN separately or as an additional component to be able to
use these configuration settings.
* ``KMP_BLOCKTIME`` Sets the time, in milliseconds, that a thread should wait
after completing the execution of a parallel region, before sleeping.
* ``KMP_AFFINITY`` Enables the runtime library to bind threads to physical
processing units.
* ``KMP_SETTINGS`` Enables (``true``) or disables (``false``) the printing of
OpenMP* runtime library environment variables during program execution.
* ``OMP_NUM_THREADS`` Specifies the number of threads to use.
nGraph-enabled Intel® Xeon®
===========================
The list below includes recommendations on data layout, parameters, and
application configuration to achieve best performance running DNN workloads on
Intel® Xeon® (CPU processor) systems.
Threading
---------
The number of threads set by ``OMP_NUM_THREADS`` ought not exceed the number of
physical cores. The threads should be pinned to their respective physical cores
and activated as follows:
* When ``HT=off``, ``KMP_AFFINITY=compact,granularity=fine``
* When ``HT=on``, ``KMP_AFFINITY=compact,1,0,granularity=fine``
Memory allocation
-----------------
Buffer pointers should be aligned at the 64-byte boundary. NUMA policy should be
configured for local memory allocation (``numactl --localloc``)
Convolution shapes
^^^^^^^^^^^^^^^^^^
* When **running inference, or training for forward-propagation and weight
updates**, for best performance:
- the number of input channels should be 1, 3, or a multiple of SIMD-width (8
for AVX2 systems, 16 for AVX512 systems).
- the number of output channels should be a multiple of SIMD-width (8 for AVX2
systems, 16 for AVX512 systems).
* When **training backward propagation**, the number of input and output
channels should be a multiple of SIMD-width (8 for AVX2 systems, 16 for AVX512
systems),
- padding should not exceed :math:`0.5x` where :math:`x` is the kernel size.
- kernel width should be less than 14.
``OMP_NUM_THREADS``
^^^^^^^^^^^^^^^^^^^
The best resource for this configuration option is the `gnu.org site`_
``OMP_NUM_THREADS`` defaults to the number of logical cores. To check the
number of cores on your system, you can run the following on the command-line to
see the details of your CPU:
.. code-block:: console
$ lscpu
Intra-op and inter-op parallelism
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
* ``intra_op_parallelism_threads``
* ``inter_op_parallelism_threads``
Some frameworks, like Tensorflow, use these settings to improve performance;
however, they are often not sufficient to achieve optimal performance.
Framework-based adjustments cannot access the underlying NUMA configuration in
multi-socket Intel Xeon processor-based platforms, which is a key requirement for
many kinds of inference-engine computations. See the next section on
NUMA performance to learn more about this performance feature available to systems
utilizing nGraph.
NUMA performance
~~~~~~~~~~~~~~~~~
NUMA stands for :abbr:`Non-Uniform Memory Access (NUMA)`. It indicates how each
CPU can access memory attached to each socket.
Without the "knowledge" of CPU socket and NUMA configuration, a simple thread
affinity (as in the case of thread pool) does not lead to optimal performance.
In fact, it can sometimes prohibitively decrease throughput; a core from socket
0 might have to continually access cache lines from the memory bank of socket 1,
increasing bandwidth demands on the Intel® Ultra-Path Interconnect (Intel® UPI).
This situation is exacerbated with larger number of sockets found in 4, 8, and
16-socket systems. We believe that users need to be aware of system level
optimizations in addition to framework specific configuration parameters to
achieve the best performance for NN workloads on CPU platforms.
.. _KMP options: https://software.intel.com/en-us/node/522691
.. _MKLDNN: https://github.com/intel/mkl-dnn
.. _gnu.org site: https://gcc.gnu.org/onlinedocs/libgomp/Environment-Variables.html
.. _Movidius: https://www.movidius.com/
.. optimize/index:
#############################
Integrate Generic Frameworks
#############################
This section, written for framework architects or engineers who want
to optimize a generic, brand new or less widely-supported framework, we
provide some of our learnings from the work we've done in developing
"framework direct optimizations (DO)" and custom bridge code, such as
that for our `ngraph tensorflow bridge`_ code.
.. important:: This section contains articles for framework owners or developers
who want to incorporate the nGraph library directly into their framework and
optimize for some specific compute-time characteristic.
.. toctree::
:maxdepth: 1
generic.rst
When using a framework to run a model or deploy an algorithm on nGraph
devices, there are some additional configuration options that can be
incorporated -- manually on the command line or via scripting -- to improve
performance. Fine-tuning an nGraph-enabled device is as much of an art as it
is a science; there are virtually limitless ways to do so.
Since a framework is typically designed around some feature, such as fast
training using image data, inference on a mobile device, or support for voice
and speech pattern recognition, a framework cannot optimize for all
possibilities at the same time.
In general, the larger and more complex a framework is, the harder it becomes
to navigate and extract the best performance; configuration options that are
enabled by "default" from the framework side can sometimes slow down compilation
without the developer being any the wiser. Sometimes only `a few small`_
adjustments can increase performance. Likewise, a minimalistic framework that
is designed around one specific kind of model can sometimes offer significant
performance-improvement opportunities by lowering overhead.
Right now the preferred way for a data scientist to get better performance is
to shop around and select the framework that is "already" designed or optimized
for some characteristic or trait of the model they want to build, test, tweak,
or run. One challenge of the framework developer, then, is to differentiate from
the pack by providing a means for the data scientist to obtain reproducible
results. The other challenge is to provide sufficient documentation, or to
provide sufficient hints for how to do any "fine-tuning" for specific use cases.
How this has worked in creating the :doc:`the direct optimizations <../framework-integration-guides>`
we've shared with the developer community, our `engineering teams carefully tune the workload to extract best performance`_
from a specific :abbr:`DL (Deep Learning)` model embedded in a specific framework
that is training a specific dataset. Our forks of the frameworks adjust the code
and/or explain how to set the parameters that achieve reproducible results.
Some of the ways we attempt to improve performance include:
* Testing and recording the results of various system-level configuration options
or enabled or disabled flags,
* Compiling with a mix of custom environment variables,
* Finding semi-related comparisons for benchmarking [#1]_, and
* Tuning lower levels of the system so that the machine-learning algorithm can
learn faster or more accurately that it did on previous runs,
* Incorporating various :doc:`../ops/index` to build graphs more efficiently.
This approach, however, is obviously not a scalable solution for developers on
the framework side who are trying to support multiple use cases. Nor is it ideal
for teams looking to pivot or innovate multi-layer solutions based on something
**other than training speed**, things like accuracy or precision. Chasing
performance improvements does eventually yield a diminishing
:abbr:`Return on Investment (ROI)`, though it is up to the framework
developer to decide when that is for each of their customers.
For these reasons, we're providing some of the more commonly-used options for
fine-tuning various code deployments to the nGraph-enabled devices we
currently support. Watch this section as we enable new devices and post new
updates.
.. rubric:: Footnotes
.. [#1] Benchmarking performance of DL systems is a young discipline; it is a
good idea to be vigilant for results based on atypical distortions in the
configuration parameters. Every topology is different, and performance
increases or slowdowns can be attributed to multiple means.
.. _ngraph tensorflow bridge: http://ngraph.nervanasys.com/docs/latest/framework-integration-guides.html#tensorflow
.. _engineering teams carefully tune the workload to extract best performance: https://ai.intel.com/accelerating-deep-learning-training-inference-system-level-optimizations
.. _a few small: https://software.intel.com/en-us/articles/boosting-deep-learning-training-inference-performance-on-xeon-and-xeon-phi
.. _Movidius: https://www.movidius.com/
\ No newline at end of file
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment