New PR with framework DO docs only (#896)

ec26acf2 · L.S. Cook · Robert Kimball · 37fca35c · ec26acf2 · ec26acf2
Commit ec26acf2 authored Apr 20, 2018 by L.S. Cook Committed by Robert Kimball Apr 20, 2018
6 changed files
--- a/doc/sphinx/source/framework-integration-guides.rst
+++ b/doc/sphinx/source/framework-integration-guides.rst
 .. framework-integration-guides:

-#############################
-Framework Integration Guides
-#############################
+###############################
+Integrate Supported Frameworks
+###############################

 * :ref:`neon_intg`
 * :ref:`mxnet_intg`

--- a/doc/sphinx/source/howto/derive-for-training.rst
+++ b/doc/sphinx/source/howto/derive-for-training.rst
@@ -110,7 +110,7 @@ Backprop
 --------

 We want to reduce the loss by adjusting the weights. We compute the
-asjustments using the reverse mode autodiff algorithm, commonly
+adjustments using the reverse mode autodiff algorithm, commonly
 referred to as "backprop" because of the way it is implemented in
 interpreted frameworks. In nGraph, we augment the loss computation
 with computations for the weight adjustments. This allows the

--- a/doc/sphinx/source/howto/index.rst
+++ b/doc/sphinx/source/howto/index.rst
@@ -18,7 +18,7 @@ nGraph components. The recipes are all framework agnostic; in other words,
 if an entity (framework or user) wishes to make use of target-based computational 
 resources, it can either:

-* Do the tasks programatically through the framework, or 
+* Do the tasks programatically through a framework, or 
 * Provide a serialized model that can be imported to run on one of the nGraph
  backends. 

@@ -33,14 +33,14 @@ that use custom backends. For example, we know that GPU resources can be useful
 backends for *some* kinds of algorithmic operations while they impose inherent 
 limitations or slow down others. 

-One of our goals with the nGraph++ library is to enable developers with tools to 
+One of our goals with the nGraph library is to enable developers with tools to 
 quickly build programs that access and process data from a breadth of edge and 
 networked devices. This might mean bringing compute resources closer to edge 
 devices, or it might mean programatically adjusting a model or the compute 
-resources it requires, at an unknown or arbitray time after it has been deemed 
+resources it requires, at an unknown or arbitrary time after it has been deemed 
 to be trained well enough.

-To get started, we've provided a basic example for how to execute a
+To get started, we've provided a basic example for how to :doc:`execute` a
 computation that can run on an nGraph backend; this is analogous to a
 framework bridge.  We also provide a larger example for training and
 evaluating a simple MNIST MLP model.

--- a/doc/sphinx/source/index.rst
+++ b/doc/sphinx/source/index.rst
@@ -142,8 +142,9 @@ Contents
   graph-basics.rst
   howto/index.rst
   ops/index.rst
-   framework-integration-guides.rst
   project/index.rst
+   framework-integration-guides.rst
+   optimize/index.rst


 Indices and tables

--- a/doc/sphinx/source/optimize/generic.rst
+++ b/doc/sphinx/source/optimize/generic.rst
+.. generic-frameworks.rst
+
+
+Activating nGraph on generic frameworks
+========================================
+
+This section details some of the *configuration options* and some of the 
+*environment variables* that can be used to tune for optimal performance when 
+your system already has a version of nGraph installed with one of our supported
+backends. 
+
+.. csv-table::
+   :header: "Backend", "Current nGraph support", "Future nGraph support"
+   :widths: 35, 10, 10
+
+   Intel® Architecture Processors (CPUs), Yes, Yes
+   Intel® Nervana™ Neural Network Processor™ (NNPs), Yes, Yes
+   NVIDIA\* CUDA (GPUs), Yes, Some 
+   :abbr:`Field Programmable Gate Arrays (FPGA)` (FPGAs), Coming soon, Yes
+   `Movidius`_, Not yet, Yes
+   Other, Not yet, Ask
+
+
+Regardless of the framework, after the :doc:`../install`, a good place to start 
+usually involves making the libraries available to the framework. On Linux\* 
+systems, that command tends to looks something like: 
+
+.. code-block:: console
+
+   export NGRAPH_CPP_BUILD_PATH=$HOME/ngraph_dist/
+   export LD_LIBRARY_PATH=$HOME/ngraph_dist/lib/
+
+
+Training Deep Neural Networks
+==============================
+
+Before tweaking various environment variables, be aware that how the computation 
+gets executed depends upon the ordering of the data format that the model is 
+using. ``NHWC`` and ``NCHW`` are the two more common layouts in Deep Learning 
+models. Your ultimate runtime can vary greatly -- even when all other factors 
+are exactly the same -- when this detail is overlooked.
+
+For CPU (and most cuDNN) backends, the preferred layout is currently ``NCHW``.
+
+* **N** -- Number of images per batch
+* **C** -- Channel of the image (expressed as a number like 3 for RGB and 1 
+  for grayscale)
+* **H** -- Height of the image
+* **W** -- Width of the image
+
+MKL-DNN
+-------
+
+The following `KMP options`_ were originally optimized for `MKLDNN`_ projects 
+running models with the ``NCHW`` data layout; however, other configurations can 
+be explored. MKL-DNN is automatically enabled as part of an nGraph build; you do 
+*not* need to add MKL-DNN separately or as an additional component to be able to 
+use these configuration settings.   
+
+* ``KMP_BLOCKTIME`` Sets the time, in milliseconds, that a thread should wait 
+  after completing the execution of a parallel region, before sleeping.
+* ``KMP_AFFINITY`` Enables the runtime library to bind threads to physical 
+  processing units. 
+* ``KMP_SETTINGS`` Enables (``true``) or disables (``false``) the printing of 
+  OpenMP* runtime library environment variables during program execution.
+* ``OMP_NUM_THREADS`` Specifies the number of threads to use.
+
+
+nGraph-enabled Intel® Xeon®
+===========================
+
+The list below includes recommendations on data layout, parameters, and 
+application configuration to achieve best performance running DNN workloads on 
+Intel® Xeon® (CPU processor) systems.
+
+Threading 
+---------
+
+The number of threads set by ``OMP_NUM_THREADS`` ought not exceed the number of 
+physical cores. The threads should be pinned to their respective physical cores 
+and activated as follows:
+
+* When ``HT=off``, ``KMP_AFFINITY=compact,granularity=fine``
+
+* When ``HT=on``, ``KMP_AFFINITY=compact,1,0,granularity=fine``
+
+
+Memory allocation 
+-----------------
+
+Buffer pointers should be aligned at the 64-byte boundary. NUMA policy should be 
+configured for local memory allocation (``numactl --localloc``)
+
+Convolution shapes
+^^^^^^^^^^^^^^^^^^
+
+* When **running inference, or training for forward-propagation and weight 
+  updates**, for best performance:
+  
+  - the number of input channels should be 1, 3, or a multiple of SIMD-width (8 
+    for AVX2 systems, 16 for AVX512 systems). 
+  - the number of output channels should be a multiple of SIMD-width (8 for AVX2 
+    systems, 16 for AVX512 systems).
+
+* When **training backward propagation**, the number of input and output 
+  channels should be a multiple of SIMD-width (8 for AVX2 systems, 16 for AVX512 
+  systems),
+  
+  - padding should not exceed :math:`0.5x` where :math:`x` is the kernel size.
+  - kernel width should be less than 14.
+
+
+``OMP_NUM_THREADS``
+^^^^^^^^^^^^^^^^^^^
+
+The best resource for this configuration option is the `gnu.org site`_ 
+``OMP_NUM_THREADS`` defaults to the number of logical cores. To check the 
+number of cores on your system, you can run the following on the command-line to 
+see the details of your CPU: 
+
+.. code-block:: console
+
+   $ lscpu
+
+
+Intra-op and inter-op parallelism 
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+* ``intra_op_parallelism_threads``
+* ``inter_op_parallelism_threads``
+
+Some frameworks, like Tensorflow, use these settings to improve performance; 
+however, they are often not sufficient to achieve optimal performance. 
+Framework-based adjustments cannot access the underlying  NUMA configuration in 
+multi-socket Intel Xeon processor-based platforms, which is a key requirement for
+many kinds of inference-engine computations.  See the next section on 
+NUMA performance to learn more about this performance feature available to systems
+utilizing nGraph. 
+
+
+NUMA performance 
+~~~~~~~~~~~~~~~~~
+
+NUMA stands for :abbr:`Non-Uniform Memory Access (NUMA)`. It indicates how each 
+CPU can access memory attached to each socket. 
+
+Without the "knowledge" of CPU socket and NUMA configuration, a simple thread 
+affinity (as in the case of thread pool) does not lead to optimal performance. 
+In fact, it can sometimes prohibitively decrease throughput; a core from socket 
+0 might have to continually access cache lines from the memory bank of socket 1, 
+increasing bandwidth demands on the Intel® Ultra-Path Interconnect (Intel® UPI). 
+This situation is exacerbated with larger number of sockets found in 4, 8, and 
+16-socket systems. We believe that users need to be aware of system level 
+optimizations in addition to framework specific configuration parameters to 
+achieve the best performance for NN workloads on CPU platforms. 
+
+
+.. _KMP options: https://software.intel.com/en-us/node/522691
+.. _MKLDNN: https://github.com/intel/mkl-dnn
+.. _gnu.org site: https://gcc.gnu.org/onlinedocs/libgomp/Environment-Variables.html
+.. _Movidius: https://www.movidius.com/
--- a/doc/sphinx/source/optimize/index.rst
+++ b/doc/sphinx/source/optimize/index.rst
+.. optimize/index: 
+
+#############################
+Integrate Generic Frameworks   
+#############################
+
+This section, written for framework architects or engineers who want 
+to optimize a generic, brand new or less widely-supported framework, we
+provide some of our learnings from the work we've done in developing 
+"framework direct optimizations (DO)" and custom bridge code, such as 
+that for our `ngraph tensorflow bridge`_ code.
+
+.. important:: This section contains articles for framework owners or developers
+   who want to incorporate the nGraph library directly into their framework and 
+   optimize for some specific compute-time characteristic. 
+
+
+.. toctree::
+   :maxdepth: 1 
+
+   generic.rst
+
+
+When using a framework to run a model or deploy an algorithm on nGraph 
+devices, there are some additional configuration options that can be 
+incorporated -- manually on the command line or via scripting -- to improve 
+performance. Fine-tuning an nGraph-enabled device is as much of an art as it 
+is a science; there are virtually limitless ways to do so. 
+
+Since a framework is typically designed around some feature, such as fast 
+training using image data, inference on a mobile device, or support for voice 
+and speech pattern recognition, a framework cannot optimize for all 
+possibilities at the same time.   
+
+In general, the larger and more complex a framework is, the harder it becomes 
+to navigate and extract the best performance; configuration options that are 
+enabled by "default" from the framework side can sometimes slow down compilation 
+without the developer being any the wiser. Sometimes only `a few small`_ 
+adjustments can increase performance. Likewise, a minimalistic framework that 
+is designed around one specific kind of model can sometimes offer significant 
+performance-improvement opportunities by lowering overhead. 
+
+Right now the preferred way for a data scientist to get better performance is 
+to shop around and select the framework that is "already" designed or optimized 
+for some characteristic or trait of the model they want to build, test, tweak, 
+or run. One challenge of the framework developer, then, is to differentiate from 
+the pack by providing a means for the data scientist to obtain reproducible 
+results. The other challenge is to provide sufficient documentation, or to 
+provide sufficient hints for how to do any "fine-tuning" for specific use cases. 
+
+How this has worked in creating the :doc:`the direct optimizations <../framework-integration-guides>` 
+we've shared with the developer community, our `engineering teams carefully tune the workload to extract best performance`_ 
+from a specific :abbr:`DL (Deep Learning)` model embedded in a specific framework 
+that is training a specific dataset. Our forks of the frameworks adjust the code 
+and/or explain how to set the parameters that achieve reproducible results. 
+
+Some of the ways we attempt to improve performance include: 
+
+* Testing and recording the results of various system-level configuration options
+  or enabled or disabled flags,
+* Compiling with a mix of custom environment variables, 
+* Finding semi-related comparisons for benchmarking [#1]_, and 
+* Tuning lower levels of the system so that the machine-learning algorithm can 
+  learn faster or more accurately that it did on previous runs, 
+* Incorporating various :doc:`../ops/index` to build graphs more efficiently. 
+
+This approach, however, is obviously not a scalable solution for developers on  
+the framework side who are trying to support multiple use cases. Nor is it ideal 
+for teams looking to pivot or innovate multi-layer solutions based on something 
+**other than training speed**, things like accuracy or precision. Chasing 
+performance improvements does eventually yield a diminishing 
+:abbr:`Return on Investment (ROI)`, though it is up to the framework 
+developer to decide when that is for each of their customers.    
+
+For these reasons, we're providing some of the more commonly-used options for 
+fine-tuning various code deployments to the nGraph-enabled devices we 
+currently support. Watch this section as we enable new devices and post new 
+updates. 
+
+.. rubric:: Footnotes
+
+.. [#1] Benchmarking performance of DL systems is a young discipline; it is a
+   good idea to be vigilant for results based on atypical distortions in the 
+   configuration parameters. Every topology is different, and performance 
+   increases or slowdowns can be attributed to multiple means.    
+
+
+.. _ngraph tensorflow bridge: http://ngraph.nervanasys.com/docs/latest/framework-integration-guides.html#tensorflow
+.. _engineering teams carefully tune the workload to extract best performance: https://ai.intel.com/accelerating-deep-learning-training-inference-system-level-optimizations
+.. _a few small: https://software.intel.com/en-us/articles/boosting-deep-learning-training-inference-performance-on-xeon-and-xeon-phi
+.. _Movidius: https://www.movidius.com/
\ No newline at end of file