Robust debugging docs (#4060)

* Robust debugging docs * Add section on nbench and address comments from review * Collaborate with Gauri to revise profiling section * Revise and PR feedback * Move note * Fix wording * Order sections more logically and fix a comma * Phrasing fix on nbench_tf summary * More prominent notice of experimental debug flags * Better description for diagnostic tools * Remove miscellaneous framework support * clean up section * Remove deprecated links * Update sitemap to not use a page title * Useful descriptions * PR feedback * Not a flag * Prebuilt MLIR compile flag available * Remove duplicate flag * update pass manager example * Meta documentation note in release notes * Ensure docs build with lastest upstream ops * Transpose op doc fixes * Better intra-doc links * Commas in csv format are important * Final review with Gauri * Remove dupes, CPU-specific envvars * changes re: Comments from Gauri Co-authored-by: Robert Kimball <robert.kimball@intel.com> Co-authored-by: Sang Ik Lee <sang.ik.lee@intel.com>

Robust debugging docs (#4060)
* Robust debugging docs * Add section on nbench and address comments from review * Collaborate with Gauri to revise profiling section * Revise and PR feedback * Move note * Fix wording * Order sections more logically and fix a comma * Phrasing fix on nbench_tf summary * More prominent notice of experimental debug flags * Better description for diagnostic tools * Remove miscellaneous framework support * clean up section * Remove deprecated links * Update sitemap to not use a page title * Useful descriptions * PR feedback * Not a flag * Prebuilt MLIR compile flag available * Remove duplicate flag * update pass manager example * Meta documentation note in release notes * Ensure docs build with lastest upstream ops * Transpose op doc fixes * Better intra-doc links * Commas in csv format are important * Final review with Gauri * Remove dupes, CPU-specific envvars * changes re: Comments from Gauri Co-authored-by: Robert Kimball <robert.kimball@intel.com> Co-authored-by: Sang Ik Lee <sang.ik.lee@intel.com>
8a5a4c89 · Leona C · Sang Ik Lee · 8e46ff86 · 8a5a4c89 · 8a5a4c89
Commit 8a5a4c89 authored Jan 17, 2020 by Leona C Committed by Sang Ik Lee Jan 17, 2020
15 changed files
--- a/doc/sphinx/ngraph_theme/static/css/theme.css
+++ b/doc/sphinx/ngraph_theme/static/css/theme.css
@@ -1414,15 +1414,16 @@ input[type="radio"][disabled], input[type="checkbox"][disabled] {
  border-left-width: 0;
 }
 .wy-table thead, .rst-content table.docutils thead, .rst-content table.field-list thead {
-  color: #000;
-  text-align: left;
+  color: #1b1b1e;
+  text-align: center;
  vertical-align: bottom;
  white-space: nowrap;
 }
 .wy-table thead th, .rst-content table.docutils thead th, .rst-content table.field-list thead th {
-  font-family: Nunito, 'Nunito Sans', sans;
-  font-variant: small-caps;
-  border-bottom: solid 1px #c1c7d7;
+  font-family: monospace;
+  font-size: 1.33em;
+  background-color: #3e4451;
+  color: #e0e0e0;
 }
 .wy-table td, .rst-content table.docutils td, .rst-content table.field-list td {
  background-color: transparent;
@@ -1491,7 +1492,6 @@ input[type="radio"][disabled], input[type="checkbox"][disabled] {
 .wy-table-horizontal td, .wy-table-horizontal th {
  border-width: 0 0 1px 0;
  border-bottom: 1px solid #e1e4e5;
-  font-variant: small-caps;
 }
 .wy-table-horizontal tbody > tr:last-child td {
  border-bottom-width: 0;
@@ -1512,8 +1512,8 @@ input[type="radio"][disabled], input[type="checkbox"][disabled] {

 .wy-table-responsive table th {
  white-space: pre-wrap;
-  font-family: Nunito, 'Nunito Sans', sans;
-  font-variant: small-caps;
+  font-family: monospace;
+
 }


@@ -2310,7 +2310,7 @@ div[class^='highlight'] pre {
  margin-bottom: 12px;
 }
 .rst-content .toc-backref {
-  color: #7b7064;
+  color: #1b1b1e;
 }
 .rst-content .align-right {
  float: right;
@@ -3000,7 +3000,7 @@ footer span.commit code, footer span.commit .rst-content tt, .rst-content footer
 }

 .rst-footer-buttons {
-  *zoom: 1;
+  zoom: 1;
 }
 .rst-footer-buttons:before, .rst-footer-buttons:after {
  display: table;

--- a/doc/sphinx/source/core/passes/passes.rst
+++ b/doc/sphinx/source/core/passes/passes.rst
@@ -32,12 +32,12 @@ steps and the code below.
 #. Create a "pass manager" object (line 1)
 #. Populate it with the desired pass or passes (lines 2-4)
 #. Invoke the pass manager with a pointer to your unoptimized graph, and 
-   it will return a pointer to an optimized graph (lines 5-6)
+   it will return a pointer to an optimized graph (lines 5-8)


-.. literalinclude:: ../../../../../test/cpu_fusion.cpp
+.. literalinclude:: ../../../../../test/pass_memory_layout.cpp
   :language: cpp
-   :lines: 2085-2092
+   :lines: 222-230
   :linenos: 

 nGraph Core includes a large library of hardware-agnostic passes useful 

--- a/doc/sphinx/source/frameworks/index.rst
+++ b/doc/sphinx/source/frameworks/index.rst
@@ -13,4 +13,3 @@ Working with Frameworks
   onnx_integ.rst 
   paddle_integ.rst
   tensorflow_connect.rst
-   other/index.rst
--- a/doc/sphinx/source/frameworks/other/index.rst
+++ b/doc/sphinx/source/frameworks/other/index.rst
-.. frameworks/other/index.rst:
-
-.. _fw_other: 
-
-.. contents::
-
-Integrating other frameworks
-============================
-
-This section details some of the *configuration options* and some of the 
-*environment variables* that can be used to tune for optimal performance when 
-your system already has a version of nGraph installed with one or more of our 
-supported :doc:`../../backends/index`.
-
-Regardless of the framework, after the :doc:`../../buildlb` step, a good place 
-to start usually involves making the libraries available to the framework. On 
-Linux\* systems built on Intel® Architecture, that command tends to looks 
-something like: 
-
-.. code-block:: console
-
-   export NGRAPH_CPP_BUILD_PATH=path/to/ngraph_dist/
-   export LD_LIBRARY_PATH=path/to/ngraph_dist/lib/
-
-
-Find or display version
-----------------------
-
-If you're working with the :doc:`../../python_api/index`, the following command 
-may be useful:
-
-.. code-block:: console
-
-   python3 -c "import ngraph as ng; print('nGraph version: ',ng.__version__)";
-
-To manually build a newer version than is available from the latest `PyPI`_
-(:abbr:`Python Package Index (PyPI)`), see our nGraph Python API `BUILDING.md`_ 
-documentation.
-
-
-Activate logtrace-related environment variables
-----------------------------------------------
-
-Another configuration option is to activate ``NGRAPH_CPU_DEBUG_TRACER``,
-a runtime environment variable that supports extra logging and debug detail. 
-
-This is a useful tool for data scientists interested in outputs from logtrace 
-files that can, for example, help in tracking down model convergences. It can 
-also help engineers who might want to add their new ``Backend`` to an existing 
-framework to compare intermediate tensors/values to references from a CPU 
-backend.
-
-To activate this tool, set the ``env`` var ``NGRAPH_CPU_DEBUG_TRACER=1``.
-It will dump ``trace_meta.log`` and ``trace_bin_data.log``. The names of the 
-logfiles can be customized.
-
-To specify the names of logs with those flags:
-
-:: 
-
-  NGRAPH_TRACER_LOG = "meta.log"
-  NGRAPH_BIN_TRACER_LOG = "bin.log"
-
-The meta_log contains::
- 
-  kernel_name, serial_number_of_op, tensor_id, symbol_of_in_out, num_elements, shape, binary_data_offset, mean_of_tensor, variance_of_tensor
-
-A line example from a unit-test might look like::
-
-  K=Add S=0 TID=0_0 >> size=4 Shape{2, 2} bin_data_offset=8 mean=1.5 var=1.25
-
-The binary_log line contains::
-
-  tensor_id, binary data (tensor data)
-
-A reference for the implementation of parsing these logfiles can also be found 
-in the unit test for this feature.
-
-
-FMV
---
-
-FMV stands for :abbr:`Function Multi-Versioning`, and it can also provide a 
-number of generic ways to patch or bring architecture-based optimizations to 
-the :abbr:`Operating System (OS)` that is handling your ML environment. See 
-the `GCC wiki for details`_.
-
-If your nGraph build is a Neural Network configured on Clear Linux\* OS 
-for Intel® Architecture, and it includes at least one older CPU, the 
-`following article may be helpful`_.
-
-
-Training Deep Neural Networks
-----------------------------
-
-Before tweaking various environment variables, be aware that how the computation 
-gets executed depends  on the data layout that the model is using. ``NHWC`` and 
-``NCHW`` are common layouts in Deep Learning models. Your ultimate 
-runtime can vary greatly -- even when all other factors are exactly the same -- 
-when this detail is overlooked.
-
-For CPU (and most cuDNN) backends, the preferred layout is currently ``NCHW``.
-
-* **N** -- Number of images per batch
-* **C** -- Channel of the image (expressed as a number like 3 for RGB and 1 
-  for grayscale)
-* **H** -- Height of the image
-* **W** -- Width of the image
-
-
-Intel® Math Kernel Library for Deep Neural Networks 
---------------------------------------------------
-
-.. important:: Intel® MKL-DNN is automatically enabled as part of an
-   nGraph default :doc:`build <../../buildlb>`; you do *not* need to add it 
-   separately or as an additional component to be able to use these 
-   configuration settings.
-
-The following `KMP`_ options were originally optimized for models using the 
-Intel® `MKL-DNN`_ to train models with the ``NCHW`` data layout; however, other 
-configurations can be explored.    
-
-* ``KMP_BLOCKTIME`` Sets the time, in milliseconds, that a thread should wait 
-  after completing the execution of a parallel region, before sleeping.
-* ``KMP_AFFINITY`` Enables the runtime library to bind threads to physical 
-  processing units. A useful article that explains more about how to use this 
-  option for various CPU backends is here: https://web.archive.org/web/20190401182248/https://www.nas.nasa.gov/hecc/support/kb/Using-Intel-OpenMP-Thread-Affinity-for-Pinning_285.html
-* ``KMP_SETTINGS`` Enables (``true``) or disables (``false``) the printing of 
-  OpenMP\* runtime library environment variables during program execution.
-* ``OMP_NUM_THREADS`` Specifies the number of threads to use.
-
-
-nGraph-enabled Intel® Xeon® 
---------------------------
-
-The list below includes recommendations on data layout, parameters, and 
-application configuration to achieve best performance running DNN workloads on 
-Intel® Xeon® (CPU processor) systems.
-
-Threading 
---------
-
-The number of threads set by ``OMP_NUM_THREADS`` ought not exceed the number of 
-physical cores. The threads should be pinned to their respective physical cores 
-and activated as follows:
-
-* When ``HT=off``, ``KMP_AFFINITY=compact,granularity=fine``
-
-* When ``HT=on``, ``KMP_AFFINITY=compact,1,0,granularity=fine``
-
-
-Memory allocation 
-----------------
-
-Buffer pointers should be aligned on 64-byte boundaries. NUMA policy should be 
-configured for local memory allocation (``numactl --localloc``). 
-
-
-
-Convolution shapes
-^^^^^^^^^^^^^^^^^^
-
-* When **running inference, or training for forward-propagation and weight 
-  updates**, for best performance:
-  
-  - the number of input channels should be 1, 3, or a multiple of SIMD-width (8 
-    for AVX2 systems, 16 for AVX512 systems). 
-  - the number of output channels should be a multiple of SIMD-width (8 for AVX2 
-    systems, 16 for AVX512 systems).
-
-* When **training backward propagation**, the number of input and output 
-  channels should be a multiple of SIMD-width (8 for AVX2 systems, 16 for AVX512 
-  systems),
-  
-  - padding should not exceed :math:`0.5x` where :math:`x` is the kernel size.
-  - kernel width should be less than 14.
-
-
-``OMP_NUM_THREADS``
-^^^^^^^^^^^^^^^^^^^
-
-The best resource for this configuration option is the Intel® OpenMP\* docs 
-at the following link: `Intel OpenMP documentation`_. ``OMP_NUM_THREADS`` 
-defaults to the number of logical cores. To check the number of cores on your 
-system, you can run the following on the command-line to see the details 
-of your CPU:
-
-.. code-block:: console
-
-   $ lscpu
-
-
-Intra-op and inter-op parallelism 
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-* ``intra_op_parallelism_threads``
-* ``inter_op_parallelism_threads``
-
-Some frameworks, like TensorFlow\*, use these settings to improve performance; 
-however, they are often not sufficient for optimal performance. Framework-based 
-adjustments cannot access the underlying NUMA configuration in multi-socket 
-Intel® Xeon® processor-based platforms, which is a key requirement for 
-many kinds of inference-engine computations. See the next section on NUMA 
-performance to learn more about this performance feature available to systems 
-utilizing nGraph. 
-   
-
-NUMA performance 
-~~~~~~~~~~~~~~~~~
-
-NUMA stands for :abbr:`Non-Uniform Memory Access (NUMA)`. It indicates how each 
-CPU can access memory attached to each socket. 
-
-Without the "knowledge" of CPU socket and NUMA configuration, a simple thread 
-affinity (as in the case of thread pool) does not lead to optimal performance. 
-In fact, it can sometimes prohibitively decrease throughput; a core from socket 
-0 might have to continually access cache lines from the memory bank of socket 1, 
-increasing bandwidth demands on the Intel® Ultra-Path Interconnect (Intel® UPI). 
-This situation is exacerbated with larger number of sockets found in 4, 8, and 
-16-socket systems. We believe that users need to be aware of system level 
-optimizations in addition to framework specific configuration parameters to 
-achieve the best performance for NN workloads on CPU platforms. The nGraph 
-Compiler stack runs on transformers handled by Intel® Architecture (IA), and 
-thus can make more efficient use of the underlying hardware.
-
-.. _PyPI: https://pypi.org/project/ngraph-core
-.. _KMP: https://software.intel.com/en-us/node/522691
-.. _MKL-DNN: https://github.com/intel/mkl-dnn
-.. _Intel OpenMP documentation: https://www.openmprtl.org/documentation
-.. _Movidius: https://www.movidius.com/
-.. _BUILDING.md: https://github.com/NervanaSystems/ngraph/blob/master/python/BUILDING.md
-.. _GCC wiki for details: https://gcc.gnu.org/wiki/FunctionMultiVersioning
-.. _following article may be helpful: https://clearlinux.org/documentation/clear-linux/tutorials/fmv
--- a/doc/sphinx/source/index.rst
+++ b/doc/sphinx/source/index.rst
@@ -46,7 +46,6 @@ nGraph Compiler Stack Documentation
   frameworks/tensorflow_connect.rst
   frameworks/onnx_integ.rst
   frameworks/paddle_integ.rst
-   frameworks/other/index.rst

 .. toctree::
   :maxdepth: 1
@@ -68,8 +67,7 @@ nGraph Compiler Stack Documentation
   :caption: Backend Support

   Basic Concepts <backends/index.rst>
-   backends/plaidml-ng-api/index.rst
-   Integrating Other Backends <backends/cpp-api.rst>
+   Adding New Backends <backends/cpp-api.rst>


 .. toctree::
@@ -89,9 +87,14 @@ nGraph Compiler Stack Documentation

 .. toctree::
   :maxdepth: 1
-   :caption: Debugging Graphs
-
-   inspection/index.rst
+   :caption: Diagnostics
+
+   inspection/debug_core.rst
+   inspection/debug_tf.rst
+   inspection/debug_onnx.rst
+   inspection/debug_paddle.rst
+   inspection/viz_tools.rst 
+   inspection/profiling.rst


 .. toctree::

--- a/doc/sphinx/source/inspection/debug_core.rst
+++ b/doc/sphinx/source/inspection/debug_core.rst
+.. inspection/debug_core.rst:
+
+.. contents::
+
+.. _debug_core:
+
+Diagnostics
+###########
+
+.. important:: Many of the following flags may be experimental only and subject to change.
+
+Build nGraph with various compile flags and environment variables to diagnose performance
+and memory issues.  See also :doc:`profiling`.
+
+
+Compile Flags
+=============
+
+.. csv-table::
+   :header: "Compile Flag", "Description", "Default Value"
+   :widths: 20, 35, 5
+   :escape: ~
+
+   ``NGRAPH_CODE_COVERAGE_ENABLE``, Enable code coverage data collection, ``FALSE``
+   ``NGRAPH_DEBUG_ENABLE``, Enable output for ``NGRAPH_DEBUG`` statements, ``FALSE``
+   ``NGRAPH_DEPRECATED_ENABLE``, Enable compiler deprecation pragmas for deprecated APIs (recommended only for development use), ``FALSE``
+   ``NGRAPH_DEX_ONLY``, Build CPU DEX without codegen, ``FALSE``
+   ``NGRAPH_DISTRIBUTED_ENABLE``, Enable distributed training using MLSL/OpenMPI, ``OFF``
+   ``NGRAPH_DISTRIBUTED_MLSL_ENABLE``, Use MLSL, ``OFF``
+   ``NGRAPH_DOC_BUILD_ENABLE``,  Automatically build documentation, ``OFF``
+   ``NGRAPH_FAST_MATH_ENABLE``,  Enable fast math, ``ON``
+   ``NGRAPH_HALIDE``,  ,``OFF``
+   ``NGRAPH_INTERPRETER_ENABLE``, Control the building of the ``INTERPRETER`` backend,  ``TRUE``
+   ``NGRAPH_INTERPRETER_STATIC_LIB_ENABLE``, Enable build INTERPRETER backend static library, ``FALSE``
+   ``NGRAPH_JSON_ENABLE``, Enable JSON based serialization and tracing features, ``TRUE``
+   ``NGRAPH_LIB_VERSIONING_ENABLE``, Enable shared library versioning, ``FALSE``
+   ``NGRAPH_MLIR_ENABLE``, Control the building of MLIR backend, ``FALSE``
+   ``NGRAPH_NOP_ENABLE``,  Control the building of the NOP backend,  ``TRUE``
+   ``NGRAPH_ONNX_IMPORT_ENABLE``, Enable ONNX importer, ``FALSE``
+   ``NGRAPH_PLAIDML_ENABLE``, Enable the PlaidML backend,  ``${PLAIDML_FOUND}``
+   ``NGRAPH_PYTHON_BUILD_ENABLE``, Enable build of ``NGRAPH`` python package wheel, ``FALSE``
+   ``NGRAPH_STATIC_LIB_ENABLE``, Enable build ``NGRAPH`` static library, ``FALSE``
+   ``NGRAPH_TBB_ENABLE``, Only if (``NGRAPH_CPU_ENABLE``) Control usage of TBB for CPU backend, ``TRUE``
+   ``NGRAPH_TOOLS_ENABLE``, Control the building of tools, ``TRUE``
+   ``NGRAPH_UNIT_TEST_ENABLE``,  Control the building of unit tests, ``TRUE``
+   ``NGRAPH_USE_PREBUILT_LLVM``, Use a precompiled LLVM, ``FALSE``
+   ``NGRAPH_USE_PREBUILT_MLIR``, Use the `precompiled MLIR`_,``FALSE``
+
+
+Environment Variables
+=====================
+
+.. important:: Many of the following flags may be experimental only and subject to change.
+
+
+.. csv-table::
+   :header: "Environment Variable", "Description"
+   :widths: 20, 35
+   :escape: ~
+
+   ``NGRAPH_DISABLE_LOGGING``,	Disable printing all logs irrespective of build type
+   ``NGRAPH_DISABLED_FUSIONS``,	Disable specified fusions. Specified as `;` separated list and supports regex
+   ``NGRAPH_ENABLE_REPLACE_CHECK``,	Enables strict type checking in copy constructor copy_with_new_args
+   ``NGRAPH_ENABLE_SERIALIZE_TRACING``, generates 1 ``json`` file per pass to run with ``nbench`` for localized execution rather than whole stack execution
+   ``NGRAPH_ENABLE_TRACING``, Enables creating graph execution timelines to be viewed in ``chrome://tracing`` see also :doc:`viz_tools`.
+   ``NGRAPH_ENABLE_VISUALIZE_TRACING``,	Enables creating visual graph for each pass ``.svg`` files by default; see also :doc:`viz_tools`
+   ``NGRAPH_FAIL_MATCH_AT``, Allows one to specify node name patterns to abort pattern matching at particular nodes. Helps debug an offending fusion
+   ``NGRAPH_GTEST_INFO``, Enables printing info about a specific test
+   ``NGRAPH_INTER_OP_PARALLELISM``, See :ref:`interop_intraop`
+   ``NGRAPH_INTRA_OP_PARALLELISM``, See :ref:`interop_intraop`
+   ``NGRAPH_PASS_ATTRIBUTES``, Specify pass-specific attributes as a semi-colon separated list to be enabled or disabled. Naming of pass attributes is up to the backends and see also `pass config`_
+   ``NGRAPH_PASS_ENABLES``,	Specify a semi-colon separated list to enable or disable a pass on core or backend. This will override the default enable/disable values
+   ``NGRAPH_PROFILE_PASS_ENABLE``, Dump the name and execution time of each pass; shows per-pass time taken to compile
+   ``NGRAPH_PROVENANCE_ENABLE``, Enable adding provenance info to nodes. This will also be added to serialized files.
+   ``NGRAPH_SERIALIZER_OUTPUT_SHAPES``,	Enable adding output shapes in the serialized graph
+   ``NGRAPH_VISUALIZE_EDGE_JUMP_DISTANCE``,	Calculated in code; helps prevent *long* edges between two nodes very far apart
+   ``NGRAPH_VISUALIZE_EDGE_LABELS``, Set it to 1 in ``~/.bashrc``; adds label to a graph edge when NGRAPH_ENABLE_VISUALIZE_TRACING=1
+   ``NGRAPH_VISUALIZE_TREE_OUTPUT_SHAPES``, Set it to 1 in ``~/.bashrc``; adds output shape of a node when NGRAPH_ENABLE_VISUALIZE_TRACING=1
+   ``NGRAPH_VISUALIZE_TREE_OUTPUT_TYPES``, Set it to 1 in ``~/.bashrc``; adds output type of a node when NGRAPH_ENABLE_VISUALIZE_TRACING=1
+   ``NGRAPH_VISUALIZE_TRACING_FORMAT``, Default format is ``.svg``. See also :doc:`viz_tools` 
+   ``OMP_NUM_THREADS``, See: `OpenMPI Runtime Library Documentation`_
+
+
+
+.. _debug_tracer:
+
+Debug Tracer
+------------
+
+Another diagnostic configuration option is to activate ``NGRAPH_CPU_DEBUG_TRACER``,
+a runtime environment variable that supports extra logging and debug detail. 
+
+This is a useful tool for data scientists interested in outputs from logtrace 
+files that can, for example, help in tracking down model convergences. It can 
+also help engineers who might want to add their new ``Backend`` to an existing 
+framework to compare intermediate tensors/values to references from a CPU 
+backend.
+
+To activate this tool, set the ``env`` var ``NGRAPH_CPU_DEBUG_TRACER=1``.
+It will dump ``trace_meta.log`` and ``trace_bin_data.log``. The names of the 
+logfiles can be customized.
+
+To specify the names of logs with those flags:
+
+:: 
+
+  NGRAPH_TRACER_LOG = "meta.log"
+  NGRAPH_BIN_TRACER_LOG = "bin.log"
+
+
+.. _interop_intraop:
+
+Intra-op and inter-op parallelism
+---------------------------------
+
+* ``intra_op_parallelism_threads``
+* ``inter_op_parallelism_threads``
+
+Some frameworks, like TensorFlow\*, use these settings to improve performance; 
+however, they are often not sufficient for optimal performance. Framework-based 
+adjustments cannot access the underlying NUMA configuration in multi-socket 
+Intel® Xeon® processor-based platforms, which is a key requirement for 
+many kinds of inference-engine computations.
+
+The meta_log contains::
+ 
+  kernel_name, serial_number_of_op, tensor_id, symbol_of_in_out, num_elements, shape, binary_data_offset, mean_of_tensor, variance_of_tensor
+
+A line example from a unit-test might look like::
+
+  K=Add S=0 TID=0_0 >> size=4 Shape{2, 2} bin_data_offset=8 mean=1.5 var=1.25
+
+The binary_log line contains::
+
+  tensor_id, binary data (tensor data)
+
+A reference for the implementation of parsing these logfiles can also be found 
+in the unit test for this feature.
+
+
+.. _pass config: https://github.com/NervanaSystems/ngraph/blob/a4a3031bb40f19ec28704f76de39762e1f27e031/src/ngraph/pass/pass_config.cpp#L54
+.. _OpenMPI Runtime Library Documentation: https://www.openmprtl.org/documentation
+.. _precompiled MLIR: https://github.com/IntelAI/mlir
\ No newline at end of file
--- a/doc/sphinx/source/inspection/debug_onnx.rst
+++ b/doc/sphinx/source/inspection/debug_onnx.rst
+.. inspection/debug_onnx: 
+
+.. _debug_onnx:
+
+Debug ONNX
+==========
+
+.. note:: These flags are all disabled by default
+
+.. csv-table:: 
+   :header: "Flag", "Description"
+   :widths: 20, 35
+   :escape: ~
+
+   ``ONNXRUNTIME_NGRAPH_DUMP_OPS``, Dumps ONNX ops 
+   ``ONNXRUNTIME_NGRAPH_LRU_CACHE_SIZE``, Modify LRU cache size (``NGRAPH_EP_LRU_CACHE_DEFAULT_SIZE 500``)   
\ No newline at end of file
--- a/doc/sphinx/source/inspection/debug_paddle.rst
+++ b/doc/sphinx/source/inspection/debug_paddle.rst
+.. inspection/debug_paddle.rst: 
+
+.. _debug_paddle:
+
+Debug PaddlePaddle\*
+====================
+
+PaddlePaddle has its `own env vars`_. 
+
+
+
+.. _own env vars: https://github.com/PaddlePaddle/Paddle/blob/cdd46d7e022add8de56995e681fa807982b02124/python/paddle/fluid/__init__.py#L161-L227
\ No newline at end of file
--- a/doc/sphinx/source/inspection/debug_tf.rst
+++ b/doc/sphinx/source/inspection/debug_tf.rst
+.. inspection/debug_tf: 
+
+.. _debug_tf:
+
+Debug TensorFlow\*
+==================
+
+.. note:: These flags are all disabled by default
+
+For profiling with TensorFlow\* and ``nbench``, see :ref:`nbench_tf`.
+
+.. csv-table:: 
+   :header: "Flag", "Description"
+   :widths: 20, 35
+   :escape: ~
+
+   ``NGRAPH_ENABLE_SERIALIZE=1``,Generate nGraph-level serialized graphs
+   ``NGRAPH_TF_VLOG_LEVEL=5``, Generate ngraph-tf logging info for different passes
+   ``NGRAPH_TF_LOG_PLACEMENT=1``, Generate op placement log at stdout
+   ``NGRAPH_TF_DUMP_CLUSTERS=1``, Dump Encapsulated TF Graphs formatted as ``NGRAPH_cluster_<cluster_num>``
+   ``NGRAPH_TF_DUMP_GRAPHS=1``,"Dump TF graphs for different passes: precapture, capture, unmarked, marked, clustered, declustered, encapsulated"
+   ``TF_CPP_MIN_VLOG_LEVEL=1``, Enable TF CPP logs
+   ``NGRAPH_TF_DUMP_DECLUSTERED_GRAPHS=1``, Dump graphs with final clusters assigned. Use this to view TF computation graph with colored nodes indicating clusters
+   ``NGRAPH_TF_USE_LEGACY_EXECUTOR``, This flag will be obsolete soon.
--- a/doc/sphinx/source/inspection/index.rst
+++ b/doc/sphinx/source/inspection/index.rst
 .. inspection/index: 

-.. _inspection: 
-
-Visualization Tools
-###################
-
-nGraph provides serialization and deserialization facilities, along with the 
-ability to create image formats or a PDF. 
-
-When visualization is enabled, ``svg`` files for your graph get generated. The 
-default can be adjusted by setting the ``NGRAPH_VISUALIZE_TRACING_FORMAT`` 
-flag to another format, like PNG or PDF. 
-
-.. note:: Large graphs are usually not legible with formats like PDF.
-
-Large graphs may require additional work to get into a human-readable format. 
-On the back end, very long edges will need to be cut to make (for example) a 
-hard-to-render training graph tractable. This can be a tedious process, so 
-incorporating the help of a rendering engine or third-party tool like those 
-listed below may be useful.  
-
-
-.. Additional scripts
-.. ==================
-
-.. We have provided a script to convert the `most common default output`_, nGraph 
-.. ``JSON``, to an output that is better able to handle detailed graphs; however, 
-.. we do not offer user support for this script. The script will produce a 
-.. ``.graphml`` file that can be imported and inspected with third-party tools 
-.. like: 
-
-#. `Gephi`_
-
-#. `Cytoscape`_
-
-.. #. `Netron`_ support tentatively planned to come soon
-
-
-.. _CMakeLists.txt: https:github.com/NervanaSystems/ngraph/blob/master/CMakeLists.txt
-.. _most common default output: https:github.com/NervanaSystems/ngraph/contrib/tools/graphml/ngraph_json_to_graphml.py
-.. _visualize_tree.cpp: https://github.com/NervanaSystems/ngraph/blob/master/src/ngraph/pass/visualize_tree.cpp
-.. _Netron: https:github.com/lutzroeder/netron/blob/master/README.md
-.. _Gephi: https:gephi.org
-.. _Cytoscape: https:cytoscape.org
+:orphan:
+
+.. _inspection:
+
+Debug Tools
+###########
+
+.. toctree::
+   :maxdepth: 1
+   
+   debug_core.rst
+   debug_tf.rst
+   debug_onnx.rst
+   debug_paddle.rst
+   viz_tools.rst 
+   profiling.rst
--- a/doc/sphinx/source/inspection/profiling.rst
+++ b/doc/sphinx/source/inspection/profiling.rst
+.. inspection/profiling.rst:
+
+.. _profiling: 
+
+Performance testing with ``nbench``
+###################################
+
+The nGraph Compiler stack includes the ``nbench`` tool which 
+provides additional methods of assessing or debugging performance 
+issues.
+
+If you follow the build process under :doc:`../buildlb`, the 
+``NGRAPH_TOOLS_ENABLE`` flag defaults to ``ON`` and automatically 
+builds ``nbench``. As its name suggests, ``nbench`` can be used 
+to benchmark any nGraph-serialized model with a given backend.
+
+To benchmark an already-serialized nGraph ``.json`` model with, for 
+example, a ``CPU`` backend, run ``nbench`` as follows.
+
+.. code-block:: console
+
+   $ cd ngraph/build/src/tools
+   $ nbench/nbench -b CPU - i 1 -f <serialized_json file>
+
+Samples for testing can be found under  ``ngraph/test/models``.
+
+.. _nbench:
+
+``nbench``
+==========
+
+.. code-block:: none
+
+    Benchmark and nGraph JSON model with a given backend.
+    
+    SYNOPSIS
+        nbench [-f <filename>] [-b <backend>] [-i <iterations>]
+    OPTIONS
+        -f|--file                 Serialized model file
+        -b|--backend              Backend to use (default: CPU)
+        -d|--directory            Directory to scan for models. All models are benchmarked.
+        -i|--iterations           Iterations (default: 10)
+        -s|--statistics           Display op statistics
+        -v|--visualize            Visualize a model (WARNING: requires Graphviz installed)
+        --timing_detail           Gather detailed timing
+        -w|--warmup_iterations    Number of warm-up iterations
+        --no_copy_data            Disable copy of input/result data every iteration
+        --dot                     Generate Graphviz dot file
+        --double_buffer           Double buffer inputs and outputs
+
+.. _nbench_tf:
+
+Use ``nbench`` to ease end-to-end debugging for TensorFlow\*
+------------------------------------------------------------
+
+Rather than run a TensorFlow\* model "end-to-end" all the time, 
+developers who notice a problem with performance or memory usage 
+can generate a unique serialized model for debugging by using 
+``NGRAPH_ENABLE_SERIALIZE=1``. This serialized model can then be 
+run and re-run with ``nbench`` to efficiently experiment with any 
+changes in ``ngraph`` space; developers can make changes and test 
+changes without the overhead of a complete end-to-end compilation 
+for each change.
+
+Find or display version
+-----------------------
+
+If you're working with the :doc:`../../python_api/index`, the following command 
+may be useful:
+
+.. code-block:: console
+
+   python3 -c "import ngraph as ng; print('nGraph version: ',ng.__version__)";
+
+To manually build a newer version than is available from the latest `PyPI`_
+(:abbr:`Python Package Index (PyPI)`), see our nGraph Python API `BUILDING.md`_ 
+documentation.
+
+
+.. _PyPI: https://pypi.org/project/ngraph-core/
+.. _BUILDING.md: https://github.com/NervanaSystems/ngraph/blob/master/python/BUILDING.md
--- a/doc/sphinx/source/inspection/viz_tools.rst
+++ b/doc/sphinx/source/inspection/viz_tools.rst
+.. inspection/viz_tools.rst:
+
+.. _viz_tools: 
+
+General Visualization Tools
+###########################
+
+nGraph provides serialization and deserialization facilities, along with the 
+ability to create image formats or a PDF. 
+
+
+``NGRAPH_ENABLE_VISUALIZE_TRACING=1`` enables visualization and generates graph 
+visualization files.  
+
+.. note:: Using ``NGRAPH_ENABLE_VISUALIZE_TRACING=1`` will affect performance.
+
+When visualization is enabled, ``svg`` files for your graph get generated. The 
+default format can be adjusted by setting the ``NGRAPH_VISUALIZE_TRACING_FORMAT`` 
+flag to another format, like PNG or PDF.
+
+.. note:: Large graphs are usually not legible with formats like PDF.
+
+Large graphs may require additional work to get into a human-readable format. 
+On the back end, very long edges will need to be cut to make (for example) a 
+hard-to-render training graph tractable. This can be a tedious process, so 
+incorporating the help of a rendering engine or third-party tool like one
+listed below may be useful.  
+
+#. `Gephi`_
+
+#. `Cytoscape`_
+
+#. `Netron`_ 
+
+
+.. Additional scripts
+.. ==================
+
+.. We have provided a script to convert the `most common default output`_, nGraph 
+.. ``JSON``, to an output that is better able to handle detailed graphs; however, 
+.. we do not offer user support for this script. The script will produce a 
+.. ``.graphml`` file that can be imported and inspected with third-party tools 
+.. like those listed above. 
+
+.. _most common default output: https:github.com/NervanaSystems/ngraph/contrib/tools/graphml/ngraph_json_to_graphml.py
+.. _Netron: https:github.com/lutzroeder/netron/blob/master/README.md
+.. _Gephi: https:gephi.org
+.. _Cytoscape: https:cytoscape.org
--- a/doc/sphinx/source/ops/transpose.rst
+++ b/doc/sphinx/source/ops/transpose.rst
@@ -21,22 +21,22 @@ matrix transposition, and also more general cases on higher-rank tensors.
 Inputs
 ------

-+-----------------+-------------------------+---------------------------------------------+
-| Name            | Element Type            | Shape                                       |
-+=================+=========================+=============================================+
-| ``arg``         | Any                     | Any                                         |
-+-----------------+-------------------------+---------------------------------------------+
-| ``input_order`` | ``element::i64``        | ``[n]``, where `n`` is the rank of ``arg``. |
-+-----------------+-------------------------+---------------------------------------------+
+-----------------+-------------------------+----------------------------------------------+
+| Name            | Element Type            | Shape                                        |
+=================+=========================+==============================================+
+| ``arg``         | Any                     | Any                                          |
+-----------------+-------------------------+----------------------------------------------+
+| ``input_order`` | ``element::i64``        | ``[n]``, where ``n`` is the rank of ``arg``. |
+-----------------+-------------------------+----------------------------------------------+

 Outputs
 -------

-+-----------------+-------------------------+-------------------------------------------------------------------------------+
-| Name            | Element Type            | Shape                                                                         |
-+=================+=========================+===============================================================================+
-| ``output``      | Same as ``arg``         | ``P(ShapeOf(arg))``, where `P` is the permutation supplied for `input_order`. |
-+-----------------+-------------------------+-------------------------------------------------------------------------------+
+-----------------+-------------------------+---------------------------------------------------------------------------------+
+| Name            | Element Type            | Shape                                                                           |
+=================+=========================+=================================================================================+
+| ``output``      | Same as ``arg``         | ``P(ShapeOf(arg))``, where *P* is the permutation supplied for ``input_order``. |
+-----------------+-------------------------+---------------------------------------------------------------------------------+

 The input ``input_order`` must be a vector of shape `[n]`, where `n` is the
 rank of ``arg``, and must contain every integer in the range ``[0,n-1]``. This
@@ -69,6 +69,6 @@ Not yet implemented.
 C++ Interface
 =============

-.. doxygenclass:: ngraph::op::v0::Transpose
+.. doxygenclass:: ngraph::op::v1::Transpose
   :project: ngraph
   :members:
--- a/doc/sphinx/source/project/release-notes.rst
+++ b/doc/sphinx/source/project/release-notes.rst
@@ -26,6 +26,7 @@ Core updates for |version|
 Latest documentation updates
 ----------------------------

+ Better debugging documentation
 + Dynamic Shapes and APIs
 + Provenance
 + Add linkages and overview for quantization APIs

--- a/doc/sphinx/source/sitemap.rst
+++ b/doc/sphinx/source/sitemap.rst
@@ -6,7 +6,6 @@
     :maxdepth: 1

     introduction
-     tutorials/index.rst


 * :ref:`Framework Support <framework_support>`
@@ -63,12 +62,18 @@
     frameworks/validated/list.rst


-* :ref:`Debugging Graphs <inspection>`
+* :ref:`Diagnostics <inspection>`

  .. toctree::
     :maxdepth: 1

-     inspection/index.rst
+     inspection/debug_core.rst
+     inspection/debug_tf.rst
+     inspection/debug_onnx.rst
+     inspection/debug_paddle.rst
+     inspection/viz_tools.rst 
+     inspection/profiling.rst
+


 * :ref:`Contribution <contribution_guide>`