Reorganize doc folders for core-related doc on fusion, graph rewrite, and compiler passes (#2466)

* Cleaner API doc reference for compile call * Add a useful table for nGraph namespaces * Remove layout namespace * Show exploding kernel problem on illustration like IEEE preso * WIP branch for new documentation restructuring that is a huge pain * Fix the doc reorg mess * Fix underline * List of passes disclaimer note * Update disclaimers on README * More cleanup of doc reorg * Update core docs * Update overview on core * Add PR feedback * Get rid of all the gazillion of doc build errors from rearranging stuff * Add section on tutorials * Update branch * Cleanup intro * Add better detail to overview

Reorganize doc folders for core-related doc on fusion, graph rewrite, and compiler passes (#2466)
* Cleaner API doc reference for compile call * Add a useful table for nGraph namespaces * Remove layout namespace * Show exploding kernel problem on illustration like IEEE preso * WIP branch for new documentation restructuring that is a huge pain * Fix the doc reorg mess * Fix underline * List of passes disclaimer note * Update disclaimers on README * More cleanup of doc reorg * Update core docs * Update overview on core * Add PR feedback * Get rid of all the gazillion of doc build errors from rearranging stuff * Add section on tutorials * Update branch * Cleanup intro * Add better detail to overview
fd0ed37c · Leona C · Sang Ik Lee · 12b5f085 · fd0ed37c · fd0ed37c
Commit fd0ed37c authored Feb 23, 2019 by Leona C Committed by Sang Ik Lee Feb 23, 2019
42 changed files
--- a/README.md
+++ b/README.md
@@ -10,7 +10,6 @@
 ## Quick start
 To begin using nGraph with popular frameworks to accelerate deep learning 
 workloads on CPU for inference, please refer to the links below. 
@@ -20,11 +19,21 @@ workloads on CPU for inference, please refer to the links below.
 | MXNet* 1.3                 | [Pip install](https://github.com/NervanaSystems/ngraph-mxnet#Installation) or [Build from source](https://github.com/NervanaSystems/ngraph-mxnet#building-with-ngraph-support)| 18 [Validated workloads]   
 | ONNX 1.3                   | [Pip install](https://github.com/NervanaSystems/ngraph-onnx#installation)                          | 14 [Validated workloads] 
-:exclamation: :exclamation: :exclamation: Note that the ``pip`` package option 
-works only with Ubuntu 16.04 or greater and Intel® Xeon® CPUs. CPUs without 
+#### Python wheels for nGraph 
-Intel® Advanced Vector Extensions 512 (Intel® AVX-512) will not run these 
-packages; the alternative is to build from source. Wider support for other 
+The Python wheels for nGraph have been tested and are supported on the following 
-CPUs will be offered starting in early 2019 :exclamation: :exclamation: :exclamation:
+64-bit systems
+* Ubuntu 16.04 or later
+* CentOS 7.6
+* Debian 10
+* macOS 10.14.3 (Mojave)
+:exclamation: Note that the ``pip`` package option works only Intel® Xeon® CPUs. 
+CPUs without Intel® Advanced Vector Extensions 512 (Intel® AVX-512) will not run 
+these packages; the alternative is to build from source. Wider support for other 
+CPUs will be offered in later releases.  
 Frameworks using nGraph Compiler stack to execute workloads have shown 
 [**up to 45X**](https://ai.intel.com/ngraph-compiler-stack-beta-release/) 

--- a/doc/sphinx/source/backend-support/index.rst
+++ b/doc/sphinx/source/backend-support/index.rst
 .. backend-support/index.rst
-Transformers, PlaidML
+About backends
-###############################
+##############
+* :ref:`what_is_backend`
 * :ref:`hybrid_transformer`
 * :ref:`cpu_backend`
 * :ref:`plaidml_backend`
 * :ref:`gpu_backend`
-What is a backend?
+.. _what_is_backend:
------------------
+What's a backend?
+-----------------
-Backends are responsible for function execution and value allocation. They 
+In the nGraph Compiler stack, what we call a *backend* is responsible for 
-can be used to :doc:`carry out a programmed computation<../howto/execute>`
+function execution and value allocation. A  backend can be used to 
-from a framework by using a CPU or GPU; or they can be used with an *Interpreter* 
+:doc:`carry out a programmed computation<../core/constructing-graphs/execute>` 
-mode, which is primarily intended for testing, to analyze a program, or for a 
+from a framework on a CPU or GPU; or it can be used with an *Interpreter* mode, 
-framework developer to develop customizations. Experimental APIs to support 
+which is primarily intended for testing, to analyze a program, or to help a 
+framework developer customize targeted solutions. Experimental APIs to support 
 current and future nGraph Backends are also available; see, for example, the 
-section on :ref:`plaidml_backend`.
+section on the :ref:`plaidml_backend`.
 .. _hybrid_transformer:
@@ -28,7 +31,7 @@ section on :ref:`plaidml_backend`.
 Hybrid Transformer
 ==================
-Coming soon
+More detail coming soon
 .. _cpu_backend:
@@ -36,7 +39,7 @@ Coming soon
 CPU Backend
 ===========
-Coming soon
+More detail coming soon
 .. _gpu_backend:
@@ -44,7 +47,7 @@ Coming soon
 GPU Backend
 ===========
-Coming soon 
+More detail coming soon 
 .. _plaidml_backend:

--- a/doc/sphinx/source/buildlb.rst
+++ b/doc/sphinx/source/buildlb.rst
@@ -61,7 +61,7 @@ The process documented here will work on Ubuntu\* 16.04 (LTS) or on Ubuntu
   give ownership of that directory to your user. Creating such a placeholder 
   can be useful if you'd like to have a local reference for APIs and 
   documentation, or if you are a developer who wants to experiment with 
-   how to :doc:`../howto/execute` using resources available through the 
+   how to :doc:`core/constructing-graphs/execute` using resources available through the 
   code base.
   .. code-block:: console
@@ -131,7 +131,7 @@ The process documented here will work on CentOS 7.4.
   give ownership of that directory to your user. Creating such a placeholder 
   can be useful if you'd like to have a local reference for APIs and 
   documentation, or if you are a developer who wants to experiment with 
-   how to :doc:`../howto/execute` using resources available through the 
+   how to :doc:`core/constructing-graphs/execute` using resources available through the 
   code base.
   .. code-block:: console
@@ -231,7 +231,7 @@ can help you get started with a training a model on a supported framework.
 For the latter case, if you've followed a tutorial from `ONNX`_, and you have an 
 exported, serialized model, you can skip the section on frameworks and go directly
-to our :doc:`../howto/import` documentation. 
+to our :doc:`core/constructing-graphs/import` documentation. 
 Please keep in mind that both of these are under continuous development, and will 
 be updated frequently in the coming months. Stay tuned!  

--- a/doc/sphinx/source/howto/derive-for-training.rst
+++ b/doc/sphinx/source/howto/derive-for-training.rst
@@ -80,20 +80,20 @@ We begin by building the graph, starting with the input parameter
 ``X``. We also define a fully-connected layer, including parameters for
 weights and bias:
-.. literalinclude:: ../../../examples/mnist_mlp/mnist_mlp.cpp
+.. literalinclude:: ../../../../examples/mnist_mlp/mnist_mlp.cpp
   :language: cpp
   :lines: 127-135
 Repeat the process for the next layer,
-.. literalinclude:: ../../../examples/mnist_mlp/mnist_mlp.cpp
+.. literalinclude:: ../../../../examples/mnist_mlp/mnist_mlp.cpp
   :language: cpp
   :lines: 138-146
 and normalize everything with a ``softmax``.
-.. literalinclude:: ../../../examples/mnist_mlp/mnist_mlp.cpp
+.. literalinclude:: ../../../../examples/mnist_mlp/mnist_mlp.cpp
   :language: cpp
   :lines: 148-150
@@ -107,7 +107,7 @@ We use cross-entropy to compute the loss. nGraph does not currenty have a core
 op for cross-entropy, so we implement it directly, adding clipping to prevent 
 underflow.
-.. literalinclude:: ../../../examples/mnist_mlp/mnist_mlp.cpp
+.. literalinclude:: ../../../../examples/mnist_mlp/mnist_mlp.cpp
   :language: cpp
   :lines: 154-166
@@ -123,7 +123,7 @@ because of the way it is implemented in interpreted frameworks. In nGraph, we
 augment the loss computation with computations for the weight adjustments. This 
 allows the calculations for the adjustments to be further optimized.
-.. literalinclude:: ../../../examples/mnist_mlp/mnist_mlp.cpp
+.. literalinclude:: ../../../../examples/mnist_mlp/mnist_mlp.cpp
   :language: cpp
   :lines: 169-172
@@ -136,16 +136,16 @@ update computation for ``N`` will be given by the node
   auto update = loss->backprop_node(N, delta);
-.. literalinclude:: ../../../examples/mnist_mlp/mnist_mlp.cpp
+.. literalinclude:: ../../../../examples/mnist_mlp/mnist_mlp.cpp
   :language: cpp
   :lines: 177-181
 The different update nodes will share intermediate computations. So to
 get the updated values for the weights as computed with the specified 
-:doc:`backend <../backend-support/index>`:
+:doc:`backend <../../backend-support/index>`:
-.. literalinclude:: ../../../examples/mnist_mlp/mnist_mlp.cpp
+.. literalinclude:: ../../../../examples/mnist_mlp/mnist_mlp.cpp
   :language: cpp
   :lines: 182-215
@@ -165,7 +165,7 @@ use the same nodes in different functions, nGraph currently does not
 allow the same nodes to be compiled in different functions, so we
 compile clones of the nodes.
-.. literalinclude:: ../../../examples/mnist_mlp/mnist_mlp.cpp
+.. literalinclude:: ../../../../examples/mnist_mlp/mnist_mlp.cpp
   :language: cpp
   :lines: 216-224
--- a/doc/sphinx/source/howto/distribute-train.rst
+++ b/doc/sphinx/source/howto/distribute-train.rst
@@ -8,9 +8,10 @@ Distribute training across multiple nGraph backends
    however, the following configuration options have worked for nGraph devices 
    with mixed or limited success in testing.
-In the :doc:`previous section <../howto/derive-for-training>`, we described the 
+In the :doc:`previous section <../constructing-graphs/derive-for-training>`, 
-steps needed to create a "trainable" nGraph model. Here we demonstrate how to 
+we described the steps needed to create a "trainable" nGraph model. Here we 
-train a data parallel model by distributing the graph to more than one device.
+demonstrate how to train a data parallel model by distributing the graph to 
+more than one device.
 Frameworks can implement distributed training with nGraph versions prior to 
 `0.13`:
@@ -35,12 +36,12 @@ Finally, to run the training using two nGraph devices, invoke
   $ mpirun 
 To deploy data-parallel training, the ``AllReduce`` op should be added after the 
-steps needed to complete the :doc:`backpropagation <../howto/derive-for-training>`; 
+steps needed to complete the :doc:`backpropagation <../constructing-graphs/derive-for-training>`; 
 the new code is highlighted below: 
-.. literalinclude:: ../../../examples/mnist_mlp/dist_mnist_mlp.cpp
+.. literalinclude:: ../../../../examples/mnist_mlp/dist_mnist_mlp.cpp
   :language: cpp
-   :lines: 180-196
+   :lines: 178-194
   :emphasize-lines: 8-11
 See the `full code`_ in the ``examples`` folder ``/doc/examples/mnist_mlp/dist_mnist_mlp.cpp``. 

--- a/doc/sphinx/source/howto/execute.rst
+++ b/doc/sphinx/source/howto/execute.rst
@@ -60,7 +60,7 @@ Every node has zero or more *inputs*, zero or more *outputs*, and zero or more
 *attributes*. 
 The specifics for each ``type`` permitted on a core ``Op``-specific basis can be 
-discovered in our :doc:`../ops/index` docs. For our purpose to 
+discovered in our :doc:`../../ops/index` docs. For our purpose to 
 :ref:`define a computation <define_cmp>`, nodes should be thought of as essentially 
 immutable; that is, when constructing a node, we need to supply all of its 
 inputs. We get this process started with ops that have no inputs, since any op 
@@ -71,7 +71,7 @@ They receive their values from outside of the graph, so they have no inputs.
 They have attributes for the element type and the shape of the tensor that will 
 be passed to them.
-.. literalinclude:: ../../../examples/abc/abc.cpp
+.. literalinclude:: ../../../../examples/abc/abc.cpp
   :language: cpp
   :lines: 25-29
@@ -81,7 +81,7 @@ shape ``(2, 3)`` and a row-major element layout.
 To create a graph for ``(a + b) * c``, first make an ``op::Add`` node with inputs 
 from ``a`` and ``b``, and an ``op::Multiply`` node from the add node and ``c``:
-.. literalinclude:: ../../../examples/abc/abc.cpp
+.. literalinclude:: ../../../../examples/abc/abc.cpp
   :language: cpp
   :lines: 31-32
@@ -94,7 +94,7 @@ type and shape of its unique output.
 Once the graph is built, we need to package it in a ``Function``:
-.. literalinclude:: ../../../examples/abc/abc.cpp
+.. literalinclude:: ../../../../examples/abc/abc.cpp
   :language: cpp
   :lines: 35-36
@@ -126,12 +126,12 @@ There are two backends for the CPU: the optimized ``"CPU"`` backend, which uses
 the `Intel MKL-DNN`_, and the ``"INTERPRETER"`` backend, which runs reference 
 versions of kernels that favor implementation clarity over speed. The 
 ``"INTERPRETER"`` backend can be slow, and is primarily intended for testing. 
-See the documentation on :doc:`runtime options for various backends <../backend-support/index>` 
+See the documentation on :doc:`runtime options for various backends <../../backend-support/index>` 
 for additional details.
 To continue with our original example and select the ``"CPU_Backend"``: 
-.. literalinclude:: ../../../examples/abc/abc.cpp
+.. literalinclude:: ../../../../examples/abc/abc.cpp
   :language: cpp
   :lines: 38-39
@@ -168,14 +168,14 @@ Backends are responsible for managing storage. If the storage is off-CPU, caches
 are used to minimize copying between device and CPU. We can allocate storage for 
 the three parameters and the return value.
-.. literalinclude:: ../../../examples/abc/abc.cpp
+.. literalinclude:: ../../../../examples/abc/abc.cpp
   :language: cpp
   :lines: 41-46
 Each tensor is a shared pointer to a :term:`Tensorview`, which is the interface 
 backends implement for tensor use. When there are no more references to the 
 tensor view, it will be freed when convenient for the backend. See the 
-:doc:`../backend-support/cpp-api` documentation for details on how to work 
+:doc:`../../backend-support/cpp-api` documentation for details on how to work 
 with ``Tensor``.
@@ -186,7 +186,7 @@ Initialize the inputs
 Next we need to copy some data into the tensors.
-.. literalinclude:: ../../../examples/abc/abc.cpp
+.. literalinclude:: ../../../../examples/abc/abc.cpp
   :language: cpp
   :lines: 48-55
@@ -201,7 +201,7 @@ Invoke the computation
 To invoke the function, we simply pass argument and resultant tensors to the 
 call frame:
-.. literalinclude:: ../../../examples/abc/abc.cpp
+.. literalinclude:: ../../../../examples/abc/abc.cpp
   :language: cpp
   :lines: 57-58
@@ -213,7 +213,7 @@ Access the outputs
 We can use the ``read`` method to access the result:
-.. literalinclude:: ../../../examples/abc/abc.cpp
+.. literalinclude:: ../../../../examples/abc/abc.cpp
   :language: cpp
   :lines: 60-77
@@ -222,7 +222,7 @@ We can use the ``read`` method to access the result:
 Put it all together
 ===================
-.. literalinclude:: ../../../examples/abc/abc.cpp
+.. literalinclude:: ../../../../examples/abc/abc.cpp
   :language: cpp
   :linenos:
   :caption: "The (a + b) * c example for executing a computation on nGraph"

--- a/doc/sphinx/source/howto/import.rst
+++ b/doc/sphinx/source/howto/import.rst
@@ -22,13 +22,12 @@ usually named ``<some_model>.onnx`` or ``<some_model>.onnx.pb``. These
 `tutorials from ONNX`_ describe how to turn trained models into an 
 ``.onnx`` export.
-.. important:: If you landed on this page and you already have an ``.onnx`` 
+.. important:: If you landed on this page and you already have an ``.onnx`` or 
-   or ``.onnx.pb`` formatted file, you should be able to run the inference 
+   an ``.onnx.pb`` formatted file, you should be able to run the inference without
-   without needing to dig into anything from the "Frameworks" sections. You 
+   needing to dig into anything from the "Frameworks" sections. You will, however, 
-   will, however, need to have completed the steps outlined in 
+   need to have completed the steps outlined in our :doc:`../../buildlb` guide.  
-   our :doc:`../buildlb` guide.  If you intend to build nGraph for :   doc:`distributed-training`, 
+   If you intend to build nGraph for distributed-training, you will need 
-   you will need to build that has already been compiled with the additional 
+   to follow instructions on the documentation for :doc:`../../distr/index`.
-   cmake flag: ``-DNGRAPH_DISTRIBUTED_ENABLE=TRUE``.
 To demonstrate functionality, we'll use an already-serialized CIFAR10 model 
 trained via ResNet20. Remember that this model has already been trained and 
@@ -154,7 +153,7 @@ specify the relative path to the location of the ``.onnx`` file.
 Enable ONNX and load an ONNX file from disk
 --------------------------------------------
-.. literalinclude:: ../../../examples/onnx/onnx_example.py
+.. literalinclude:: ../../../../examples/onnx/onnx_example.py
   :language: python
   :lines: 17-19
@@ -162,7 +161,7 @@ Enable ONNX and load an ONNX file from disk
 Convert an ONNX model to an ngraph model 
 -------------------------------------------
-.. literalinclude:: ../../../examples/onnx/onnx_example.py
+.. literalinclude:: ../../../../examples/onnx/onnx_example.py
   :language: python
   :lines: 22-23
@@ -189,7 +188,7 @@ input parameters for the computation which generates the output.
 Using ngraph_api, create a callable computation object
 -------------------------------------------------------
-.. literalinclude:: ../../../examples/onnx/onnx_example.py
+.. literalinclude:: ../../../../examples/onnx/onnx_example.py
   :language: python
   :lines: 27-29
@@ -197,14 +196,14 @@ Using ngraph_api, create a callable computation object
 Load or create an image
 ------------------------
-.. literalinclude:: ../../../examples/onnx/onnx_example.py
+.. literalinclude:: ../../../../examples/onnx/onnx_example.py
   :language: python
   :lines: 32-33
 Run ResNet inference on picture
 ---------------------------------
-.. literalinclude:: ../../../examples/onnx/onnx_example.py
+.. literalinclude:: ../../../../examples/onnx/onnx_example.py
   :language: python
   :lines: 36-37
@@ -212,7 +211,7 @@ Run ResNet inference on picture
 Put it all together
 ===================
-.. literalinclude:: ../../../examples/onnx/onnx_example.py
+.. literalinclude:: ../../../../examples/onnx/onnx_example.py
   :language: python
   :lines: 17-37
   :caption: "Demo sample code to run inference with nGraph"

--- a/doc/sphinx/source/howto/index.rst
+++ b/doc/sphinx/source/howto/index.rst
@@ -12,10 +12,12 @@ Constructing Graphs
   update.rst
   derive-for-training.rst
   distribute-train.rst
-   import.rst    
+   import.rst
+   Using the Python API <../../python_api/index.rst>
-The "How to" articles in this section explain how to do specific tasks with 
+The "How to" articles in this section explain how to build or construct graphs 
-nGraph components. The recipes are all framework agnostic; in other words, 
+with nGraph components. The recipes are all framework agnostic; in other words, 
 if an entity (framework or user) wishes to make use of target-based computational 
 resources, it can either:

--- a/doc/sphinx/source/howto/operator.rst
+++ b/doc/sphinx/source/howto/operator.rst
@@ -10,13 +10,13 @@ building of graphs.
 Several C++ operators are overloaded to simplify graph construction.
 For example, the following:
-.. literalinclude:: ../../../examples/abc/abc.cpp
+.. literalinclude:: ../../../../examples/abc/abc.cpp
   :language: cpp
   :lines: 32-32
 can be simplified to:	   
-.. literalinclude:: ../../../examples/abc_operator/abc_operator.cpp
+.. literalinclude:: ../../../../examples/abc_operator/abc_operator.cpp
   :language: cpp
   :lines: 31

--- a/doc/sphinx/source/howto/update.rst
+++ b/doc/sphinx/source/howto/update.rst
@@ -15,7 +15,7 @@ An example from C++
 Let's start with a simple C++ example, a function ``count`` that
 returns how many times it has already been called:
-.. literalinclude:: ../../../examples/update/update.cpp
+.. literalinclude:: ../../../../examples/update/update.cpp
   :language: cpp
   :lines: 20-24
   :caption: update.cpp
@@ -27,13 +27,13 @@ convert this to use a stateless function, define a function that
 takes the current value of ``counter`` as an argument and returns the
 updated value.
-.. literalinclude:: ../../../examples/update/update.cpp
+.. literalinclude:: ../../../../examples/update/update.cpp
   :language: cpp
   :lines: 26-29
 To use this version of counting,
-.. literalinclude:: ../../../examples/update/update.cpp
+.. literalinclude:: ../../../../examples/update/update.cpp
   :language: cpp
   :lines: 36-48

--- a/doc/sphinx/source/fusion/graph-rewrite.rst
+++ b/doc/sphinx/source/fusion/graph-rewrite.rst
--- a/doc/sphinx/source/fusion/optimize-graphs.rst
+++ b/doc/sphinx/source/fusion/optimize-graphs.rst
-.. fusion/optimize-graphs: 
+.. fusion/index.rst:
+Pattern matcher
+###############
-Optimize Graphs 
+.. toctree::
-===============
+   :maxdepth: 1 
-with nGraph Compiler fusions
+   overview.rst
----------------------------
+   graph-rewrite.rst
 The nGraph Compiler is an optimizing compiler. As such, it provides a way to 
 capture a given :term:`function graph` and perform a series of optimization 
 passes over that graph. The result is a semantically-equivalent graph that, when 
-executed using any :doc:`backend <../backend-support/index>`, has optimizations 
+executed using any :doc:`backend <../../backend-support/index>`, has optimizations 
 inherent at the hardware level: superior runtime characteristics to increase 
 training performance or reduce inference latency.   
@@ -41,7 +43,30 @@ then inspecting the transformed graph.
 Optimization passes can be programmed ahead of time if you know or can predict 
 what your graph will look like when it's ready to be executed (in other words: 
-which `ops` can be automatically translated into :doc:`nGraph Core ops <../ops/index>`). 
+which `ops` can be automatically translated into :doc:`nGraph Core ops <../../ops/index>`). 
 The ``Interpreter`` is simply a backend providing reference implementations of 
 ngraph ops in C++, with the focus on simplicity over performance.
+Example 
+-------
+Let us first consider a simple example. A user would like to execute a graph 
+that describes the following arithmetic expression:
+:math:`a + b * 1` or :math:`Add(a, Mul(b, 1))` 
+In the above expressions, `1` is an identity element; any element multiplied by 
+the identity element is equal to itself. In other words, the original expression 
+:math:`a + b * 1` is exactly equivalent to the expression :math:`a + b`, so we 
+can eliminate this extra multiplication step.
+The writer of an optimization pass which uses algebraic simplification would 
+probably want to first ``locate`` all multiplication expressions where 
+multiplicands are multiplied by `1` (for stage 1) and to then ``replace``, 
+those expressions with just their multiplicands (for stage 2).  
+To make the work of an optimization pass writer easier, the nGraph Library 
+includes facilities that enable the *finding* of relevant candidates using 
+pattern matching (via ``pattern/matcher.hpp``), and the *transforming* of the 
+original graph into an optimized version (via ``pass/graph_rewrite.hpp``).   
\ No newline at end of file
--- a/doc/sphinx/source/fusion/mg/cat.jpg
+++ b/doc/sphinx/source/fusion/mg/cat.jpg
--- a/doc/sphinx/source/fusion/mg/fusion_pattern.png
+++ b/doc/sphinx/source/fusion/mg/fusion_pattern.png
--- a/doc/sphinx/source/fusion/mg/pr1_graph1.png
+++ b/doc/sphinx/source/fusion/mg/pr1_graph1.png
--- a/doc/sphinx/source/fusion/mg/pr1_graph2.png
+++ b/doc/sphinx/source/fusion/mg/pr1_graph2.png
--- a/doc/sphinx/source/fusion/mg/pr1_graph3.png
+++ b/doc/sphinx/source/fusion/mg/pr1_graph3.png
--- a/doc/sphinx/source/fusion/mg/pr1_graph4.png
+++ b/doc/sphinx/source/fusion/mg/pr1_graph4.png
--- a/doc/sphinx/source/fusion/mg/pr1_graph5.png
+++ b/doc/sphinx/source/fusion/mg/pr1_graph5.png
--- a/doc/sphinx/source/fusion/mg/pr1_pattern.png
+++ b/doc/sphinx/source/fusion/mg/pr1_pattern.png
--- a/doc/sphinx/source/fusion/mg/pr1_pattern2.png
+++ b/doc/sphinx/source/fusion/mg/pr1_pattern2.png
--- a/doc/sphinx/source/fusion/mg/pr2_graph1.png
+++ b/doc/sphinx/source/fusion/mg/pr2_graph1.png
--- a/doc/sphinx/source/fusion/mg/pr2_graph2.png
+++ b/doc/sphinx/source/fusion/mg/pr2_graph2.png
--- a/doc/sphinx/source/fusion/mg/pr2_graph3.png
+++ b/doc/sphinx/source/fusion/mg/pr2_graph3.png
--- a/doc/sphinx/source/fusion/mg/pr2_pattern2.png
+++ b/doc/sphinx/source/fusion/mg/pr2_pattern2.png
--- a/doc/sphinx/source/fusion/mg/rp_graph1.png
+++ b/doc/sphinx/source/fusion/mg/rp_graph1.png
--- a/doc/sphinx/source/fusion/mg/rp_pattern.png
+++ b/doc/sphinx/source/fusion/mg/rp_pattern.png
--- a/doc/sphinx/source/core/fusion/overview.rst
+++ b/doc/sphinx/source/core/fusion/overview.rst
+.. fusion/overview.rst
+Overview: Optimize graphs with nGraph Compiler fusions
+-------------------------------------------------------
+The nGraph Compiler is an optimizing compiler. As such, it provides a way to 
+capture a given :term:`function graph` and perform a series of optimization 
+passes over that graph. The result is a semantically-equivalent graph that, when 
+executed using any :doc:`backend <../../backend-support/index>`, has 
+hardware-agnostic *and* hardware-specific optimizations, providing superior 
+runtime characteristics to increase training performance or reduce inference 
+latency.   
+There are several ways to describe what happens when we capture and translate 
+the framework's output of ops into an nGraph graph. :term:`Fusion` is the term 
+we shall use in our documentation; the action also can be described as: 
+*combining*, *folding*, *squashing*, *collapsing*, or *merging* of graph 
+functions. 
+Optimization passes may include algebraic simplifications, domain-specific 
+simplifications, and fusion. Most passes share the same mode of operation (or 
+the same operational structure) and consist of various stages (each one a 
+:term:`step`) where a developer can experiment with the intercepted or dynamic 
+graph. These steps may be cycled or recycled as needed: 
+#. Locate a list of potentially-transformable subgraphs in the given graph.
+#. Transform the selected candidates into semantically-equivalent subgraphs 
+   that execute faster, or with less memory (or both). 
+#. Verify that the optimization pass performs correctly, with any or all expected 
+   transformations, with the ``NGRAPH_SERIALIZE_TRACING`` option, which 
+   serializes a graph in the `json` format after a pass.
+#. Measure and evaluate your performance improvements with ``NGRAPH_CPU_TRACING``, 
+   which produces timelines compatible with ``chrome://tracing``.
+Optimizations can be experimented upon without using any backend by registering 
+a pass with pass manager (``Manager``), calling ``run_passes`` on a function, and 
+then inspecting the transformed graph. 
+Optimization passes can be programmed ahead of time if you know or can predict 
+what your graph will look like when it's ready to be executed (in other words: 
+which `ops` can be automatically translated into :doc:`nGraph Core ops <../../ops/index>`). 
+The ``Interpreter`` is simply a backend providing reference implementations of 
+ngraph ops in C++, with the focus on simplicity over performance.
+Example 
+-------
+Let us first consider a simple example. A user would like to execute a graph 
+that describes the following arithmetic expression:
+:math:`a + b * 1` or :math:`Add(a, Mul(b, 1))` 
+In the above expressions, `1` is an identity element; any element multiplied by 
+the identity element is equal to itself. This is the same as saying:
+:math:`b * 1 = b` 
+The writer of an optimization pass which uses algebraic simplification would 
+probably want to first ``locate`` all multiplication expressions where 
+multiplicands are multiplied by `1` (for stage 1) and to then ``transform``, 
+``simplify``, or ``replace`` those expressions with just their multiplicands 
+(for stage 2).  
+To make the work of an optimization pass writer easier, the nGraph Library 
+includes facilities that enable the *finding* of relevant candidates using 
+pattern matching (via ``pattern/matcher.hpp``), and the *transforming* of the 
+original graph into a condensed version (via ``pass/graph_rewrite.hpp``).
--- a/doc/sphinx/source/core/overview.rst
+++ b/doc/sphinx/source/core/overview.rst
@@ -4,11 +4,57 @@
 Overview
 ========
-What follows here is a table of all documented namespaces with brief descriptions:
+.. figure:: ../graphics/whole-stack.png
+   :alt: The whole stack
+   The whole nGraph Compiler stack  
+The nGraph Compiler stack consists of bridges, core, and backends. We'll examine 
+each of these briefly to get started. 
+A framework bridge interfaces with the "frontend" Core API. A framework bridge 
+is a component that sits between a framework like TensorFlow or MXNet, and the
+nGraph Core frontend API. A framework bridge does two things: first, it 
+translates a framework's operations into graphs in nGraph’s in-memory :abbr:`Intermediary Representation (IR)`. Second, it executes the nGraph IR graphs via 
+the backend execution interface.
+The details of bridge implementation vary from framework to framework, but there 
+are some common patterns: a fairly typical example for a graph-based framework 
+is illustrated here, and consists of basically two phases: a **clustering** 
+phase and a **translation** phase.
+.. figure:: ../graphics/translation-flow-to-ng-fofx.png
+   :alt: The whole stack
+   Translation flow to an nGraph function 
+The clustering phase operates on the original framework's graph. During this 
+stage, we look for maximal subgraphs containing nodes that can be translated 
+to data flow functions in nGraph. The ability to capture subgraphs of the original 
+graph means that we maintain interoperability with the native framework runtime. 
+Any node that is not placed in a cluster can still by handled by the native 
+framework. On the other hand, identifying maximal subgraphs means that we can 
+avoid unnecessary handoffs between the native framework runtime and nGraph; 
+minimizing this is good for performance.
+In the second phase, called translation, we cut out each cluster subgraph, 
+translate it into an nGraph Function, and replace the cluster subgraph with a 
+stand-in node called an "encapsulation node" that holds a pointer to the nGraph 
+``Function``. Later, at runtime, those functions will be invoked when the 
+framework asks us to execute the encapsulation node.
+It’s worth noting that backends have total freedom to rewrite the nGraph 
+Functions: they can do it for the sake of structural or algorithmic optimization 
+of the graph, for easy integration with kernel libraries, or for any or no 
+reason at all.
+Namespaces in nGraph
+--------------------
+What follows here is a table of all documented namespaces with brief 
+descriptions:
-Namespace List
--------------
 .. csv-table::
   :header: "Namespace", "Description", "Location in Repo", "Docs"
@@ -27,7 +73,3 @@ Namespace List
 .. _Ndescriptor: https://github.com/NervanaSystems/ngraph/tree/master/src/ngraph/descriptor
 .. _Nop: https://github.com/NervanaSystems/ngraph/tree/master/src/ngraph/op
 .. _Nruntime: https://github.com/NervanaSystems/ngraph/tree/master/src/ngraph/runtime
--- a/doc/sphinx/source/fusion/index.rst
+++ b/doc/sphinx/source/fusion/index.rst
-.. fusion/index.rst:
+.. core/passes/list-of-passes:
-Pattern matcher
+List of passes
-###############
+==============
-* :ref:`overview` 
-* :ref:`passes_list`
-* :ref:`more_detail` 
-* :ref:`passes_examples`
-* :doc:`optimize-graphs` 
-.. _overview:
-Generic graph optimizers: Optimization passes
-=============================================
-The pass manager infrastructure in nGraph makes it easy to reuse and mix the 
-generic optimization passes. It also permits you to roll your own device-specific 
-optimizations; that is, the same unified interface and APIs may be used to 
-cover both things.
-Invoking these passes is fairly straightforward:  
-#. Create a "pass manager" object. 
-#. Populate it with the desired passes. 
-#. Pass to it a pointer to your unoptimized graph, and it’ll return a pointer 
-   to an optimized graph.
-nGraph Core includes a large library of hardware-agnostic passes -- passes useful 
-for almost any kind of hardware backend. Some of these passes should be familiar 
-to people who are comfortable with classical compiler designs. Others, like the 
-reshape/transpose elimination and sinking passes, are quite specific to deep 
-learning.
-Let’s take a look at some of these passes.
+.. csv-table::
+   :header: "Pass Name", "More Detail"
+   :widths: 29, 31
+   :escape: ~
+   ``AlgebraicSimplification``, :ref:`algebraic_simpl`
+   ``AssignLayout``, Coming Soon
+   ``CallGraphPass``, Coming Soon
+   ``CommonFunctionCollection``, Coming Soon
+   ``CommonSubexpressionElimination``, :ref:`common_subex_elim`
+   ``ConstantFolding``, :ref:`constant_fold`
+   ``CoreFusion``, Coming Soon
+   ``DumpSorted``, Coming Soon
+   ``FunctionPass``, Coming Soon
+   ``GetOutputElementElimination``, Coming Soon
+   ``GraphRewrite``, Coming Soon
+   ``LikeReplacement``, Coming Soon
+   ``Liveness``, Coming Soon
+   ``Manager``, Coming Soon
+   ``ManagerState``, Coming Soon
+   ``MemoryLayout``, Coming Soon
+   ``MemoryManager``, Coming Soon
+   ``MemoryVisualize``, Coming Soon
+   ``ModulePass``, Coming Soon
+   ``NodePass``, Coming Soon
+   ``NopElimination``, Coming Soon
+   ``PassBase``, Coming Soon
+   ``PassConfig``, Coming Soon
+   ``PrefixReshapeElimination``, Coming Soon
+   ``PropagateCacheability``, Coming Soon
+   ``RecurrentGraphRewrite``, Coming Soon
+   ``ReshapeElimination``, :ref:`reshape_transpose_elim`
+   ``ReshapeSinking``, :ref:`reshape_transpose_sink`
+   ``Serialization``, Coming Soon
+   ``ValidateGraph``, Coming Soon
+   ``VisualizeTree``, Coming Soon
+   ``ZeroDimTensorElimination``, Coming soon 
+.. important:: All of the above passes are currently implementable; more 
+   detailed documentation for each pass may be a :abbr:`Work In Progress (WIP)` 
+   (WIP).
-.. _passes_list:
-List of Passes
+.. _algebraic_simpl: 
-==============
-* :ref:`algebraic_simpl`
+``Algebraic Simplification``
-* :ref:`common_subex_elim`
+----------------------------
-* :ref:`constant_fold`
-* :ref:`reshape_transpose_elim`
-* :ref:`reshape_transpose_sink`
+.. figure:: ../../graphics/algebraic-simpl.png
+   :width: 650px
-.. _algebraic_simpl: 
+   Algebraic simplification
-Algebraic Simplification
------------------------
 The **Algebraic Simplification** pass implements what amounts to a "grab bag" of 
 algebraic simplification rules. It does some basic things like rewrite "zero 
@@ -60,10 +65,8 @@ times x" to simply "zero", or "zero plus x" to plain "x".
 It can also do a number of tricks more specific to deep learning. For example,
 if we discover that a tensor is being sliced up by adjacent segments, only to 
 have those slices concatenated back together again, we can skip the slicing and 
-concatting altogether. 
+concatting altogether. Or, if a tensor is being padded, but the actual width of 
+the padding is zero all around, we can skip the padding step entirely.
-Or, if a tensor is being padded, but the actual width of the padding is zero 
-all around, we can skip the padding step entirely.
 Several other transformations like this are implemented in the algebraic 
 simplification pass. And while none of these transformations might seem 
@@ -71,33 +74,34 @@ particularly impressive on their own, when everything comes together the
 results of this pass often yield improvement even on the initial graph straight 
 out of the bridge. This pass is also quite important as a "glue" pass that can 
 be used to clean up and/or re-simplify after other passes have done their own 
-tricks.
+tricks.  See the example on :doc:`passes` for an example of how effective this 
+can be. 
 .. _common_subex_elim: 
-Common Subexpression Elimination
+``Common Subexpression Elimination``
--------------------------------
+-------------------------------------
 .. _constant_fold:
-Constant Folding
+``Constant Folding``
----------------
+--------------------
 .. _core_fusion:
-Core Fusion
+``Core Fusion``
-----------
+---------------
 .. _reshape_transpose_elim:
-Reshape/Transpose Elimination
+``Reshape Elimination``
-----------------------------
+-----------------------
-The pass called **Reshape/Transpose Elimination** will find and optimize where 
+The pass also called **Reshape/Transpose Elimination** will find and optimize where 
 we can "push" two ``Transpose`` ops through a matrix multiplication. For example, 
 if you have two matrices (say, *foo* and *bar*), both of these matrices will be 
 transposed (to produce *foo.t* and *bar.t*, respectively), aftew which *foo.t* 
@@ -120,8 +124,8 @@ them both out of the graph.
 .. _reshape_transpose_sink:
-``Reshape/Transpose Sinking``
+``Reshape Sinking``
-----------------------------
+-------------------
@@ -130,76 +134,4 @@ them both out of the graph.
 .. _elementzero_tensor_elim:
 ``Zero-Element Tensor Elimination``
-----------------------------------
+-----------------------------------   
\ No newline at end of file
-.. _more_detail:
-More detail
-----------
-Let us first consider a simple example. A user would like to execute a graph 
-that describes the following arithmetic expression:
-:math:`a + b * 1` or :math:`Add(a, Mul(b, 1))` 
-In the above expressions, `1` is an identity element; any element multiplied by 
-the identity element is equal to itself. This is the same as saying:
-:math:`b * 1 = b` 
-The writer of an optimization pass which uses algebraic simplification would 
-probably want to first ``locate`` all multiplication expressions where 
-multiplicands are multiplied by `1` (for stage 1) and to then ``transform``, 
-``simplify``, or ``replace`` those expressions with just their multiplicands 
-(for stage 2).  
-To make the work of an optimization pass writer easier, the nGraph Library 
-includes facilities that enable the *finding* of relevant candidates using 
-pattern matching (via ``pattern/matcher.hpp``), and the *transforming* of the 
-original graph into a condensed version (via ``pass/graph_rewrite.hpp``).
-Let's consider each in more detail and many ways they can help the graph 
-optimizer. 
-.. toctree::
-   :maxdepth: 1 
-   graph-rewrite.rst
-   passes-that-use-matcher.rst
-   optimize-graphs.rst
-.. _passes_examples:
-Examples of Passes
-==================
-The effectiveness of these passes is more striking to look at in terms of an 
-actual input graph, such as one from the framework bridge.
-*Figure 0* shows an excerpt from ``MobileNet v1``, a topology which makes heavy 
-use of group convolution.
-.. _figure-mobilenet-gc:
-.. figure:: ../graphics/mobilenet-group-conv.png
-   :width: 700px
-   :alt: 
-   Figure 0: Each of these grouped convolution complexes -- the 
-   operations within the rectangles on the left -- is very wide; each is too 
-   wide to fit legibly on the illustration.
-The group convolution fusion is able to replace each of those giant subgraphs 
-with a single CPU group convolution node. This ends up being a win in several 
-ways: 
-* sheer node count, 
-* mappability to MKL-DNN (which has an accelerated group convolution implementation), 
-* elimination of unnecessary temporaries, and so on.
\ No newline at end of file
--- a/doc/sphinx/source/fusion/passes-that-use-matcher.rst
+++ b/doc/sphinx/source/fusion/passes-that-use-matcher.rst
@@ -131,6 +131,6 @@ Equivalent to ``"A(BC)+A"`` in regexes
-.. |image11| image:: mg/fusion_pattern.png
+.. |image11| image:: ../fusion/mg/fusion_pattern.png
-.. |image12| image:: mg/rp_graph1.png
+.. |image12| image:: ../fusion/mg/rp_graph1.png
-.. |image13| image:: mg/rp_pattern.png
+.. |image13| image:: ../fusion/mg/rp_pattern.png
\ No newline at end of file
--- a/doc/sphinx/source/core/passes/passes.rst
+++ b/doc/sphinx/source/core/passes/passes.rst
+.. core/passes:
+Compiler passes
+===============
+.. toctree::
+   :maxdepth: 1
+   :caption: Compiler passes 
+   list-of-passes.rst 
+   passes-that-use-matcher.rst
+Overview: Generic graph optimization passes
+-------------------------------------------
+The pass manager infrastructure in nGraph makes it easy to reuse and mix the 
+generic optimization passes. It also permits you to roll your own device-specific 
+optimizations; that is, the same unified interface and APIs may be used to 
+cover both things.
+Invoking these passes is fairly straightforward:  
+#. Create a "pass manager" object. 
+#. Populate it with the desired pass(es). 
+#. Invoke the pass manager with a pointer to your unoptimized graph, and it’ll return a pointer 
+   to an optimized graph.
+nGraph Core includes a large library of hardware-agnostic passes useful 
+for almost any kind of hardware backend. Some of these passes are likely familiar 
+to people who are comfortable with classical compiler designs. Others, like the 
+reshape/transpose elimination and sinking passes, are quite specific to deep 
+learning.
+Example of Passes
+-----------------
+The effectiveness of graph-level optimization with nGraph is more striking to look 
+at in terms of an actual input graph, such as one from the framework bridge.
+*Figure A* shows an excerpt from ``MobileNet v1``, a topology which makes heavy 
+use of group convolution.
+.. _figure-mobilenet-gc:
+.. figure:: ../../graphics/mobilenet-group-conv.png
+   :width: 700px
+   :alt: 
+   Figure A: Each of these grouped convolution complexes -- the 
+   operations within the rectangles on the left -- is very wide; each is too 
+   wide to fit legibly on the illustration.
+The group convolution fusion is able to replace each of those giant subgraphs 
+with a single CPU group convolution node. This ends up being a win in several 
+ways: 
+* sheer node count, 
+* mappability to MKL-DNN (which has an accelerated group convolution implementation), 
+* elimination of unnecessary temporaries, and so on.
\ No newline at end of file
--- a/doc/sphinx/source/frameworks/index.rst
+++ b/doc/sphinx/source/frameworks/index.rst
 .. frameworks/index.rst: 
-.. TODO update CODEOWNERS for this new structure
 Current framework integrations  
 ==============================
 .. toctree::
   :maxdepth: 1
   tensorflow_integ.rst
   mxnet_integ.rst
   onnx_integ.rst
@@ -22,7 +18,7 @@ cloned from one of our GitHub repos and built to connect to nGraph device backen
 all the while maintaining the framework's programmatic or user interface. Bridges 
 currently exist for the TensorFlow\* and MXNet\* frameworks. 
-.. figure:: ../graphics/bridge-to-graph-compiler.png
+.. figure:: ../graphics/whole-stack.png
    :width: 733px
    :alt: JiT compiling of a computation

--- a/doc/sphinx/source/graphics/nGraph-handles-complexity.png
+++ b/doc/sphinx/source/graphics/nGraph-handles-complexity.png
--- a/doc/sphinx/source/graphics/scalability-matters.xcf
+++ b/doc/sphinx/source/graphics/scalability-matters.xcf
--- a/doc/sphinx/source/graphics/whole-stack.png
+++ b/doc/sphinx/source/graphics/whole-stack.png
--- a/doc/sphinx/source/index.rst
+++ b/doc/sphinx/source/index.rst
@@ -42,12 +42,10 @@ nGraph Compiler stack
   :caption: nGraph Core
   core/overview.rst
-   Pattern matcher <fusion/index.rst>
+   core/fusion/index.rst
   nGraph Core Ops <ops/index.rst>
-   More about Ops <ops/about.rst>
+   core/constructing-graphs/index.rst
-   Graph construction <howto/index.rst>
+   core/passes/passes.rst
-   Using the Python API <python_api/index.rst>
-   Compiler passes  <fusion/graph-rewrite.rst>
   buildlb.rst
@@ -75,6 +73,12 @@ nGraph Compiler stack
   diagnostics/visualize.rst
   diagnostics/debug.rst 
+.. toctree::
+   :maxdepth: 1
+   :caption: Tutorials
+   tutorials/index.rst
 .. toctree::
   :maxdepth: 1
@@ -84,7 +88,7 @@ nGraph Compiler stack
   project/contribution-guide.rst
   project/index.rst 
   glossary.rst
+   project/doc-contributor-README.rst
 Indices and tables

--- a/doc/sphinx/source/ops/about.rst
+++ b/doc/sphinx/source/ops/about.rst
-.. ops/about.rst: 
-##############
-About Core Ops
-##############
-An ``Op``'s primary role is to function as a node in a directed acyclic graph 
-dependency computation graph.  
-*Core ops* are ops that are available and generally useful to all framework 
-bridges and that can be compiled by all transformers. A framework bridge may 
-define framework-specific ops to simplify graph construction, provided that the 
-bridge can enable every transformer to replace all such ops with equivalent 
-clusters or subgraphs composed of core ops. Similary, transformers may define 
-transformer-specific ops to represent kernels or other intermediate operations. 
-If a framework supports extending the set of ops it offers, a bridge may even 
-expose transformer-specific ops to the framework user.
-.. figure:: ../graphics/tablengraphops.png
-    :width: 535px
-    :alt: Operations Available in the nGraph IR 
-    Operations Available in the nGraph IR
-.. important:: Our design philosophy is that the graph is not a script for 
-   running kernels; rather, our compilation will match ``ops`` to appropriate 
-   kernels for the backend(s) in use. Thus, we expect that adding of new Core 
-   ops should be infrequent and that most functionality instead gets added with 
-   new functions that build sub-graphs from existing core ops.   
-It is easiest to define a new op by adapting an existing op. Some of the tasks 
-that must be performed are:
- Op constructor:
-  * Checking type-consistency of arguments 
-  * Specifying the result type for a call 
- Serializer/Deserializer
- Transformer handlers:
-  * Interpreter (reference) implementation of behavior. The
-    implementation should favor clarity over efficiency.
--- a/doc/sphinx/source/ops/index.rst
+++ b/doc/sphinx/source/ops/index.rst
@@ -6,6 +6,8 @@ List of Core ``ops``
 Not currently a comprehensive list.  
+:ref:`more_about`
 .. hlist:: 
   :columns: 3
@@ -143,3 +145,51 @@ Not currently a comprehensive list.
   subtract.rst
   tan.rst
   tanh.rst
+.. _more_about: 
+More about Core Ops
+-------------------
+An ``Op``'s primary role is to function as a node in a directed acyclic graph 
+dependency computation graph.  
+*Core ops* are ops that are available and generally useful to all framework 
+bridges and that can be compiled by all transformers. A framework bridge may 
+define framework-specific ops to simplify graph construction, provided that the 
+bridge can enable every transformer to replace all such ops with equivalent 
+clusters or subgraphs composed of core ops. Similary, transformers may define 
+transformer-specific ops to represent kernels or other intermediate operations. 
+If a framework supports extending the set of ops it offers, a bridge may even 
+expose transformer-specific ops to the framework user.
+.. figure:: ../graphics/tablengraphops.png
+    :width: 535px
+    :alt: Operations Available in the nGraph IR 
+    Operations Available in the nGraph IR
+.. important:: Our design philosophy is that the graph is not a script for 
+   running kernels; rather, our compilation will match ``ops`` to appropriate 
+   kernels for the backend(s) in use. Thus, we expect that adding of new Core 
+   ops should be infrequent and that most functionality instead gets added with 
+   new functions that build sub-graphs from existing core ops.   
+It is easiest to define a new op by adapting an existing op. Some of the tasks 
+that must be performed are:
+- Op constructor:
+  * Checking type-consistency of arguments 
+  * Specifying the result type for a call 
+- Serializer/Deserializer
+- Transformer handlers:
+  * Interpreter (reference) implementation of behavior. The
+    implementation should favor clarity over efficiency.
\ No newline at end of file
--- a/doc/sphinx/source/project/introduction.rst
+++ b/doc/sphinx/source/project/introduction.rst
@@ -10,8 +10,8 @@ optimizing an :abbr:`Artificial Neural Network (ANN)` (often abbreviated :term:`
 to run graph-based computations for training, inference, testing, or validation.  
 Because today's NNs make use of many custom-purpose devices (FPGAs, GPUs, CPUs, 
 and custom silicon), having such a standard simplifies what would otherwise be 
-an enormously complex and difficult-to-scale pipeline (:ref:`Figure 3 <figure-3>`) 
+an enormously complex and difficult-to-scale pipeline (:ref:`Figure C <figure-C>`) 
-from "training with your favorite framework using GPUs" (:ref:`Figure 4 <figure-4>`), 
+from "training with your favorite framework using GPUs" (:ref:`Figure D <figure-D>`), 
 to deploying that (now) pre-trained model in a datacenter or production 
 environment, where infrastructure owners or software developers renting anything 
 in a datacenter ought to be mutually concerned with **efficiency per-watt**, to 
@@ -30,35 +30,35 @@ library unique to that vendor's hardware. For example, after integration, a
 kernel library can run operations that it is "familar" with optimally; however, 
 the graph itself within any larger :term:`NN` won't be optimal.
-.. _figure-0:
+.. _figure-A:
 .. figure:: ../graphics/framework-to-kernel-lib.png
   :width: 555px
   :alt: 
-   Figure 0: Lack of graph-level optimization makes framework-to-kernel library
+   Figure A: Lack of graph-level optimization makes framework-to-kernel library
   integration enormously inefficient. The computation graph above represents 
   the computation: "A plus B times C".
-.. _figure-1:
+.. _figure-B:
 .. figure:: ../graphics/framework-to-graph-opt.png
   :width: 555px
   :alt: 
-   Figure 1: Notice that an operation on the constant B (in this case a ``Broadcast``) 
+   Figure B: Notice that an operation on the constant B (in this case a ``Broadcast``) 
   can be done at compile time. This is an example of constant folding, and it 
   is not available to a device-based kernel library.   
-.. _figure-2:
+.. _figure-C:
 .. figure:: ../graphics/ngraph-algebraic-simp.png
   :width: 555px
   :alt: 
-   Figure 2: Finally notice that the constant has value "zero" thus the add is an 
+   Figure C: Finally notice that the constant has value "zero" thus the add is an 
   *identity* operation and can be eliminated. This is an example of **Algebraic 
   simplification**, and it is not available to a device-based kernel library.
@@ -78,7 +78,7 @@ A typical network is constructed using some kind of language-based API, which
 translates the network or :abbr:`DL (Deep Learning)` model (statically or 
 dynamically) into serialized graphs. Those graphs can then passed through a 
 compilation process (the *Graph optimization or compilation* step in 
-*Figure 3* below), where various graph-level optimizations, like constant folding 
+*Figure D* below), where various graph-level optimizations, like constant folding 
 or fusion can happen. These processes require unique vendor-provided libraries 
 to communicate with a driver (possibly through OpenCL\*, CUDA\*, or SYCL\*), to 
 compile and execute an implementation (kernel) for a specific 
@@ -89,25 +89,25 @@ each component. Note that optimizing for any one on its own usually requires
 engineering expertise that can be highly specialized to that component, and that 
 the terms have been simplified for illustrative purposes. 
-.. _figure-3:
+.. _figure-D:
 .. figure:: ../graphics/components-dl-stack.png
   :width: 700px
   :alt: A simplified DL stack
-   Figure 3: Components of a DL stack, simplified for illustrative purposes.
+   Figure D: Components of a DL stack, simplified for illustrative purposes.
 There are many deep learning frameworks, each with its own strengths and user 
 bases. A setup that is common to many DL practitioners is shown in the 
 illustration below.
-.. _figure-4:
+.. _figure-E:
 .. figure:: ../graphics/a-common-stack.png
   :width: 700px
   :alt: A common implementation
-   Figure 4: A commonly-implemented stack uses TensorFlow\* as the frontend. 
+   Figure E: A commonly-implemented stack uses TensorFlow\* as the frontend. 
   The input is either optimized via Grappler, or executed directly via TensorFlow. 
   In either case, when targeting an Nvidia\* GPU, cuDNN is called to select an 
   optimal kernel for the operation; cuDNN then relies on CUDA\* or direct access 
@@ -121,13 +121,13 @@ memory layout, its feature set, etc. Each of these connections, then, represents
 significant work for what will ultimately be a brittle setup that is enormously 
 expensive to maintain.    
-.. _figure-5:
+.. _figure-F:
 .. figure:: ../graphics/dl-current-state.png
   :width: 700px
   :alt: Scalability matters
-   Figure 5: The number of kernels necessary to achieve optimal performance is 
+   Figure F: The number of kernels necessary to achieve optimal performance is 
   bounded by the product of the number of chip designs one wishes to support, 
   the number of data types supported, the number of operations, and the 
   cardinality of each parameter for each operation.
@@ -148,22 +148,32 @@ hardware coverage and optimization automatically. Any hardware that supports
 LLVM, OpenCL, OpenGL, CUDA or Metal can be supported automatically with PlaidML 
 and nGraph.  
-.. _figure-6:
+.. _figure-G:
 .. figure:: ../graphics/graph-compilers-at-a-glance.png
   :width: 700px
   :alt: Overview of various graph and tensor compilers.
-   Figure 6: Overview of various graph and tensor compilers.
+   Figure G: Overview of various graph and tensor compilers.
-.. _figure-7:
+.. _figure-H:
 .. figure:: ../graphics/tensor-compilers-at-a-glance.png
   :width: 700px
   :alt: A closer look at tensor compilers.
-   Figure 7: A closer look at tensor compilers.
+   Figure H: A closer look at tensor compilers.
+Other notable efforts
+----------------------
+A few other notable efforts in compiler projects include: 
+* **TVM** https://github.com/dmlc/tvm
+* **XLA** https://developers.googleblog.com/2017/03/xla-tensorflow-compiled.html
+* **Glow** https://arxiv.org/pdf/1805.00907.pdf 

--- a/doc/sphinx/source/project/release-notes.rst
+++ b/doc/sphinx/source/project/release-notes.rst
@@ -29,7 +29,7 @@ the following categories:
 In our tests, the optimized workloads can perform up to 45X faster than native 
 frameworks, and we expect performance gains for other workloads due to our 
-powerful :doc:`../fusion/index` feature.
+powerful :doc:`../core/fusion/index` feature.
 See also our recent `API changes`_

--- a/doc/sphinx/source/tutorials/index.rst
+++ b/doc/sphinx/source/tutorials/index.rst
+.. tutorials/index:
+##########
+Tutorials
+##########
+Coming soon 
+.. toctree::
+   :maxdepth: 1