Architecture and feature docs2 (#2038)

* editing docs * more doc updates * Cleanup theme, update backends for PlaidML, remove stale font * Add PlaidML description and doc update that should have been added with PR 1888 * Add PlaidML description and doc update that should have been added with PR 1888 * Latest release doc updates * Add PlaidML description and doc update for PR 1888 * Update glossary with tensor description and quantization def * Refactor landpage with QuickStart guides * Add better details about nGraph features and roadmap * Placeholder detail for comparison section * Add section link * order sections alphabetically for now * update compiler illustration * Address feedback from doc review * Update illustration wording * Formatting and final edits * keep tables consistent * Clarify doc on bridge and compiler docs * Clarify doc on bridge and compiler docs * yay for more feedback and improvements * edit with built doc * Fix typo * Another phase of PR review editing * Final review comment resolved * Update README with different sort of navigation options * Remove unnecessary repetition * Add links to announcement blogs for previous contributions to ONNX and PyTorch projects * Better link * Add syllabus for perf criterion * Editing and readability on README and add page for performance-validated workloads * Editing and readability on README and add page for performance-validated workloads * Editing and readability on README and add page for performance-validated workloads * Update README * Update illustrations with detail pertinent to br * Documenting diagram and updating about features faqs doc * Latest Beta doc updates * clarify wording on arch doc * Update deprecated INSTALL.md file and CONTRIB.md * Legacy framework support for neon removed; instead show how nGraph enables custom or customizable frameworks * nGraph Compiler stack beta * Add markdown version of some docs as requested * update full ng stack diagram * Make sure diagrams work after moving them to tld * Update ABOUT tld info doc with PR review feedback

Architecture and feature docs2 (#2038)
* editing docs * more doc updates * Cleanup theme, update backends for PlaidML, remove stale font * Add PlaidML description and doc update that should have been added with PR 1888 * Add PlaidML description and doc update that should have been added with PR 1888 * Latest release doc updates * Add PlaidML description and doc update for PR 1888 * Update glossary with tensor description and quantization def * Refactor landpage with QuickStart guides * Add better details about nGraph features and roadmap * Placeholder detail for comparison section * Add section link * order sections alphabetically for now * update compiler illustration * Address feedback from doc review * Update illustration wording * Formatting and final edits * keep tables consistent * Clarify doc on bridge and compiler docs * Clarify doc on bridge and compiler docs * yay for more feedback and improvements * edit with built doc * Fix typo * Another phase of PR review editing * Final review comment resolved * Update README with different sort of navigation options * Remove unnecessary repetition * Add links to announcement blogs for previous contributions to ONNX and PyTorch projects * Better link * Add syllabus for perf criterion * Editing and readability on README and add page for performance-validated workloads * Editing and readability on README and add page for performance-validated workloads * Editing and readability on README and add page for performance-validated workloads * Update README * Update illustrations with detail pertinent to br * Documenting diagram and updating about features faqs doc * Latest Beta doc updates * clarify wording on arch doc * Update deprecated INSTALL.md file and CONTRIB.md * Legacy framework support for neon removed; instead show how nGraph enables custom or customizable frameworks * nGraph Compiler stack beta * Add markdown version of some docs as requested * update full ng stack diagram * Make sure diagrams work after moving them to tld * Update ABOUT tld info doc with PR review feedback
b1a22df1 · L.S. Cook · Scott Cyphers · 6eefbce4 · b1a22df1 · b1a22df1
Commit b1a22df1 authored Nov 20, 2018 by L.S. Cook Committed by Scott Cyphers Nov 20, 2018
20 changed files
--- a/ABOUT.md
+++ b/ABOUT.md
+About nGraph Compiler stack
+===========================
+nGraph Compiler stack architecture
+----------------------------------
+The diagram below represents our current Beta release stack. Please note 
+that the stack diagram is simplified to show how nGraph executes deep 
+learning workloads with two hardware backends; however, many other 
+deep learning frameworks and backends currently are functioning.
+![](doc/sphinx/source/graphics/stackngrknl.png)
+Starting from the top of the diagram, we present a simplified view of 
+the nGraph Intermediate Representation (IR). The nGraph IR is a format 
+which works with a framework such as TensorFlow\* or MXNet\* when there 
+is a corresponding "Bridge" or import method, such as from NNVM or via 
+[ONNX](http://onnx.ai). Once the nGraph IR can begin using nGraph's 
+Core ops, components lower in the stack can begin parsing and 
+pattern-matching subgraphs for device-specific optimizations; these 
+are then encapsulated. This encapsulation is represented on the diagram 
+as the colored background between the `ngraph` kernel(s) and the the 
+stack above.
+Note that everything at or below the "Kernel APIs" and "Subgraph 
+APIs" gets executed "automatically" during training runs. In other 
+words, the accelerations are automatic: parts of the graph that 
+are not encapsulated default to framework implementation when 
+executed. For example, if nGraph optimizes ResNet50 for TensorFlow, 
+the same optimization can be readily applied to the NNVM/MXNet 
+implementation of ResNet50. This works efficiently because the 
+nGraph (IR) Intermediate Representation, which keeps the input 
+and output semantics of encapsulated subgraphs, rebuilds an 
+encapsulated subgraph that can efficiently make use or re-use 
+of operations. Such an approach significantly cuts down on the 
+time needed to compile; when we're not relying upon the framework's 
+ops alone, memory management and data layouts can be more efficiently 
+applied to the hardware backends in use.
+The nGraph Core uses a strongly-typed and platform-neutral (IR) 
+Intermediate Representation to construct a "stateless" graph. 
+Each node, or `op`, in the graph corresponds to one step in 
+a computation, where each step produces zero or more tensor 
+outputs from zero or more tensor inputs.
+After construction, our Hybrid transformer takes the IR, further 
+partitions it into subgraphs, and assigns them to the best-performing 
+backend. There are two hardware backends shown in the stack diagram 
+to demonstrate nGraph's graph partitioning. The Hybrid transformer 
+assigns complex operations (subgraphs) to the Intel® Nervana™ Neural 
+Network Processor (NNP), or to a different CPU backend to expedite 
+the computation, and the remaining operations default to CPU. In the 
+future, we will further expand the capabilities of Hybrid transformer 
+by enabling more features, such as localized cost modeling and memory 
+sharing, when the next generation of NNP (Neural Network Processor) 
+is released. In the meantime, your deep learning software engineering 
+or modeling can be confidently built upon this stable anchorage.
+The Intel® Architecture IA (Intel® Architecture) transformer provides 
+two modes that reduce compilation time, and have already been shown 
+as useful for training, deploying, and retraining a deep learning 
+workload in production. For example, in our tests, DEX mode reduced 
+ResNet50 compilation time by 30X.
+We are excited to continue our work in enabling distributed training, 
+and we plan to expand the nodes to 256 in Q4 ‘18. Additionally, we 
+are testing model parallelism in addition to data parallelism.
+In this Beta release, nGraph via Bridge code supports only Just In 
+Time (JiT) compilation; the ONNX importer does not support anything 
+that nGraph cannot support. While nGraph currently has very limited 
+support for dynamic graphs, it is possible to get dynamic graphs 
+working. Future releases will add better support and use case 
+examples for such things as Ahead of Time compilation.
+Features
+--------
+The nGraph (IR) Intermediate Representation contains a combination 
+of device-specific and non-device-specific optimization :
+-   **Fusion** -- Fuse multiple ops to to decrease memory usage.
+-   **Data layout abstraction** -- Make abstraction easier and faster 
+    with nGraph translating element order to work best for a given or 
+    available device.
+-   **Data reuse** -- Save results and reuse for subgraphs with the 
+    same input.
+-   **Graph scheduling** -- Run similar subgraphs in parallel via 
+    multi-threading.
+-   **Graph partitioning** -- Partition subgraphs to run on different 
+    devices to speed up computation; make better use of spare CPU cycles 
+    with nGraph.
+-   **Memory management** -- Prevent peak memory usage by intercepting 
+    a graph with or by a "saved checkpoint," and to enable data auditing.
+-   **Data layout abstraction** -- Make abstraction easier and faster 
+    with nGraph translating element order to work best for whatever given 
+    or available device.
+Current nGraph Compiler full stack
+----------------------------------
+![](doc/sphinx/source/graphics/full-ngstck.png)
+In addition to IA and NNP transformers, nGraph Compiler stack has transformers
+for multiple GPU types and an upcoming Intel deep learning accelerator. To 
+support the growing number of transformers, we plan to expand the capabilities 
+of the hybrid transformer with a cost model and memory sharing. With these new 
+features, even if nGraph has multiple backends targeting the same hardware, it 
+will partition the graph into multiple subgraphs and determine the best way to 
+execute each subgraph.   
--- a/CONTRIB.md
+++ b/CONTRIB.md
 Contributor Guidelines
 ======================
-http://ngraph.nervanasys.com/docs/latest/project/code-contributor-README.html
+https://ngraph.nervanasys.com/docs/latest/project/code-contributor-README.html
 License

--- a/FAQs.md
+++ b/FAQs.md
+FAQs
+----
+### Why nGraph?
+We developed nGraph to simplify the realization of optimized deep learning 
+performance across frameworks and hardware platforms. The value we're offering 
+to the developer community is empowerment: we are confident that Intel® 
+Architecture already provides the best computational resources available 
+for the breadth of ML/DL tasks.
+### How do I connect a framework?
+The nGraph Library manages framework bridges for some of the more widely-known 
+frameworks. A bridge acts as an intermediary between the nGraph core and the 
+framework, and the result is a function that can be compiled from a framework. 
+A fully-compiled function that makes use of bridge code thus becomes a 
+"function graph", or what we sometimes call an **nGraph graph**.
+Low-level nGraph APIs are not accessible *dynamically* via bridge code; this 
+is the nature of stateless graphs. However, do note that a graph with a 
+"saved" checkpoint can be "continued" to run from a previously-applied checkpoint, 
+or it can loaded as static graph for further inspection.
+For a more detailed dive into how custom bridge code can be implemented, see our 
+documentation on [Working with other frameworks]. To learn how TensorFlow and MXNet 
+currently make use of custom bridge code, see [Integrate supported frameworks].
+![](doc/sphinx/source/graphics/bridge-to-graph-compiler.png) 
+<alt="JiT Compiling for computation" width="733" />
+Although we only directly support a few frameworks at this time, we provide 
+documentation to help developers and engineers create custom solutions. 
+### How do I run an inference model?
+Framework bridge code is *not* the only way to connect a model (function graph) to 
+nGraph's ../ops/index. We've also built an importer for models that have been 
+exported from a framework and saved as serialized file, such as ONNX. To learn 
+how to convert such serialized files to an nGraph model, please see the "How to" 
+documentation.
+### What's next?
+The Gold release is targeted for April 2019; it will feature broader workload 
+coverage, including support for quantized graphs, and more detail on our 
+advanced support for ``int8``.  We developed nGraph to simplify the realization 
+of optimized deep learning performance across frameworks and hardware platforms. 
+You can read more about design decisions and what is tentatively in the pipeline 
+for development in our [arXiv paper](https://arxiv.org/pdf/1801.08058.pdf) from 
+the 2018 SysML conference.
+[Working with other frameworks]: http://ngraph.nervanasys.com/docs/latest/frameworks/index.html
+[Integrate supported frameworks]: http://ngraph.nervanasys.com/docs/latest/framework-integration-guides.html
--- a/INSTALL.md
+++ b/INSTALL.md
-Currently two platforms are known to work:
+Tested Platforms:
- Ubuntu 16.04
+- Ubuntu 16.04 and 18.04
 - CentOS 7.4
-Ubuntu 16.04 Prerequisites
+Our latest instructions for how to build the library are available 
-==========================
+[in the documentation](https://ngraph.nervanasys.com/docs/latest/buildlb.html).
-Compilers currently known to work are gcc-5.4.0, clang-3.9, and gcc-4.8.5.
+Use `cmake -LH` after cloning the repo to see the currently-supported 
+build options. We recommend using, at the least, something like:  
-If you are using gcc-5.4.0 or clang-3.9, it is recommended to add the
+$ cmake ../ -DCMAKE_INSTALL_PREFIX=~/ngraph_dist -DNGRAPH_USE_PREBUILT_LLVM 
-option `-DNGRAPH_USE_PREBUILT_LLVM=TRUE` to the `cmake` command. This causes
+-DNGRAPH_ONNX_IMPORT_ENABLE=ON
-the build system to fetch a pre-built tarball of LLVM+Clang from `llvm.org`,
-which substantially cuts down on build times.
-If you are using gcc-4.8, it may be necessary to add symlinksfrom `gcc` to
-`gcc-4.8`, and from `g++` to `g++-4.8`, in your PATH, even if you have
-specify CMAKE_C_COMPILER and CMAKE_CXX_COMPILER when building. (You should
-NOT supply the `-DNGRAPH_USE_PREBUILT_LLVM` flag in this case, because the
-prebuilt tarball supplied on llvm.org is not compatible with a gcc-4.8
-based build.)
-CentOS 7.4 Prerequisites
-========================
-CentOS supplies an older version of CMake that is not compatible with
-LLVM-5.0.1, which we build as an external dependency. There are two options:
-1. (requires root privileges) install the the `cmake3` package from EPEL, or
-2. (does not require root privileges) build CMake (3.1 or newer) from source,
-   and run it from its build directory.
-General Instructions
-====================
-These instructions assume that your system has been prepared in accordance
-with the above prerequisites.
-```
-$ cd ngraph
-$ mkdir build
-$ cd build
-$ cmake .. \
-    -DCMAKE_C_COMPILER=<path to C compiler> \
-    -DCMAKE_CXX_COMPILER=<path to C++ compiler>
-$ make -j install
-```
--- a/README.md
+++ b/README.md
--- a/doc/sphinx/source/branding-notice.rst
+++ b/doc/sphinx/source/branding-notice.rst
@@ -35,7 +35,7 @@ repeated use of the trademark / branding symbols.
 * Intel® Xeon® (CPU processor)
-* Intel® Architecture
+* Intel® Architecture (IA)
 * Intel® nGraph™

--- a/doc/sphinx/source/framework-integration-guides.rst
+++ b/doc/sphinx/source/framework-integration-guides.rst
@@ -6,7 +6,6 @@ Integrate Supported Frameworks
 * :ref:`mxnet_intg`
 * :ref:`tensorflow_intg`
-* :ref:`neon_intg`
 A framework is "supported" when there is a framework :term:`bridge` that can be 
 cloned from one of our GitHub repos and built to connect to nGraph device backends, 
@@ -46,80 +45,9 @@ See the `ngraph tensorflow bridge README`_ for how to install the `DSO`_ for the
 nGraph-TensorFlow bridge.
-.. _neon_intg:
-neon |trade|
-============
-Use ``neon`` as a frontend for nGraph backends
-----------------------------------------------
-``neon`` is an open source Deep Learning framework that has a history 
-of `being the fastest`_ framework `for training CNN-based models with GPUs`_. 
-Detailed info about neon's features and functionality can be found in the 
-`neon docs`_. This section covers installing neon on an existing 
-system that already has an ``ngraph_dist`` installed. 
-.. important:: As of version |version|, these instructions presume that your 
-   system already has the library installed to the default location, as outlined 
-   in our :doc:`buildlb` documentation. 
-#. Set the ``NGRAPH_CPP_BUILD_PATH`` and the ``LD_LIBRARY_PATH``. You can use 
-   the ``env`` command to see if these paths have been set already and if they 
-   have not, they can be set with something like: 
-   .. code-block:: bash
-      export NGRAPH_CPP_BUILD_PATH=$HOME/ngraph_dist/
-      export LD_LIBRARY_PATH=$HOME/ngraph_dist/lib/
-#. The neon framework uses the :command:`pip` package manager during installation; 
-   install it with Python version 3.5 or higher:
-   .. code-block:: console
-      $ sudo apt-get install python3-pip python3-venv
-      $ python3 -m venv neon_venv
-      $ cd neon_venv 
-      $ . bin/activate
-      (neon_venv) ~/frameworks$ 
-#. Go to the "python" subdirectory of the ``ngraph`` repo we cloned during the 
-   previous :doc:`buildlb`, and complete these actions: 
-   .. code-block:: console
-      (neon_venv)$ cd /opt/libraries/ngraph/python
-      (neon_venv)$ git clone --recursive -b allow-nonconstructible-holders https://github.com/jagerman/pybind11.git
-      (neon_venv)$ export PYBIND_HEADERS_PATH=/opt/libraries/ngraph/python/pybind11
-      (neon_venv)$ pip install -U . 
-#. Finally we're ready to install the `neon` integration: 
-   .. code-block:: console
-      (neon_venv)$ git clone git@github.com:NervanaSystems/ngraph-neon
-      (neon_venv)$ cd ngraph-neon
-      (neon_venv)$ make install
-#. To test a training example, you can run the following from ``ngraph-neon/examples/cifar10``
-   .. code-block:: console
-      (neon_venv)$ python cifar10_conv.py
-#. (Optional) For experimental or alternative approaches to distributed training
-   methodologies, including data parallel training, see the :doc:`distr/index` 
-   and :doc:`How to <howto/index>` articles on :doc:`howto/distribute-train`. 
 .. _nGraph-MXNet: https://github.com/NervanaSystems/ngraph-mxnet/blob/master/NGRAPH_README.md
 .. _MXNet: http://mxnet.incubator.apache.org
 .. _DSO: http://csweb.cs.wfu.edu/%7Etorgerse/Kokua/More_SGI/007-2360-010/sgi_html/ch03.html
-.. _ngraph-neon python README: https://github.com/NervanaSystems/ngraph/blob/master/python/README.md
-.. _ngraph neon repo's README: https://github.com/NervanaSystems/ngraph-neon/blob/master/README.md
-.. _neon docs: https://github.com/NervanaSystems/neon/tree/master/doc
 .. _being the fastest: https://github.com/soumith/convnet-benchmarks
-.. _for training CNN-based models with GPUs: https://www.microway.com/hpc-tech-tips/deep-learning-frameworks-survey-tensorflow-torch-theano-caffe-neon-ibm-machine-learning-stack
 .. _ngraph tensorflow bridge README: https://github.com/NervanaSystems/ngraph-tf
--- a/doc/sphinx/source/frameworks/generic.rst
+++ b/doc/sphinx/source/frameworks/generic.rst
@@ -5,7 +5,7 @@ Working with other frameworks
 ##############################
 An engineer may want to work with a deep learning framework that does not yet 
-have bridge code written. For non-supported or “generic” frameworks, it is 
+have bridge code written. For non-supported or "generic" frameworks, it is 
 expected that engineers will use the nGraph library to create custom bridge code, 
 and/or to design and document a user interface (UI) with specific runtime 
 options for whatever custom use case they need. 

--- a/doc/sphinx/source/frameworks/index.rst
+++ b/doc/sphinx/source/frameworks/index.rst
 .. framework/index: 
-#############################
+Integrate Other Frameworks   
-Integrate Generic Frameworks   
+###########################
-#############################
 In this section, written for framework architects or engineers who want 
 to optimize brand new, generic, or less widely-supported frameworks, we provide 
@@ -20,6 +19,7 @@ work and custom bridge code, such as that for our `ngraph tensorflow bridge`_.
   :maxdepth: 1 
   generic.rst
+   validation-testing.rst

--- a/doc/sphinx/source/frameworks/validation-testing.rst
+++ b/doc/sphinx/source/frameworks/validation-testing.rst
+.. frameworks/validation-testing: 
+Validation and testing
+######################
+* **Validating** -- To provide optimizations with nGraph, we first 
+  confirm that a given workload is "validated" as being functional; 
+  that is, we can successfully load its serialized graph as an nGraph 
+  :term:`function graph`. Following here is a list of 14 workloads 
+  we've tested with success.
+.. csv-table::
+   :header: "Workload", "Validated"
+   :widths: 27, 53
+   :escape: ~
+   DenseNet-121, Functional
+   Inception-v1, Functional
+   Inception-v2, Functional
+   ResNet-50, Functional
+   Shufflenet, Functional
+   SqueezeNet, Functional
+   VGG-19, Functional
+   ZFNet-512, Functional
+   MNIST, Functional
+   Emotion-FERPlus, Functional
+   BVLC AlexNet, Functional
+   BVLC GoogleNet, Functional
+   BVLC CaffeNet, Functional
+   BVLC R-CNN ILSVRC13, Functional 
+* **Testing & Performance Optimizations** for workloads that have been 
+  "validated" with nGraph are also available via the nGraph 
+  :abbr:`Intermediate Representation (IR)`). For example, a common use 
+  case for data scientists is to train a new model with a large dataset, 
+  and so nGraph already has several accelerations available "out of the 
+  box" for the workloads noted below.
+TensorFlow 
+==========
+.. csv-table::
+   :header: "TensorFlow Workloads", "Performance"
+   :widths: 27, 53
+   :escape: ~
+   Resnet50 v1 and v2, 50% of P40
+   Inception V3 and V4, 50% of P40
+   Inception-ResNetv2, 50% of P40
+   MobileNet v1, 50% of P40
+   SqueezeNet v1.1, 50% of P40
+   SSD-VGG16, 50% of P40
+   R-FCN, 50% of P40
+   Faster RCNN, 50% of P40
+   Yolo v2, 50% of P40
+   GNMT, Greater than or equal to :abbr:`Direct Optimization (DO)`
+   Transformer-LT, 50% of P40
+   Wide & Deep, 50% of P40
+   WaveNet, Functional
+   U-Net, Greater than DO
+   DRAW, 50% of P40
+   A3C, 50% of P40
+MXNet
+=====
+.. csv-table::
+   :header: "MXNet Workloads", "Performance"
+   :widths: 27, 53
+   :escape: ~
+   Resnet50 v1 and v2, 50% of P40
+   DenseNet (121 161 169 201), 50% of P40
+   InceptionV3, 50% of P40
+   InceptionV4, 50% of P40
+   Inception-ResNetv2, 50% of P40
+   MobileNet v1, 50% of P40
+   SqueezeNet v1 and v1.1, 50% of P40
+   VGG16, Functional (No DO available)
+   Faster RCNN, 50% of P40
+   SSD-VGG16, 50% of P40
+   GNMT, Greater than or equal to :abbr:`Direct Optimization (DO)`
+   Transformer-LT, 50% of P40
+   Wide & Deep, 50% of P40
+   WaveNet, Functional
+   DeepSpeech2, 50% of P40
+   DCGAN, 50% of P40
+   A3C, Greater than or equal to DO
--- a/doc/sphinx/source/graphics/descriptor-of-tensor.png
+++ b/doc/sphinx/source/graphics/descriptor-of-tensor.png
--- a/doc/sphinx/source/graphics/develop-without-lockin.png
+++ b/doc/sphinx/source/graphics/develop-without-lockin.png
--- a/doc/sphinx/source/graphics/develop-without-lockin.xcf
+++ b/doc/sphinx/source/graphics/develop-without-lockin.xcf
--- a/doc/sphinx/source/graphics/full-ngstck.png
+++ b/doc/sphinx/source/graphics/full-ngstck.png
--- a/doc/sphinx/source/graphics/full-ngstck.xcf
+++ b/doc/sphinx/source/graphics/full-ngstck.xcf
--- a/doc/sphinx/source/graphics/ngraph-compiler-stack-readme.png
+++ b/doc/sphinx/source/graphics/ngraph-compiler-stack-readme.png
--- a/doc/sphinx/source/graphics/ngraph-compiler-stack-readme.xcf
+++ b/doc/sphinx/source/graphics/ngraph-compiler-stack-readme.xcf
--- a/doc/sphinx/source/graphics/stackngrknl.png
+++ b/doc/sphinx/source/graphics/stackngrknl.png
--- a/doc/sphinx/source/graphics/stackngrknl.xcf
+++ b/doc/sphinx/source/graphics/stackngrknl.xcf
--- a/doc/sphinx/source/project/about.rst
+++ b/doc/sphinx/source/project/about.rst
 .. about: 
-About Features, FAQs
+Architecture, Features, FAQs
-####################
+############################
+* :ref:`architecture`
 * :ref:`features`
 * :ref:`faq`
 * :ref:`whats_next`
+.. _architecture:
+nGraph Compiler stack architecture
+==================================
+The diagram below represents our current Beta release stack. Please note that
+the stack diagram is simplified to show how nGraph executes deep learning 
+workloads with two hardware backends; however, many other deep learning 
+frameworks and backends currently are functioning. 
+.. figure:: ../graphics/stackngrknl.png
+    :width: 771px
+    :alt: Current Beta release stack
+    Simplified stack diagram for nGraph Compiler and components Beta 
+Starting from the top of the diagram, we present a simplified view of how 
+the nGraph :abbr:`Intermediate Representation (IR)` can receive a graph from a
+framework such as TensorFlow\* or MXNet\* when there is a corresponding 
+"Bridge" or import method, such as from NNVM or via `ONNX`_. Once the nGraph 
+:doc:`../ops/index` can begin parsing the graph as a computation graph, they
+can pattern-match subgraphs for device-specific optimizations; these are then 
+encapsulated. This encapsulation is represented on the diagram as the colored  
+background between the ``ngraph`` kernel(s) and the the stack above. 
+Note that everything at or below the "Kernel APIs" and "Subgraph APIs" gets 
+executed "automatically" during training runs. In other words, the accelerations 
+are automatic: parts of the graph that are not encapsulated default to framework 
+implementation when executed. For example, if nGraph optimizes ResNet50 for 
+TensorFlow, the same optimization can be readily applied to the NNVM/MXNet 
+implementation of ResNet50. This works efficiently because the nGraph 
+:abbr:`(IR) Intermediate Representation`, which keeps the input and output 
+semantics of encapsulated subgraphs, rebuilds an encapsulated subgraph that can 
+efficiently make use or re-use of operations. Such an  approach significantly 
+cuts down on the time needed to compile; when we're not relying upon the 
+framework's ops alone, memory management and data layouts can be more efficiently 
+applied to the hardware backends in use.    
+The :doc:`nGraph Core <../ops/index>` uses a strongly-typed and platform-neutral 
+:abbr:`(IR) Intermediate Representation` to construct a "stateless" graph. Each 
+node, or ``op``, in the graph corresponds to one :term:`step` in a computation, 
+where each step produces zero or more tensor outputs from zero or more tensor 
+inputs.  
+After construction, our Hybrid transformer takes the IR, further partitions it 
+into subgraphs, and assigns them to the best-performing backend. There are two 
+hardware backends shown in the stack diagram to demonstrate nGraph's graph 
+partitioning. The Hybrid transformer assigns complex operations (subgraphs) to 
+the Intel® Nervana™ :abbr:`Neural Network Processor (NNP)`, or to a different 
+CPU backend to expedite the computation, and the remaining operations default 
+to CPU. In the future, we will further expand the capabilities of Hybrid 
+transformer by enabling more features, such as localized cost modeling and 
+memory sharing, when the next generation of :abbr:`NNP (Neural Network Processor)` 
+is released. In the meantime, your deep learning software engineering or modeling 
+can be confidently built upon this stable anchorage.  
+The  Intel® Architecture :abbr:`IA (Intel® Architecture)` transformer provides 
+two modes that reduce compilation time, and have already been shown as useful 
+for training, deploying, and retraining a deep learning workload in production. 
+For example, in our tests, DEX mode reduced ResNet50 compilation time by 30X. 
+We are excited to continue our work in enabling distributed training, and we 
+plan to expand the nodes to 256 in Q4 ‘18. Additionally, we are testing model 
+parallelism in addition to data parallelism.  
+.. note::  In this Beta release, nGraph via Bridge code supports only :abbr:`Just In Time (JiT)` 
+   compilation; the nGraph ONNX companion tool supports dynamic graphs and will 
+   add additional support for Ahead of Time compilation in the official release. 
 .. _features:
 Features
 ========
-The nGraph :abbr:`Intermediate Representation (IR)` contains a combination of 
+The nGraph :abbr:`(IR) Intermediate Representation` contains a combination of 
-device-specific and non device-specific optimization and compilations to  
+device-specific and non-device-specific optimization :
-enable:
-* **Fusion** -- Fuse multiple ``ops`` to to decrease memory usage "localities". 
+* **Fusion** -- Fuse multiple ops to to decrease memory usage "localities".
+* **Data layout abstraction** -- Make abstraction easier and faster with nGraph 
+  translating element order to work best for a given or available device.
+* **Data reuse** -- Save results and reuse for subgraphs with the same input.
+* **Graph scheduling** -- Run similar subgraphs in parallel via multi-threading.
+* **Graph partitioning** -- Partition subgraphs to run on different devices to 
+  speed up computation; make better use of spare CPU cycles with nGraph. 
 * **Memory management** -- Prevent peak memory usage by intercepting a graph 
  with or by a "saved checkpoint," and to enable data auditing. 
-* **Data reuse** -- Save result and reuse for subgraphs with the same input
+* **Data layout abstraction** -- Make abstraction easier and faster with nGraph 
-* **Graph scheduling** -- Run similar subgraphs in parallel 
+  translating element order to work best for whatever given or available device.  
-* **Graph partitioning** -- Partition subgraphs to run on different devices to 
-  speed up computation.
-* :abbr:`Direct EXecution mode (DEX)` or **DEX** -- Execute kernels for the 
-  op directly instead of using codegen when traversing the computation graph.
-  .. important:: See :doc:`../ops/index` to learn the nGraph means for graph 
+.. important:: See :doc:`../ops/index` to learn the nGraph means for graph computations.
-     computations.
 .. Our design philosophy is that the graph is not a script for running kernels; 
   rather, our compilation will match ``ops`` to appropriate available kernels
@@ -37,8 +109,6 @@ enable:
   new Core ops should be infrequent and that most functionality instead gets 
   added with new functions that build sub-graphs from existing core ops.   
-* **Data layout abstraction** -- Make abstraction easier and faster with nGraph 
-  translating element order to work best for whatever given or available device.  
 .. _portable:
@@ -112,15 +182,17 @@ into their frameworks.
 .. figure:: ../graphics/develop-without-lockin.png
-The value we're offering to the developer community is empowerment: we are 
-confident that Intel® Architecture already provides the best computational 
-resources available for the breadth of ML/DL tasks. 
 .. _faq:
 FAQs
-=====
+====
+Why nGraph? 
+-----------
+The value we're offering to the developer community is empowerment: we are 
+confident that Intel® Architecture already provides the best computational 
+resources available for the breadth of ML/DL tasks. 
 How does it work?
 ------------------
@@ -197,7 +269,8 @@ our `arXiv paper`_ from the 2018 SysML conference.
 .. _arXiv paper: https://arxiv.org/pdf/1801.08058.pdf
-.. _ONNX: http://onnx.ai 
+.. _ONNX: http://onnx.ai
+.. _NNVM: http://
 .. _nGraph ONNX companion tool: https://github.com/NervanaSystems/ngraph-onnx
 .. _Intel® MKL-DNN: https://github.com/intel/mkl-dnn
 .. _Movidius: https://developer.movidius.com/