Commit fd0ed37c authored by Leona C's avatar Leona C Committed by Sang Ik Lee

Reorganize doc folders for core-related doc on fusion, graph rewrite, and compiler passes (#2466)

* Cleaner API doc reference for compile call

* Add a useful table for nGraph namespaces

* Remove layout namespace

* Show exploding kernel problem on illustration like IEEE preso

* WIP branch for new documentation restructuring that is a huge pain

* Fix the doc reorg mess

* Fix underline

* List of passes disclaimer note

* Update disclaimers on README

* More cleanup of doc reorg

* Update core docs

* Update overview on core

* Add PR feedback

* Get rid of all the gazillion of doc build errors from rearranging stuff

* Add section on tutorials

* Update branch

* Cleanup intro

* Add better detail to overview
parent 12b5f085
......@@ -10,7 +10,6 @@
## Quick start
To begin using nGraph with popular frameworks to accelerate deep learning
workloads on CPU for inference, please refer to the links below.
......@@ -20,11 +19,21 @@ workloads on CPU for inference, please refer to the links below.
| MXNet* 1.3 | [Pip install](https://github.com/NervanaSystems/ngraph-mxnet#Installation) or [Build from source](https://github.com/NervanaSystems/ngraph-mxnet#building-with-ngraph-support)| 18 [Validated workloads]
| ONNX 1.3 | [Pip install](https://github.com/NervanaSystems/ngraph-onnx#installation) | 14 [Validated workloads]
:exclamation: :exclamation: :exclamation: Note that the ``pip`` package option
works only with Ubuntu 16.04 or greater and Intel® Xeon® CPUs. CPUs without
Intel® Advanced Vector Extensions 512 (Intel® AVX-512) will not run these
packages; the alternative is to build from source. Wider support for other
CPUs will be offered starting in early 2019 :exclamation: :exclamation: :exclamation:
#### Python wheels for nGraph
The Python wheels for nGraph have been tested and are supported on the following
64-bit systems
* Ubuntu 16.04 or later
* CentOS 7.6
* Debian 10
* macOS 10.14.3 (Mojave)
:exclamation: Note that the ``pip`` package option works only Intel® Xeon® CPUs.
CPUs without Intel® Advanced Vector Extensions 512 (Intel® AVX-512) will not run
these packages; the alternative is to build from source. Wider support for other
CPUs will be offered in later releases.
Frameworks using nGraph Compiler stack to execute workloads have shown
[**up to 45X**](https://ai.intel.com/ngraph-compiler-stack-beta-release/)
......
.. backend-support/index.rst
Transformers, PlaidML
###############################
About backends
##############
* :ref:`what_is_backend`
* :ref:`hybrid_transformer`
* :ref:`cpu_backend`
* :ref:`plaidml_backend`
* :ref:`gpu_backend`
What is a backend?
------------------
.. _what_is_backend:
What's a backend?
-----------------
Backends are responsible for function execution and value allocation. They
can be used to :doc:`carry out a programmed computation<../howto/execute>`
from a framework by using a CPU or GPU; or they can be used with an *Interpreter*
mode, which is primarily intended for testing, to analyze a program, or for a
framework developer to develop customizations. Experimental APIs to support
In the nGraph Compiler stack, what we call a *backend* is responsible for
function execution and value allocation. A backend can be used to
:doc:`carry out a programmed computation<../core/constructing-graphs/execute>`
from a framework on a CPU or GPU; or it can be used with an *Interpreter* mode,
which is primarily intended for testing, to analyze a program, or to help a
framework developer customize targeted solutions. Experimental APIs to support
current and future nGraph Backends are also available; see, for example, the
section on :ref:`plaidml_backend`.
section on the :ref:`plaidml_backend`.
.. _hybrid_transformer:
......@@ -28,7 +31,7 @@ section on :ref:`plaidml_backend`.
Hybrid Transformer
==================
Coming soon
More detail coming soon
.. _cpu_backend:
......@@ -36,7 +39,7 @@ Coming soon
CPU Backend
===========
Coming soon
More detail coming soon
.. _gpu_backend:
......@@ -44,7 +47,7 @@ Coming soon
GPU Backend
===========
Coming soon
More detail coming soon
.. _plaidml_backend:
......
......@@ -61,7 +61,7 @@ The process documented here will work on Ubuntu\* 16.04 (LTS) or on Ubuntu
give ownership of that directory to your user. Creating such a placeholder
can be useful if you'd like to have a local reference for APIs and
documentation, or if you are a developer who wants to experiment with
how to :doc:`../howto/execute` using resources available through the
how to :doc:`core/constructing-graphs/execute` using resources available through the
code base.
.. code-block:: console
......@@ -131,7 +131,7 @@ The process documented here will work on CentOS 7.4.
give ownership of that directory to your user. Creating such a placeholder
can be useful if you'd like to have a local reference for APIs and
documentation, or if you are a developer who wants to experiment with
how to :doc:`../howto/execute` using resources available through the
how to :doc:`core/constructing-graphs/execute` using resources available through the
code base.
.. code-block:: console
......@@ -231,7 +231,7 @@ can help you get started with a training a model on a supported framework.
For the latter case, if you've followed a tutorial from `ONNX`_, and you have an
exported, serialized model, you can skip the section on frameworks and go directly
to our :doc:`../howto/import` documentation.
to our :doc:`core/constructing-graphs/import` documentation.
Please keep in mind that both of these are under continuous development, and will
be updated frequently in the coming months. Stay tuned!
......
......@@ -80,20 +80,20 @@ We begin by building the graph, starting with the input parameter
``X``. We also define a fully-connected layer, including parameters for
weights and bias:
.. literalinclude:: ../../../examples/mnist_mlp/mnist_mlp.cpp
.. literalinclude:: ../../../../examples/mnist_mlp/mnist_mlp.cpp
:language: cpp
:lines: 127-135
Repeat the process for the next layer,
.. literalinclude:: ../../../examples/mnist_mlp/mnist_mlp.cpp
.. literalinclude:: ../../../../examples/mnist_mlp/mnist_mlp.cpp
:language: cpp
:lines: 138-146
and normalize everything with a ``softmax``.
.. literalinclude:: ../../../examples/mnist_mlp/mnist_mlp.cpp
.. literalinclude:: ../../../../examples/mnist_mlp/mnist_mlp.cpp
:language: cpp
:lines: 148-150
......@@ -107,7 +107,7 @@ We use cross-entropy to compute the loss. nGraph does not currenty have a core
op for cross-entropy, so we implement it directly, adding clipping to prevent
underflow.
.. literalinclude:: ../../../examples/mnist_mlp/mnist_mlp.cpp
.. literalinclude:: ../../../../examples/mnist_mlp/mnist_mlp.cpp
:language: cpp
:lines: 154-166
......@@ -123,7 +123,7 @@ because of the way it is implemented in interpreted frameworks. In nGraph, we
augment the loss computation with computations for the weight adjustments. This
allows the calculations for the adjustments to be further optimized.
.. literalinclude:: ../../../examples/mnist_mlp/mnist_mlp.cpp
.. literalinclude:: ../../../../examples/mnist_mlp/mnist_mlp.cpp
:language: cpp
:lines: 169-172
......@@ -136,16 +136,16 @@ update computation for ``N`` will be given by the node
auto update = loss->backprop_node(N, delta);
.. literalinclude:: ../../../examples/mnist_mlp/mnist_mlp.cpp
.. literalinclude:: ../../../../examples/mnist_mlp/mnist_mlp.cpp
:language: cpp
:lines: 177-181
The different update nodes will share intermediate computations. So to
get the updated values for the weights as computed with the specified
:doc:`backend <../backend-support/index>`:
:doc:`backend <../../backend-support/index>`:
.. literalinclude:: ../../../examples/mnist_mlp/mnist_mlp.cpp
.. literalinclude:: ../../../../examples/mnist_mlp/mnist_mlp.cpp
:language: cpp
:lines: 182-215
......@@ -165,7 +165,7 @@ use the same nodes in different functions, nGraph currently does not
allow the same nodes to be compiled in different functions, so we
compile clones of the nodes.
.. literalinclude:: ../../../examples/mnist_mlp/mnist_mlp.cpp
.. literalinclude:: ../../../../examples/mnist_mlp/mnist_mlp.cpp
:language: cpp
:lines: 216-224
......@@ -8,9 +8,10 @@ Distribute training across multiple nGraph backends
however, the following configuration options have worked for nGraph devices
with mixed or limited success in testing.
In the :doc:`previous section <../howto/derive-for-training>`, we described the
steps needed to create a "trainable" nGraph model. Here we demonstrate how to
train a data parallel model by distributing the graph to more than one device.
In the :doc:`previous section <../constructing-graphs/derive-for-training>`,
we described the steps needed to create a "trainable" nGraph model. Here we
demonstrate how to train a data parallel model by distributing the graph to
more than one device.
Frameworks can implement distributed training with nGraph versions prior to
`0.13`:
......@@ -35,12 +36,12 @@ Finally, to run the training using two nGraph devices, invoke
$ mpirun
To deploy data-parallel training, the ``AllReduce`` op should be added after the
steps needed to complete the :doc:`backpropagation <../howto/derive-for-training>`;
steps needed to complete the :doc:`backpropagation <../constructing-graphs/derive-for-training>`;
the new code is highlighted below:
.. literalinclude:: ../../../examples/mnist_mlp/dist_mnist_mlp.cpp
.. literalinclude:: ../../../../examples/mnist_mlp/dist_mnist_mlp.cpp
:language: cpp
:lines: 180-196
:lines: 178-194
:emphasize-lines: 8-11
See the `full code`_ in the ``examples`` folder ``/doc/examples/mnist_mlp/dist_mnist_mlp.cpp``.
......
......@@ -60,7 +60,7 @@ Every node has zero or more *inputs*, zero or more *outputs*, and zero or more
*attributes*.
The specifics for each ``type`` permitted on a core ``Op``-specific basis can be
discovered in our :doc:`../ops/index` docs. For our purpose to
discovered in our :doc:`../../ops/index` docs. For our purpose to
:ref:`define a computation <define_cmp>`, nodes should be thought of as essentially
immutable; that is, when constructing a node, we need to supply all of its
inputs. We get this process started with ops that have no inputs, since any op
......@@ -71,7 +71,7 @@ They receive their values from outside of the graph, so they have no inputs.
They have attributes for the element type and the shape of the tensor that will
be passed to them.
.. literalinclude:: ../../../examples/abc/abc.cpp
.. literalinclude:: ../../../../examples/abc/abc.cpp
:language: cpp
:lines: 25-29
......@@ -81,7 +81,7 @@ shape ``(2, 3)`` and a row-major element layout.
To create a graph for ``(a + b) * c``, first make an ``op::Add`` node with inputs
from ``a`` and ``b``, and an ``op::Multiply`` node from the add node and ``c``:
.. literalinclude:: ../../../examples/abc/abc.cpp
.. literalinclude:: ../../../../examples/abc/abc.cpp
:language: cpp
:lines: 31-32
......@@ -94,7 +94,7 @@ type and shape of its unique output.
Once the graph is built, we need to package it in a ``Function``:
.. literalinclude:: ../../../examples/abc/abc.cpp
.. literalinclude:: ../../../../examples/abc/abc.cpp
:language: cpp
:lines: 35-36
......@@ -126,12 +126,12 @@ There are two backends for the CPU: the optimized ``"CPU"`` backend, which uses
the `Intel MKL-DNN`_, and the ``"INTERPRETER"`` backend, which runs reference
versions of kernels that favor implementation clarity over speed. The
``"INTERPRETER"`` backend can be slow, and is primarily intended for testing.
See the documentation on :doc:`runtime options for various backends <../backend-support/index>`
See the documentation on :doc:`runtime options for various backends <../../backend-support/index>`
for additional details.
To continue with our original example and select the ``"CPU_Backend"``:
.. literalinclude:: ../../../examples/abc/abc.cpp
.. literalinclude:: ../../../../examples/abc/abc.cpp
:language: cpp
:lines: 38-39
......@@ -168,14 +168,14 @@ Backends are responsible for managing storage. If the storage is off-CPU, caches
are used to minimize copying between device and CPU. We can allocate storage for
the three parameters and the return value.
.. literalinclude:: ../../../examples/abc/abc.cpp
.. literalinclude:: ../../../../examples/abc/abc.cpp
:language: cpp
:lines: 41-46
Each tensor is a shared pointer to a :term:`Tensorview`, which is the interface
backends implement for tensor use. When there are no more references to the
tensor view, it will be freed when convenient for the backend. See the
:doc:`../backend-support/cpp-api` documentation for details on how to work
:doc:`../../backend-support/cpp-api` documentation for details on how to work
with ``Tensor``.
......@@ -186,7 +186,7 @@ Initialize the inputs
Next we need to copy some data into the tensors.
.. literalinclude:: ../../../examples/abc/abc.cpp
.. literalinclude:: ../../../../examples/abc/abc.cpp
:language: cpp
:lines: 48-55
......@@ -201,7 +201,7 @@ Invoke the computation
To invoke the function, we simply pass argument and resultant tensors to the
call frame:
.. literalinclude:: ../../../examples/abc/abc.cpp
.. literalinclude:: ../../../../examples/abc/abc.cpp
:language: cpp
:lines: 57-58
......@@ -213,7 +213,7 @@ Access the outputs
We can use the ``read`` method to access the result:
.. literalinclude:: ../../../examples/abc/abc.cpp
.. literalinclude:: ../../../../examples/abc/abc.cpp
:language: cpp
:lines: 60-77
......@@ -222,7 +222,7 @@ We can use the ``read`` method to access the result:
Put it all together
===================
.. literalinclude:: ../../../examples/abc/abc.cpp
.. literalinclude:: ../../../../examples/abc/abc.cpp
:language: cpp
:linenos:
:caption: "The (a + b) * c example for executing a computation on nGraph"
......
......@@ -22,13 +22,12 @@ usually named ``<some_model>.onnx`` or ``<some_model>.onnx.pb``. These
`tutorials from ONNX`_ describe how to turn trained models into an
``.onnx`` export.
.. important:: If you landed on this page and you already have an ``.onnx``
or ``.onnx.pb`` formatted file, you should be able to run the inference
without needing to dig into anything from the "Frameworks" sections. You
will, however, need to have completed the steps outlined in
our :doc:`../buildlb` guide. If you intend to build nGraph for : doc:`distributed-training`,
you will need to build that has already been compiled with the additional
cmake flag: ``-DNGRAPH_DISTRIBUTED_ENABLE=TRUE``.
.. important:: If you landed on this page and you already have an ``.onnx`` or
an ``.onnx.pb`` formatted file, you should be able to run the inference without
needing to dig into anything from the "Frameworks" sections. You will, however,
need to have completed the steps outlined in our :doc:`../../buildlb` guide.
If you intend to build nGraph for distributed-training, you will need
to follow instructions on the documentation for :doc:`../../distr/index`.
To demonstrate functionality, we'll use an already-serialized CIFAR10 model
trained via ResNet20. Remember that this model has already been trained and
......@@ -154,7 +153,7 @@ specify the relative path to the location of the ``.onnx`` file.
Enable ONNX and load an ONNX file from disk
--------------------------------------------
.. literalinclude:: ../../../examples/onnx/onnx_example.py
.. literalinclude:: ../../../../examples/onnx/onnx_example.py
:language: python
:lines: 17-19
......@@ -162,7 +161,7 @@ Enable ONNX and load an ONNX file from disk
Convert an ONNX model to an ngraph model
-------------------------------------------
.. literalinclude:: ../../../examples/onnx/onnx_example.py
.. literalinclude:: ../../../../examples/onnx/onnx_example.py
:language: python
:lines: 22-23
......@@ -189,7 +188,7 @@ input parameters for the computation which generates the output.
Using ngraph_api, create a callable computation object
-------------------------------------------------------
.. literalinclude:: ../../../examples/onnx/onnx_example.py
.. literalinclude:: ../../../../examples/onnx/onnx_example.py
:language: python
:lines: 27-29
......@@ -197,14 +196,14 @@ Using ngraph_api, create a callable computation object
Load or create an image
------------------------
.. literalinclude:: ../../../examples/onnx/onnx_example.py
.. literalinclude:: ../../../../examples/onnx/onnx_example.py
:language: python
:lines: 32-33
Run ResNet inference on picture
---------------------------------
.. literalinclude:: ../../../examples/onnx/onnx_example.py
.. literalinclude:: ../../../../examples/onnx/onnx_example.py
:language: python
:lines: 36-37
......@@ -212,7 +211,7 @@ Run ResNet inference on picture
Put it all together
===================
.. literalinclude:: ../../../examples/onnx/onnx_example.py
.. literalinclude:: ../../../../examples/onnx/onnx_example.py
:language: python
:lines: 17-37
:caption: "Demo sample code to run inference with nGraph"
......
......@@ -12,10 +12,12 @@ Constructing Graphs
update.rst
derive-for-training.rst
distribute-train.rst
import.rst
import.rst
Using the Python API <../../python_api/index.rst>
The "How to" articles in this section explain how to do specific tasks with
nGraph components. The recipes are all framework agnostic; in other words,
The "How to" articles in this section explain how to build or construct graphs
with nGraph components. The recipes are all framework agnostic; in other words,
if an entity (framework or user) wishes to make use of target-based computational
resources, it can either:
......
......@@ -10,13 +10,13 @@ building of graphs.
Several C++ operators are overloaded to simplify graph construction.
For example, the following:
.. literalinclude:: ../../../examples/abc/abc.cpp
.. literalinclude:: ../../../../examples/abc/abc.cpp
:language: cpp
:lines: 32-32
can be simplified to:
.. literalinclude:: ../../../examples/abc_operator/abc_operator.cpp
.. literalinclude:: ../../../../examples/abc_operator/abc_operator.cpp
:language: cpp
:lines: 31
......
......@@ -15,7 +15,7 @@ An example from C++
Let's start with a simple C++ example, a function ``count`` that
returns how many times it has already been called:
.. literalinclude:: ../../../examples/update/update.cpp
.. literalinclude:: ../../../../examples/update/update.cpp
:language: cpp
:lines: 20-24
:caption: update.cpp
......@@ -27,13 +27,13 @@ convert this to use a stateless function, define a function that
takes the current value of ``counter`` as an argument and returns the
updated value.
.. literalinclude:: ../../../examples/update/update.cpp
.. literalinclude:: ../../../../examples/update/update.cpp
:language: cpp
:lines: 26-29
To use this version of counting,
.. literalinclude:: ../../../examples/update/update.cpp
.. literalinclude:: ../../../../examples/update/update.cpp
:language: cpp
:lines: 36-48
......
.. fusion/optimize-graphs:
.. fusion/index.rst:
Pattern matcher
###############
Optimize Graphs
===============
.. toctree::
:maxdepth: 1
with nGraph Compiler fusions
----------------------------
overview.rst
graph-rewrite.rst
The nGraph Compiler is an optimizing compiler. As such, it provides a way to
capture a given :term:`function graph` and perform a series of optimization
passes over that graph. The result is a semantically-equivalent graph that, when
executed using any :doc:`backend <../backend-support/index>`, has optimizations
executed using any :doc:`backend <../../backend-support/index>`, has optimizations
inherent at the hardware level: superior runtime characteristics to increase
training performance or reduce inference latency.
......@@ -41,7 +43,30 @@ then inspecting the transformed graph.
Optimization passes can be programmed ahead of time if you know or can predict
what your graph will look like when it's ready to be executed (in other words:
which `ops` can be automatically translated into :doc:`nGraph Core ops <../ops/index>`).
which `ops` can be automatically translated into :doc:`nGraph Core ops <../../ops/index>`).
The ``Interpreter`` is simply a backend providing reference implementations of
ngraph ops in C++, with the focus on simplicity over performance.
Example
-------
Let us first consider a simple example. A user would like to execute a graph
that describes the following arithmetic expression:
:math:`a + b * 1` or :math:`Add(a, Mul(b, 1))`
In the above expressions, `1` is an identity element; any element multiplied by
the identity element is equal to itself. In other words, the original expression
:math:`a + b * 1` is exactly equivalent to the expression :math:`a + b`, so we
can eliminate this extra multiplication step.
The writer of an optimization pass which uses algebraic simplification would
probably want to first ``locate`` all multiplication expressions where
multiplicands are multiplied by `1` (for stage 1) and to then ``replace``,
those expressions with just their multiplicands (for stage 2).
To make the work of an optimization pass writer easier, the nGraph Library
includes facilities that enable the *finding* of relevant candidates using
pattern matching (via ``pattern/matcher.hpp``), and the *transforming* of the
original graph into an optimized version (via ``pass/graph_rewrite.hpp``).
\ No newline at end of file
.. fusion/overview.rst
Overview: Optimize graphs with nGraph Compiler fusions
-------------------------------------------------------
The nGraph Compiler is an optimizing compiler. As such, it provides a way to
capture a given :term:`function graph` and perform a series of optimization
passes over that graph. The result is a semantically-equivalent graph that, when
executed using any :doc:`backend <../../backend-support/index>`, has
hardware-agnostic *and* hardware-specific optimizations, providing superior
runtime characteristics to increase training performance or reduce inference
latency.
There are several ways to describe what happens when we capture and translate
the framework's output of ops into an nGraph graph. :term:`Fusion` is the term
we shall use in our documentation; the action also can be described as:
*combining*, *folding*, *squashing*, *collapsing*, or *merging* of graph
functions.
Optimization passes may include algebraic simplifications, domain-specific
simplifications, and fusion. Most passes share the same mode of operation (or
the same operational structure) and consist of various stages (each one a
:term:`step`) where a developer can experiment with the intercepted or dynamic
graph. These steps may be cycled or recycled as needed:
#. Locate a list of potentially-transformable subgraphs in the given graph.
#. Transform the selected candidates into semantically-equivalent subgraphs
that execute faster, or with less memory (or both).
#. Verify that the optimization pass performs correctly, with any or all expected
transformations, with the ``NGRAPH_SERIALIZE_TRACING`` option, which
serializes a graph in the `json` format after a pass.
#. Measure and evaluate your performance improvements with ``NGRAPH_CPU_TRACING``,
which produces timelines compatible with ``chrome://tracing``.
Optimizations can be experimented upon without using any backend by registering
a pass with pass manager (``Manager``), calling ``run_passes`` on a function, and
then inspecting the transformed graph.
Optimization passes can be programmed ahead of time if you know or can predict
what your graph will look like when it's ready to be executed (in other words:
which `ops` can be automatically translated into :doc:`nGraph Core ops <../../ops/index>`).
The ``Interpreter`` is simply a backend providing reference implementations of
ngraph ops in C++, with the focus on simplicity over performance.
Example
-------
Let us first consider a simple example. A user would like to execute a graph
that describes the following arithmetic expression:
:math:`a + b * 1` or :math:`Add(a, Mul(b, 1))`
In the above expressions, `1` is an identity element; any element multiplied by
the identity element is equal to itself. This is the same as saying:
:math:`b * 1 = b`
The writer of an optimization pass which uses algebraic simplification would
probably want to first ``locate`` all multiplication expressions where
multiplicands are multiplied by `1` (for stage 1) and to then ``transform``,
``simplify``, or ``replace`` those expressions with just their multiplicands
(for stage 2).
To make the work of an optimization pass writer easier, the nGraph Library
includes facilities that enable the *finding* of relevant candidates using
pattern matching (via ``pattern/matcher.hpp``), and the *transforming* of the
original graph into a condensed version (via ``pass/graph_rewrite.hpp``).
......@@ -4,11 +4,57 @@
Overview
========
What follows here is a table of all documented namespaces with brief descriptions:
.. figure:: ../graphics/whole-stack.png
:alt: The whole stack
The whole nGraph Compiler stack
The nGraph Compiler stack consists of bridges, core, and backends. We'll examine
each of these briefly to get started.
A framework bridge interfaces with the "frontend" Core API. A framework bridge
is a component that sits between a framework like TensorFlow or MXNet, and the
nGraph Core frontend API. A framework bridge does two things: first, it
translates a framework's operations into graphs in nGraph’s in-memory :abbr:`Intermediary Representation (IR)`. Second, it executes the nGraph IR graphs via
the backend execution interface.
The details of bridge implementation vary from framework to framework, but there
are some common patterns: a fairly typical example for a graph-based framework
is illustrated here, and consists of basically two phases: a **clustering**
phase and a **translation** phase.
.. figure:: ../graphics/translation-flow-to-ng-fofx.png
:alt: The whole stack
Translation flow to an nGraph function
The clustering phase operates on the original framework's graph. During this
stage, we look for maximal subgraphs containing nodes that can be translated
to data flow functions in nGraph. The ability to capture subgraphs of the original
graph means that we maintain interoperability with the native framework runtime.
Any node that is not placed in a cluster can still by handled by the native
framework. On the other hand, identifying maximal subgraphs means that we can
avoid unnecessary handoffs between the native framework runtime and nGraph;
minimizing this is good for performance.
In the second phase, called translation, we cut out each cluster subgraph,
translate it into an nGraph Function, and replace the cluster subgraph with a
stand-in node called an "encapsulation node" that holds a pointer to the nGraph
``Function``. Later, at runtime, those functions will be invoked when the
framework asks us to execute the encapsulation node.
It’s worth noting that backends have total freedom to rewrite the nGraph
Functions: they can do it for the sake of structural or algorithmic optimization
of the graph, for easy integration with kernel libraries, or for any or no
reason at all.
Namespaces in nGraph
--------------------
What follows here is a table of all documented namespaces with brief
descriptions:
Namespace List
--------------
.. csv-table::
:header: "Namespace", "Description", "Location in Repo", "Docs"
......@@ -27,7 +73,3 @@ Namespace List
.. _Ndescriptor: https://github.com/NervanaSystems/ngraph/tree/master/src/ngraph/descriptor
.. _Nop: https://github.com/NervanaSystems/ngraph/tree/master/src/ngraph/op
.. _Nruntime: https://github.com/NervanaSystems/ngraph/tree/master/src/ngraph/runtime
.. fusion/index.rst:
.. core/passes/list-of-passes:
Pattern matcher
###############
* :ref:`overview`
* :ref:`passes_list`
* :ref:`more_detail`
* :ref:`passes_examples`
* :doc:`optimize-graphs`
.. _overview:
Generic graph optimizers: Optimization passes
=============================================
The pass manager infrastructure in nGraph makes it easy to reuse and mix the
generic optimization passes. It also permits you to roll your own device-specific
optimizations; that is, the same unified interface and APIs may be used to
cover both things.
Invoking these passes is fairly straightforward:
#. Create a "pass manager" object.
#. Populate it with the desired passes.
#. Pass to it a pointer to your unoptimized graph, and it’ll return a pointer
to an optimized graph.
nGraph Core includes a large library of hardware-agnostic passes -- passes useful
for almost any kind of hardware backend. Some of these passes should be familiar
to people who are comfortable with classical compiler designs. Others, like the
reshape/transpose elimination and sinking passes, are quite specific to deep
learning.
Let’s take a look at some of these passes.
List of passes
==============
.. csv-table::
:header: "Pass Name", "More Detail"
:widths: 29, 31
:escape: ~
``AlgebraicSimplification``, :ref:`algebraic_simpl`
``AssignLayout``, Coming Soon
``CallGraphPass``, Coming Soon
``CommonFunctionCollection``, Coming Soon
``CommonSubexpressionElimination``, :ref:`common_subex_elim`
``ConstantFolding``, :ref:`constant_fold`
``CoreFusion``, Coming Soon
``DumpSorted``, Coming Soon
``FunctionPass``, Coming Soon
``GetOutputElementElimination``, Coming Soon
``GraphRewrite``, Coming Soon
``LikeReplacement``, Coming Soon
``Liveness``, Coming Soon
``Manager``, Coming Soon
``ManagerState``, Coming Soon
``MemoryLayout``, Coming Soon
``MemoryManager``, Coming Soon
``MemoryVisualize``, Coming Soon
``ModulePass``, Coming Soon
``NodePass``, Coming Soon
``NopElimination``, Coming Soon
``PassBase``, Coming Soon
``PassConfig``, Coming Soon
``PrefixReshapeElimination``, Coming Soon
``PropagateCacheability``, Coming Soon
``RecurrentGraphRewrite``, Coming Soon
``ReshapeElimination``, :ref:`reshape_transpose_elim`
``ReshapeSinking``, :ref:`reshape_transpose_sink`
``Serialization``, Coming Soon
``ValidateGraph``, Coming Soon
``VisualizeTree``, Coming Soon
``ZeroDimTensorElimination``, Coming soon
.. important:: All of the above passes are currently implementable; more
detailed documentation for each pass may be a :abbr:`Work In Progress (WIP)`
(WIP).
.. _passes_list:
List of Passes
==============
.. _algebraic_simpl:
* :ref:`algebraic_simpl`
* :ref:`common_subex_elim`
* :ref:`constant_fold`
* :ref:`reshape_transpose_elim`
* :ref:`reshape_transpose_sink`
``Algebraic Simplification``
----------------------------
.. figure:: ../../graphics/algebraic-simpl.png
:width: 650px
.. _algebraic_simpl:
Algebraic simplification
Algebraic Simplification
------------------------
The **Algebraic Simplification** pass implements what amounts to a "grab bag" of
algebraic simplification rules. It does some basic things like rewrite "zero
......@@ -60,10 +65,8 @@ times x" to simply "zero", or "zero plus x" to plain "x".
It can also do a number of tricks more specific to deep learning. For example,
if we discover that a tensor is being sliced up by adjacent segments, only to
have those slices concatenated back together again, we can skip the slicing and
concatting altogether.
Or, if a tensor is being padded, but the actual width of the padding is zero
all around, we can skip the padding step entirely.
concatting altogether. Or, if a tensor is being padded, but the actual width of
the padding is zero all around, we can skip the padding step entirely.
Several other transformations like this are implemented in the algebraic
simplification pass. And while none of these transformations might seem
......@@ -71,33 +74,34 @@ particularly impressive on their own, when everything comes together the
results of this pass often yield improvement even on the initial graph straight
out of the bridge. This pass is also quite important as a "glue" pass that can
be used to clean up and/or re-simplify after other passes have done their own
tricks.
tricks. See the example on :doc:`passes` for an example of how effective this
can be.
.. _common_subex_elim:
Common Subexpression Elimination
--------------------------------
``Common Subexpression Elimination``
-------------------------------------
.. _constant_fold:
Constant Folding
----------------
``Constant Folding``
--------------------
.. _core_fusion:
Core Fusion
-----------
``Core Fusion``
---------------
.. _reshape_transpose_elim:
Reshape/Transpose Elimination
-----------------------------
``Reshape Elimination``
-----------------------
The pass called **Reshape/Transpose Elimination** will find and optimize where
The pass also called **Reshape/Transpose Elimination** will find and optimize where
we can "push" two ``Transpose`` ops through a matrix multiplication. For example,
if you have two matrices (say, *foo* and *bar*), both of these matrices will be
transposed (to produce *foo.t* and *bar.t*, respectively), aftew which *foo.t*
......@@ -120,8 +124,8 @@ them both out of the graph.
.. _reshape_transpose_sink:
``Reshape/Transpose Sinking``
-----------------------------
``Reshape Sinking``
-------------------
......@@ -130,76 +134,4 @@ them both out of the graph.
.. _elementzero_tensor_elim:
``Zero-Element Tensor Elimination``
-----------------------------------
.. _more_detail:
More detail
-----------
Let us first consider a simple example. A user would like to execute a graph
that describes the following arithmetic expression:
:math:`a + b * 1` or :math:`Add(a, Mul(b, 1))`
In the above expressions, `1` is an identity element; any element multiplied by
the identity element is equal to itself. This is the same as saying:
:math:`b * 1 = b`
The writer of an optimization pass which uses algebraic simplification would
probably want to first ``locate`` all multiplication expressions where
multiplicands are multiplied by `1` (for stage 1) and to then ``transform``,
``simplify``, or ``replace`` those expressions with just their multiplicands
(for stage 2).
To make the work of an optimization pass writer easier, the nGraph Library
includes facilities that enable the *finding* of relevant candidates using
pattern matching (via ``pattern/matcher.hpp``), and the *transforming* of the
original graph into a condensed version (via ``pass/graph_rewrite.hpp``).
Let's consider each in more detail and many ways they can help the graph
optimizer.
.. toctree::
:maxdepth: 1
graph-rewrite.rst
passes-that-use-matcher.rst
optimize-graphs.rst
.. _passes_examples:
Examples of Passes
==================
The effectiveness of these passes is more striking to look at in terms of an
actual input graph, such as one from the framework bridge.
*Figure 0* shows an excerpt from ``MobileNet v1``, a topology which makes heavy
use of group convolution.
.. _figure-mobilenet-gc:
.. figure:: ../graphics/mobilenet-group-conv.png
:width: 700px
:alt:
Figure 0: Each of these grouped convolution complexes -- the
operations within the rectangles on the left -- is very wide; each is too
wide to fit legibly on the illustration.
The group convolution fusion is able to replace each of those giant subgraphs
with a single CPU group convolution node. This ends up being a win in several
ways:
* sheer node count,
* mappability to MKL-DNN (which has an accelerated group convolution implementation),
* elimination of unnecessary temporaries, and so on.
\ No newline at end of file
-----------------------------------
\ No newline at end of file
......@@ -131,6 +131,6 @@ Equivalent to ``"A(BC)+A"`` in regexes
.. |image11| image:: mg/fusion_pattern.png
.. |image12| image:: mg/rp_graph1.png
.. |image13| image:: mg/rp_pattern.png
\ No newline at end of file
.. |image11| image:: ../fusion/mg/fusion_pattern.png
.. |image12| image:: ../fusion/mg/rp_graph1.png
.. |image13| image:: ../fusion/mg/rp_pattern.png
\ No newline at end of file
.. core/passes:
Compiler passes
===============
.. toctree::
:maxdepth: 1
:caption: Compiler passes
list-of-passes.rst
passes-that-use-matcher.rst
Overview: Generic graph optimization passes
-------------------------------------------
The pass manager infrastructure in nGraph makes it easy to reuse and mix the
generic optimization passes. It also permits you to roll your own device-specific
optimizations; that is, the same unified interface and APIs may be used to
cover both things.
Invoking these passes is fairly straightforward:
#. Create a "pass manager" object.
#. Populate it with the desired pass(es).
#. Invoke the pass manager with a pointer to your unoptimized graph, and it’ll return a pointer
to an optimized graph.
nGraph Core includes a large library of hardware-agnostic passes useful
for almost any kind of hardware backend. Some of these passes are likely familiar
to people who are comfortable with classical compiler designs. Others, like the
reshape/transpose elimination and sinking passes, are quite specific to deep
learning.
Example of Passes
-----------------
The effectiveness of graph-level optimization with nGraph is more striking to look
at in terms of an actual input graph, such as one from the framework bridge.
*Figure A* shows an excerpt from ``MobileNet v1``, a topology which makes heavy
use of group convolution.
.. _figure-mobilenet-gc:
.. figure:: ../../graphics/mobilenet-group-conv.png
:width: 700px
:alt:
Figure A: Each of these grouped convolution complexes -- the
operations within the rectangles on the left -- is very wide; each is too
wide to fit legibly on the illustration.
The group convolution fusion is able to replace each of those giant subgraphs
with a single CPU group convolution node. This ends up being a win in several
ways:
* sheer node count,
* mappability to MKL-DNN (which has an accelerated group convolution implementation),
* elimination of unnecessary temporaries, and so on.
\ No newline at end of file
.. frameworks/index.rst:
.. TODO update CODEOWNERS for this new structure
Current framework integrations
==============================
.. toctree::
:maxdepth: 1
tensorflow_integ.rst
mxnet_integ.rst
onnx_integ.rst
......@@ -22,7 +18,7 @@ cloned from one of our GitHub repos and built to connect to nGraph device backen
all the while maintaining the framework's programmatic or user interface. Bridges
currently exist for the TensorFlow\* and MXNet\* frameworks.
.. figure:: ../graphics/bridge-to-graph-compiler.png
.. figure:: ../graphics/whole-stack.png
:width: 733px
:alt: JiT compiling of a computation
......
......@@ -42,12 +42,10 @@ nGraph Compiler stack
:caption: nGraph Core
core/overview.rst
Pattern matcher <fusion/index.rst>
core/fusion/index.rst
nGraph Core Ops <ops/index.rst>
More about Ops <ops/about.rst>
Graph construction <howto/index.rst>
Using the Python API <python_api/index.rst>
Compiler passes <fusion/graph-rewrite.rst>
core/constructing-graphs/index.rst
core/passes/passes.rst
buildlb.rst
......@@ -75,6 +73,12 @@ nGraph Compiler stack
diagnostics/visualize.rst
diagnostics/debug.rst
.. toctree::
:maxdepth: 1
:caption: Tutorials
tutorials/index.rst
.. toctree::
:maxdepth: 1
......@@ -84,7 +88,7 @@ nGraph Compiler stack
project/contribution-guide.rst
project/index.rst
glossary.rst
project/doc-contributor-README.rst
Indices and tables
......
.. ops/about.rst:
##############
About Core Ops
##############
An ``Op``'s primary role is to function as a node in a directed acyclic graph
dependency computation graph.
*Core ops* are ops that are available and generally useful to all framework
bridges and that can be compiled by all transformers. A framework bridge may
define framework-specific ops to simplify graph construction, provided that the
bridge can enable every transformer to replace all such ops with equivalent
clusters or subgraphs composed of core ops. Similary, transformers may define
transformer-specific ops to represent kernels or other intermediate operations.
If a framework supports extending the set of ops it offers, a bridge may even
expose transformer-specific ops to the framework user.
.. figure:: ../graphics/tablengraphops.png
:width: 535px
:alt: Operations Available in the nGraph IR
Operations Available in the nGraph IR
.. important:: Our design philosophy is that the graph is not a script for
running kernels; rather, our compilation will match ``ops`` to appropriate
kernels for the backend(s) in use. Thus, we expect that adding of new Core
ops should be infrequent and that most functionality instead gets added with
new functions that build sub-graphs from existing core ops.
It is easiest to define a new op by adapting an existing op. Some of the tasks
that must be performed are:
- Op constructor:
* Checking type-consistency of arguments
* Specifying the result type for a call
- Serializer/Deserializer
- Transformer handlers:
* Interpreter (reference) implementation of behavior. The
implementation should favor clarity over efficiency.
......@@ -6,6 +6,8 @@ List of Core ``ops``
Not currently a comprehensive list.
:ref:`more_about`
.. hlist::
:columns: 3
......@@ -143,3 +145,51 @@ Not currently a comprehensive list.
subtract.rst
tan.rst
tanh.rst
.. _more_about:
More about Core Ops
-------------------
An ``Op``'s primary role is to function as a node in a directed acyclic graph
dependency computation graph.
*Core ops* are ops that are available and generally useful to all framework
bridges and that can be compiled by all transformers. A framework bridge may
define framework-specific ops to simplify graph construction, provided that the
bridge can enable every transformer to replace all such ops with equivalent
clusters or subgraphs composed of core ops. Similary, transformers may define
transformer-specific ops to represent kernels or other intermediate operations.
If a framework supports extending the set of ops it offers, a bridge may even
expose transformer-specific ops to the framework user.
.. figure:: ../graphics/tablengraphops.png
:width: 535px
:alt: Operations Available in the nGraph IR
Operations Available in the nGraph IR
.. important:: Our design philosophy is that the graph is not a script for
running kernels; rather, our compilation will match ``ops`` to appropriate
kernels for the backend(s) in use. Thus, we expect that adding of new Core
ops should be infrequent and that most functionality instead gets added with
new functions that build sub-graphs from existing core ops.
It is easiest to define a new op by adapting an existing op. Some of the tasks
that must be performed are:
- Op constructor:
* Checking type-consistency of arguments
* Specifying the result type for a call
- Serializer/Deserializer
- Transformer handlers:
* Interpreter (reference) implementation of behavior. The
implementation should favor clarity over efficiency.
\ No newline at end of file
......@@ -10,8 +10,8 @@ optimizing an :abbr:`Artificial Neural Network (ANN)` (often abbreviated :term:`
to run graph-based computations for training, inference, testing, or validation.
Because today's NNs make use of many custom-purpose devices (FPGAs, GPUs, CPUs,
and custom silicon), having such a standard simplifies what would otherwise be
an enormously complex and difficult-to-scale pipeline (:ref:`Figure 3 <figure-3>`)
from "training with your favorite framework using GPUs" (:ref:`Figure 4 <figure-4>`),
an enormously complex and difficult-to-scale pipeline (:ref:`Figure C <figure-C>`)
from "training with your favorite framework using GPUs" (:ref:`Figure D <figure-D>`),
to deploying that (now) pre-trained model in a datacenter or production
environment, where infrastructure owners or software developers renting anything
in a datacenter ought to be mutually concerned with **efficiency per-watt**, to
......@@ -30,35 +30,35 @@ library unique to that vendor's hardware. For example, after integration, a
kernel library can run operations that it is "familar" with optimally; however,
the graph itself within any larger :term:`NN` won't be optimal.
.. _figure-0:
.. _figure-A:
.. figure:: ../graphics/framework-to-kernel-lib.png
:width: 555px
:alt:
Figure 0: Lack of graph-level optimization makes framework-to-kernel library
Figure A: Lack of graph-level optimization makes framework-to-kernel library
integration enormously inefficient. The computation graph above represents
the computation: "A plus B times C".
.. _figure-1:
.. _figure-B:
.. figure:: ../graphics/framework-to-graph-opt.png
:width: 555px
:alt:
Figure 1: Notice that an operation on the constant B (in this case a ``Broadcast``)
Figure B: Notice that an operation on the constant B (in this case a ``Broadcast``)
can be done at compile time. This is an example of constant folding, and it
is not available to a device-based kernel library.
.. _figure-2:
.. _figure-C:
.. figure:: ../graphics/ngraph-algebraic-simp.png
:width: 555px
:alt:
Figure 2: Finally notice that the constant has value "zero" thus the add is an
Figure C: Finally notice that the constant has value "zero" thus the add is an
*identity* operation and can be eliminated. This is an example of **Algebraic
simplification**, and it is not available to a device-based kernel library.
......@@ -78,7 +78,7 @@ A typical network is constructed using some kind of language-based API, which
translates the network or :abbr:`DL (Deep Learning)` model (statically or
dynamically) into serialized graphs. Those graphs can then passed through a
compilation process (the *Graph optimization or compilation* step in
*Figure 3* below), where various graph-level optimizations, like constant folding
*Figure D* below), where various graph-level optimizations, like constant folding
or fusion can happen. These processes require unique vendor-provided libraries
to communicate with a driver (possibly through OpenCL\*, CUDA\*, or SYCL\*), to
compile and execute an implementation (kernel) for a specific
......@@ -89,25 +89,25 @@ each component. Note that optimizing for any one on its own usually requires
engineering expertise that can be highly specialized to that component, and that
the terms have been simplified for illustrative purposes.
.. _figure-3:
.. _figure-D:
.. figure:: ../graphics/components-dl-stack.png
:width: 700px
:alt: A simplified DL stack
Figure 3: Components of a DL stack, simplified for illustrative purposes.
Figure D: Components of a DL stack, simplified for illustrative purposes.
There are many deep learning frameworks, each with its own strengths and user
bases. A setup that is common to many DL practitioners is shown in the
illustration below.
.. _figure-4:
.. _figure-E:
.. figure:: ../graphics/a-common-stack.png
:width: 700px
:alt: A common implementation
Figure 4: A commonly-implemented stack uses TensorFlow\* as the frontend.
Figure E: A commonly-implemented stack uses TensorFlow\* as the frontend.
The input is either optimized via Grappler, or executed directly via TensorFlow.
In either case, when targeting an Nvidia\* GPU, cuDNN is called to select an
optimal kernel for the operation; cuDNN then relies on CUDA\* or direct access
......@@ -121,13 +121,13 @@ memory layout, its feature set, etc. Each of these connections, then, represents
significant work for what will ultimately be a brittle setup that is enormously
expensive to maintain.
.. _figure-5:
.. _figure-F:
.. figure:: ../graphics/dl-current-state.png
:width: 700px
:alt: Scalability matters
Figure 5: The number of kernels necessary to achieve optimal performance is
Figure F: The number of kernels necessary to achieve optimal performance is
bounded by the product of the number of chip designs one wishes to support,
the number of data types supported, the number of operations, and the
cardinality of each parameter for each operation.
......@@ -148,22 +148,32 @@ hardware coverage and optimization automatically. Any hardware that supports
LLVM, OpenCL, OpenGL, CUDA or Metal can be supported automatically with PlaidML
and nGraph.
.. _figure-6:
.. _figure-G:
.. figure:: ../graphics/graph-compilers-at-a-glance.png
:width: 700px
:alt: Overview of various graph and tensor compilers.
Figure 6: Overview of various graph and tensor compilers.
Figure G: Overview of various graph and tensor compilers.
.. _figure-7:
.. _figure-H:
.. figure:: ../graphics/tensor-compilers-at-a-glance.png
:width: 700px
:alt: A closer look at tensor compilers.
Figure 7: A closer look at tensor compilers.
Figure H: A closer look at tensor compilers.
Other notable efforts
----------------------
A few other notable efforts in compiler projects include:
* **TVM** https://github.com/dmlc/tvm
* **XLA** https://developers.googleblog.com/2017/03/xla-tensorflow-compiled.html
* **Glow** https://arxiv.org/pdf/1805.00907.pdf
......
......@@ -29,7 +29,7 @@ the following categories:
In our tests, the optimized workloads can perform up to 45X faster than native
frameworks, and we expect performance gains for other workloads due to our
powerful :doc:`../fusion/index` feature.
powerful :doc:`../core/fusion/index` feature.
See also our recent `API changes`_
......
.. tutorials/index:
##########
Tutorials
##########
Coming soon
.. toctree::
:maxdepth: 1
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment