Commit a2732033 authored by L.S. Cook's avatar L.S. Cook Committed by Scott Cyphers

Leona/patternmatchdoc (#1057)

* editing how to execute computation file for clarity and linenos

* Add placeholder for runtime docs

* Update section on backends, interpreter, and FPGA options

* add updated master to fix python_ci

* Weird autosummary issue reverted

* Clarify new section

* fix up docs

* Update pattern matcher doc based on Nik's presentation slides WIP

* Update doc structure and examples

* remove old folder

* Fix broken Tensorview refs

* . helping people document code more efficiently

* PR review edits

* Finish PR review comment fixes so far

* split patternmatcher PR

* small fixes to PM docs

* remove mark tags from source code

* Final PR cleanup edits
parent 7758cf5d
......@@ -56,7 +56,7 @@ source_suffix = '.rst'
master_doc = 'index'
# General information about the project.
project = u'Intel® nGraph™ library'
project = u'Intel® nGraph Library'
copyright = '2018, Intel Corporation'
author = 'Intel Corporation'
......
.. generic-frameworks.rst
.. frameworks/generic.rst
Activating nGraph on generic frameworks
========================================
Activate nGraph |trade| on generic frameworks
=============================================
This section details some of the *configuration options* and some of the
*environment variables* that can be used to tune for optimal performance when
......@@ -62,12 +62,12 @@ use these configuration settings.
* ``KMP_AFFINITY`` Enables the runtime library to bind threads to physical
processing units.
* ``KMP_SETTINGS`` Enables (``true``) or disables (``false``) the printing of
OpenMP* runtime library environment variables during program execution.
OpenMP\* runtime library environment variables during program execution.
* ``OMP_NUM_THREADS`` Specifies the number of threads to use.
nGraph-enabled Intel® Xeon®
===========================
nGraph-enabled Intel® Xeon®
============================
The list below includes recommendations on data layout, parameters, and
application configuration to achieve best performance running DNN workloads on
......@@ -88,8 +88,10 @@ and activated as follows:
Memory allocation
-----------------
Buffer pointers should be aligned at the 64-byte boundary. NUMA policy should be
configured for local memory allocation (``numactl --localloc``)
Buffer pointers should be aligned on 64-byte boundaries. NUMA policy should be
configured for local memory allocation (``numactl --localloc``).
Convolution shapes
^^^^^^^^^^^^^^^^^^
......@@ -129,13 +131,11 @@ Intra-op and inter-op parallelism
* ``intra_op_parallelism_threads``
* ``inter_op_parallelism_threads``
Some frameworks, like Tensorflow, use these settings to improve performance;
however, they are often not sufficient to achieve optimal performance.
Framework-based adjustments cannot access the underlying NUMA configuration in
multi-socket Intel Xeon processor-based platforms, which is a key requirement for
many kinds of inference-engine computations. See the next section on
NUMA performance to learn more about this performance feature available to systems
utilizing nGraph.
Some frameworks, like TensorFlow\*, use these settings to improve performance;
however, they are often not sufficient for optimal performance. Framework-based adjustments cannot access the underlying NUMA configuration in multi-socket
Intel Xeon processor-based platforms, which is a key requirement for many kinds
of inference-engine computations. See the next section on NUMA performance to
learn more about this performance feature available to systems utilizing nGraph.
NUMA performance
......
.. optimize/index:
.. framework/index:
#############################
Integrate Generic Frameworks
#############################
This section, written for framework architects or engineers who want
to optimize a generic, brand new or less widely-supported framework, we
provide some of our learnings from the work we've done in developing
"framework direct optimizations (DO)" and custom bridge code, such as
that for our `ngraph tensorflow bridge`_ code.
In this section, written for framework architects or engineers who want
to optimize brand new, generic, or less widely-supported frameworks, we provide
some of our learnings from our "framework Direct Optimization (framework DO)"
work and custom bridge code, such as that for our `ngraph tensorflow bridge`_.
.. important:: This section contains articles for framework owners or developers
who want to incorporate the nGraph library directly into their framework and
......@@ -21,6 +22,7 @@ that for our `ngraph tensorflow bridge`_ code.
generic.rst
When using a framework to run a model or deploy an algorithm on nGraph
devices, there are some additional configuration options that can be
incorporated -- manually on the command line or via scripting -- to improve
......@@ -48,8 +50,10 @@ the pack by providing a means for the data scientist to obtain reproducible
results. The other challenge is to provide sufficient documentation, or to
provide sufficient hints for how to do any "fine-tuning" for specific use cases.
How this has worked in creating the :doc:`the direct optimizations <../framework-integration-guides>`
we've shared with the developer community, our `engineering teams carefully tune the workload to extract best performance`_
How this has worked in creating the
:doc:`the direct optimizations <../framework-integration-guides>` we've shared
with the developer community, our engineering teams carefully
`tune the workload to extract best performance`_
from a specific :abbr:`DL (Deep Learning)` model embedded in a specific framework
that is training a specific dataset. Our forks of the frameworks adjust the code
and/or explain how to set the parameters that achieve reproducible results.
......@@ -82,10 +86,11 @@ updates.
.. [#1] Benchmarking performance of DL systems is a young discipline; it is a
good idea to be vigilant for results based on atypical distortions in the
configuration parameters. Every topology is different, and performance
increases or slowdowns can be attributed to multiple means.
changes can be attributed to multiple causes. Also watch out for the word "theoretical" in comparisons; actual performance should not be
compared to theoretical performance.
.. _ngraph tensorflow bridge: http://ngraph.nervanasys.com/docs/latest/framework-integration-guides.html#tensorflow
.. _engineering teams carefully tune the workload to extract best performance: https://ai.intel.com/accelerating-deep-learning-training-inference-system-level-optimizations
.. _tune the workload to extract best performance: https://ai.intel.com/accelerating-deep-learning-training-inference-system-level-optimizations
.. _a few small: https://software.intel.com/en-us/articles/boosting-deep-learning-training-inference-performance-on-xeon-and-xeon-phi
.. _Movidius: https://www.movidius.com/
\ No newline at end of file
.. fusion/graph-rewrite.rst:
Using ``GraphRewrite`` to fuse ops
-----------------------------------
Exact pattern matching
~~~~~~~~~~~~~~~~~~~~~~
For the example of ``$-(-A) = A$``, various graphs of varying complexity can be
created and overwritten with recipes for pattern-matching + graph-rewrite. To
get started, a simple example for a trivial graph, followed by more complex
examples:
|image3|
.. code-block:: cpp
auto a = make_shared<op::Parameter>(element::i32, shape);
auto absn = make_shared<op::Abs>(a);
auto neg1 = make_shared<op::Negative>(absn);
auto neg2 = make_shared<op::Negative>(neg1);
|image4|
.. code-block:: cpp
auto a = make_shared<op::Parameter>(element::i32, shape);
auto b = make_shared<op::Parameter>(element::i32, shape);
auto c = a + b;
auto absn = make_shared<op::Abs>(c);
auto neg1 = make_shared<op::Negative>(absn);
auto neg2 = make_shared<op::Negative>(neg1);
Label AKA ``.`` in regexes
^^^^^^^^^^^^^^^^^^^^^^^^^^^
|image5|
For the code below, ``element::f32`` will still match integer Graph1 and
Graph2
.. code-block:: cpp
//note element::f32, will still match integer Graph1 and Graph2
auto lbl = std::make_shared<pattern::op::Label>(element::f32, Shape{});
auto neg1 = make_shared<op::Negative>(lbl);
auto neg2 = make_shared<op::Negative>(neg1);
Constructing labels from existing nodes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Double Negative w/ Add
^^^^^^^^^^^^^^^^^^^^^^
|image6|
.. code-block:: cpp
auto a = make_shared<op::Parameter>(element::i32, shape);
//`lbl` borrows the type and shape information from `a`
auto lbl = std::make_shared<pattern::op::Label>(a);
auto neg1 = make_shared<op::Negative>(a);
auto neg2 = make_shared<op::Negative>(neg1);
Double Negative w/ Sub
^^^^^^^^^^^^^^^^^^^^^^
|image7|
Predicates are of type ``std::function<bool(std::shared_ptr<Node>)>``
.. code-block:: cpp
//predicates are of type std::function<bool(std::shared_ptr<Node>)>
auto add_or_sub = [](std::shared_ptr<Node> n) {
return std::dynamic_pointer_cast<op::Add>(n) != nullptr ||
std::dynamic_pointer_cast<op::Sub>(n) != nullptr
};
auto lbl = std::make_shared<pattern::op::Label>(
element::f32,
Shape{},
add_or_sub
);
auto neg1 = make_shared<op::Negative>(a);
auto neg2 = make_shared<op::Negative>(neg1);
Passes that use Matcher
=======================
* CPUFusion (GraphRewrite)
* CoreFusion (GraphRewrite)
* ReshapeElimination (GraphRewrite)
* AlgebraicSimplification
* CPUPostLayoutOptimizations (GraphRewrite)
* CPURnnMatFusion
* and many more...
Register `simplify_neg` handler
--------------------------------
::
static std::unordered_map<std::type_index, std::function<bool(std::shared_ptr<Node>)>>
initialize_const_values_to_ops()
{
return std::unordered_map<std::type_index, std::function<bool(std::shared_ptr<Node>)>>({
{TI(op::Add), simplify_add},
{TI(op::Multiply), simplify_multiply},
{TI(op::Sum), simplify_sum},
{TI(op::Negative), simplify_neg}
});
}
Add a fusion
~~~~~~~~~~~~
$max(0, A) = Relu(A)$
Pattern for capturing
~~~~~~~~~~~~~~~~~~~~~
|image11|
$max(0, A) = Relu(A)$
::
namespace ngraph
{
namespace pass
{
class CoreFusion;
}
}
class ngraph::pass::CoreFusion : public ngraph::pass::GraphRewrite
{
public:
CoreFusion()
: GraphRewrite()
{
construct_relu_pattern();
}
//this should go in a cpp file.
void construct_relu_pattern()
{
auto iconst0 = ngraph::make_zero(element::i32, Shape{});
auto val = make_shared(iconst0);
auto zero = make_shared(iconst0, nullptr, NodeVector{iconst0});
auto broadcast_pred = [](std::shared_ptr n) {
return static_cast(std::dynamic_pointer_cast(n));
};
auto skip_broadcast = std::make_shared(zero, broadcast_pred);
auto max = make_shared(skip_broadcast, val);
pattern::graph_rewrite_callback callback = [val, zero](pattern::Matcher& m) {
NGRAPH_DEBUG << "In a callback for construct_relu_pattern against "
<< m.get_match_root()->get_name();
auto pattern_map = m.get_pattern_map();
auto mzero = m.get_pattern_map()[zero];
if (!ngraph::is_zero(mzero))
{
NGRAPH_DEBUG << "zero constant = " << mzero->get_name() << " not equal to 0n";
return false;
}
auto mpattern = m.get_match_root();
auto cg = shared_ptr(new op::Relu(pattern_map[val]));
ngraph::replace_node(m.get_match_root(), cg);
return true;
};
auto m = make_shared(max, callback);
this->add_matcher(m);
}
};
Recurrent patterns
------------------
::
$ (((A + 0) + 0) + 0) = A$
Equivalent to ``"A(BC)+A"`` in regexes
::
$ (((A + 0) + 0) + 0) = A$
|image12|
|image13|
::
Shape shape{};
auto a = make_shared<op::Parameter>(element::i32, shape);
auto b = make_shared<op::Parameter>(element::i32, shape);
auto rpattern = std::make_shared<pattern::op::Label>(b);
auto iconst0 = ngraph::make_zero(element::i32, shape);
auto abs = make_shared<op::Abs>(a);
auto add1 = iconst0 + b;
auto add2 = iconst0 + add1;
auto add3 = iconst0 + add2;
auto padd = iconst0 + rpattern;
std::set<std::shared_ptr<pattern::op::Label>> empty_correlated_matches;
RecurrentMatcher rm(padd, rpattern, empty_correlated_matches, nullptr);
ASSERT_TRUE(rm.match(add3));
.. |image3| image:: mg/pr1_graph2.png
.. |image4| image:: mg/pr1_graph3.png
.. |image5| image:: mg/pr1_pattern2.png
.. |image6| image:: mg/pr1_graph4.png
.. |image7| image:: mg/pr1_graph5.png
.. |image8| image:: mg/pr2_graph1.png
.. |image9| image:: mg/pr2_graph2.png
.. |image10| image:: mg/pr2_pattern2.png
.. |image11| image:: mg/fusion_pattern.png
.. |image12| image:: mg/rp_graph1.png
.. |image13| image:: mg/rp_pattern.png
\ No newline at end of file
.. fusion/index.rst:
Optimize Graphs
===============
with nGraph Compiler fusions
-----------------------------
The nGraph Compiler is an optimizing compiler. As such, it performs a series
of optimization passes over a given function graph to translate it into a
semantically-equivalent and inherently-optimized graph with superior runtime
characteristics for any of nGraph's current or future backends. Indeed, a
framework's capability to increase training performance or to reduce inference
latency by simply adding another device of *any* specialized form factor (CPU,
GPU, VPU, or FPGA) is one of the :doc:`key benefits <../project/about>` of
developing upon a framework that uses the nGraph Compiler.
In handling a :term:`function graph`, there are many ways to describe what
happens when we translate the framework's output of ops into an nGraph
graph. :term:`Fusion` is the term we shall use in our documentation, but the the
action also can be described as: *combining*, *folding*, *collapsing*, or
*merging* of graph functions. The most common use case is to *fuse* a subgraph
from the function graph into :doc:`one of the nGraph Core ops <../ops/index>`.
Optimization passes may include algebraic simplifications, domain-specific
simplifications, and fusion. Most passes share the same mode of operation (or
the same operational structure) and consist of two stages:
#. Locating a list of potential transformation candidates (usually, subgraphs)
in the given graph.
#. Transforming the selected candidates into semantically-equivalent subgraphs
that run faster and/or with less memory.
Optimization passes can be programmed ahead of time if you know what your graph
will look like when it's ready to be executed, or the optimization passes can
be figured out manually with *Interpreter* mode on a stateless graph.
Let us first consider an example. A user would like to execute a simple graph
that describes the following arithmetic expression:
:math:`a + b * 1` or :math:`Add(a, Mul(b, 1))`
In the above expressions, `1` is an identity element; any element multiplied by
the identity element is equal to itself. This is the same as saying:
:math:`b * 1 = b`
The writer of an optimization pass which uses algebraic simplification would
probably want to first ``locate`` all multiplication expressions where
multiplicands are multiplied by `1` (for stage 1) and to then ``transform``,
``simplify``, or ``replace`` those expressions with just their multiplicands
(for stage 2).
To make the work of an optimization pass writer easier, the nGraph library
includes facilities that enable the *finding* of relevant candidates using
pattern matching (via ``pattern/matcher.hpp``), and the *transforming* of the
original graph into a condensed version (via ``pass/graph_rewrite.hpp``).
Let's consider each of the two in more detail and many ways they can help the
work of the optimization pass writer.
.. toctree::
:maxdepth: 1
graph-rewrite.rst
......@@ -33,6 +33,12 @@ Glossary
The Intel nGraph library uses a function graph to represent an
``op``'s parameters and results.
fusion
Fusion is the fusing, combining, merging, collapsing, or refactoring
of a graph's functional operations (``ops``) into one or more of
nGraph's core ops.
op
An op represents an operation. Ops are stateless and have zero
......@@ -98,6 +104,14 @@ Glossary
Tensors are maps from *coordinates* to scalar values, all of the
same type, called the *element type* of the tensor.
Tensorview
The interface backends implement for tensor use. When there are no more
references to the tensor view, it will be freed when convenient for the
backend.
model description
A description of a program's fundamental operations that are
......
......@@ -166,17 +166,16 @@ you switch between odd/even generations of variables on each update.
Backends are responsible for managing storage. If the storage is off-CPU, caches
are used to minimize copying between device and CPU. We can allocate storage for
the three parameters and the return value as follows:
the three parameters and the return value.
.. literalinclude:: ../../../examples/abc/abc.cpp
:language: cpp
:lines: 41-46
Each tensor is a shared pointer to a :doc:`../programmable/index/tensorview`,
the interface backends implement for tensor use. When there are no more references to the
Each tensor is a shared pointer to a :term:`Tensorview`, which is the interface
backends implement for tensor use. When there are no more references to the
tensor view, it will be freed when convenient for the backend. See the
:doc:`../programmable/index` documentation for details on ``TensorView ``.
:doc:`../programmable/index` documentation for details on ``TensorView``.
.. _initialize_inputs:
......
......@@ -71,8 +71,9 @@ Python-based API. See the `ngraph onnx companion tool`_ to get started.
TensorFlow, Yes, Yes
MXNet, Yes, Yes
PaddlePaddle, Coming Soon, Yes
neon, none needed, Yes
PyTorch, Not yet, Yes
PyTorch, Coming Soon, Yes
CNTK, Not yet, Yes
Other, Not yet, Doable
......@@ -140,13 +141,14 @@ Contents
install.rst
graph-basics.rst
fusion/index.rst
howto/index.rst
ops/index.rst
project/index.rst
framework-integration-guides.rst
optimize/index.rst
frameworks/index.rst
programmable/index.rst
python_api/index.rst
project/index.rst
......
.. about:
About
=====
Overview
========
Welcome to the documentation site for nGraph™, an open-source C++ Compiler,
Library, and runtime suite for running training and inference on
:abbr:`Deep Neural Network (DNN)` models. nGraph is framework-neutral and can be
targeted for programming and deploying :abbr:`Deep Learning (DL)` applications
on the most modern compute and edge devices.
Features
--------
:ref:`no-lockin`
:ref:`framework-flexibility`
.. _no-lockin:
Develop without lock-in
~~~~~~~~~~~~~~~~~~~~~~~
.. figure:: ../graphics/develop-without-lockin.png
:width: 650px
Indeed, capabilities to increase training performance or to reduce inference
latency by simply adding another device of *any* specialized form factor --
whether it be more compute (CPU), GPU or VPU processing power, custom ASIC or
FPGA, or a yet-to-be invented generation of NNP or accelerator -- are the key
benefits for frameworks developers working with nGraph. Our commitment to bake
flexibility into our ecosystem ensures developers' freedom to design user-facing
APIs for various hardware deployments directly into the framework.
Developers working on things like edge-devices augmented by machine learning, or
large distributed training clusters, or those who simply want a framework
without restrictive lock-in to let users switch or upgrade backends quickly and
easily.
Welcome to nGraph™, an open-source C++ compiler library for running and
training :abbr:`Deep Neural Network (DNN)` models. This project is
framework-neutral and can target a variety of modern devices or platforms.
.. figure:: ../graphics/ngraph-ecosystem.png
:width: 585px
......@@ -14,11 +46,11 @@ nGraph currently supports :doc:`three popular <../framework-integration-guides>`
frameworks for :abbr:`Deep Learning (DL)` models through what we call
a :term:`bridge` that can be integrated during the framework's build time.
For developers working with other frameworks (even those not listed above),
we've created a :doc:`How to Guide <../howto/index>` so you can learn how to create
custom bridge code that can be used to :doc:`compile and run <../howto/execute>`
a training model.
we've created a :doc:`How to Guide <../howto/index>` so you can learn how to
create custom bridge code that can be used to
:doc:`compile and run <../howto/execute>` a training model.
We've recently added initial support for the `ONNX`_ format. Developers who
Additionally We've recently added initial support for the `ONNX`_ format. Developers who
already have a "trained" model can use nGraph to bypass a lot of the
framework-based complexity and :doc:`../howto/import` to test or run it
on targeted and efficient backends with our user-friendly ``ngraph_api``.
......@@ -29,17 +61,14 @@ about how to adapt models to train and run efficiently on different devices.
Supported platforms
--------------------
Initially-supported backends include:
* Intel® Architecture Processors (CPUs),
* Intel® Nervana™ Neural Network Processor™ (NNPs), and
* NVIDIA\* CUDA (GPUs).
Tentatively in the pipeline, we plan to add support for more backends,
including:
* :abbr:`Field Programmable Gate Arrays (FPGA)` (FPGAs)
* `Movidius`_ compute stick
.. note:: The library code is under active development as we're continually
adding support for more kinds of DL models and ops, framework compiler
......@@ -82,6 +111,8 @@ tensor outputs from zero or more tensor inputs. For a more detailed dive into
how this works, read our documentation on how to :doc:`../howto/execute`.
.. _framework-flexibility:
How do I connect it to a framework?
------------------------------------
......
.. project/index.rst
Project Docs
============
More about nGraph
==================
This section contains documentation about the project and how to contribute.
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment