Unverified Commit e5a69122 authored by Michał Karzyński's avatar Michał Karzyński Committed by GitHub

Update ABOUT.md (#2107)

parent 79878d71
......@@ -4,99 +4,118 @@ About nGraph Compiler stack
nGraph Compiler stack architecture
----------------------------------
The diagram below represents our current Beta release stack. Please note
that the stack diagram is simplified to show how nGraph executes deep
learning workloads with two hardware backends; however, many other
The diagram below represents our current Beta release stack.
In the diagram, nGraph components are colored in gray. Please note
that the stack diagram is simplified to show how nGraph executes deep
learning workloads with two hardware backends; however, many other
deep learning frameworks and backends currently are functioning.
![](doc/sphinx/source/graphics/stackngrknl.png)
Starting from the top of the diagram, we present a simplified view of
the nGraph Intermediate Representation (IR). The nGraph IR is a format
which works with a framework such as TensorFlow\* or MXNet\* when there
is a corresponding "Bridge" or import method, such as from NNVM or via
[ONNX](http://onnx.ai). Once the nGraph IR can begin using nGraph's
Core ops, components lower in the stack can begin parsing and
pattern-matching subgraphs for device-specific optimizations; these
are then encapsulated. This encapsulation is represented on the diagram
as the colored background between the `ngraph` kernel(s) and the the
stack above.
Note that everything at or below the "Kernel APIs" and "Subgraph
APIs" gets executed "automatically" during training runs. In other
words, the accelerations are automatic: parts of the graph that
are not encapsulated default to framework implementation when
executed. For example, if nGraph optimizes ResNet50 for TensorFlow,
the same optimization can be readily applied to the NNVM/MXNet
implementation of ResNet50. This works efficiently because the
nGraph (IR) Intermediate Representation, which keeps the input
and output semantics of encapsulated subgraphs, rebuilds an
encapsulated subgraph that can efficiently make use or re-use
of operations. Such an approach significantly cuts down on the
time needed to compile; when we're not relying upon the framework's
ops alone, memory management and data layouts can be more efficiently
applied to the hardware backends in use.
The nGraph Core uses a strongly-typed and platform-neutral (IR)
Intermediate Representation to construct a "stateless" graph.
Each node, or `op`, in the graph corresponds to one step in
a computation, where each step produces zero or more tensor
outputs from zero or more tensor inputs.
After construction, our Hybrid transformer takes the IR, further
partitions it into subgraphs, and assigns them to the best-performing
backend. There are two hardware backends shown in the stack diagram
to demonstrate nGraph's graph partitioning. The Hybrid transformer
assigns complex operations (subgraphs) to the Intel® Nervana™ Neural
Network Processor (NNP), or to a different CPU backend to expedite
the computation, and the remaining operations default to CPU. In the
future, we will further expand the capabilities of Hybrid transformer
by enabling more features, such as localized cost modeling and memory
sharing, when the next generation of NNP (Neural Network Processor)
is released. In the meantime, your deep learning software engineering
or modeling can be confidently built upon this stable anchorage.
The Intel® Architecture IA (Intel® Architecture) transformer provides
two modes that reduce compilation time, and have already been shown
as useful for training, deploying, and retraining a deep learning
workload in production. For example, in our tests, DEX mode reduced
ResNet50 compilation time by 30X.
We are excited to continue our work in enabling distributed training,
and we plan to expand the nodes to 256 in Q4 ‘18. Additionally, we
#### Bridge
Starting from the top of the stack, nGraph receives a computational graph
from a deep learning framework such as TensorFlow* or MXNet*. The
computational graph is converted to an nGraph internal representation
by a bridge created for the corresponding framework.
An nGraph bridge examines the whole graph to pattern match subgraphs
which nGraph knows how to execute, and these subgraphs are encapsulated.
Parts of the graph that are not encapsulated will default to framework
implementation when executed.
#### nGraph Core
nGraph uses a strongly-typed and platform-neutral
`Intermediate Representation (IR)` to construct a "stateless"
computational graph. Each node, or op, in the graph corresponds to
one `step` in a computation, where each step produces zero or
more tensor outputs from zero or more tensor inputs.
This allows nGraph to apply its state of the art optimizations instead
of having to follow how a particular framework implements op execution,
memory management, data layouts, etc.
In addition, using nGraph IR allows faster optimization delivery
for many of the supported frameworks. For example, if nGraph optimizes
ResNet* for TensorFlow*, the same optimization can be readily applied
to MXNet* or ONNX* implementations of ResNet*.
#### Hybrid Transformer
Hybrid transformer takes the nGraph IR, and partitions it into
subgraphs, which can then be assigned to the best-performing backend.
There are two hardware backends shown in the stack diagram to demonstrate
this graph partitioning. The Hybrid transformer assigns complex operations
(subgraphs) to Intel® Nervana™ Neural Network Processor (NNP) to expedite the
computation, and the remaining operations default to CPU. In the future,
we will further expand the capabilities of Hybrid transformer
by enabling more features, such as localized cost modeling and memory
sharing.
Once the subgraphs are assigned, the corresponding backend will
execute the IR.
#### Backends
Focusing our attention on the CPU backend, when the IR is passed to
the Intel® Architecture (IA) transformer, it can be executed in two modes:
Direct EXecution (DEX) and code generation (`codegen`).
In `codegen` mode, nGraph generates and compiles code which can
either call into highly optimized kernels like MKL-DNN or JITers like Halide.
Although our team wrote kernels for nGraph for some operations,
nGraph leverages existing kernel libraries such as MKL-DNN, Eigen, and MLSL.
MLSL library is called when nGraph executes distributed training.
At the time of the nGraph Beta release, nGraph achieved state of the art
results for ResNet50 with 16 nodes and 32 nodes for TensorFlow* and MXNet*.
We are excited to continue our work in enabling distributed training,
and we plan to expand to 256 nodes in Q4 ‘18. Additionally, we
are testing model parallelism in addition to data parallelism.
In this Beta release, nGraph via Bridge code supports only Just In
Time (JiT) compilation; the ONNX importer does not support anything
that nGraph cannot support. While nGraph currently has very limited
support for dynamic graphs, it is possible to get dynamic graphs
working. Future releases will add better support and use case
examples for such things as Ahead of Time compilation.
The other mode of execution is Direct EXecution (DEX). In DEX mode,
nGraph can execute the operations by directly calling associated kernels
as it walks though the IR instead of compiling via `codegen`. This mode
reduces the compilation time, and it will be useful for training,
deploying, and retraining a deep learning workload in production.
In our tests, DEX mode reduced ResNet50 compilation time by 30X.
nGraph further tries to speed up the computation by leveraging
multi-threading and graph scheduling libraries such as OpenMP and
TBB Flow Graph.
Features
--------
The nGraph (IR) Intermediate Representation contains a combination
of device-specific and non-device-specific optimization :
nGraph performs a combination of device-specific and
non-device-specific optimizations:
- **Fusion** -- Fuse multiple ops to to decrease memory usage.
- **Data layout abstraction** -- Make abstraction easier and faster
with nGraph translating element order to work best for a given or
- **Data layout abstraction** -- Make abstraction easier and faster
with nGraph translating element order to work best for a given or
available device.
- **Data reuse** -- Save results and reuse for subgraphs with the
- **Data reuse** -- Save results and reuse for subgraphs with the
same input.
- **Graph scheduling** -- Run similar subgraphs in parallel via
- **Graph scheduling** -- Run similar subgraphs in parallel via
multi-threading.
- **Graph partitioning** -- Partition subgraphs to run on different
devices to speed up computation; make better use of spare CPU cycles
- **Graph partitioning** -- Partition subgraphs to run on different
devices to speed up computation; make better use of spare CPU cycles
with nGraph.
- **Memory management** -- Prevent peak memory usage by intercepting
- **Memory management** -- Prevent peak memory usage by intercepting
a graph with or by a "saved checkpoint," and to enable data auditing.
- **Data layout abstraction** -- Make abstraction easier and faster
with nGraph translating element order to work best for whatever given
- **Data layout abstraction** -- Make abstraction easier and faster
with nGraph translating element order to work best for whatever given
or available device.
Beta Limitations
----------------
In this Beta release, nGraph only supports Just In Time compilation,
but we plan to add support for Ahead of Time compilation in the official
release of nGraph. nGraph currently has limited support for dynamic graphs.
Current nGraph Compiler full stack
----------------------------------
......@@ -105,9 +124,9 @@ Current nGraph Compiler full stack
In addition to IA and NNP transformers, nGraph Compiler stack has transformers
for multiple GPU types and an upcoming Intel deep learning accelerator. To
support the growing number of transformers, we plan to expand the capabilities
of the hybrid transformer with a cost model and memory sharing. With these new
features, even if nGraph has multiple backends targeting the same hardware, it
will partition the graph into multiple subgraphs and determine the best way to
execute each subgraph.
for multiple GPU types and an upcoming Intel deep learning accelerator. To
support the growing number of transformers, we plan to expand the capabilities
of the hybrid transformer with a cost model and memory sharing. With these new
features, even if nGraph has multiple backends targeting the same hardware, it
will partition the graph into multiple subgraphs and determine the best way to
execute each subgraph.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment