Unverified Commit e5a69122 authored by Michał Karzyński's avatar Michał Karzyński Committed by GitHub

Update ABOUT.md (#2107)

parent 79878d71
...@@ -4,99 +4,118 @@ About nGraph Compiler stack ...@@ -4,99 +4,118 @@ About nGraph Compiler stack
nGraph Compiler stack architecture nGraph Compiler stack architecture
---------------------------------- ----------------------------------
The diagram below represents our current Beta release stack. Please note The diagram below represents our current Beta release stack.
that the stack diagram is simplified to show how nGraph executes deep In the diagram, nGraph components are colored in gray. Please note
learning workloads with two hardware backends; however, many other that the stack diagram is simplified to show how nGraph executes deep
learning workloads with two hardware backends; however, many other
deep learning frameworks and backends currently are functioning. deep learning frameworks and backends currently are functioning.
![](doc/sphinx/source/graphics/stackngrknl.png) ![](doc/sphinx/source/graphics/stackngrknl.png)
Starting from the top of the diagram, we present a simplified view of #### Bridge
the nGraph Intermediate Representation (IR). The nGraph IR is a format
which works with a framework such as TensorFlow\* or MXNet\* when there Starting from the top of the stack, nGraph receives a computational graph
is a corresponding "Bridge" or import method, such as from NNVM or via from a deep learning framework such as TensorFlow* or MXNet*. The
[ONNX](http://onnx.ai). Once the nGraph IR can begin using nGraph's computational graph is converted to an nGraph internal representation
Core ops, components lower in the stack can begin parsing and by a bridge created for the corresponding framework.
pattern-matching subgraphs for device-specific optimizations; these
are then encapsulated. This encapsulation is represented on the diagram An nGraph bridge examines the whole graph to pattern match subgraphs
as the colored background between the `ngraph` kernel(s) and the the which nGraph knows how to execute, and these subgraphs are encapsulated.
stack above. Parts of the graph that are not encapsulated will default to framework
implementation when executed.
Note that everything at or below the "Kernel APIs" and "Subgraph
APIs" gets executed "automatically" during training runs. In other #### nGraph Core
words, the accelerations are automatic: parts of the graph that
are not encapsulated default to framework implementation when nGraph uses a strongly-typed and platform-neutral
executed. For example, if nGraph optimizes ResNet50 for TensorFlow, `Intermediate Representation (IR)` to construct a "stateless"
the same optimization can be readily applied to the NNVM/MXNet computational graph. Each node, or op, in the graph corresponds to
implementation of ResNet50. This works efficiently because the one `step` in a computation, where each step produces zero or
nGraph (IR) Intermediate Representation, which keeps the input more tensor outputs from zero or more tensor inputs.
and output semantics of encapsulated subgraphs, rebuilds an
encapsulated subgraph that can efficiently make use or re-use This allows nGraph to apply its state of the art optimizations instead
of operations. Such an approach significantly cuts down on the of having to follow how a particular framework implements op execution,
time needed to compile; when we're not relying upon the framework's memory management, data layouts, etc.
ops alone, memory management and data layouts can be more efficiently
applied to the hardware backends in use. In addition, using nGraph IR allows faster optimization delivery
for many of the supported frameworks. For example, if nGraph optimizes
The nGraph Core uses a strongly-typed and platform-neutral (IR) ResNet* for TensorFlow*, the same optimization can be readily applied
Intermediate Representation to construct a "stateless" graph. to MXNet* or ONNX* implementations of ResNet*.
Each node, or `op`, in the graph corresponds to one step in
a computation, where each step produces zero or more tensor #### Hybrid Transformer
outputs from zero or more tensor inputs.
Hybrid transformer takes the nGraph IR, and partitions it into
After construction, our Hybrid transformer takes the IR, further subgraphs, which can then be assigned to the best-performing backend.
partitions it into subgraphs, and assigns them to the best-performing There are two hardware backends shown in the stack diagram to demonstrate
backend. There are two hardware backends shown in the stack diagram this graph partitioning. The Hybrid transformer assigns complex operations
to demonstrate nGraph's graph partitioning. The Hybrid transformer (subgraphs) to Intel® Nervana™ Neural Network Processor (NNP) to expedite the
assigns complex operations (subgraphs) to the Intel® Nervana™ Neural computation, and the remaining operations default to CPU. In the future,
Network Processor (NNP), or to a different CPU backend to expedite we will further expand the capabilities of Hybrid transformer
the computation, and the remaining operations default to CPU. In the by enabling more features, such as localized cost modeling and memory
future, we will further expand the capabilities of Hybrid transformer sharing.
by enabling more features, such as localized cost modeling and memory
sharing, when the next generation of NNP (Neural Network Processor) Once the subgraphs are assigned, the corresponding backend will
is released. In the meantime, your deep learning software engineering execute the IR.
or modeling can be confidently built upon this stable anchorage.
The Intel® Architecture IA (Intel® Architecture) transformer provides #### Backends
two modes that reduce compilation time, and have already been shown
as useful for training, deploying, and retraining a deep learning Focusing our attention on the CPU backend, when the IR is passed to
workload in production. For example, in our tests, DEX mode reduced the Intel® Architecture (IA) transformer, it can be executed in two modes:
ResNet50 compilation time by 30X. Direct EXecution (DEX) and code generation (`codegen`).
We are excited to continue our work in enabling distributed training, In `codegen` mode, nGraph generates and compiles code which can
and we plan to expand the nodes to 256 in Q4 ‘18. Additionally, we either call into highly optimized kernels like MKL-DNN or JITers like Halide.
Although our team wrote kernels for nGraph for some operations,
nGraph leverages existing kernel libraries such as MKL-DNN, Eigen, and MLSL.
MLSL library is called when nGraph executes distributed training.
At the time of the nGraph Beta release, nGraph achieved state of the art
results for ResNet50 with 16 nodes and 32 nodes for TensorFlow* and MXNet*.
We are excited to continue our work in enabling distributed training,
and we plan to expand to 256 nodes in Q4 ‘18. Additionally, we
are testing model parallelism in addition to data parallelism. are testing model parallelism in addition to data parallelism.
In this Beta release, nGraph via Bridge code supports only Just In The other mode of execution is Direct EXecution (DEX). In DEX mode,
Time (JiT) compilation; the ONNX importer does not support anything nGraph can execute the operations by directly calling associated kernels
that nGraph cannot support. While nGraph currently has very limited as it walks though the IR instead of compiling via `codegen`. This mode
support for dynamic graphs, it is possible to get dynamic graphs reduces the compilation time, and it will be useful for training,
working. Future releases will add better support and use case deploying, and retraining a deep learning workload in production.
examples for such things as Ahead of Time compilation. In our tests, DEX mode reduced ResNet50 compilation time by 30X.
nGraph further tries to speed up the computation by leveraging
multi-threading and graph scheduling libraries such as OpenMP and
TBB Flow Graph.
Features Features
-------- --------
The nGraph (IR) Intermediate Representation contains a combination nGraph performs a combination of device-specific and
of device-specific and non-device-specific optimization : non-device-specific optimizations:
- **Fusion** -- Fuse multiple ops to to decrease memory usage. - **Fusion** -- Fuse multiple ops to to decrease memory usage.
- **Data layout abstraction** -- Make abstraction easier and faster - **Data layout abstraction** -- Make abstraction easier and faster
with nGraph translating element order to work best for a given or with nGraph translating element order to work best for a given or
available device. available device.
- **Data reuse** -- Save results and reuse for subgraphs with the - **Data reuse** -- Save results and reuse for subgraphs with the
same input. same input.
- **Graph scheduling** -- Run similar subgraphs in parallel via - **Graph scheduling** -- Run similar subgraphs in parallel via
multi-threading. multi-threading.
- **Graph partitioning** -- Partition subgraphs to run on different - **Graph partitioning** -- Partition subgraphs to run on different
devices to speed up computation; make better use of spare CPU cycles devices to speed up computation; make better use of spare CPU cycles
with nGraph. with nGraph.
- **Memory management** -- Prevent peak memory usage by intercepting - **Memory management** -- Prevent peak memory usage by intercepting
a graph with or by a "saved checkpoint," and to enable data auditing. a graph with or by a "saved checkpoint," and to enable data auditing.
- **Data layout abstraction** -- Make abstraction easier and faster - **Data layout abstraction** -- Make abstraction easier and faster
with nGraph translating element order to work best for whatever given with nGraph translating element order to work best for whatever given
or available device. or available device.
Beta Limitations
----------------
In this Beta release, nGraph only supports Just In Time compilation,
but we plan to add support for Ahead of Time compilation in the official
release of nGraph. nGraph currently has limited support for dynamic graphs.
Current nGraph Compiler full stack Current nGraph Compiler full stack
---------------------------------- ----------------------------------
...@@ -105,9 +124,9 @@ Current nGraph Compiler full stack ...@@ -105,9 +124,9 @@ Current nGraph Compiler full stack
In addition to IA and NNP transformers, nGraph Compiler stack has transformers In addition to IA and NNP transformers, nGraph Compiler stack has transformers
for multiple GPU types and an upcoming Intel deep learning accelerator. To for multiple GPU types and an upcoming Intel deep learning accelerator. To
support the growing number of transformers, we plan to expand the capabilities support the growing number of transformers, we plan to expand the capabilities
of the hybrid transformer with a cost model and memory sharing. With these new of the hybrid transformer with a cost model and memory sharing. With these new
features, even if nGraph has multiple backends targeting the same hardware, it features, even if nGraph has multiple backends targeting the same hardware, it
will partition the graph into multiple subgraphs and determine the best way to will partition the graph into multiple subgraphs and determine the best way to
execute each subgraph. execute each subgraph.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment