Unverified Commit e5a69122 authored by Michał Karzyński's avatar Michał Karzyński Committed by GitHub

Update ABOUT.md (#2107)

parent 79878d71
...@@ -4,7 +4,8 @@ About nGraph Compiler stack ...@@ -4,7 +4,8 @@ About nGraph Compiler stack
nGraph Compiler stack architecture nGraph Compiler stack architecture
---------------------------------- ----------------------------------
The diagram below represents our current Beta release stack. Please note The diagram below represents our current Beta release stack.
In the diagram, nGraph components are colored in gray. Please note
that the stack diagram is simplified to show how nGraph executes deep that the stack diagram is simplified to show how nGraph executes deep
learning workloads with two hardware backends; however, many other learning workloads with two hardware backends; however, many other
deep learning frameworks and backends currently are functioning. deep learning frameworks and backends currently are functioning.
...@@ -12,73 +13,85 @@ deep learning frameworks and backends currently are functioning. ...@@ -12,73 +13,85 @@ deep learning frameworks and backends currently are functioning.
![](doc/sphinx/source/graphics/stackngrknl.png) ![](doc/sphinx/source/graphics/stackngrknl.png)
Starting from the top of the diagram, we present a simplified view of #### Bridge
the nGraph Intermediate Representation (IR). The nGraph IR is a format
which works with a framework such as TensorFlow\* or MXNet\* when there Starting from the top of the stack, nGraph receives a computational graph
is a corresponding "Bridge" or import method, such as from NNVM or via from a deep learning framework such as TensorFlow* or MXNet*. The
[ONNX](http://onnx.ai). Once the nGraph IR can begin using nGraph's computational graph is converted to an nGraph internal representation
Core ops, components lower in the stack can begin parsing and by a bridge created for the corresponding framework.
pattern-matching subgraphs for device-specific optimizations; these
are then encapsulated. This encapsulation is represented on the diagram An nGraph bridge examines the whole graph to pattern match subgraphs
as the colored background between the `ngraph` kernel(s) and the the which nGraph knows how to execute, and these subgraphs are encapsulated.
stack above. Parts of the graph that are not encapsulated will default to framework
implementation when executed.
Note that everything at or below the "Kernel APIs" and "Subgraph
APIs" gets executed "automatically" during training runs. In other #### nGraph Core
words, the accelerations are automatic: parts of the graph that
are not encapsulated default to framework implementation when nGraph uses a strongly-typed and platform-neutral
executed. For example, if nGraph optimizes ResNet50 for TensorFlow, `Intermediate Representation (IR)` to construct a "stateless"
the same optimization can be readily applied to the NNVM/MXNet computational graph. Each node, or op, in the graph corresponds to
implementation of ResNet50. This works efficiently because the one `step` in a computation, where each step produces zero or
nGraph (IR) Intermediate Representation, which keeps the input more tensor outputs from zero or more tensor inputs.
and output semantics of encapsulated subgraphs, rebuilds an
encapsulated subgraph that can efficiently make use or re-use This allows nGraph to apply its state of the art optimizations instead
of operations. Such an approach significantly cuts down on the of having to follow how a particular framework implements op execution,
time needed to compile; when we're not relying upon the framework's memory management, data layouts, etc.
ops alone, memory management and data layouts can be more efficiently
applied to the hardware backends in use. In addition, using nGraph IR allows faster optimization delivery
for many of the supported frameworks. For example, if nGraph optimizes
The nGraph Core uses a strongly-typed and platform-neutral (IR) ResNet* for TensorFlow*, the same optimization can be readily applied
Intermediate Representation to construct a "stateless" graph. to MXNet* or ONNX* implementations of ResNet*.
Each node, or `op`, in the graph corresponds to one step in
a computation, where each step produces zero or more tensor #### Hybrid Transformer
outputs from zero or more tensor inputs.
Hybrid transformer takes the nGraph IR, and partitions it into
After construction, our Hybrid transformer takes the IR, further subgraphs, which can then be assigned to the best-performing backend.
partitions it into subgraphs, and assigns them to the best-performing There are two hardware backends shown in the stack diagram to demonstrate
backend. There are two hardware backends shown in the stack diagram this graph partitioning. The Hybrid transformer assigns complex operations
to demonstrate nGraph's graph partitioning. The Hybrid transformer (subgraphs) to Intel® Nervana™ Neural Network Processor (NNP) to expedite the
assigns complex operations (subgraphs) to the Intel® Nervana™ Neural computation, and the remaining operations default to CPU. In the future,
Network Processor (NNP), or to a different CPU backend to expedite we will further expand the capabilities of Hybrid transformer
the computation, and the remaining operations default to CPU. In the
future, we will further expand the capabilities of Hybrid transformer
by enabling more features, such as localized cost modeling and memory by enabling more features, such as localized cost modeling and memory
sharing, when the next generation of NNP (Neural Network Processor) sharing.
is released. In the meantime, your deep learning software engineering
or modeling can be confidently built upon this stable anchorage. Once the subgraphs are assigned, the corresponding backend will
execute the IR.
#### Backends
The Intel® Architecture IA (Intel® Architecture) transformer provides Focusing our attention on the CPU backend, when the IR is passed to
two modes that reduce compilation time, and have already been shown the Intel® Architecture (IA) transformer, it can be executed in two modes:
as useful for training, deploying, and retraining a deep learning Direct EXecution (DEX) and code generation (`codegen`).
workload in production. For example, in our tests, DEX mode reduced
ResNet50 compilation time by 30X.
In `codegen` mode, nGraph generates and compiles code which can
either call into highly optimized kernels like MKL-DNN or JITers like Halide.
Although our team wrote kernels for nGraph for some operations,
nGraph leverages existing kernel libraries such as MKL-DNN, Eigen, and MLSL.
MLSL library is called when nGraph executes distributed training.
At the time of the nGraph Beta release, nGraph achieved state of the art
results for ResNet50 with 16 nodes and 32 nodes for TensorFlow* and MXNet*.
We are excited to continue our work in enabling distributed training, We are excited to continue our work in enabling distributed training,
and we plan to expand the nodes to 256 in Q4 ‘18. Additionally, we and we plan to expand to 256 nodes in Q4 ‘18. Additionally, we
are testing model parallelism in addition to data parallelism. are testing model parallelism in addition to data parallelism.
In this Beta release, nGraph via Bridge code supports only Just In The other mode of execution is Direct EXecution (DEX). In DEX mode,
Time (JiT) compilation; the ONNX importer does not support anything nGraph can execute the operations by directly calling associated kernels
that nGraph cannot support. While nGraph currently has very limited as it walks though the IR instead of compiling via `codegen`. This mode
support for dynamic graphs, it is possible to get dynamic graphs reduces the compilation time, and it will be useful for training,
working. Future releases will add better support and use case deploying, and retraining a deep learning workload in production.
examples for such things as Ahead of Time compilation. In our tests, DEX mode reduced ResNet50 compilation time by 30X.
nGraph further tries to speed up the computation by leveraging
multi-threading and graph scheduling libraries such as OpenMP and
TBB Flow Graph.
Features Features
-------- --------
The nGraph (IR) Intermediate Representation contains a combination nGraph performs a combination of device-specific and
of device-specific and non-device-specific optimization : non-device-specific optimizations:
- **Fusion** -- Fuse multiple ops to to decrease memory usage. - **Fusion** -- Fuse multiple ops to to decrease memory usage.
- **Data layout abstraction** -- Make abstraction easier and faster - **Data layout abstraction** -- Make abstraction easier and faster
...@@ -97,6 +110,12 @@ of device-specific and non-device-specific optimization : ...@@ -97,6 +110,12 @@ of device-specific and non-device-specific optimization :
with nGraph translating element order to work best for whatever given with nGraph translating element order to work best for whatever given
or available device. or available device.
Beta Limitations
----------------
In this Beta release, nGraph only supports Just In Time compilation,
but we plan to add support for Ahead of Time compilation in the official
release of nGraph. nGraph currently has limited support for dynamic graphs.
Current nGraph Compiler full stack Current nGraph Compiler full stack
---------------------------------- ----------------------------------
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment