Update ABOUT.md (#2107)

e5a69122 · Michał Karzyński · GitHub · 79878d71 · e5a69122 · 79878d71
Unverified Commit e5a69122 authored Nov 25, 2018 by Michał Karzyński Committed by GitHub Nov 25, 2018
Show whitespace changes
Inline Side-by-side

Showing with 77 additions and 58 deletions

ABOUT.md ABOUT.md +77 -58

full-ngstck.png doc/sphinx/source/graphics/full-ngstck.png +0 -0

stackngrknl.png doc/sphinx/source/graphics/stackngrknl.png +0 -0

No files found.
--- a/ABOUT.md
+++ b/ABOUT.md
@@ -4,7 +4,8 @@ About nGraph Compiler stack
 nGraph Compiler stack architecture
 ----------------------------------
-The diagram below represents our current Beta release stack. Please note 
+The diagram below represents our current Beta release stack.
+In the diagram, nGraph components are colored in gray. Please note
 that the stack diagram is simplified to show how nGraph executes deep
 learning workloads with two hardware backends; however, many other
 deep learning frameworks and backends currently are functioning.
@@ -12,73 +13,85 @@ deep learning frameworks and backends currently are functioning.
 ![](doc/sphinx/source/graphics/stackngrknl.png)
-Starting from the top of the diagram, we present a simplified view of 
+#### Bridge
-the nGraph Intermediate Representation (IR). The nGraph IR is a format 
-which works with a framework such as TensorFlow\* or MXNet\* when there 
+Starting from the top of the stack, nGraph receives a computational graph
-is a corresponding "Bridge" or import method, such as from NNVM or via 
+from a deep learning framework such as TensorFlow* or MXNet*. The
-[ONNX](http://onnx.ai). Once the nGraph IR can begin using nGraph's 
+computational graph is converted to an nGraph internal representation
-Core ops, components lower in the stack can begin parsing and 
+by a bridge created for the corresponding framework.
-pattern-matching subgraphs for device-specific optimizations; these 
-are then encapsulated. This encapsulation is represented on the diagram 
+An nGraph bridge examines the whole graph to pattern match subgraphs
-as the colored background between the `ngraph` kernel(s) and the the 
+which nGraph knows how to execute, and these subgraphs are encapsulated.
-stack above.
+Parts of the graph that are not encapsulated will default to framework
+implementation when executed.
-Note that everything at or below the "Kernel APIs" and "Subgraph 
-APIs" gets executed "automatically" during training runs. In other 
+#### nGraph Core
-words, the accelerations are automatic: parts of the graph that 
-are not encapsulated default to framework implementation when 
+nGraph uses a strongly-typed and platform-neutral
-executed. For example, if nGraph optimizes ResNet50 for TensorFlow, 
+`Intermediate Representation (IR)` to construct a "stateless"
-the same optimization can be readily applied to the NNVM/MXNet 
+computational graph. Each node, or op, in the graph corresponds to
-implementation of ResNet50. This works efficiently because the 
+one `step` in a computation, where each step produces zero or
-nGraph (IR) Intermediate Representation, which keeps the input 
+more tensor outputs from zero or more tensor inputs.
-and output semantics of encapsulated subgraphs, rebuilds an 
-encapsulated subgraph that can efficiently make use or re-use 
+This allows nGraph to apply its state of the art optimizations instead
-of operations. Such an approach significantly cuts down on the 
+of having to follow how a particular framework implements op execution,
-time needed to compile; when we're not relying upon the framework's 
+memory management, data layouts, etc.
-ops alone, memory management and data layouts can be more efficiently 
-applied to the hardware backends in use.
+In addition, using nGraph IR allows faster optimization delivery
+for many of the supported frameworks. For example, if nGraph optimizes
-The nGraph Core uses a strongly-typed and platform-neutral (IR) 
+ResNet* for TensorFlow*, the same optimization can be readily applied
-Intermediate Representation to construct a "stateless" graph. 
+to MXNet* or ONNX* implementations of ResNet*.
-Each node, or `op`, in the graph corresponds to one step in 
-a computation, where each step produces zero or more tensor 
+#### Hybrid Transformer
-outputs from zero or more tensor inputs.
+Hybrid transformer takes the nGraph IR, and partitions it into
-After construction, our Hybrid transformer takes the IR, further 
+subgraphs, which can then be assigned to the best-performing backend.
-partitions it into subgraphs, and assigns them to the best-performing 
+There are two hardware backends shown in the stack diagram to demonstrate
-backend. There are two hardware backends shown in the stack diagram 
+this graph partitioning. The Hybrid transformer assigns complex operations
-to demonstrate nGraph's graph partitioning. The Hybrid transformer 
+(subgraphs) to Intel® Nervana™ Neural Network Processor (NNP) to expedite the
-assigns complex operations (subgraphs) to the Intel® Nervana™ Neural 
+computation, and the remaining operations default to CPU. In the future,
-Network Processor (NNP), or to a different CPU backend to expedite 
+we will further expand the capabilities of Hybrid transformer
-the computation, and the remaining operations default to CPU. In the 
-future, we will further expand the capabilities of Hybrid transformer 
 by enabling more features, such as localized cost modeling and memory
-sharing, when the next generation of NNP (Neural Network Processor) 
+sharing.
-is released. In the meantime, your deep learning software engineering 
-or modeling can be confidently built upon this stable anchorage.
+Once the subgraphs are assigned, the corresponding backend will
+execute the IR.
+#### Backends
-The Intel® Architecture IA (Intel® Architecture) transformer provides 
+Focusing our attention on the CPU backend, when the IR is passed to
-two modes that reduce compilation time, and have already been shown 
+the Intel® Architecture (IA) transformer, it can be executed in two modes:
-as useful for training, deploying, and retraining a deep learning 
+Direct EXecution (DEX) and code generation (`codegen`).
-workload in production. For example, in our tests, DEX mode reduced 
-ResNet50 compilation time by 30X.
+In `codegen` mode, nGraph generates and compiles code which can
+either call into highly optimized kernels like MKL-DNN or JITers like Halide.
+Although our team wrote kernels for nGraph for some operations,
+nGraph leverages existing kernel libraries such as MKL-DNN, Eigen, and MLSL.
+MLSL library is called when nGraph executes distributed training.
+At the time of the nGraph Beta release, nGraph achieved state of the art
+results for ResNet50 with 16 nodes and 32 nodes for TensorFlow* and MXNet*.
 We are excited to continue our work in enabling distributed training,
-and we plan to expand the nodes to 256 in Q4 ‘18. Additionally, we 
+and we plan to expand to 256 nodes in Q4 ‘18. Additionally, we
 are testing model parallelism in addition to data parallelism.
-In this Beta release, nGraph via Bridge code supports only Just In 
+The other mode of execution is Direct EXecution (DEX). In DEX mode,
-Time (JiT) compilation; the ONNX importer does not support anything 
+nGraph can execute the operations by directly calling associated kernels
-that nGraph cannot support. While nGraph currently has very limited 
+as it walks though the IR instead of compiling via `codegen`. This mode
-support for dynamic graphs, it is possible to get dynamic graphs 
+reduces the compilation time, and it will be useful for training,
-working. Future releases will add better support and use case 
+deploying, and retraining a deep learning workload in production.
-examples for such things as Ahead of Time compilation.
+In our tests, DEX mode reduced ResNet50 compilation time by 30X.
+nGraph further tries to speed up the computation by leveraging
+multi-threading and graph scheduling libraries such as OpenMP and
+TBB Flow Graph.
 Features
 --------
-The nGraph (IR) Intermediate Representation contains a combination 
+nGraph performs a combination of device-specific and
-of device-specific and non-device-specific optimization :
+non-device-specific optimizations:
 -   **Fusion** -- Fuse multiple ops to to decrease memory usage.
 -   **Data layout abstraction** -- Make abstraction easier and faster
@@ -97,6 +110,12 @@ of device-specific and non-device-specific optimization :
    with nGraph translating element order to work best for whatever given
    or available device.
+Beta Limitations
+----------------
+In this Beta release, nGraph only supports Just In Time compilation,
+but we plan to add support for Ahead of Time compilation in the official
+release of nGraph. nGraph currently has limited support for dynamic graphs.
 Current nGraph Compiler full stack
 ----------------------------------

--- a/doc/sphinx/source/graphics/full-ngstck.png
+++ b/doc/sphinx/source/graphics/full-ngstck.png
--- a/doc/sphinx/source/graphics/stackngrknl.png
+++ b/doc/sphinx/source/graphics/stackngrknl.png