Commit 79878d71 authored by L.S. Cook's avatar L.S. Cook Committed by Michał Karzyński

Architecture and feature docs (#2092)

parent 7b665771
# nGraph Compiler Stack Beta
# nGraph Compiler Stack
[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://github.com/NervanaSystems/ngraph/blob/master/LICENSE) [![Build Status][build-status-badge]][build-status]
<div align="left">
<h3>
<a href="https://ngraph.nervanasys.com/docs/latest/project/about.html">
Architecture and features</a> | <a href="#Ecosystem" >nGraph ecosystem</a><span> </span> <span> | </span>
<a href="https://ngraph.nervanasys.com/docs/latest/project/release-notes.html">
Beta release notes</a><span> | </span> <br />
<a href="https://ngraph.nervanasys.com/docs/latest">Documentation</a><span> | </span>
<a href="#How-to-contribute" >How to contribute</a>
</h3>
<h4>
<a href="./ABOUT.md">Architecture & features</a> | <a href="./ecosystem-overview.md" >Ecosystem</a> | <a href="https://ngraph.nervanasys.com/docs/latest/project/release-notes.html">Release notes</a><span> | </span> <a href="https://ngraph.nervanasys.com/docs/latest">Documentation</a><span> | </span> <a href="#How-to-contribute" >Contribution guide</a>
</h4>
</div>
## Quick start
......@@ -19,14 +14,14 @@
To begin using nGraph with popular frameworks to accelerate deep learning
workloads on CPU for inference, please refer to the links below.
| Framework / Version | Installation guide | Notes
| Framework (Version) | Installation guide | Notes
|----------------------------|----------------------------------------|-----------------------------------
| TensorFlow* 1.12 | [Pip package] or [Build from source] | 17 [Validated workloads]
| MXNet* 1.4 | [Enable the module] or [Source compile]| 17 [Validated workloads]
| ONNX 1.3 | [Pip package] | 13 [Functional] workloads with DenseNet-121, Inception-v1, ResNet-50, Inception-v2, ShuffleNet, SqueezeNet, VGG-19, and 7 more
| ONNX 1.3 | [Pip package] | 14 [Validated workloads]
Frameworks using nGraph Compiler stack to execute workloads have shown
**3X** to **45X** performance boost when compared to native framework
**up to 45X** performance boost when compared to native framework
implementations. We've also seen performance boosts running workloads that
are not included on the list of [Validated workloads], thanks to our
powerful subgraph pattern matching and thanks to the collaborative efforts
......@@ -47,23 +42,23 @@ We strongly believe in providing freedom, performance, and ease-of-use to AI
developers.
The diagram below shows what deep learning frameworks and hardware targets
we support. More details on these current and future plans are in the ecosystem
we support. More details on these current and future plans are in the [ecosystem]
section.
![nGraph ecosystem][ngraph-ecosystem]
![nGraph wireframe][ngraph_wireframes_with_notice]
While the ecosystem shown above is all functioning, we have validated
performance metrics for deep learning inference on CPU processors including
as Intel® Xeon®. Please refer to the [Beta release notes] to learn more.
The Gold release is targeted for April 2019; it will feature broader workload
coverage, including support for quantized graphs, and more detail on our
advanced support for ``int8``.
performance for deep learning inference on CPU processors such as Intel® Xeon®.
Please refer to the [Release notes] to learn more. The Gold release
is targeted for April 2019; it will feature broader workload coverage,
including quantized graphs, and more detail on our advanced support for
``int8``.
Our documentation has extensive information about how to use nGraph Compiler
stack to create an nGraph computational graph, integrate custom frameworks,
and interact with supported backends. If you wish to contribute to the
and to interact with supported backends. If you wish to contribute to the
project, please don't hesitate to ask questions in [GitHub issues] after
reviewing our contribution guide below.
......@@ -85,20 +80,8 @@ to improve it:
modifications are necessary, may provide feedback to guide you. When
accepted, your pull request will be merged to the repository.
![nGraph Compiler Stack][ngraph-compiler-stack-readme]
| Backend | current support | future support |
|-----------------------------------------------|-------------------|----------------|
| Intel® Architecture CPU | yes | yes |
| Intel® Nervana™ Neural Network Processor (NNP)| yes | yes |
| Intel [Movidius™ Myriad™ 2] VPUs | coming soon | yes |
| Intel® Architecture GPUs | via PlaidML | yes |
| AMD* GPUs | via PlaidML | yes |
| NVIDIA* GPUs | via PlaidML | some |
| Field Programmable Gate Arrays (FPGA) | no | yes |
[Ecosystem]: ecosystem-overview
[Architecture and features]:https://ngraph.nervanasys.com/docs/latest/project/about.html
[Documentation]: https://ngraph.nervanasys.com/docs/latest
[build the Library]: https://ngraph.nervanasys.com/docs/latest/buildlb.html
......@@ -112,7 +95,7 @@ to improve it:
[contrib guide]: https://ngraph.nervanasys.com/docs/latest/project/code-contributor-README.html
[pull request]: https://github.com/NervanaSystems/ngraph/pulls
[how to import]: https://ngraph.nervanasys.com/docs/latest/howto/import.html
[ngraph-ecosystem]: doc/sphinx/source/graphics/599px-Intel-ngraph-ecosystem.png "nGraph Ecosystem"
[ngraph_wireframes_with_notice]: doc/sphinx/source/graphics/ngraph_wireframes_with_notice.png "nGraph wireframe"
[ngraph-compiler-stack-readme]: doc/sphinx/source/graphics/ngraph-compiler-stack-readme.png "nGraph Compiler Stack"
[build-status]: https://travis-ci.org/NervanaSystems/ngraph/branches
[build-status-badge]: https://travis-ci.org/NervanaSystems/ngraph.svg?branch=master
......@@ -121,6 +104,7 @@ to improve it:
[PlaidML]: https://github.com/plaidml/plaidml
[Pip package]: https://github.com/NervanaSystems/ngraph-onnx#installing-ngraph-onnx
[Build from source]: https://github.com/NervanaSystems/ngraph-tf
[Enable the module]: https://github.com/NervanaSystems/ngraph/blob/mbrookhart/mxnet_tutorial/doc/sphinx/source/shared/mxnet_tutorial.rst
[Source compile]: https://github.com/NervanaSystems/ngraph-mxnet/blob/master/NGRAPH_README.md
[nGraph-ONNX]: https://github.com/NervanaSystems/ngraph-onnx/blob/master/README.md
[nGraph-ONNX adaptable]: https://ai.intel.com/adaptable-deep-learning-solutions-with-ngraph-compiler-and-onnx/
......
# Install script for directory: /opt/libraries/ngraph/doc/sphinx
# Set the install prefix
if(NOT DEFINED CMAKE_INSTALL_PREFIX)
set(CMAKE_INSTALL_PREFIX "/usr/local")
endif()
string(REGEX REPLACE "/$" "" CMAKE_INSTALL_PREFIX "${CMAKE_INSTALL_PREFIX}")
# Set the install configuration name.
if(NOT DEFINED CMAKE_INSTALL_CONFIG_NAME)
if(BUILD_TYPE)
string(REGEX REPLACE "^[^A-Za-z0-9_]+" ""
CMAKE_INSTALL_CONFIG_NAME "${BUILD_TYPE}")
else()
set(CMAKE_INSTALL_CONFIG_NAME "")
endif()
message(STATUS "Install configuration: \"${CMAKE_INSTALL_CONFIG_NAME}\"")
endif()
# Set the component getting installed.
if(NOT CMAKE_INSTALL_COMPONENT)
if(COMPONENT)
message(STATUS "Install component: \"${COMPONENT}\"")
set(CMAKE_INSTALL_COMPONENT "${COMPONENT}")
else()
set(CMAKE_INSTALL_COMPONENT)
endif()
endif()
# Install shared libraries without execute permission?
if(NOT DEFINED CMAKE_INSTALL_SO_NO_EXE)
set(CMAKE_INSTALL_SO_NO_EXE "1")
endif()
# Is this installation the result of a crosscompile?
if(NOT DEFINED CMAKE_CROSSCOMPILING)
set(CMAKE_CROSSCOMPILING "FALSE")
endif()
if(CMAKE_INSTALL_COMPONENT)
set(CMAKE_INSTALL_MANIFEST "install_manifest_${CMAKE_INSTALL_COMPONENT}.txt")
else()
set(CMAKE_INSTALL_MANIFEST "install_manifest.txt")
endif()
string(REPLACE ";" "\n" CMAKE_INSTALL_MANIFEST_CONTENT
"${CMAKE_INSTALL_MANIFEST_FILES}")
file(WRITE "/opt/libraries/ngraph/doc/sphinx/${CMAKE_INSTALL_MANIFEST}"
"${CMAKE_INSTALL_MANIFEST_CONTENT}")
......@@ -49,9 +49,9 @@ across all workers with an op that performs "allreduce", and applied to update
the weights.
Using multiple machines helps to scale and speed up deep learning. With large
mini-batch training, `one could train ResNet-50 with Imagenet-1k data`_ to the
*Top 5* classifier in minutes using thousands of CPU nodes. See also:
`arxiv.org/pdf/1709.05011.pdf`_.
mini-batch training, one could train ResNet-50 with Imagenet-1k data to the
*Top 5* classifier in minutes using thousands of CPU nodes. See
`arxiv.org/abs/1709.05011`_.
......@@ -96,7 +96,6 @@ communication collective ops such as allgather, scatter, gather, etc. in
the future.
.. _based on the synchronous: https://arxiv.org/pdf/1602.06709.pdf
.. _one could train ResNet-50 with Imagenet-1k data: https://blog.surf.nl/en/imagenet-1k-training-on-intel-xeon-phi-in-less-than-40-minutes/
.. _arxiv.org/pdf/1709.05011.pdf: https://arxiv.org/pdf/1709.05011.pdf
.. _arxiv.org/abs/1709.05011: https://arxiv.org/format/1709.05011
.. _based on the synchronous: https://arxiv.org/format/1602.06709
.. _Intel MLSL: https://github.com/intel/MLSL/releases
\ No newline at end of file
......@@ -4,99 +4,102 @@
Validation and testing
######################
* **Validating** -- To provide optimizations with nGraph, we first
confirm that a given workload is "validated" as being functional;
that is, we can successfully load its serialized graph as an nGraph
:term:`function graph`. Following here is a list of 14 workloads
we've tested with success.
We validated performance for the following TensorFlow* and MXNet* workloads:
TensorFlow
==========
.. csv-table::
:header: "Workload", "Validated"
:header: "TensorFlow Workloads", "Type"
:widths: 27, 53
:escape: ~
DenseNet-121, Functional
Inception-v1, Functional
Inception-v2, Functional
ResNet-50, Functional
Shufflenet, Functional
SqueezeNet, Functional
VGG-19, Functional
ZFNet-512, Functional
MNIST, Functional
Emotion-FERPlus, Functional
BVLC AlexNet, Functional
BVLC GoogleNet, Functional
BVLC CaffeNet, Functional
BVLC R-CNN ILSVRC13, Functional
Resnet50 v1 and v2, Image recognition
Inception V3 and V4, Image recognition
Inception-ResNetv2, Image recognition
MobileNet v1, Image recognition
SqueezeNet v1.1, Image recognition
DenseNet-121, Image recognition
SSD-VGG16, Object detection
SSD-MobileNetv1, Object detection
Faster RCNN, Object detection
Yolo v2, Object detection
Wide & Deep, Recommender system
NCF, Recommender system
WaveNet, Speech generation
U-Net, Image segmentation
DCGAN, Generative adversarial network
DRAW, Image generation
A3C, Reinforcement learning
MXNet
=====
.. csv-table::
:header: "MXNet Workloads", "Type"
:widths: 27, 53
:escape: ~
* **Testing & Performance Optimizations** for workloads that have been
"validated" with nGraph are also available via the nGraph
:abbr:`Intermediate Representation (IR)`). For example, a common use
case for data scientists is to train a new model with a large dataset,
and so nGraph already has several accelerations available "out of the
box" for the workloads noted below.
Resnet50 v1 and v2, Image recognition
DenseNet (121, 161, 169, 201), Image recognition
InceptionV3, Image recognition
InceptionV4, Image recognition
Inception-ResNetv2, Image recognition
MobileNet v1, Image recognition
SqueezeNet v1 and v1.1, Image recognition
VGG16, Image recognition
Faster RCNN, Object detection
SSD-VGG16, Object detection
GNMT, Language translation
Transformer-LT, Language translation
Wide & Deep, Recommender system
WaveNet, Speech generation
DeepSpeech2, Speech recognition
DCGAN, Generative adversarial network
A3C, Reinforcement learning
ONNX
=====
Additionally, we validated the following workloads are functional through nGraph ONNX importer:
TensorFlow
==========
.. csv-table::
:header: "TensorFlow Workloads", "Performance"
:header: "Workload", "Type"
:widths: 27, 53
:escape: ~
Resnet50 v1 and v2, 50% of P40
Inception V3 and V4, 50% of P40
Inception-ResNetv2, 50% of P40
MobileNet v1, 50% of P40
SqueezeNet v1.1, 50% of P40
SSD-VGG16, 50% of P40
R-FCN, 50% of P40
Faster RCNN, 50% of P40
Yolo v2, 50% of P40
GNMT, Greater than or equal to :abbr:`Direct Optimization (DO)`
Transformer-LT, 50% of P40
Wide & Deep, 50% of P40
WaveNet, Functional
U-Net, Greater than DO
DRAW, 50% of P40
A3C, 50% of P40
DenseNet-121, Image recognition
Inception-v1, Image recognition
Inception-v2, Image recognition
ResNet-50, Image recognition
Shufflenet, Image recognition
SqueezeNet, Image recognition
VGG-19, Image recognition
ZFNet-512, Image recognition
MNIST, Image recognition
Emotion-FERPlus, Image recognition
BVLC AlexNet, Image recognition
BVLC GoogleNet, Image recognition
BVLC CaffeNet, Image recognition
BVLC R-CNN ILSVRC13, Object detection
MXNet
=====
.. csv-table::
:header: "MXNet Workloads", "Performance"
:widths: 27, 53
:escape: ~
Resnet50 v1 and v2, 50% of P40
DenseNet (121 161 169 201), 50% of P40
InceptionV3, 50% of P40
InceptionV4, 50% of P40
Inception-ResNetv2, 50% of P40
MobileNet v1, 50% of P40
SqueezeNet v1 and v1.1, 50% of P40
VGG16, Functional (No DO available)
Faster RCNN, 50% of P40
SSD-VGG16, 50% of P40
GNMT, Greater than or equal to :abbr:`Direct Optimization (DO)`
Transformer-LT, 50% of P40
Wide & Deep, 50% of P40
WaveNet, Functional
DeepSpeech2, 50% of P40
DCGAN, 50% of P40
A3C, Greater than or equal to DO
.. important:: Please see Intel's `Optimization Notice`_ for details on disclaimers.
.. _Optimization Notice: https://software.intel.com/en-us/articles/optimization-notice
.. Notice revision #20110804: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
......
......@@ -175,5 +175,15 @@ Glossary
SGD
:abbr:`Stochastic Gradient Descent (SGD)`, also known as incremental
gradient descent, is an iterative method for optimizing a differentiable
objective function.
\ No newline at end of file
gradient descent, is an iterative method for optimizing a
differentiable objective function.
validated
To provide optimizations with nGraph, we first confirm that a given
workload is "validated" as being functional; that is, we can
successfully load its serialized graph as an nGraph :term:`function
graph`
......@@ -16,27 +16,27 @@ nGraph Compiler stack architecture
==================================
The diagram below represents our current Beta release stack. Please note that
the stack diagram is simplified to show how nGraph executes deep learning
workloads with two hardware backends; however, many other deep learning
frameworks and backends currently are functioning.
The diagram below represents our current |release| release stack. Please
note that the stack diagram is simplified to show how nGraph executes deep
learning workloads with two hardware backends; however, many other deep
learning frameworks and backends currently are functioning.
.. figure:: ../graphics/stackngrknl.png
:width: 771px
:width: 455px
:alt: Current Beta release stack
Simplified stack diagram for nGraph Compiler and components Beta
Starting from the top of the diagram, we present a simplified view of how
the nGraph :abbr:`Intermediate Representation (IR)` can receive a graph from a
framework such as TensorFlow\* or MXNet\* when there is a corresponding
"Bridge" or import method, such as from NNVM or via `ONNX`_. Once the nGraph
:doc:`../ops/index` can begin parsing the graph as a computation graph, they
can pattern-match subgraphs for device-specific optimizations; these are then
encapsulated. This encapsulation is represented on the diagram as the colored
background between the ``ngraph`` kernel(s) and the the stack above.
Starting from the top of the diagram, we present a simplified view of the nGraph
Intermediate Representation (IR). The nGraph IR is a format which works with a
framework such as TensorFlow* or MXNet* when there is a corresponding "Bridge"
or import method, such as from NNVM or via `ONNX`_. Once the nGraph IR can begin
using nGraph's Core ops, components lower in the stack can begin parsing and
pattern-matching subgraphs for device-specific optimizations; these are then
encapsulated. This encapsulation is represented on the diagram as the colored
background between the ``ngraph`` kernel(s) and the the stack above.
Note that everything at or below the "Kernel APIs" and "Subgraph APIs" gets
Note that everything at or below the **Kernel APIs** and **Subgraph APIs** gets
executed "automatically" during training runs. In other words, the accelerations
are automatic: parts of the graph that are not encapsulated default to framework
implementation when executed. For example, if nGraph optimizes ResNet50 for
......@@ -89,7 +89,7 @@ Features
The nGraph :abbr:`(IR) Intermediate Representation` contains a combination of
device-specific and non-device-specific optimization :
* **Fusion** -- Fuse multiple ops to to decrease memory usage "localities".
* **Fusion** -- Fuse multiple ops to to decrease memory usage.
* **Data layout abstraction** -- Make abstraction easier and faster with nGraph
translating element order to work best for a given or available device.
* **Data reuse** -- Save results and reuse for subgraphs with the same input.
......@@ -110,7 +110,6 @@ device-specific and non-device-specific optimization :
added with new functions that build sub-graphs from existing core ops.
.. _portable:
Portable
......
# Framework & runtime support
One of nGraph’s key features is framework neutrality. We currently support
popular deep learning frameworks such as TensorFlow and MXNet with stable
bridges to pass computational graphs to nGraph. Additionally nGraph
Compiler has functional bridges to PaddlePaddle and PyTorch (via [ONNXIFI]).
For these frameworks, we have successfully tested functionality with a few
deep learning workloads, and we plan to bring stable support for them in the
upcoming releases.
To further promote framework neutrality, the nGraph team has been actively
contributing to the ONNX project. Developers who already have a "trained"
DNN (Deep Neural Network) model can use nGraph to bypass significant
framework-based complexity and [import it] to test or run on targeted and
efficient backends with our user-friendly Python-based API.
nGraph is also integrated as an computation provider for [ONNX Runtime],
which is a runtime for [WinML] on Windows OS and Azure to accelerate DL
workloads.
The table below summarizes our current progress on supported frameworks.
If you are an architect of a framework wishing to take advantage of speed
and multi-device support of nGraph Compiler, please refer to [Framework integration guide] section.
| Framework & Runtime | Supported | Validated
|----------------------------|--------------------|-------------
| TensorFlow* 1.12 | :heavy_check_mark: | :heavy_check_mark:
| MXNet* 1.4 | :heavy_check_mark: | :heavy_check_mark:
| ONNX 1.3 | :heavy_check_mark: | :heavy_check_mark:
| ONNX Runtime Functional | Functional | No
| PyTorch (via ONNXIFI) | Functional | No
| PaddlePaddle | Functional | No
## Hardware & backend support
The current release of nGraph primarily focuses on accelerating inference
performance on CPU. However we are also working on adding support for more
hardware and backends. As with the frameworks, we believe in providing
freedom to AI developers to deploy their deep learning workloads to the
desired hardware without a lock in. We currently have functioning backends
for Intel, Nvidia*, and AMD* GPU either leveraging kernel libraries
such as clDNN and cuDNN directly or utilizing PlaidML to compile for codegen
and emit OpenCL, OpenGL, LLVM, Cuda, and Metal. Please refer to [Architecture
and features] section to learn more about how we plan to take advantage of
both solutions using hybrid transformer. We expect to have stable support for aformentioned GPUs
in the early second half of 2019. In the similar time frame, we plan
to release multinode support.
Additionally, we are excited about providing support for our upcoming deep learning
accelerators such as NNP (Neural Network Processor) via nGraph compiler
stack, and early adopters will be able test them in 2019.
| Backend | supported
|-----------------------------------------------|-------------------
| Intel® Architecture CPU | :heavy_check_mark:
| Intel® Architecture GPUs | Functional via clDNN and PlaidML
| AMD* GPUs | Functional via PlaidML
| Nvidia* GPUs | Functional via cuDNN and PlaidML
| Intel® Nervana™ Neural Network Processor (NNP)| Functional
| Upcoming DL accelerators | Functional and will be announced in the near future
[Architecture and features]: ./ABOUT.md
[Upcoming DL accelerators]: https://www.intel.com/content/dam/www/public/us/en/documents/product-briefs/vision-accelerator-design-product-brief.pdf
[import it]: http://ngraph.nervanasys.com/docs/latest/howto/import.html
[ONNXIFI]: https://github.com/onnx/onnx/blob/master/docs/ONNXIFI.md
[ONNX Runtime]:https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-build-deploy-onnx
[WinML]: http://docs.microsoft.com/en-us/windows/ai
[How to]: https://ngraph.nervanasys.com/docs/latest/howto/index.html
[Framework integration guide]: https://ngraph.nervanasys.com/docs/latest/frameworks/index.html
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment