Commit 90ca4d87 authored by Leona C's avatar Leona C Committed by Scott Cyphers

Update README to fix broken TF link (#3192)

* Update ngtf bridge doc for generic versioning

* Many improvements, including previous group-based edits

- New section for docs on extra-experimental project-based features
- Team-based edits from PR 2994 added
- Reorganize GSG sections to be cleaner for getting started
- Fix link style warning generated by broadcast.hpp

* Update README with correct ngraph-bridge link on tf repo

* Fix comment so warning not generated on broadcasting op

* Add latest doc changes to ReleaseNotes

* Fix wording suggestion on PR review

* Update HE Transformer experimental backend
parent fde81260
......@@ -14,8 +14,7 @@ workloads on CPU for inference, please refer to the links below.
| Framework (Version) | Installation guide | Notes
|----------------------------|----------------------------------------|-----------------------------------
| TensorFlow* 1.13.1 | [Pip install](https://github.com/NervanaSystems/ngraph-tf#option-1-use-a-pre-built-ngraph-tensorflow-bridge) or [Build from source](https://github.com/NervanaSystems/ngraph-tf#option-2-build-ngraph-bridge-from-source) | 20 [Validated workloads]
| MXNet* 1.3 | [Pip install](https://github.com/NervanaSystems/ngraph-mxnet#Installation) or [Build from source](https://github.com/NervanaSystems/ngraph-mxnet#building-with-ngraph-support)| 18 [Validated workloads]
| TensorFlow* | [Pip install](https://github.com/tensorflow/ngraph-bridge#use-pre-built-packages) or [Build from source](https://github.com/tensorflow/ngraph-bridge#build-ngraph-from-source) | 20 [Validated workloads]
| ONNX 1.4 | [Pip install](https://github.com/NervanaSystems/ngraph-onnx#installation) | 17 [Validated workloads]
......
......@@ -73,7 +73,6 @@ See :ref:`ngraph_plaidml_backend` section on how to build the
nGraph-PlaidML.
Other integration paths
=======================
......
......@@ -17,17 +17,6 @@ as a backend to ONNX with the add-on package `nGraph ONNX`_.
support.
Installation
------------
To prepare your environment to use nGraph and ONNX, install the Python packages
for nGraph, ONNX and NumPy:
::
$ pip install ngraph-core onnx numpy
Importing an ONNX model
-----------------------
......
......@@ -10,8 +10,7 @@ workloads:
* :ref:`tensorflow_valid`
* :ref:`mxnet_valid`
* :ref:`onnx_valid`
* :doc:`../../project/extras/testing_latency.rst`
.. _tensorflow_valid:
......
......@@ -35,11 +35,11 @@ nGraph Compiler stack
For information about the releases, see the :doc:`../project/release-notes`.
The nGraph Library and Compiler stack are provided under the `Apache 2.0 license`_
(found in the LICENSE file in the project's `GitHub repo`_). It may also import
or reference packages, scripts, and other files that use licensing.
(found in the LICENSE file in the project's `repo`_). It may also import or reference
packages, scripts, and other files that use licensing.
.. _Apache 2.0 license: https://github.com/NervanaSystems/ngraph/blob/master/LICENSE
.. _GitHub repo: https://github.com/NervanaSystems/ngraph
.. _repo: https://github.com/NervanaSystems/ngraph
.. toctree::
......@@ -94,7 +94,7 @@ or reference packages, scripts, and other files that use licensing.
project/contribution-guide.rst
project/doc-contributor-README.rst
project/index.rst
project/extras.rst
project/extras/index.rst
glossary.rst
.. only:: html
......
.. project/extras.rst
#######
Extras
#######
* :ref:`homomorphic_encryption`
* :ref:`distributed_training`
This section contains extra tools and tips for working with up-and-coming
features of the nGraph Compiler stack.
.. _homomorphic_encryption:
Homomorphic Encryption (HE)
===========================
* **Encryption with Intel® HE transformer for nGraph™**
* The `Intel HE_transformer`_ enables deep encryption with nGraph Backends.
* `Blog post`_ with `examples`_
.. note:: Some implementations using TensorFlow* may also work with the
`nGraph Bridge repo`_ if older versions of ``ngraph-tf`` are not
available.
.. _distributed_training:
.. project/extras/distributed_training.rst:
Distributed training with nGraph
================================
......@@ -43,7 +11,7 @@ Distributed training with nGraph
How? (Generic frameworks)
-------------------------
See also: :doc:`../core/constructing-graphs/distribute-train`
See also: :doc:`../../core/constructing-graphs/distribute-train`
To synchronize gradients across all workers, the essential operation for data
parallel training, due to its simplicity and scalability over parameter servers,
......@@ -58,7 +26,7 @@ find it worthwhile to experiment with different modes or variations of
distributed training. Deployments using nGraph Library with supported backends
can be configured to train with data parallelism and will soon work with model
parallelism. Distributing workloads is increasingly important, as more data and
bigger models mean the ability to :doc:`../core/constructing-graphs/distribute-train`
bigger models mean the ability to :doc:`../../core/constructing-graphs/distribute-train`
work with larger and larger datasets, or to work with models having many layers
that aren't designed to fit to a single device.
......@@ -72,11 +40,4 @@ mini-batch training, one could train ResNet-50 with Imagenet-1k data to the
*Top 5* classifier in minutes using thousands of CPU nodes. See
`arxiv.org/abs/1709.05011`_.
.. _nGraph Bridge repo: https://github.com/tensorflow/ngraph-bridge
.. _Intel HE_transformer: https://github.com/NervanaSystems/he-transformer
.. _Blog post: https://www.intel.ai/he-transformer-for-ngraph-enabling-deep-learning-on-encrypted-data/
.. _examples: https://github.com/NervanaSystems/he-transformer#examples
.. _arxiv.org/abs/1709.05011: https://arxiv.org/format/1709.05011
.. _based on the synchronous: https://arxiv.org/format/1602.06709
.. _arxiv.org/abs/1709.05011: https://arxiv.org/format/1709.05011
\ No newline at end of file
.. project/extras/homomorphic_encryption.rst:
Homomorphic Encryption (HE)
===========================
* **Encryption with Intel® HE transformer for nGraph™**
* The `Intel HE_transformer`_ is an experimental nGraph backend
which enables deep learning on encrypted data using homomorphic
encryption.
* `Blog post`_ with `examples`_
.. note:: Some implementations using TensorFlow* may also work with the
`nGraph Bridge repo`_ if older versions of ``ngraph-tf`` are not
available.
.. _Intel HE_transformer: https://github.com/NervanaSystems/he-transformer
.. _Blog post: https://www.intel.ai/he-transformer-for-ngraph-enabling-deep-learning-on-encrypted-data/
.. _examples: https://github.com/NervanaSystems/he-transformer#examples
.. _nGraph Bridge repo: https://github.com/tensorflow/ngraph-bridge
.. project/extras/index.rst
#######
Extras
#######
This section contains extra tools and tips for working with
previously-tested, or up-and-coming features of the nGraph
Compiler stack.
.. toctree::
:maxdepth: 1
homomorphic_encryption.rst
distributed_training.rst
testing_latency.rst
.. project/extras/testing_latency.rst:
Testing latency
===============
Many open-source DL frameworks provide a layer where experts in data science
can make use of optimizations contributed by machine learning engineers. Having
a common API benefits both: it simplifies deployment and makes it easier for ML
engineers working on advanced deep learning hardware to bring highly-optimized
performance to a wide range of models, especially in inference.
One DL framework with advancing efforts on graph optimizations is Apache
MXNet\*, where `Intel has contributed efforts showing`_ how to work with our
nGraph Compiler stack as an `experimental backend`_. Our approach provides
**more opportunities** to start working with different kinds of graph
optimizations **than would be available to the MXNet framework alone**, for
reasons outlined in our `introduction`_ documentation. Note that the
MXNet bridge requires trained models only; it does not support distributed
training.
.. figure:: ../../graphics/ngraph-mxnet-models.png
:width: 533px
:alt: Up to 45X faster
Up to 45X faster compilation with nGraph backend
Tutorial: Testing inference latency of ResNet-50-V2 with MXNet
--------------------------------------------------------------
This tutorial supports compiling MXNet with nGraph's CPU backend.
Begin by cloning MXNet from GitHub:
.. code-block:: console
git clone --recursive https://github.com/apache/incubator-mxnet
To compile run:
.. code-block:: console
cd incubator-mxnet
make -j USE_NGRAPH=1
MXNet's build system will automatically download, configure, and build the
nGraph library, then link it into ``libmxnet.so``. Once this is complete, we
recommend building a python3 virtual environment for testing, and then
install MXNet to the virtual environment:
.. code-block:: console
python3 -m venv .venv
. .venv/bin/activate
cd python
pip install -e .
cd ../
Now we're ready to use nGraph to run any model on a CPU backend. Building MXNet
with nGraph automatically enabled nGraph on your model scripts, and you
shouldn't need to do anything special. If you run into trouble, you can disable
nGraph by setting
.. code-block:: console
MXNET_SUBGRAPH_BACKEND=
If you do see trouble, please report it and we'll address it as soon as possible.
Running ResNet-50-V2 Inference
------------------------------
To show a working example, we'll demonstrate how MXNet may be used to run
ResNet-50 Inference. For ease, we'll consider the standard MXNet ResNet-50-V2
model from the `gluon model zoo`_, and we'll test with ``batch_size=1``.
Note that the nGraph-MXNet bridge supports static graphs only (dynamic graphs
are in the works); so for this example, we begin by converting the gluon model
into a static graph. Also note that any model with a saved checkpoint can be
considered a "static graph" in nGraph. For this example, we'll presume that the
model is pre-trained.
.. literalinclude:: ../../../../examples/subgraph_snippets/mxnet-gluon-example.py
:language: python
:lines: 17-32
To load the model into nGraph, we simply bind the symbol into an Executor.
.. literalinclude:: ../../../../examples/subgraph_snippets/mxnet-gluon-example.py
:language: python
:lines: 34-35
At binding, the MXNet Subgraph API finds nGraph, determines how to partition
the graph, and in the case of Resnet, sends the entire graph to nGraph for
compilation. This produces a single call to an `NNVM`_ ``NGraphSubgraphOp`` embedded
with the compiled model. At this point, we can test the model's performance.
.. literalinclude:: ../../../../examples/subgraph_snippets/mxnet-gluon-example.py
:language: python
:lines: 40-48
.. _experimental backend: https://github.com/apache/incubator-mxnet/pull/12502
.. _Intel has contributed efforts showing: https://cwiki.apache.org/confluence/display/MXNET/MXNet+nGraph+integration+using+subgraph+backend+interface
.. _introduction: http://ngraph.nervanasys.com/docs/latest/project/introduction.html
.. _gluon model zoo: https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/gluon/model_zoo/vision/resnet.py#L499
.. _subgraph acceleration API: https://cwiki.apache.org/confluence/display/MXNET/Unified+integration+with+external+backend+libraries
.. _NNVM: https://github.com/dmlc/nnvm
.. _nGraph-MXNet: https://github.com/NervanaSystems/ngraph-mxnet/blob/master/README.md
\ No newline at end of file
.. project/introduction.rst:
#######
Summary
#######
nGraph is an open-source graph compiler for :abbr:`Artificial Neural Networks (ANNs)`.
The nGraph Compiler stack provides an inherently efficient graph-based compilation
infrastructure designed to be compatible with many upcoming
:abbr:`Application-Specific Integrated Circuits (ASICs)`, like the Intel® Nervana™
Neural Network Processor (Intel® Nervana™ NNP), while also unlocking a massive
performance boost on any existing hardware targets for your neural network: both
GPUs and CPUs. Using its flexible infrastructure, you will find it becomes much
easier to create Deep Learning (DL) models that can adhere to the "write once,
run anywhere" mantra that enables your AI solutions to easily go from concept to
production to scale.
Frameworks using nGraph to execute workloads have shown `up to 45X`_ performance
boost compared to native implementations.
For a detailed overview, see below; for a more historical perspective, see
our `arXiv`_ paper.
Motivations
Introduction
############
Future developments in :abbr:`Artificial Intelligence (AI)` will increasingly
rely on better methods to accelerate the performance of deep learning workloads.
As :abbr:`Deep Learning (DL)` models become more complex, and as the volume of
data those models are expected to handle increases rapidly, the deployment of
scalable AI solutions becomes a greater challenge.
Today, two standard approaches to accelerate deep learning performance are:
#. **Design hardware solutions dedicated to deep learning computation** -- Many
companies, ranging from startups to established manufacturers such as
Intel, are actively developing :abbr:`Application Specific Integrated Circuits (ASICs)`
to accelerate the performance of deep learning for both training and
inference.
#. **Optimize software to accelerate performance** -- nGraph Compiler, an
open-source deep learning compiler, is Intel's solution to deliver performance
via software optimization. nGraph provides developers with a way to
accelerate workloads via software and to provide a significant increase
in performance for standard hardware targets such as CPUs and GPUs. For
deploying scalable AI solutions, nGraph uses kernel libraries, a popular
and effective method to improve deep learning performance. Where kernel
libraries are available and perform well, we use them.
Motivations
===========
Developers working to craft solutions with :abbr:`Artificial Intelligence (AI)`
face a steep learning curve in taking their concepts from design to
production. It can be challenging to create a :abbr:`Deep Learning (DL)` model
that maintains a minimum standard of consistency, as it must be continually
tweaked, adapted, or rewritten to use and optimize various parts of the stack
during its life cycle. For DL models that do reach production-ready status, an
entirely new set of problems emerges in how to scale and use larger and larger
datasets, data that must be encrypted, data-in-motion, and of course, in
finding the best compromises among speed, accuracy, and performance.
Two general approaches to advancing deep learning performance dominate the
industry today. The first is to design hardware dedicated exclusively to
handling compute for specialized kinds of :abbr:`Machine Learning (ML)` or
:abbr:`DL (Deep Learning)` operations; this approach essentially designs a
custom network infrastructure *around* specific problems AI is supposed to
solve. For example, many companies are actively developing specialized
:abbr:`Application-Specific Integrated Circuits (ASICs)` to speed-up
training (one kind of ASIC) or to reduce inference latency (another kind
of ASIC) in their cloud-based or local data centers. This approach works
great for :abbr:`Cloud Service Providers (CSPs)` and others that have
considerable budgets to invest in researching and building new hardware;
however, it creates a significant burden on the developer who needs to
invest in adapting the context of their model for training and then for
inference, to figure out at least two data-cycle pipelines or deployment
scenarios, and to decide what trade-offs to make when and where.
The second approach to making deep learning more efficient is to design a
software stack that lets the :abbr:`Neural Network (NN)` adapt to whatever
compute resources are available and deliver performance via software
optimization. The nGraph Compiler stack is our solution to this second
approach: it provides an inherently efficient graph-based compilation
infrastructure designed to be compatible with many upcoming DL ASICs while
also unlocking a massive performance boost on any existing hardware targets
in a network, whether they are CPUs, GPUs, or other custom silicon. nGraph
provides optimization opportunities at the graph level, where the
network-to-device compilation can be managed with a series of "subgraphs"
that can be handled in either a static or a dynamic manner. With our
:doc:`../ops/index` and graph-based infrastructure for neural networks,
it's also possible to extract context semantics that make it much easier to
work with many of the new and emerging problems in Deep Learning including
larger datasets, data that must be encrypted, and data-in-motion. Our solution
also addresses the scalability issue with kernel libraries, the current
popular solution to accelerating deep learning performance.
The current state-of-the-art software solution for speeding up deep learning
computation is to integrate kernel libraries like Intel® Math Kernel Library
for Deep Neural Networks (Intel® MKL DNN) and Nvidia\*'s CuDNN into deep
learning frameworks. These kernel libraries offer a runtime performance boost
on specific hardware targets through highly-optimized kernels and other
operator-level optimizations.
However, kernel libraries have three main problems:
#. Kernel libraries do not support graph-level optimizations.
#. Framework integration of kernel libraries does not scale.
#. There are too many kernels to write, and they require expert knowledge.
The nGraph Compiler stack is designed to address the first two problems. nGraph
applies graph-level optimizations by taking the computational graph from a deep
learning framework like TensorFlow\* and reconstructing it with the nGraph
:abbr:`Intermediate Representation (IR)`. The nGraph IR centralizes computational
The current :abbr:`State-of-the-Art (SoTA)` software solution for deep
learning computation is to integrate kernel libraries such as Intel®
:abbr:`Math Kernel Library for Deep Neural Networks (Intel® MKL DNN)`
and Nvidia\*'s CuDNN into deep learning frameworks. These kernel
libraries offer a performance boost during runtime on specific hardware
targets through highly-optimized kernels and other operator-level
optimizations.
However, kernel libraries have three main problems:
#. Kernel libraries do not support graph-level optimizations.
#. Framework integration of kernel libraries does not scale.
#. The number of required kernels keeps growing.
nGraph Compiler addresses the first two problems, and nGraph Compiler combined
with PlaidML addresses the third problem. nGraph applies graph-level
optimizations by taking the computational graph from a deep learning framework
such as TensorFlow\* and reconstructing it with nGraph's
:abbr: `IR (Intermediate Representation)`. nGraph IR centralizes computational
graphs from various frameworks and provides a unified way to connect backends
for targeted hardware. From here, PlaidML or one of the nGraph transformers can
generate code in various forms, including LLVM, OpenCL, OpenGL, Cuda and Metal.
This generated code is where the low-level optimizations are automatically
applied. The result is a more efficient execution that does not require any
manual kernel integration work for most hardware targets.
What follows here is more detail about how our solution addresses these
problems.
for targeted hardware. To address the third problem, nGraph is integrated with
PlaidML, a tensor compiler, which generates code in LLVM, OpenCL, OpenGL, and
Metal. Low-level optimizations are automatically applied to the generated code,
resulting in a more efficient execution that does not require manual kernel
integration for most hardware targets.
The following three sections explore the main problems of kernel libraries in
more detail and describe how nGraph addresses them.
Problem: Absence of graph-level optimizations
---------------------------------------------
Problem 1: Kernel libraries do not support graph-level optimizations
--------------------------------------------------------------------
The diagram below illustrates a simple example of how a deep learning
framework, when integrated with a kernel library, is capable of running each
operation in a computational graph optimally, but the graph itself may not be
optimal:
The example diagrams below shows how a deep learning framework, when integrated
with a kernel library, can optimally run each operation in a computational
graph, but the choice of operations in the graph may not be optimal.
.. _figure-A:
.. figure:: ../graphics/intro_graph_optimization.png
.. figure:: ../graphics/kernel-problem-1.png
:width: 555px
:alt:
The computation is constructed to execute ``(A+B)*C``, but in the context of
nGraph, we can further optimize the graph to be represented as ``A*C``. From the
first graph shown on the left, the operation on the constant ``B`` can be
computed at the compile time (known as constant folding), and the graph can be
further simplified to the one on the right because the constant has value of
zero. Without such graph-level optimizations, a deep learning framework with a
kernel library will compute all operations, and the resulting execution will be
suboptimal.
Problem: Reduced scalability
----------------------------
Integrating kernel libraries with frameworks is increasingly becoming
nontrivial due to the growing number of new deep learning accelerators.
For each new deep learning accelerator, a custom kernel library integration
must be implemented by a team of experts. This labor-intensive work is
further amplified if you want your DL accelerator to support a number of
different frameworks. The work must be revisited any time you upgrade or
expand your network's hardware. Each integration is unique to the framework
and its set of deep learning operators, its view on memory layout, its
feature set, etc.
nGraph solves this problem with nGraph bridges. A bridge takes a computational
graph and reconstructs it in the nGraph IR with a few primitive nGraph
operations. With the unified computational graph, kernel libraries no longer
need to be separately integrated to each deep learning framework. Instead, the
libraries only need to support nGraph primitive operations, and this approach
streamlines integration process for the backend.
Problem: Increasing number of kernels
-------------------------------------
Kernel libraries need to be integrated with multiple deep learning frameworks,
and this arduous task becomes even harder due to increased numbers of required
kernels for achieving optimal performance. The number of required kernels is
product of number of chip designs, data types, operations, and the cardinality
of each parameter for each operation. In the past, the number of required
kernels was limited, but as the AI research and industry rapidly develops, the
final product of required kernels is increasing exponentially.
:alt:
The computation is constructed to execute ``(A+B)*C``. With nGraph, we can
further optimize the graph to be represented as ``A*C``. From the first graph
shown on the left, the operation on the constant ``B`` can be computed at
compile time (an optimization known as *constant folding*). The graph can be
further simplified to the one on the right because the constant has a value of
zero (known as *algebraic simplification*). Without such graph-level
optimizations, a deep learning framework with a kernel library will compute
all operations, resulting in suboptimal execution.
Problem 2: Framework integration of kernel libraries does not scale
-------------------------------------------------------------------
Due to the growing number of new deep learning accelerators, integrating
kernel libraries with frameworks has become increasingly more difficult. For
each new deep learning accelerator, a custom kernel library integration must
be implemented by a team of experts. This labor-intensive work is further
complicated by the number of frameworks, as illustrated in the following
diagram.
.. _figure-B:
.. figure:: ../graphics/intro_kernel_explosion.png
.. figure:: ../graphics/kernel-problem-2.png
:width: 555px
:alt:
Each of these connections represents significant work for what will
ultimately be a brittle setup that is enormously expensive to maintain.
:alt:
Each framework must be manually integrated with each hardware-specific kernel
library. Additionally, each integration is unique to the framework and its set
of deep learning operators, view on memory layout, feature set, etc. Each
connection that needs to be made increases the amount of work, resulting in a
fragile setup that is costly to maintain.
nGraph solves this problem with bridges. A bridge takes a computational
graph or similar structure and reconstructs it in the nGraph IR along with a
few primitive nGraph operations. With a unified computational graph, kernel
libraries no longer need to be separately integrated into each deep learning
framework. Instead, the libraries only need to support nGraph primitive
operations, and this approach streamlines the integration process for the
backend.
Problem 3: The number of required kernels keeps growing
-------------------------------------------------------
Integrating kernel libraries with multiple deep learning frameworks is a
difficult task that becomes more complex with the growing number of
kernels needed to achieve optimal performance. Past deep learning research has
been built on a small set of standard computational primitives (convolution,
GEMM, etc.). But as AI research advances and industrial deep learning
applications continue to develop, the number of required kernels continues to
increase exponentially. The number of required kernels is based on the number
of chip designs, data types, operations, and the cardinality of each parameter
per operation. Each connection in the following diagram represents significant
work for what will ultimately be a fragile setup that is costly to maintain.
.. _figure-C:
.. figure:: ../graphics/kernel-problem-3.png
:width: 555px
:alt:
Integrating PlaidML with nGraph provides flexbility to support the latest deep
learning models in the absence of hand-optimized kernels for new operations.
PlaidML works together with nGraph to address the exponential growth of
kernels.
PlaidML addresses the kernel explosion problem in a manner that lifts a heavy
burden off kernel developers. It automatically lowers networks from nGraph
into Tile, a :abbr:`Domain-Specific Language (DSL)` designed for deep learning
that allows developers to express how an operation should calculate tensors in
an intuitive, mathematical form via `Stripe`_. Integration of PlaidML with
nGraph means extra flexibility to support newer deep learning models in the
absence of by-hand optimized kernels for the new operations.
PlaidML takes two inputs: the operation defined by the user and the machine
description of the hardware target. It then automatically generates kernels
that are iteratively optimized through an IR known as `Stripe`_. Integration of
PlaidML with nGraph allows users to choose the hardware and framework that
suits their needs, resulting in freedom from kernel libraries.
Solution: nGraph and PlaidML
============================
Each of the problems above can be solved with nGraph and PlaidML. We developed
nGraph and integrated it with PlaidML so developers wanting to craft solutions
with :abbr:`AI (Artificial Intelligence)` won't have to face such a steep
learning curve in taking their concepts from design to production, and to scale.
The fundamental efficiencies behind Moore's Law are not dead; rather than fitting
`more transistors on denser and denser circuits`_, with nGraph and PlaidML,
we're enabling advances in compute with more transformers on denser and more
data-heavy :abbr:`Deep Learning Networks (DNNs)`, and making it easier to apply
:abbr:`Machine Learning (ML)` to different industries and problems.
For developers with a neural network already in place, executing workloads using
the nGraph Compiler provides further performance benefits and allows for quicker
adaptation of models. It also makes it much easier to upgrade hardware
infrastructure pieces as workloads grow.
This documentation provides technical details of nGraph's core functionality,
framework and backend integrations. Creating a compiler stack like nGraph and
PlaidML requires expert knowledge, and we're confident that nGraph and PlaidML
will make life easier for many kinds of developers:
We developed nGraph and integrated it with PlaidML to allow developers to
accelerate deep learning performance and address the problem of scalable
kernel libraries. To address the problem of scaling backends, nGraph applies
graph-level optimizations to deep learning computations and unifies
computational graphsfrom deep learning frameworks with nGraph IR.
In conjunction with nGraph's graph-level optimizations, PlaidML automatically
applies low-level optimizations to improve deep learning performance.
Additionally, PlaidML offers extensive support for various hardware targets
due to its abilility to generate code in LLVM, OpenCL, OpenGL, and Metal.
Given a backend with existing kernel libraries, nGraph can readily support the
target hardware because the backend only needs to support a few primitive
operations. If the hardware supports one of the coding languages supported by
PlaidML, developers must specify the machine description to support the
hardware. Together, nGraph and PlaidML provide the best of both worlds.
This documentation provides technical details of nGraph's core functionality
as well as framework and backend integrations. Creating a compiler stack like
nGraph and PlaidML requires expert knowledge, and we're confident that nGraph
and PlaidML will make life easier for many kinds of developers:
#. Framework owners looking to support new hardware and custom chips.
#. Data scientists and ML developers wishing to accelerate deep learning
#. Data scientists and ML developers wishing to accelerate deep learning
performance.
#. New DL accelerator developers creating an end-to-end software stack from
a deep learning framework to their silicon.
#. New DL accelerator developers creating an end-to-end software stack from a
deep learning framework to their silicon.
.. _arXiv: https://arxiv.org/abs/1801.08058
.. _Stripe: https://arxiv.org/abs/1903.06498
.. _publication: https://arxiv.org/abs/1801.08058
.. _up to 45X: https://ai.intel.com/ngraph-compiler-stack-beta-release/
.. _more transistors on denser and denser circuits: https://www.intel.com/content/www/us/en/silicon-innovations/moores-law-technology.html
.. _Stripe: https://arxiv.org/abs/1903.06498
......@@ -16,25 +16,14 @@ We are pleased to announce the release of version |version|-doc.
Core updates for |version|
~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ PlaidML support
+ More ONNX ops
+ Elementwise divide defaults to Python semantics
+ GenerateMask seed optional
+ Graph visualization improvements
+ **Known Issues**
- When using TensorFlow\* v1.14.0 with ```ngraph-bridge`` v0.16.0rc0 and CPU
backend, we saw notable to severe decreases in throughput in many models.
+ Better PlaidML support
Latest doc updates
~~~~~~~~~~~~~~~~~~
+ Document new debug tool
+ Note deprecation of MXNet's ``ngraph-mxnet`` PyPI
+ Note default change to `svg` files for graphs and visualization
+ Add more prominent tips for contributors who find the doc-contributor-README
+ Add instructions how to build ``NGRAPH_PLAIDML`` backend.
.. important:: Pre-releases (``-rc-0.*``) have newer features, and are less stable.
......@@ -43,6 +32,27 @@ Latest doc updates
Changelog on Previous Releases
==============================
0.23
----
+ PlaidML support
+ More ONNX ops
+ Elementwise divide defaults to Python semantics
+ GenerateMask seed optional
+ Document new debug tool
+ Graph visualization improvements
+ Note deprecation of MXNet's ``ngraph-mxnet`` PyPI
+ Note default change to `svg` files for graphs and visualization
+ Add more prominent tips for contributors who find the doc-contributor-README
+ Better GSG / Install Guide structure.
+ Added group edits and new illustrations from PR 2994 to `introduction.rst`.
+ Ensure ngraph-bridge link in README goes to right place.
+ Make project `extras` their own subdirectory with index to help organize them.
+ **Known Issues**
- When using TensorFlow\* v1.14.0 with ```ngraph-bridge`` v0.16.0rc0 and CPU
backend, we saw notable to severe decreases in throughput in many models.
0.22
----
......
......@@ -67,7 +67,7 @@ namespace ngraph
/// \brief Broadcast shape of two nodes to make them compatible for a matrix multiplication.
///
/// \note This function is reflecting broadcasting behaviour of NumPy's `matmul` operation
/// \link https://docs.scipy.org/doc/numpy/reference/generated/numpy.matmul.html
/// (https://docs.scipy.org/doc/numpy/reference/generated/numpy.matmul.html)
/// This mean that only \"stack of matrices\" axes are bidirectionally broadcasted.
/// The last two dimension are left untouched.
///
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment