Commit 51270c68 authored by harryskim's avatar harryskim Committed by Robert Kimball

Revamp doc intro (#2588)

* Delete old stack diagram

* Update first paragraph

* Update introduction.rst

* Update introduction.rst

* Update introduction.rst

* Update introduction.rst

* Introduction graph optimization diagram

* Update introduction.rst

* Introduction kernel to deep learning frameworks

* Update introduction.rst

* Update introduction.rst

* Delete intro_kernel_to_fw.png

* Introduction kernel to frameworks with accent

* Update introduction.rst

* Update introduction.rst

* Update introduction.rst

* Introduction multiplying kernels

* Update introduction.rst

* Update introduction.rst

* Update introduction.rst

* Update introduction.rst

* Update introduction.rst

* Delete intro_kernel_to_fw_accent.png

* Introduction kernel explosion image

* Update introduction.rst

* Update introduction.rst

* Update introduction.rst

* Edit introduction to be relevant to a wider audience

* Refine intro

* writing thru another draft

* Another batch of updates

* Tie the conclusion to the intro

* Clean up the doc

* First reviewer feedback done!

* fix typo

* fix typos

* Minor formatting fixes
parent 30c1c0c4
......@@ -48,6 +48,12 @@ nGraph Compiler stack
core/passes/passes.rst
buildlb.rst
.. toctree::
:maxdepth: 1
:caption: Python API
python_api/index.rst
.. toctree::
:maxdepth: 1
......
......@@ -20,7 +20,7 @@ stack diagram is simplified to show how nGraph executes deep learning
workloads with two hardware backends; however, many other deep learning
frameworks and backends currently are functioning.
.. figure:: ../graphics/stackngrknl.png
.. figure:: ../graphics/stackngrknl-notice.png
:alt:
Bridge
......
......@@ -4,165 +4,192 @@
Introduction
############
Deep Learning (DL) computational performance is critical for scientists and engineers applying deep learning techniques to many challenges in healthcare, commerce, autonomous driving, and other domains. While they can leverage
Developers working to craft solutions with :abbr:`Artificial Intelligence (AI)`
face a steep learning curve in taking their concepts from design to
production. It can be challenging to create a :abbr:`Deep Learning (DL)` model
that maintains a minimum standard of consistency, as it must be continually
tweaked, adapted, or rewritten to use and optimize various parts of the stack
during its life cycle. For DL models that do reach production-ready status, an
entirely new set of problems emerges in how to scale and use larger and larger
datasets, data that must be encrypted, data-in-motion, and of course, in
finding the best compromises among speed, accuracy, and performance.
Two general approaches to advancing deep learning performance dominate the
industry today. The first is to design hardware dedicated exclusively to
handling compute for specialized kinds of :abbr:`Machine Learning (ML)` or
:abbr:`DL (Deep Learning)` operations; this approach essentially designs a
custom network infrastructure *around* specific problems AI is supposed to
solve. For example, many companies are actively developing specialized
:abbr:`Application-Specific Integrated Circuits (ASICs)` to speed-up
training (one kind of ASIC) or to reduce inference latency (another kind
of ASIC) in their cloud-based or local data centers. This approach works
great for :abbr:`Cloud Service Providers (CSPs)` and others that have
considerable budgets to invest in researching and building new hardware;
however, it creates a significant burden on the developer who needs to
invest in adapting the context of their model for training and then for
inference, to figure out at least two data-cycle pipelines or deployment
scenarios, and to decide what trade-offs to make when and where.
The second approach to making deep learning more efficient is to design a
software stack that lets the :abbr:`Neural Network (NN)` adapt to whatever
compute resources are available and deliver performance via software
optimization. The nGraph Compiler stack is our solution to this second
approach: it provides an inherently efficient graph-based compilation
infrastructure designed to be compatible with many upcoming DL ASICs while
also unlocking a massive performance boost on any existing hardware targets
in a network, whether they are CPUs, GPUs, or other custom silicon. nGraph
provides optimization opportunities at the graph level, where the
network-to-device compilation can be managed with a series of "subgraphs"
that can be handled in either a static or a dynamic manner. With our
:doc:`../ops/index` and graph-based infrastructure for neural networks,
it's also possible to extract context semantics that make it much easier to
work with many of the new and emerging problems in Deep Learning including
larger datasets, data that must be encrypted, and data-in-motion. Our solution
also addresses the scalability issue with kernel libraries, the current
popular solution to accelerating deep learning performance.
Motivations
===========
Kernel libraries do not support graph-level optimizations
---------------------------------------------------------
A framework designed for training using GPUs requires integration with a kernel
library unique to that vendor's hardware. For example, after integration, a
kernel library can run operations that it is "familar" with optimally; however,
the graph itself within any larger :term:`NN` won't be optimal.
The current state-of-the-art software solution for speeding up deep learning
computation is to integrate kernel libraries like Intel® Math Kernel Library
for Deep Neural Networks (Intel® MKL DNN) and Nvidia\*'s CuDNN into deep
learning frameworks. These kernel libraries offer a runtime performance boost
on specific hardware targets through highly-optimized kernels and other
operator-level optimizations.
.. _figure-A:
However, kernel libraries have three main problems:
.. figure:: ../graphics/framework-to-kernel-lib.png
:width: 555px
:alt:
#. Kernel libraries do not support graph-level optimizations.
#. Framework integration of kernel libraries does not scale.
#. There are too many kernels to write, and they require expert knowledge.
Figure A: Lack of graph-level optimization makes framework-to-kernel library
integration enormously inefficient. The computation graph above represents
the computation: "A plus B times C".
The nGraph Compiler stack is designed to address the first two problems. nGraph
applies graph-level optimizations by taking the computational graph from a deep
learning framework like TensorFlow\* and reconstructing it with the nGraph
:abbr:`Intermediate Representation (IR)`. The nGraph IR centralizes computational
graphs from various frameworks and provides a unified way to connect backends
for targeted hardware. From here, PlaidML or one of the nGraph transformers can
generate code in various forms, including LLVM, OpenCL, OpenGL, Cuda and Metal.
This generated code is where the low-level optimizations are automatically
applied. The result is a more efficient execution that does not require any
manual kernel integration work for most hardware targets.
What follows here is more detail about how our solution addresses these
problems.
.. _figure-B:
.. figure:: ../graphics/framework-to-graph-opt.png
:width: 555px
:alt:
Figure B: Notice that an operation on the constant B (in this case a ``Broadcast``)
can be done at compile time. This is an example of constant folding, and it
is not available to a device-based kernel library.
Problem: Absence of graph-level optimizations
---------------------------------------------
The diagram below illustrates a simple example of how a deep learning
framework, when integrated with a kernel library, is capable of running each
operation in a computational graph optimally, but the graph itself may not be
optimal:
.. _figure-C:
.. _figure-A:
.. figure:: ../graphics/ngraph-algebraic-simp.png
.. figure:: ../graphics/intro_graph_optimization.png
:width: 555px
:alt:
Figure C: Finally notice that the constant has value "zero" thus the add is an
*identity* operation and can be eliminated. This is an example of **Algebraic
simplification**, and it is not available to a device-based kernel library.
After the two graph-level optimizations above (**Algebraic Simplification** and
**Constant Folding**), we now have an optimal graph: A times C. Again, kernel
libraries do not support this type of optimization. Although each implementation
can be done individually, it will eventually yield an "exploding" number of
kernels the larger and more complex an :abbr:`NN (Neural Network)` becomes. For
some insight on why this happens, see the next section.
Too Many Kernels to write
-------------------------
A typical network is constructed using some kind of language-based API, which
translates the network or :abbr:`DL (Deep Learning)` model (statically or
dynamically) into serialized graphs. Those graphs can then passed through a
compilation process (the *Graph optimization or compilation* step in
*Figure D* below), where various graph-level optimizations, like constant folding
or fusion can happen. These processes require unique vendor-provided libraries
to communicate with a driver (possibly through OpenCL\*, CUDA\*, or SYCL\*), to
compile and execute an implementation (kernel) for a specific
:abbr:`Instruction Set Architecture (ISA)`, or :term:`ISA`.
Illustrated below is a simplified DL stack, showing relative complexity of
each component. Note that optimizing for any one on its own usually requires
engineering expertise that can be highly specialized to that component, and that
the terms have been simplified for illustrative purposes.
The following computation is constructed to execute ``(A+B)*C``, but in the
context of nGraph, we can further optimize the graph to be represented as ``A*C``.
From the first graph shown on the left, the operation on the constant ``B`` can
be computed at the compile time (known as constant folding), and the graph can
be further simplified to the one on the right because the constant has value of
zero. Without such graph-level optimizations, a deep learning framework with a
kernel library will compute all operations, and the resulting execution will be
suboptimal.
Problem: Reduced scalability
----------------------------
Integrating kernel libraries with frameworks is increasingly becoming
nontrivial due to the growing number of new deep learning accelerators.
For each new deep learning accelerator, a custom kernel library integration
must be implemented by a team of experts. This labor-intensive work is
further amplified if you want your DL accelerator to support a number of
different frameworks. The work must be revisited any time you upgrade or
expand your network's hardware. Each integration is unique to the framework
and its set of deep learning operators, its view on memory layout, its
feature set, etc.
.. _figure-D:
.. figure:: ../graphics/components-dl-stack.png
:width: 700px
:alt: A simplified DL stack
Figure D: Components of a DL stack, simplified for illustrative purposes.
There are many deep learning frameworks, each with its own strengths and user
bases. A setup that is common to many DL practitioners is shown in the
illustration below.
.. _figure-E:
.. figure:: ../graphics/a-common-stack.png
:width: 700px
:alt: A common implementation
Figure E: A commonly-implemented stack uses TensorFlow\* as the frontend.
The input is either optimized via Grappler, or executed directly via TensorFlow.
In either case, when targeting an Nvidia\* GPU, cuDNN is called to select an
optimal kernel for the operation; cuDNN then relies on CUDA\* or direct access
to run code on the target; in this toy example, the target is a V100.
A natural result of this approach is that the framework-level integration of
kernel libraries does not scale. Rather, each individual framework must be
manually integrated with each hardware-specific kernel library. Each integration
is unique to the framework and its set of deep learning operators, its view on
memory layout, its feature set, etc. Each of these connections, then, represents
significant work for what will ultimately be a brittle setup that is enormously
expensive to maintain.
.. _figure-F:
.. figure:: ../graphics/dl-current-state.png
:width: 700px
:alt: Scalability matters
Figure F: The number of kernels necessary to achieve optimal performance is
bounded by the product of the number of chip designs one wishes to support,
the number of data types supported, the number of operations, and the
cardinality of each parameter for each operation.
In the past, this upper bound was quite limited; however, since the industry is
shifting toward a more diverse future in terms of deep learning hardware, the
number of distinct kernels is exploding and will continue to explode.
.. _figure-B:
.. figure:: ../graphics/intro_kernel_to_fw_accent.png
:width: 555px
:alt:
Get the best of both worlds
---------------------------
Each of these connections represents significant work for what will
ultimately be a brittle setup that is enormously expensive to maintain.
Integrating a framework on nGraph can be an attractive option for hardware
companies trying to design their own deep learning hardware or network architecture.
Framework integration is non-trivial amount of work, and nGraph automatically
does much of the heavy lifting. Furthermore, PlaidML can provide a wide range of
hardware coverage and optimization automatically. Any hardware that supports
LLVM, OpenCL, OpenGL, CUDA or Metal can be supported automatically with PlaidML
and nGraph.
nGraph solves this problem with nGraph bridges. A bridge takes a computational
graph and reconstructs it in the nGraph IR with a few primitive nGraph
operations. With the unified computational graph, kernel libraries no longer
need to be separately integrated to each deep learning framework. Instead, the
libraries only need to support nGraph primitive operations, and this approach
streamlines integration process for the backend.
.. _figure-G:
.. figure:: ../graphics/graph-compilers-at-a-glance.png
:width: 700px
:alt: Overview of various graph and tensor compilers.
Problem: Increasing number of kernels
-------------------------------------
Figure G: Overview of various graph and tensor compilers.
Kernel libraries need to be integrated with multiple deep learning frameworks,
and this arduous task becomes even harder due to increased numbers of required
kernels for achieving optimal performance. The number of required kernels is
product of number of chip designs, data types, operations, and the cardinality
of each parameter for each operation. In the past, the number of required
kernels was limited, but as the AI research and industry rapidly develops, the
final product of required kernels is increasing exponentially.
.. _figure-C:
.. _figure-H:
.. figure:: ../graphics/intro_kernel_explosion.png
:width: 555px
:alt:
.. figure:: ../graphics/tensor-compilers-at-a-glance.png
:width: 700px
:alt: A closer look at tensor compilers.
Figure H: A closer look at tensor compilers.
PlaidML addresses the kernel explosion problem in a manner that lifts a heavy
burden off kernel developers. It automatically lowers networks from nGraph
into Tile, a :abbr:Domain-Specific Language (DSL) designed for deep learning
that allows developers to express how an operation should calculate tensors in
an intuitive, mathematical form. Integration of PlaidML with nGraph means
extra flexibility to support newer deep learning models in the absence of
by-hand optimized kernels for the new operations.
Other notable efforts
----------------------
Solution: nGraph and PlaidML
============================
A few other notable efforts in compiler projects include:
Each of the problems above can be solved with nGraph and PlaidML. We developed
nGraph and integrated it with PlaidML so developers wanting to craft solutions
with :abbr:`AI (Artificial Intelligence)` won't have to face such a steep
learning curve in taking their concepts from design to production to scale. The
fundamental efficiencies behind Moore's Law are not dead; rather than fitting
`more transistors on denser and denser circuits`_, we're enabling advances
in compute with more transformers on denser and more data-heavy
:abbr:`Deep Learning Networks (DNNs)`, and making it easier to apply
:abbr:`Machine Learning (ML)` to different industries and problems.
* **TVM** https://github.com/dmlc/tvm
* **XLA** https://developers.googleblog.com/2017/03/xla-tensorflow-compiled.html
* **Glow** https://arxiv.org/pdf/1805.00907.pdf
For developers with a neural network already in place, executing workloads using
the nGraph Compiler provides further performance benefits and allows for quicker
adaptation of models. It also make it much easier to upgrade hardware
infrastructure pieces as workloads grow and require more careful balancing.
This documentation provides technical details of nGraph's core functionality,
framework and backend integrations. Creating a compiler stack like nGraph and
PlaidML requires expert knowledge, and we're confident that nGraph and PlaidML
will make life easier for many kinds of developers:
#. Framework owners looking to support new hardware and custom chips.
#. Data scientists and ML developers wishing to accelerate deep learning
performance.
#. New DL accelerator developers creating an end-to-end software stack from
a deep learning framework to their silicon.
.. _more transistors on denser and denser circuits: https://www.intel.com/content/www/us/en/silicon-innovations/moores-law-technology.html
\ No newline at end of file
......@@ -3,6 +3,6 @@
.. _sitemap.rst:
.. toctree::
:includehidden:
:includehidden:
index.rst
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment