introduction.rst 10.8 KB
Newer Older
1
.. project/introduction.rst:
2

3 4 5
#######
Summary
#######
6

7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
nGraph is an open-source graph compiler for :abbr:`Artificial Neural Networks (ANNs)`. 
The nGraph Compiler stack provides an inherently efficient graph-based compilation 
infrastructure designed to be compatible with many upcoming 
:abbr:`Application-Specific Integrated Circuits (ASICs)`, like the Intel® Nervana™ 
Neural Network Processor (Intel® Nervana™ NNP), while also unlocking a massive 
performance boost on any existing hardware targets for your neural network: both 
GPUs and CPUs. Using its flexible infrastructure, you will find it becomes much 
easier to create Deep Learning (DL) models that can adhere to the "write once, 
run anywhere" mantra that enables your AI solutions to easily go from concept to 
production to scale.

Frameworks using nGraph to execute workloads have shown `up to 45X`_ performance 
boost compared to native implementations. 

For a detailed overview, see below; for a more historical perspective, see 
our `arXiv`_ paper.

Motivations
===========
harryskim's avatar
harryskim committed
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69

Developers working to craft solutions with :abbr:`Artificial Intelligence (AI)`
face a steep learning curve in taking their concepts from design to 
production. It can be challenging to create a :abbr:`Deep Learning (DL)` model 
that maintains a minimum standard of consistency, as it must be continually 
tweaked, adapted, or rewritten to use and optimize various parts of the stack 
during its life cycle. For DL models that do reach production-ready status, an 
entirely new set of problems emerges in how to scale and use larger and larger 
datasets, data that must be encrypted, data-in-motion, and of course, in 
finding the best compromises among speed, accuracy, and performance.  

Two general approaches to advancing deep learning performance dominate the 
industry today. The first is to design hardware dedicated exclusively to 
handling compute for specialized kinds of :abbr:`Machine Learning (ML)` or 
:abbr:`DL (Deep Learning)` operations; this approach essentially designs a 
custom network infrastructure *around* specific problems AI is supposed to 
solve. For example, many companies are actively developing specialized 
:abbr:`Application-Specific Integrated Circuits (ASICs)` to speed-up 
training (one kind of ASIC) or to reduce inference latency (another kind 
of ASIC) in their cloud-based or local data centers. This approach works 
great for :abbr:`Cloud Service Providers (CSPs)` and others that have 
considerable budgets to invest in researching and building new hardware; 
however, it creates a significant burden on the developer who needs to 
invest in adapting the context of their model for training and then for 
inference, to figure out at least two data-cycle pipelines or deployment 
scenarios, and to decide what trade-offs to make when and where.  

The second approach to making deep learning more efficient is to design a  
software stack that lets the :abbr:`Neural Network (NN)` adapt to whatever 
compute resources are available and deliver performance via software 
optimization. The nGraph Compiler stack is our solution to this second 
approach: it provides an inherently efficient graph-based compilation 
infrastructure designed to be compatible with many upcoming DL ASICs while 
also unlocking a massive performance boost on any existing hardware targets 
in a network, whether they are CPUs, GPUs, or other custom silicon. nGraph 
provides optimization opportunities at the graph level, where the 
network-to-device compilation can be managed with a series of "subgraphs"
that can be handled in either a static or a dynamic manner. With our 
:doc:`../ops/index` and graph-based infrastructure for neural networks, 
it's also possible to extract context semantics that make it much easier to 
work with many of the new and emerging problems in Deep Learning including 
larger datasets, data that must be encrypted, and data-in-motion. Our solution 
also addresses the scalability issue with kernel libraries, the current 
popular solution to accelerating deep learning performance. 
70

harryskim's avatar
harryskim committed
71 72 73 74 75 76
The current state-of-the-art software solution for speeding up deep learning 
computation is to integrate kernel libraries like Intel® Math Kernel Library 
for Deep Neural Networks (Intel® MKL DNN) and Nvidia\*'s CuDNN into deep 
learning frameworks. These kernel libraries offer a runtime performance boost 
on specific hardware targets through highly-optimized kernels and other 
operator-level optimizations.
77

harryskim's avatar
harryskim committed
78
However, kernel libraries have three main problems: 
79

harryskim's avatar
harryskim committed
80 81 82
#. Kernel libraries do not support graph-level optimizations.
#. Framework integration of kernel libraries does not scale.
#. There are too many kernels to write, and they require expert knowledge.
83

harryskim's avatar
harryskim committed
84 85 86 87 88 89 90 91 92 93
The nGraph Compiler stack is designed to address the first two problems. nGraph 
applies graph-level optimizations by taking the computational graph from a deep 
learning framework like TensorFlow\* and reconstructing it with the nGraph 
:abbr:`Intermediate Representation (IR)`. The nGraph IR centralizes computational 
graphs from various frameworks and provides a unified way to connect backends 
for targeted hardware. From here, PlaidML or one of the nGraph transformers can 
generate code in various forms, including LLVM, OpenCL, OpenGL, Cuda and Metal. 
This generated code is where the low-level optimizations are automatically 
applied.  The result is a more efficient execution that does not require any 
manual kernel integration work for most hardware targets. 
94

harryskim's avatar
harryskim committed
95 96
What follows here is more detail about how our solution addresses these 
problems. 
97 98


harryskim's avatar
harryskim committed
99 100
Problem: Absence of graph-level optimizations
---------------------------------------------
101

harryskim's avatar
harryskim committed
102 103 104 105
The diagram below illustrates a simple example of how a deep learning 
framework, when integrated with a kernel library, is capable of running each 
operation in a computational graph optimally, but the graph itself may not be 
optimal: 
106

harryskim's avatar
harryskim committed
107
.. _figure-A:
108

harryskim's avatar
harryskim committed
109
.. figure:: ../graphics/intro_graph_optimization.png
110 111 112
   :width: 555px
   :alt: 

113 114 115 116 117
The computation is constructed to execute ``(A+B)*C``, but in the context of 
nGraph, we can further optimize the graph to be represented as ``A*C``. From the 
first graph shown on the left, the operation on the constant ``B`` can be 
computed at the compile time (known as constant folding), and the graph can be 
further simplified to the one on the right because the constant has value of 
harryskim's avatar
harryskim committed
118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134
zero. Without such graph-level optimizations, a deep learning framework with a 
kernel library will compute all operations, and the resulting execution will be 
suboptimal. 


Problem: Reduced scalability 
----------------------------

Integrating kernel libraries with frameworks is increasingly becoming 
nontrivial due to the growing number of new deep learning accelerators. 
For each new deep learning accelerator, a custom kernel library integration 
must be implemented by a team of experts. This labor-intensive work is 
further amplified if you want your DL accelerator to support a number of 
different frameworks. The work must be revisited any time you upgrade or 
expand your network's hardware. Each integration is unique to the framework 
and its set of deep learning operators, its view on memory layout, its 
feature set, etc.
135 136


harryskim's avatar
harryskim committed
137 138 139 140 141 142
nGraph solves this problem with nGraph bridges. A bridge takes a computational 
graph and reconstructs it in the nGraph IR with a few primitive nGraph 
operations. With the unified computational graph, kernel libraries no longer 
need to be separately integrated to each deep learning framework. Instead, the 
libraries only need to support nGraph primitive operations, and this approach 
streamlines integration process for the backend.  
143 144


harryskim's avatar
harryskim committed
145 146
Problem: Increasing number of kernels 
-------------------------------------
147

harryskim's avatar
harryskim committed
148 149 150 151 152 153 154
Kernel libraries need to be integrated with multiple deep learning frameworks, 
and this arduous task becomes even harder due to increased numbers of required 
kernels for achieving optimal performance. The number of required kernels is 
product of number of chip designs, data types, operations, and the cardinality 
of each parameter for each operation. In the past, the number of required 
kernels was limited, but as the AI research and industry rapidly develops, the 
final product of required kernels is increasing exponentially. 
155

Leona C's avatar
Leona C committed
156
.. _figure-B:
157

harryskim's avatar
harryskim committed
158 159 160
.. figure:: ../graphics/intro_kernel_explosion.png
   :width: 555px
   :alt: 
161

Leona C's avatar
Leona C committed
162 163 164 165
   Each of these connections represents significant work for what will 
   ultimately be a brittle setup that is enormously expensive to maintain.


166

harryskim's avatar
harryskim committed
167 168
PlaidML addresses the kernel explosion problem in a manner that lifts a heavy 
burden off kernel developers. It automatically lowers networks from nGraph 
169
into Tile, a :abbr:`Domain-Specific Language (DSL)` designed for deep learning 
harryskim's avatar
harryskim committed
170
that allows developers to express how an operation should calculate tensors in
171 172 173
an intuitive, mathematical form via `Stripe`_. Integration of PlaidML with 
nGraph means extra flexibility to support newer deep learning models in the 
absence of by-hand optimized kernels for the new operations.
174 175


harryskim's avatar
harryskim committed
176 177
Solution: nGraph and PlaidML
============================
178

harryskim's avatar
harryskim committed
179 180 181
Each of the problems above can be solved with nGraph and PlaidML. We developed 
nGraph and integrated it with PlaidML so developers wanting to craft solutions 
with :abbr:`AI (Artificial Intelligence)` won't have to face such a steep 
182 183 184 185 186
learning curve in taking their concepts from design to production, and to scale. 
The fundamental efficiencies behind Moore's Law are not dead; rather than fitting 
`more transistors on denser and denser circuits`_, with nGraph and PlaidML, 
we're enabling advances in compute with more transformers on denser and more 
data-heavy :abbr:`Deep Learning Networks (DNNs)`, and making it easier to apply  
harryskim's avatar
harryskim committed
187
:abbr:`Machine Learning (ML)` to different industries and problems. 
188

harryskim's avatar
harryskim committed
189 190
For developers with a neural network already in place, executing workloads using 
the nGraph Compiler provides further performance benefits and allows for quicker 
191 192
adaptation of models. It also makes it much easier to upgrade hardware 
infrastructure pieces as workloads grow. 
193

harryskim's avatar
harryskim committed
194 195 196 197
This documentation provides technical details of nGraph's core functionality, 
framework and backend integrations. Creating a compiler stack like nGraph and 
PlaidML requires expert knowledge, and we're confident that nGraph and PlaidML 
will make life easier for many kinds of developers: 
198

harryskim's avatar
harryskim committed
199 200 201 202 203
#. Framework owners looking to support new hardware and custom chips.
#. Data scientists and ML developers wishing to accelerate deep learning 
   performance.
#. New DL accelerator developers creating an end-to-end software stack from 
   a deep learning framework to their silicon.  
204

205 206
.. _arXiv: https://arxiv.org/abs/1801.08058
.. _up to 45X: https://ai.intel.com/ngraph-compiler-stack-beta-release/
207
.. _more transistors on denser and denser circuits: https://www.intel.com/content/www/us/en/silicon-innovations/moores-law-technology.html
208
.. _Stripe: https://arxiv.org/abs/1903.06498