Revamp doc intro (#2588)

* Delete old stack diagram * Update first paragraph * Update introduction.rst * Update introduction.rst * Update introduction.rst * Update introduction.rst * Introduction graph optimization diagram * Update introduction.rst * Introduction kernel to deep learning frameworks * Update introduction.rst * Update introduction.rst * Delete intro_kernel_to_fw.png * Introduction kernel to frameworks with accent * Update introduction.rst * Update introduction.rst * Update introduction.rst * Introduction multiplying kernels * Update introduction.rst * Update introduction.rst * Update introduction.rst * Update introduction.rst * Update introduction.rst * Delete intro_kernel_to_fw_accent.png * Introduction kernel explosion image * Update introduction.rst * Update introduction.rst * Update introduction.rst * Edit introduction to be relevant to a wider audience * Refine intro * writing thru another draft * Another batch of updates * Tie the conclusion to the intro * Clean up the doc * First reviewer feedback done! * fix typo * fix typos * Minor formatting fixes

Revamp doc intro (#2588)
* Delete old stack diagram * Update first paragraph * Update introduction.rst * Update introduction.rst * Update introduction.rst * Update introduction.rst * Introduction graph optimization diagram * Update introduction.rst * Introduction kernel to deep learning frameworks * Update introduction.rst * Update introduction.rst * Delete intro_kernel_to_fw.png * Introduction kernel to frameworks with accent * Update introduction.rst * Update introduction.rst * Update introduction.rst * Introduction multiplying kernels * Update introduction.rst * Update introduction.rst * Update introduction.rst * Update introduction.rst * Update introduction.rst * Delete intro_kernel_to_fw_accent.png * Introduction kernel explosion image * Update introduction.rst * Update introduction.rst * Update introduction.rst * Edit introduction to be relevant to a wider audience * Refine intro * writing thru another draft * Another batch of updates * Tie the conclusion to the intro * Clean up the doc * First reviewer feedback done! * fix typo * fix typos * Minor formatting fixes
51270c68 · harryskim · Robert Kimball · 30c1c0c4 · 51270c68 · 51270c68
Commit 51270c68 authored Mar 22, 2019 by harryskim Committed by Robert Kimball Mar 22, 2019
8 changed files
--- a/doc/sphinx/source/graphics/intro_graph_optimization.png
+++ b/doc/sphinx/source/graphics/intro_graph_optimization.png
--- a/doc/sphinx/source/graphics/intro_kernel_explosion.png
+++ b/doc/sphinx/source/graphics/intro_kernel_explosion.png
--- a/doc/sphinx/source/graphics/intro_kernels.png
+++ b/doc/sphinx/source/graphics/intro_kernels.png
--- a/doc/sphinx/source/graphics/stackngrknl.png
+++ b/doc/sphinx/source/graphics/stackngrknl.png
--- a/doc/sphinx/source/index.rst
+++ b/doc/sphinx/source/index.rst
@@ -48,6 +48,12 @@ nGraph Compiler stack
   core/passes/passes.rst
   buildlb.rst
   
+.. toctree::
+   :maxdepth: 1
+   :caption: Python API
+
+   python_api/index.rst
+
   
 .. toctree::
   :maxdepth: 1

--- a/doc/sphinx/source/project/about.rst
+++ b/doc/sphinx/source/project/about.rst
@@ -20,7 +20,7 @@ stack diagram is simplified to show how nGraph executes deep learning
 workloads with two hardware backends; however, many other deep learning
 frameworks and backends currently are functioning.

-.. figure:: ../graphics/stackngrknl.png
+.. figure:: ../graphics/stackngrknl-notice.png
   :alt: 

 Bridge

--- a/doc/sphinx/source/project/introduction.rst
+++ b/doc/sphinx/source/project/introduction.rst
@@ -4,165 +4,192 @@
 Introduction
 ############

-Deep Learning (DL) computational performance is critical for scientists and engineers applying deep learning techniques to many challenges in healthcare, commerce, autonomous driving, and other domains. While they can leverage 
+
+Developers working to craft solutions with :abbr:`Artificial Intelligence (AI)`
+face a steep learning curve in taking their concepts from design to 
+production. It can be challenging to create a :abbr:`Deep Learning (DL)` model 
+that maintains a minimum standard of consistency, as it must be continually 
+tweaked, adapted, or rewritten to use and optimize various parts of the stack 
+during its life cycle. For DL models that do reach production-ready status, an 
+entirely new set of problems emerges in how to scale and use larger and larger 
+datasets, data that must be encrypted, data-in-motion, and of course, in 
+finding the best compromises among speed, accuracy, and performance.  
+
+Two general approaches to advancing deep learning performance dominate the 
+industry today. The first is to design hardware dedicated exclusively to 
+handling compute for specialized kinds of :abbr:`Machine Learning (ML)` or 
+:abbr:`DL (Deep Learning)` operations; this approach essentially designs a 
+custom network infrastructure *around* specific problems AI is supposed to 
+solve. For example, many companies are actively developing specialized 
+:abbr:`Application-Specific Integrated Circuits (ASICs)` to speed-up 
+training (one kind of ASIC) or to reduce inference latency (another kind 
+of ASIC) in their cloud-based or local data centers. This approach works 
+great for :abbr:`Cloud Service Providers (CSPs)` and others that have 
+considerable budgets to invest in researching and building new hardware; 
+however, it creates a significant burden on the developer who needs to 
+invest in adapting the context of their model for training and then for 
+inference, to figure out at least two data-cycle pipelines or deployment 
+scenarios, and to decide what trade-offs to make when and where.  
+
+The second approach to making deep learning more efficient is to design a  
+software stack that lets the :abbr:`Neural Network (NN)` adapt to whatever 
+compute resources are available and deliver performance via software 
+optimization. The nGraph Compiler stack is our solution to this second 
+approach: it provides an inherently efficient graph-based compilation 
+infrastructure designed to be compatible with many upcoming DL ASICs while 
+also unlocking a massive performance boost on any existing hardware targets 
+in a network, whether they are CPUs, GPUs, or other custom silicon. nGraph 
+provides optimization opportunities at the graph level, where the 
+network-to-device compilation can be managed with a series of "subgraphs"
+that can be handled in either a static or a dynamic manner. With our 
+:doc:`../ops/index` and graph-based infrastructure for neural networks, 
+it's also possible to extract context semantics that make it much easier to 
+work with many of the new and emerging problems in Deep Learning including 
+larger datasets, data that must be encrypted, and data-in-motion. Our solution 
+also addresses the scalability issue with kernel libraries, the current 
+popular solution to accelerating deep learning performance. 


 Motivations
 ===========

-Kernel libraries do not support graph-level optimizations
---------------------------------------------------------
-
-A framework designed for training using GPUs requires integration with a kernel 
-library unique to that vendor's hardware. For example, after integration, a 
-kernel library can run operations that it is "familar" with optimally; however, 
-the graph itself within any larger :term:`NN` won't be optimal.
+The current state-of-the-art software solution for speeding up deep learning 
+computation is to integrate kernel libraries like Intel® Math Kernel Library 
+for Deep Neural Networks (Intel® MKL DNN) and Nvidia\*'s CuDNN into deep 
+learning frameworks. These kernel libraries offer a runtime performance boost 
+on specific hardware targets through highly-optimized kernels and other 
+operator-level optimizations.

-.. _figure-A:
+However, kernel libraries have three main problems: 

-.. figure:: ../graphics/framework-to-kernel-lib.png
-   :width: 555px
-   :alt: 
+#. Kernel libraries do not support graph-level optimizations.
+#. Framework integration of kernel libraries does not scale.
+#. There are too many kernels to write, and they require expert knowledge.

-   Figure A: Lack of graph-level optimization makes framework-to-kernel library
-   integration enormously inefficient. The computation graph above represents 
-   the computation: "A plus B times C".
+The nGraph Compiler stack is designed to address the first two problems. nGraph 
+applies graph-level optimizations by taking the computational graph from a deep 
+learning framework like TensorFlow\* and reconstructing it with the nGraph 
+:abbr:`Intermediate Representation (IR)`. The nGraph IR centralizes computational 
+graphs from various frameworks and provides a unified way to connect backends 
+for targeted hardware. From here, PlaidML or one of the nGraph transformers can 
+generate code in various forms, including LLVM, OpenCL, OpenGL, Cuda and Metal. 
+This generated code is where the low-level optimizations are automatically 
+applied.  The result is a more efficient execution that does not require any 
+manual kernel integration work for most hardware targets. 

+What follows here is more detail about how our solution addresses these 
+problems. 

-.. _figure-B:
-
-.. figure:: ../graphics/framework-to-graph-opt.png
-   :width: 555px
-   :alt: 

-   Figure B: Notice that an operation on the constant B (in this case a ``Broadcast``) 
-   can be done at compile time. This is an example of constant folding, and it 
-   is not available to a device-based kernel library.   
+Problem: Absence of graph-level optimizations
+---------------------------------------------

+The diagram below illustrates a simple example of how a deep learning 
+framework, when integrated with a kernel library, is capable of running each 
+operation in a computational graph optimally, but the graph itself may not be 
+optimal: 

-.. _figure-C:
+.. _figure-A:

-.. figure:: ../graphics/ngraph-algebraic-simp.png
+.. figure:: ../graphics/intro_graph_optimization.png
   :width: 555px
   :alt: 

-   Figure C: Finally notice that the constant has value "zero" thus the add is an 
-   *identity* operation and can be eliminated. This is an example of **Algebraic 
-   simplification**, and it is not available to a device-based kernel library.
-
-
-After the two graph-level optimizations above (**Algebraic Simplification** and 
-**Constant Folding**),  we now have an optimal graph: A times C. Again, kernel 
-libraries do not support this type of optimization. Although each implementation 
-can be done individually, it will eventually yield an "exploding" number of 
-kernels the larger and more complex an :abbr:`NN (Neural Network)` becomes. For 
-some insight on why this happens, see the next section. 
-
-
-Too Many Kernels to write
-------------------------
-
-A typical network is constructed using some kind of language-based API, which 
-translates the network or :abbr:`DL (Deep Learning)` model (statically or 
-dynamically) into serialized graphs. Those graphs can then passed through a 
-compilation process (the *Graph optimization or compilation* step in 
-*Figure D* below), where various graph-level optimizations, like constant folding 
-or fusion can happen. These processes require unique vendor-provided libraries 
-to communicate with a driver (possibly through OpenCL\*, CUDA\*, or SYCL\*), to 
-compile and execute an implementation (kernel) for a specific 
-:abbr:`Instruction Set Architecture (ISA)`, or :term:`ISA`.
-
-Illustrated below is a simplified DL stack, showing relative complexity of 
-each component. Note that optimizing for any one on its own usually requires 
-engineering expertise that can be highly specialized to that component, and that 
-the terms have been simplified for illustrative purposes. 
+The following computation is constructed to execute ``(A+B)*C``, but in the 
+context of nGraph, we can further optimize the graph to be represented as ``A*C``. 
+From the first graph shown on the left, the operation on the constant ``B`` can 
+be computed at the compile time (known as constant folding), and the graph can 
+be further simplified to the one on the right because the constant has value of 
+zero. Without such graph-level optimizations, a deep learning framework with a 
+kernel library will compute all operations, and the resulting execution will be 
+suboptimal. 
+
+
+Problem: Reduced scalability 
+----------------------------
+
+Integrating kernel libraries with frameworks is increasingly becoming 
+nontrivial due to the growing number of new deep learning accelerators. 
+For each new deep learning accelerator, a custom kernel library integration 
+must be implemented by a team of experts. This labor-intensive work is 
+further amplified if you want your DL accelerator to support a number of 
+different frameworks. The work must be revisited any time you upgrade or 
+expand your network's hardware. Each integration is unique to the framework 
+and its set of deep learning operators, its view on memory layout, its 
+feature set, etc.

-.. _figure-D:
-
-.. figure:: ../graphics/components-dl-stack.png
-   :width: 700px
-   :alt: A simplified DL stack
-
-   Figure D: Components of a DL stack, simplified for illustrative purposes.
-
-There are many deep learning frameworks, each with its own strengths and user 
-bases. A setup that is common to many DL practitioners is shown in the 
-illustration below.
-
-.. _figure-E:
-
-.. figure:: ../graphics/a-common-stack.png
-   :width: 700px
-   :alt: A common implementation
-
-   Figure E: A commonly-implemented stack uses TensorFlow\* as the frontend. 
-   The input is either optimized via Grappler, or executed directly via TensorFlow. 
-   In either case, when targeting an Nvidia\* GPU, cuDNN is called to select an 
-   optimal kernel for the operation; cuDNN then relies on CUDA\* or direct access 
-   to run code on the target; in this toy example, the target is a V100.
-
-A natural result of this approach is that the framework-level integration of 
-kernel libraries does not scale. Rather, each individual framework must be 
-manually integrated with each hardware-specific kernel library. Each integration 
-is unique to the framework and its set of deep learning operators, its view on 
-memory layout, its feature set, etc. Each of these connections, then, represents 
-significant work for what will ultimately be a brittle setup that is enormously 
-expensive to maintain.    
-
-.. _figure-F:
-
-.. figure:: ../graphics/dl-current-state.png
-   :width: 700px
-   :alt: Scalability matters
-
-   Figure F: The number of kernels necessary to achieve optimal performance is 
-   bounded by the product of the number of chip designs one wishes to support, 
-   the number of data types supported, the number of operations, and the 
-   cardinality of each parameter for each operation.
-
-In the past, this upper bound was quite limited; however, since the industry is 
-shifting toward a more diverse future in terms of deep learning hardware, the 
-number of distinct kernels is exploding and will continue to explode.
+.. _figure-B:

+.. figure:: ../graphics/intro_kernel_to_fw_accent.png
+   :width: 555px
+   :alt: 

-Get the best of both worlds
---------------------------
+   Each of these connections represents significant work for what will 
+   ultimately be a brittle setup that is enormously expensive to maintain.

-Integrating a framework on nGraph can be an attractive option for hardware 
-companies trying to design their own deep learning hardware or network architecture. 
-Framework integration is non-trivial amount of work, and nGraph automatically 
-does much of the heavy lifting. Furthermore, PlaidML can provide a wide range of 
-hardware coverage and optimization automatically. Any hardware that supports 
-LLVM, OpenCL, OpenGL, CUDA or Metal can be supported automatically with PlaidML 
-and nGraph.  
+nGraph solves this problem with nGraph bridges. A bridge takes a computational 
+graph and reconstructs it in the nGraph IR with a few primitive nGraph 
+operations. With the unified computational graph, kernel libraries no longer 
+need to be separately integrated to each deep learning framework. Instead, the 
+libraries only need to support nGraph primitive operations, and this approach 
+streamlines integration process for the backend.  

-.. _figure-G:

-.. figure:: ../graphics/graph-compilers-at-a-glance.png
-   :width: 700px
-   :alt: Overview of various graph and tensor compilers.
+Problem: Increasing number of kernels 
+-------------------------------------

-   Figure G: Overview of various graph and tensor compilers.
+Kernel libraries need to be integrated with multiple deep learning frameworks, 
+and this arduous task becomes even harder due to increased numbers of required 
+kernels for achieving optimal performance. The number of required kernels is 
+product of number of chip designs, data types, operations, and the cardinality 
+of each parameter for each operation. In the past, the number of required 
+kernels was limited, but as the AI research and industry rapidly develops, the 
+final product of required kernels is increasing exponentially. 

+.. _figure-C:

-.. _figure-H:
+.. figure:: ../graphics/intro_kernel_explosion.png
+   :width: 555px
+   :alt: 

-.. figure:: ../graphics/tensor-compilers-at-a-glance.png
-   :width: 700px
-   :alt: A closer look at tensor compilers.

-   Figure H: A closer look at tensor compilers.
+PlaidML addresses the kernel explosion problem in a manner that lifts a heavy 
+burden off kernel developers. It automatically lowers networks from nGraph 
+into Tile, a :abbr:Domain-Specific Language (DSL) designed for deep learning 
+that allows developers to express how an operation should calculate tensors in
+an intuitive, mathematical form. Integration of PlaidML with nGraph means 
+extra flexibility to support newer deep learning models in the absence of 
+by-hand optimized kernels for the new operations.


-Other notable efforts
----------------------
+Solution: nGraph and PlaidML
+============================

-A few other notable efforts in compiler projects include: 
+Each of the problems above can be solved with nGraph and PlaidML. We developed 
+nGraph and integrated it with PlaidML so developers wanting to craft solutions 
+with :abbr:`AI (Artificial Intelligence)` won't have to face such a steep 
+learning curve in taking their concepts from design to production to scale. The 
+fundamental efficiencies behind Moore's Law are not dead; rather than fitting 
+`more transistors on denser and denser circuits`_, we're enabling advances 
+in compute with more transformers on denser and more data-heavy 
+:abbr:`Deep Learning Networks (DNNs)`, and making it easier to apply  
+:abbr:`Machine Learning (ML)` to different industries and problems. 

-* **TVM** https://github.com/dmlc/tvm
-* **XLA** https://developers.googleblog.com/2017/03/xla-tensorflow-compiled.html
-* **Glow** https://arxiv.org/pdf/1805.00907.pdf 
+For developers with a neural network already in place, executing workloads using 
+the nGraph Compiler provides further performance benefits and allows for quicker 
+adaptation of models. It also make it much easier to upgrade hardware 
+infrastructure pieces as workloads grow and require more careful balancing.

+This documentation provides technical details of nGraph's core functionality, 
+framework and backend integrations. Creating a compiler stack like nGraph and 
+PlaidML requires expert knowledge, and we're confident that nGraph and PlaidML 
+will make life easier for many kinds of developers: 

+#. Framework owners looking to support new hardware and custom chips.
+#. Data scientists and ML developers wishing to accelerate deep learning 
+   performance.
+#. New DL accelerator developers creating an end-to-end software stack from 
+   a deep learning framework to their silicon.  


+.. _more transistors on denser and denser circuits: https://www.intel.com/content/www/us/en/silicon-innovations/moores-law-technology.html
\ No newline at end of file
--- a/doc/sphinx/source/sitemap.rst
+++ b/doc/sphinx/source/sitemap.rst
@@ -3,6 +3,6 @@
 .. _sitemap.rst:
 
 .. toctree::
-   :includehidden:
- 
+   :includehidden: 
+
   index.rst