.. core/passes:

Compiler passes
===============

.. toctree::
:maxdepth: 1
:caption: Compiler passes

list-of-passes.rst
passes-that-use-matcher.rst

Overview
--------

*Generic graph optimization passes*

This section discusses how to use nGraph to create a Pass Manager for your
backend, and provides both a simple and a complex example to follow.

The pass manager infrastructure in nGraph makes it easy to reuse and mix the
generic optimization passes. It also permits you to roll your own device-specific
optimizations; that is, the same unified interface and APIs may be used to
cover both things.

Invoking these passes is fairly straightforward, illustrated by the following
steps and the code below.

#. Create a "pass manager" object (line 1)
#. Populate it with the desired pass or passes (lines 2-4)
#. Invoke the pass manager with a pointer to your unoptimized graph, and
it will return a pointer to an optimized graph (lines 5-6)

.. literalinclude:: ../../../../../test/cpu_fusion.cpp
:language: cpp
:lines: 2085-2092
:linenos:

nGraph Core includes a large library of hardware-agnostic passes useful
for almost any kind of hardware backend. Some of these passes are likely familiar
to people who are comfortable with classical compiler designs. Others, like the
reshape/transpose elimination and sinking passes, are quite specific to deep
learning.

A simple example
----------------

Here's a fairly straightforward function graph: it has 4 ops:
:doc:`../../ops/convolution`, :doc:`../../ops/broadcast`, :doc:`../../ops/add`,
and :doc:`../../ops/relu`. With nGraph, backends have the ability to rewrite the
graph in ways that are specific to the underlying device/hardware's capabilities.

When, for example, the device is an Intel® Architecture :abbr:`IA (Intel® Architecture)`
CPU, it can support a fused ``ConvolutionBiasReLU`` kernel. The backend is able
to rewrite the graph into its own custom ops that more closely match the
hardware-specific primitives; here they get matched via Intel® MKL-DNN.

.. _figure-simple-compiler:

.. figure:: ../../graphics/simple-compiler-passes.png
:width: 750px
:alt: Simple kernel fusion

Figure A: On the left side of *Figure A* is a fully-formed function
graph prior to fusion. After graph rewrite, the CPU implements a number of
custom fusions.

A complex example
-----------------

The effectiveness of graph-level optimization with nGraph is more striking to look
at in terms of an actual input graph, such as one from the framework bridge. Here
is slightly more complicated example drawn from a topology called MobileNet which
makes heavy use of group convolution.

In group convolution, sometimes called depthwise convolution, a batch's different
feature channels get divided into groups that are processed independently, rather
than every convolution kernel seeing all of the input feature channels.

With "Group Convolution Fusion", it is possible to optimize a subgraph that has
implemented group convolution by many instances of "ordinary" convolution.

*Figure B* shows an excerpt from ``MobileNet v1``, a topology which makes heavy
use of group convolution. Here, an image batch and a filter batch first undergo
a "preprocessing" phase where segments along the channel axis are sliced out:
one per channel group. Next, there are separate convolutions on each channel
group before finally concatenating the result back together.

.. _figure-mobilenet-gc:

.. figure:: ../../graphics/mobilenet-group-conv.png
:width: 700px
:alt: MobileNet example

Figure B: Each of these grouped convolution complexes -- the
operations within the rectangles on the left -- is very wide; each is too
wide to fit legibly on the illustration.

The group convolution fusion is able to replace each of those giant subgraphs
with a single CPU group convolution node. This ends up being beneficial in
several ways:

* Reduces sheer node count,
* Provides mappability to MKL-DNN, which has an accelerated group convolution implementation, and
* Eliminates unnecessary temporary nodes.