.. _post_quant-label:

Post-training quantization
==========================

Principle
---------

The post-training quantization algorithm is done in 3 steps:

1) Weights normalization
~~~~~~~~~~~~~~~~~~~~~~~~

All weights are rescaled in the range :math:`[-1.0, 1.0]`.

Per layer normalization
 There is a single weights scaling factor, global to the layer.

Per layer and per output channel normalization
 There is a different weights scaling factor for each output channel. This allows
 a finer grain quantization, with a better usage of the quantized range for some
 output channels, at the expense of more factors to be saved in memory.

2) Activations normalization
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Activations at each layer are rescaled in the range :math:`[-1.0, 1.0]` for signed 
outputs and :math:`[0.0, 1.0]` for unsigned outputs.

The **optimal quantization threshold value** of the activation output of each 
layer is determined using the validation dataset (or test dataset if no 
validation dataset is available).

This is an iterative process: need to take into account previous layers 
normalizing factors.


Finding the optimal quantization threshold value of the activation output of 
each layer is done the following:

1) Compute histogram of activation values;

2) Find threshold that minimizes distance between original distribution and 
   clipped quantized distribution. Two distance algorithms can be used:

   - Mean Squared Error (MSE);

   - Kullback–Leibler divergence metric (KL-divergence).

   Another, simpler method, is to just clip the values above a fixed quantile.


.. figure:: /_static/activations_histogram.png
   :alt: Activation values histogram and corresponding thresholds.


The obtained threshold value is therefore the activation scaling factor to be 
taken into account during quantization.


3) Quantization
~~~~~~~~~~~~~~~

Inputs, weights, biases and activations are quantized to the desired 
:math:`nbbits` precision.

Convert ranges from :math:`[-1.0, 1.0]` and :math:`[0.0, 1.0]` to 
:math:`[-2^{nbbits-1}-1, 2^{nbbits-1}-1]` and :math:`[0, 2^{nbbits}-1]` taking 
into account all dependencies.


Additional optimization strategies
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Weights clipping (optional)
^^^^^^^^^^^^^^^^^^^^^^^^^^^

Weights can be clipped using the same strategy than for the activations (
finding the optimal quantization threshold using the weights histogram).
However, this usually leads to worse results than no clipping.

Activation scaling factor approximation
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The activation scaling factor :math:`\alpha` can be approximated the following 
ways:

- Fixed-point: :math:`\alpha` is approximated by :math:`x 2^{-p}`;

- Single-shift: :math:`\alpha` is approximated by :math:`2^{x}`;

- Double-shift: :math:`\alpha` is approximated by :math:`2^{n} + 2^{m}`.


Usage in N2D2
-------------

All the post-training strategies described above are available in N2D2 for any
export type. To apply post-training quantization during export, simply use the
``-calib`` command line argument.

The following parameters are available in command line:

+--------------------------------------------+--------------------------------------------------------------------------------------------------------------------------+
| Argument [default value]                   | Description                                                                                                              |
+============================================+==========================================================================================================================+
| ``-calib``                                 | Number of stimuli used for the calibration (``-1`` = use the full validation dataset)                                    |
+--------------------------------------------+--------------------------------------------------------------------------------------------------------------------------+
| ``-calib-reload``                          | Reload and reuse the data of a previous calibration                                                                      |
+--------------------------------------------+--------------------------------------------------------------------------------------------------------------------------+
| ``-wt-clipping-mode`` [``None``]           | Weights clipping mode on export, can be ``None``, ``MSE`` or ``KL-Divergence``                                           |
+--------------------------------------------+--------------------------------------------------------------------------------------------------------------------------+
| ``-act-clipping-mode`` [``MSE``]           | Activations clipping mode on export, can be ``None``, ``MSE``, ``KL-Divergence`` or ``Quantile``                         |
+--------------------------------------------+--------------------------------------------------------------------------------------------------------------------------+
| ``-act-rescaling-mode`` [``Single-shift``] | Activations scaling mode on export, can be ``Floating-point``, ``Fixed-point16``, ``Fixed-point32``, ``Single-shift``    |
|                                            | or ``Double-shift``                                                                                                      |
+--------------------------------------------+--------------------------------------------------------------------------------------------------------------------------+
| ``-act-rescale-per-output`` [0]            | If true (1), rescale activation per output instead of per layer                                                          |
+--------------------------------------------+--------------------------------------------------------------------------------------------------------------------------+
| ``-act-quantile-value`` [0.9999]           | If activation clipping mode is ``Quantile``, fraction of the values to keep without clipping                             |
+--------------------------------------------+--------------------------------------------------------------------------------------------------------------------------+

``-act-rescaling-mode``
~~~~~~~~~~~~~~~~~~~~~~~

The ``-act-rescaling-mode`` specifies how the activation scaling must be approximated,
for values other than ``Floating-point``. This allows to avoid floating-point
operation altogether in the generated code, even for complex, multi-branches networks.
This is particularly useful on architectures without FPU or on FPGA.

For fixed-point scaling approximation (:math:`x 2^{-p}`), two modes are available:
``Fixed-point16`` and ``Fixed-point32``. ``Fixed-point16`` specifies that :math:`x`
must hold in at most 16-bits, whereas ``Fixed-point32`` allows 32-bits :math:`x`.
In the later case, beware that overflow can occur on 32-bits only architectures
when computing the scaling multiplication before the right shift (:math:`p`).

For the ``Single-shift`` and ``Double-shift`` modes, only right shifts are allowed
(scaling factor < 1.0). In case of layers with scaling factor above 1.0, ``Fixed-point16``
is used as fallback for these layers.


Command line example
~~~~~~~~~~~~~~~~~~~~

Command line example to run the C++ Export on a INI file containing an ONNX
model:

::

    n2d2 MobileNet_ONNX.ini -seed 1 -w /dev/null -export CPP -fuse -calib -1 -act-clipping-mode KL-Divergence

With the python API 
~~~~~~~~~~~~~~~~~~~


.. autofunction:: n2d2.quantizer.PTQ


Examples and results
--------------------

Post-training quantization accuracy obtained with some models from the ONNX 
Model Zoo are reported in the table below, using ``-calib 1000``:

+-------------------------------------------------------+-----------+-------------------+-------------+
| *ONNX Model Zoo* model (specificities)                | FP acc.   | Fake 8 bits acc.  | 8 bits acc. |
+=======================================================+===========+===================+=============+
| resnet18v1.onnx                                       | 69.83%    | 68.82%            | 68.78%      |
| (``-no-unsigned -act-rescaling-mode Fixed-point``)    |           |                   |             |
+-------------------------------------------------------+-----------+-------------------+-------------+
| mobilenetv2-1.0.onnx                                  | 70.95%    | 65.40%            | 65.40%      |
| (``mobilenetv20_output_flatten0_reshape0`` ignored)   |           |                   |             |
+-------------------------------------------------------+-----------+-------------------+-------------+
| mobilenetv2-1.0.onnx                                  |           | 66.67%            | 66.70%      |
| (``mobilenetv20_output_flatten0_reshape0`` ignored    |           |                   |             |
| ``-act-rescaling-mode Fixed-point``)                  |           |                   |             |
+-------------------------------------------------------+-----------+-------------------+-------------+
| squeezenet/model.onnx                                 | 57.58%    | 57.11%            | 54.98%      |
| (``-no-unsigned -act-rescaling-mode Floating-point``) |           |                   |             |
+-------------------------------------------------------+-----------+-------------------+-------------+


- *FP acc.* is the floating point accuracy obtained before post-training
  quantization on the model imported in ONNX;
- *Fake 8 bits acc.* is the accuracy obtained after post-training quantization
  in N2D2, in fake-quantized mode (the numbers are quantized but the
  representation is still floating point);
- *8 bits acc.* is the accuracy obtained after post-training quantization in the
  N2D2 reference C++ export, in actual 8 bits representation.