update docs of dynamic quantization and onnxrt adaptor as adaptor extension

airMeng · ftian1 · commit fdff16eba31a · 2020-12-31T19:50:34.000+08:00
diff --git a/docs/Quantization.md b/docs/Quantization.md
@@ -10,6 +10,8 @@ Quantization methods include the following three classes:
 
 Intel® Low Precision Optimization Tool currently supports PTQ and QAT. Using MobileNetV2 as an example, this document provides tutorials for both. It also provides helper functions for evaluation.
 
+Dynamic Quantization currently is only supported with onnxruntime backend, please refer to [dynamic quantization](./dynamic_quantization.md) for details.
+
 >Note: These quantization tutorials use [PyTorch examples](https://pytorch.org/tutorials/advanced/static_quantization_tutorial.html#model-architecture) as allowed by PyTorch's [License](https://github.com/pytorch/pytorch/blob/master/LICENSE). Refer to [PyTorch](https://github.com/pytorch/tutorials/blob/master/advanced_source/static_quantization_tutorial.py) for updates.
 
 
diff --git a/docs/adaptor.md b/docs/adaptor.md
@@ -1,2 +1,96 @@
 Adaptor
 =================
+1. query fw capbility
+2. parse tune config ( lpot config -> fwk capbility)
+3. (optianal) pre optimize 
+4. do the quantization
+
+
+
+Extension
+=================
+Let us take onnxruntime as en example. Onnxruntime is a backend proposed by microsoft, and it's based on MLAS kernel defaultly. 
+Onnxruntime already has  [quantization tools](https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/python/tools/quantization), so the question becomes how to intergrate onnxruntime quantization tools into LPOT. 
+1. capbility
+   
+   we should explore quantization capbility in first. According to [onnx_quantizer](https://github.com/microsoft/onnxruntime/blob/503b61d897074a494f5798069308ee67d8fb9ace/onnxruntime/python/tools/quantization/onnx_quantizer.py#L77), the quantization tools support following attributes:
+   1.1 whether per_channel
+   1.2 whether reduce_range
+   1.3 QLinear mode or Integer mode (which is only seen in onnxruntime)
+   1.4 whether static (static quantization or dynamci quantization)
+   1.4 weight_qtype (choices are float32, int8 and uint8)
+   1.5 input_qtype (choices are float32, int8 and uint8)
+   1.6 quantization_params (None if dynamic quantization)
+   1.7 &1.8 nodes_to_quantize, nodes_to_exclude
+   1.9  op_types_to_quantize
+
+   so we can pass a tune capbility to LPOT like
+
+   ```yaml
+   {'optypewise': {'conv': 
+                   {
+                    'activation': { 'dtype': ['uint8', 'fp32']},
+                    'weight': {'dtype': ['int8', 'fp32']},
+                    'algorithm': ['minmax', ],
+                    'granularity': ['per_channel']
+                   }, 
+                   'matmul': 
+                   {
+                    'activation': { 'dtype': ['uint8', 'fp32']},
+                    'weight': {'dtype': ['int8', 'fp32']},
+                    'algorithm': ['minmax', ],
+                    'granularity': ['per_channel']
+                   }
+                   }, 
+    'opwise':  {('conv1', 'conv'):
+                   {
+                    'activation': { 'dtype': ['uint8', 'fp32']},
+                    'weight': {'dtype': ['int8', 'fp32']}
+                   }
+                   }
+    }
+   ```
+
+2. parse tune config
+   
+   LPOT will generate a tune config from your tune capbility like
+   ```yaml
+    {
+        'fuse': {'int8': [['CONV2D', 'RELU', 'BN'], ['CONV2D', 'RELU']],
+        'fp32': [['CONV2D', 'RELU', 'BN']]}, 
+        'calib_iteration': 10,
+        'op': {
+        ['op1', 'CONV2D']: {
+            'activation':  {'dtype': 'uint8',
+                            'algorithm': 'minmax',
+                            'scheme':'sym',
+                            'granularity': 'per_tensor'},
+            'weight': {'dtype': 'int8',
+                        'algorithm': 'kl',
+                        'scheme':'asym',
+                        'granularity': 'per_channel'}
+        },
+        ['op2', 'RELU]: {
+            'activation': {'dtype': 'int8',
+                            'scheme': 'asym',
+                            'granularity': 'per_tensor',
+                            'algorithm': 'minmax'}
+        },
+        ['op3', 'CONV2D']: {
+            'activation':  {'dtype': 'fp32'},
+            'weight': {'dtype': 'fp32'}
+        },
+        ...
+        }
+    }
+   ```
+   then you can parse this config into format that ONNXQuantizer can accept
+   please make sure whether your quantization API support model wise or op wise quantization. for example, node "conv1" use "minmax" algorithm and node "conv2" use "KL" algorithm, or the whole model must use "minmax" or "KL" in general.
+
+3. pre-optimize
+   if your backend support FP32 graph optimization, you can apply it in **query_fw_capability** and quantize your optimized fp32 model instead of original model
+   >model = self.pre_optimized_model if self.pre_optimized_model else model
+
+4. do quantization
+   
+   This part depend on your backend implementationm you may refer to [onnxruntime](../lpot/adaptor/onnxrt.py) as an example.
diff --git a/docs/dynamic_quantization.md b/docs/dynamic_quantization.md
@@ -0,0 +1,32 @@
+# Dynamic Quantization
+
+### Now only onnxruntime backend support dynamic quantization.
+
+[^1]The key idea with dynamic quantization as described here is that we are going to determine the scale factor for activations dynamically based on the data range observed at runtime. This ensures that the scale factor is “tuned” so that as much signal as possible about each observed dataset is preserved.
+
+Dynamic quantization is relatively free of tuning parameters which makes it well suited to be added into production pipelines as a standard part of NLP models.
+
+Take onnxruntime bert_base model as an example, users can specific quantization method like the following yaml:
+
+
+```yaml
+model:                                               # mandatory. lpot uses this model name and framework name to decide where to save snapshot if tuning.snapshot field is empty.
+  name: bert 
+  framework: onnxrt_integerops                       # possible values are tensorflow, mxnet, pytorch or onnxrt
+
+quantization:
+  approach: post_training_dynamic_quant              # optional. default value is post_training_static_quant
+                                                     # possible value is post_training_static_quant, 
+                                                     # post_training_dynamic_quant
+                                                     # quant_aware_training                                 
+  calibration:
+    sampling_size: 8, 16, 32
+
+tuning:
+  accuracy_criterion:
+    relative:  0.01                                  # optional. default value is relative, other value is absolute. this example allows relative accuracy loss: 1%.
+  exit_policy:
+    timeout: 0                                       # optional. tuning timeout (seconds). default value is 0 which means early stop. combine with max_trials field to decide when to exit.
+  random_seed: 9527                                  # optional. random seed for deterministic tuning.
+```
+[^1]: https://pytorch.org/tutorials/recipes/recipes/dynamic_quantization.html
diff --git a/examples/onnxrt/image_recognition/resnet50/readme.md b/examples/onnxrt/image_recognition/resnet50/readme.md
@@ -1,10 +1,11 @@
 # Evaluate performance of ONNX Runtime(ResNet 50) 
+>ONNX runtime quantization is under active development. please use 1.6.0+ to get more quantization support. 
 
 This example load an image classification model exported from PyTorch and confirm its accuracy and speed based on [ILSVR2012 validation Imagenet dataset](http://www.image-net.org/challenges/LSVRC/2012/downloads). You need to download this dataset yourself.
 
 ### Environment
 onnx: 1.7.0
-onnxruntime: 1.5.2
+onnxruntime: 1.6.0+
 
 ### Prepare model
 Please refer to [pytorch official guide](https://pytorch.org/docs/stable/onnx.html) for detailed model export. The following is a simple example:
@@ -32,11 +33,10 @@ torch.onnx.export(model,               # model being run
 ### Evaluating
 To evaluate the model, run `main.py` with the path to the model:
 
-```cmd
-python main.py --model_path path/to/model  # model pat as *.onnx
-               --benchmark                 # (Optional) whether to get benchmark results
-               --tune                      # (Optional) whether to tune a model meeting requirements
-               --config resnet50_v1_5.yaml # (Needed if tune or benchmark)
+```bash
+bash run_tuning.sh --input_model path/to/model  # model path as *.onnx
+                   --config resnet50_v1_5.yaml 
+                   --output_model path/to/save
 ```
 ### Advanced 
 Usually we need to bind the program to specific cores like 4 cores to get performance under real production environments.   
@@ -48,10 +48,3 @@ numactl --physcpubind=0-3 --membind=0 python main.py --model_path path/to/model
 --tune  --config resnet50_v1_5.yaml 
 ```
 
-**for windows**
-```cmd
-start /wait  /b /node /affinity f python main.py --model_path path/to/model --benchmark
---tune  --config resnet50_v1_5.yaml 
-```
-You can refer to [windows doc](https://docs.microsoft.com/en-us/windows-server/administration/windows-commands/start) for detailed instruction.
-