Skip to content

Commit fdff16e

Browse files
airMengftian1
authored andcommitted
update docs of dynamic quantization and onnxrt adaptor as adaptor extension
1 parent 2550315 commit fdff16e

File tree

4 files changed

+134
-13
lines changed

4 files changed

+134
-13
lines changed

docs/Quantization.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,8 @@ Quantization methods include the following three classes:
1010

1111
Intel® Low Precision Optimization Tool currently supports PTQ and QAT. Using MobileNetV2 as an example, this document provides tutorials for both. It also provides helper functions for evaluation.
1212

13+
Dynamic Quantization currently is only supported with onnxruntime backend, please refer to [dynamic quantization](./dynamic_quantization.md) for details.
14+
1315
>Note: These quantization tutorials use [PyTorch examples](https://pytorch.org/tutorials/advanced/static_quantization_tutorial.html#model-architecture) as allowed by PyTorch's [License](https://github.com/pytorch/pytorch/blob/master/LICENSE). Refer to [PyTorch](https://github.com/pytorch/tutorials/blob/master/advanced_source/static_quantization_tutorial.py) for updates.
1416
1517

docs/adaptor.md

Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,96 @@
11
Adaptor
22
=================
3+
1. query fw capbility
4+
2. parse tune config ( lpot config -> fwk capbility)
5+
3. (optianal) pre optimize
6+
4. do the quantization
7+
8+
9+
10+
Extension
11+
=================
12+
Let us take onnxruntime as en example. Onnxruntime is a backend proposed by microsoft, and it's based on MLAS kernel defaultly.
13+
Onnxruntime already has [quantization tools](https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/python/tools/quantization), so the question becomes how to intergrate onnxruntime quantization tools into LPOT.
14+
1. capbility
15+
16+
we should explore quantization capbility in first. According to [onnx_quantizer](https://github.com/microsoft/onnxruntime/blob/503b61d897074a494f5798069308ee67d8fb9ace/onnxruntime/python/tools/quantization/onnx_quantizer.py#L77), the quantization tools support following attributes:
17+
1.1 whether per_channel
18+
1.2 whether reduce_range
19+
1.3 QLinear mode or Integer mode (which is only seen in onnxruntime)
20+
1.4 whether static (static quantization or dynamci quantization)
21+
1.4 weight_qtype (choices are float32, int8 and uint8)
22+
1.5 input_qtype (choices are float32, int8 and uint8)
23+
1.6 quantization_params (None if dynamic quantization)
24+
1.7 &1.8 nodes_to_quantize, nodes_to_exclude
25+
1.9 op_types_to_quantize
26+
27+
so we can pass a tune capbility to LPOT like
28+
29+
```yaml
30+
{'optypewise': {'conv':
31+
{
32+
'activation': { 'dtype': ['uint8', 'fp32']},
33+
'weight': {'dtype': ['int8', 'fp32']},
34+
'algorithm': ['minmax', ],
35+
'granularity': ['per_channel']
36+
},
37+
'matmul':
38+
{
39+
'activation': { 'dtype': ['uint8', 'fp32']},
40+
'weight': {'dtype': ['int8', 'fp32']},
41+
'algorithm': ['minmax', ],
42+
'granularity': ['per_channel']
43+
}
44+
},
45+
'opwise': {('conv1', 'conv'):
46+
{
47+
'activation': { 'dtype': ['uint8', 'fp32']},
48+
'weight': {'dtype': ['int8', 'fp32']}
49+
}
50+
}
51+
}
52+
```
53+
54+
2. parse tune config
55+
56+
LPOT will generate a tune config from your tune capbility like
57+
```yaml
58+
{
59+
'fuse': {'int8': [['CONV2D', 'RELU', 'BN'], ['CONV2D', 'RELU']],
60+
'fp32': [['CONV2D', 'RELU', 'BN']]},
61+
'calib_iteration': 10,
62+
'op': {
63+
['op1', 'CONV2D']: {
64+
'activation': {'dtype': 'uint8',
65+
'algorithm': 'minmax',
66+
'scheme':'sym',
67+
'granularity': 'per_tensor'},
68+
'weight': {'dtype': 'int8',
69+
'algorithm': 'kl',
70+
'scheme':'asym',
71+
'granularity': 'per_channel'}
72+
},
73+
['op2', 'RELU]: {
74+
'activation': {'dtype': 'int8',
75+
'scheme': 'asym',
76+
'granularity': 'per_tensor',
77+
'algorithm': 'minmax'}
78+
},
79+
['op3', 'CONV2D']: {
80+
'activation': {'dtype': 'fp32'},
81+
'weight': {'dtype': 'fp32'}
82+
},
83+
...
84+
}
85+
}
86+
```
87+
then you can parse this config into format that ONNXQuantizer can accept
88+
please make sure whether your quantization API support model wise or op wise quantization. for example, node "conv1" use "minmax" algorithm and node "conv2" use "KL" algorithm, or the whole model must use "minmax" or "KL" in general.
89+
90+
3. pre-optimize
91+
if your backend support FP32 graph optimization, you can apply it in **query_fw_capability** and quantize your optimized fp32 model instead of original model
92+
>model = self.pre_optimized_model if self.pre_optimized_model else model
93+
94+
4. do quantization
95+
96+
This part depend on your backend implementationm you may refer to [onnxruntime](../lpot/adaptor/onnxrt.py) as an example.

docs/dynamic_quantization.md

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
# Dynamic Quantization
2+
3+
### Now only onnxruntime backend support dynamic quantization.
4+
5+
[^1]The key idea with dynamic quantization as described here is that we are going to determine the scale factor for activations dynamically based on the data range observed at runtime. This ensures that the scale factor is “tuned” so that as much signal as possible about each observed dataset is preserved.
6+
7+
Dynamic quantization is relatively free of tuning parameters which makes it well suited to be added into production pipelines as a standard part of NLP models.
8+
9+
Take onnxruntime bert_base model as an example, users can specific quantization method like the following yaml:
10+
11+
12+
```yaml
13+
model: # mandatory. lpot uses this model name and framework name to decide where to save snapshot if tuning.snapshot field is empty.
14+
name: bert
15+
framework: onnxrt_integerops # possible values are tensorflow, mxnet, pytorch or onnxrt
16+
17+
quantization:
18+
approach: post_training_dynamic_quant # optional. default value is post_training_static_quant
19+
# possible value is post_training_static_quant,
20+
# post_training_dynamic_quant
21+
# quant_aware_training
22+
calibration:
23+
sampling_size: 8, 16, 32
24+
25+
tuning:
26+
accuracy_criterion:
27+
relative: 0.01 # optional. default value is relative, other value is absolute. this example allows relative accuracy loss: 1%.
28+
exit_policy:
29+
timeout: 0 # optional. tuning timeout (seconds). default value is 0 which means early stop. combine with max_trials field to decide when to exit.
30+
random_seed: 9527 # optional. random seed for deterministic tuning.
31+
```
32+
[^1]: https://pytorch.org/tutorials/recipes/recipes/dynamic_quantization.html

examples/onnxrt/image_recognition/resnet50/readme.md

Lines changed: 6 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,11 @@
11
# Evaluate performance of ONNX Runtime(ResNet 50)
2+
>ONNX runtime quantization is under active development. please use 1.6.0+ to get more quantization support.
23
34
This example load an image classification model exported from PyTorch and confirm its accuracy and speed based on [ILSVR2012 validation Imagenet dataset](http://www.image-net.org/challenges/LSVRC/2012/downloads). You need to download this dataset yourself.
45

56
### Environment
67
onnx: 1.7.0
7-
onnxruntime: 1.5.2
8+
onnxruntime: 1.6.0+
89

910
### Prepare model
1011
Please refer to [pytorch official guide](https://pytorch.org/docs/stable/onnx.html) for detailed model export. The following is a simple example:
@@ -32,11 +33,10 @@ torch.onnx.export(model, # model being run
3233
### Evaluating
3334
To evaluate the model, run `main.py` with the path to the model:
3435

35-
```cmd
36-
python main.py --model_path path/to/model # model pat as *.onnx
37-
--benchmark # (Optional) whether to get benchmark results
38-
--tune # (Optional) whether to tune a model meeting requirements
39-
--config resnet50_v1_5.yaml # (Needed if tune or benchmark)
36+
```bash
37+
bash run_tuning.sh --input_model path/to/model # model path as *.onnx
38+
--config resnet50_v1_5.yaml
39+
--output_model path/to/save
4040
```
4141
### Advanced
4242
Usually we need to bind the program to specific cores like 4 cores to get performance under real production environments.
@@ -48,10 +48,3 @@ numactl --physcpubind=0-3 --membind=0 python main.py --model_path path/to/model
4848
--tune --config resnet50_v1_5.yaml
4949
```
5050

51-
**for windows**
52-
```cmd
53-
start /wait /b /node /affinity f python main.py --model_path path/to/model --benchmark
54-
--tune --config resnet50_v1_5.yaml
55-
```
56-
You can refer to [windows doc](https://docs.microsoft.com/en-us/windows-server/administration/windows-commands/start) for detailed instruction.
57-

0 commit comments

Comments
 (0)