GPT-J 6B inference best known configurations with Intel® Extension for PyTorch.
Use Case | Framework | Model Repo | Branch/Commit/Tag | Optional Patch |
---|---|---|---|---|
Inference | PyTorch | https://huggingface.co/EleutherAI/gpt-j-6 | - | - |
Follow link to build Pytorch, IPEX, TorchVison and TCMalloc.
-
Install Intel OpenMP
pip install packaging intel-openmp accelerate
-
Set IOMP and tcmalloc Preload for better performance
export LD_PRELOAD="<path_to>/tcmalloc/lib/libtcmalloc.so":"<path_to_iomp>/lib/libiomp5.so":$LD_PRELOAD
-
Install datasets
pip install datasets
-
Set INPUT_TOKEN before running the model
export INPUT_TOKEN=32 (choice in [32 64 128 256 512 1024 2016], we prefer to benchmark on 32 and 2016)
-
Set OUTPUT_TOKEN before running the model
export OUTPUT_TOKEN=32 (32 is preferred, while you could set any other length)
-
About the BATCH_SIZE in scripts
using BATCH_SIZE=1 for realtime mode using BATCH_SIZE=N for throughput mode (N could be further tuned according to the testing host, by default using 1);
-
About the BEAM_SIZE in scripts
using BEAM_SIZE=4 by default
-
Do calibration to get "qconfig.json" before running INT8.
# You can get "qconfig.json" for calibration: bash do_quantization.sh calibration sq #using smooth quant as default
-
Set ENV to use fp16 AMX if you are using a supported platform
export DNNL_MAX_CPU_ISA=AVX512_CORE_AMX_FP16
-
git clone https://github.com/IntelAI/models.git
-
cd models/models_v2/pytorch/gptj/inference/cpu
-
Create virtual environment
venv
and activate it:python3 -m venv venv . ./venv/bin/activate
-
Run setup.sh
./setup.sh
-
Install the latest CPU versions of torch, torchvision and intel_extension_for_pytorch
-
Setup required environment paramaters
Parameter | export command |
---|---|
TEST_MODE (THROUGHPUT, ACCURACY, REALTIME) | export TEST_MODE=THROUGHPUT |
OUTPUT_DIR | export OUTPUT_DIR=$(pwd) |
PRECISION | export PRECISION=bf16 (fp32, bf32, bf16, fp16, int8-fp32, int8-bf16) |
MODEL_DIR | export MODEL_DIR=$(pwd) |
BATCH_SIZE (optional) | export BATCH_SIZE=256 |
- Run
run_model.sh
Single-tile output will typically looks like:
---------- Summary: ----------
inference-latency: 246.340 sec.
first-token-latency: 38.192 sec.
rest-token-latency: 6.681 sec.
P90-rest-token-latency: 6.857 sec.
Final results of the inference run can be found in results.yaml
file.
results:
- key: throughput
value: N/A
unit: N/A
- key: latency
value: 246.340
unit: s
- key: accuracy
value: N/A
unit: AP