This guide provides step-by-step instructions to deploy Intel® AI for Enterprise Inference on a single node.
Before running the automation, it is recommended to complete all prerequisites. For a quicker setup, the minimum steps are:
Clone the Enterprise Inference repo, then copy the single node preset inference config file to the working directory:
cd ~
git clone https://github.com/opea-project/Enterprise-Inference.git
cd Enterprise-Inference
cp -f docs/examples/single-node/inference-config.cfg core/inventory/inference-config.cfg
Modify inference-config.cfg as needed. Ensure the cluster_url field is set to the DNS used, and the paths to the certificate and key files are valid. The keycloak fields and deployment options can be left unchanged. For systems behind a proxy, refer to the proxy guide.
Copy the single node preset hosts config file to the working directory:
cp -f docs/examples/single-node/hosts.yaml core/inventory/hosts.yamlNote The
ansible_userfield is set to ubuntu by default. Change it to the actual username used.
Now run the automation using the configured files.
cd core
chmod +x inference-stack-deploy.shExport the Hugging Face token as an environment variable by replacing "Your_Hugging_Face_Token_ID" with actual Hugging Face Token. Alternatively, set hugging-face-token to the token value inside inference-config.cfg.
export HUGGINGFACE_TOKEN=<<Your_Hugging_Face_Token_ID>>Follow the steps below depending on the hardware platform. The models argument can be excluded and there will be a prompt to select from a list of models.
Run the command below to deploy the Llama 3.1 8B parameter model on CPU.
./inference-stack-deploy.sh --models "21" --cpu-or-gpu "cpu" --hugging-face-token $HUGGINGFACE_TOKEN📝 Note: If running on Intel® Gaudi® AI Accelerators, ensure firmware and drivers are up to date using the automated setup scripts before deployment.
Run the command below to deploy the Llama 3.1 8B parameter model on Intel® Gaudi®. For Gaudi 3, set cpu-or-gpu to gaudi3 instead.
./inference-stack-deploy.sh --models "1" --cpu-or-gpu "gpu" --hugging-face-token $HUGGINGFACE_TOKENSelect Option 1 and confirm the Yes/No prompt.
This will deploy the setup automatically. If any issues are encountered, double-check the prerequisites and configuration files.
On the node run the following commands to test if Intel® AI for Enterprise Inference is successfully deployed:
If using Keycloak, generate a token using the script generate-token.sh. Ensure the values of the variables match what is set in inference-config.cfg. This will also set the environment variables BASE_URL and TOKEN used in the next step.
source scripts/generate-token.shIf not using Keycloak, set the environment variable BASE_URL to the DNS used in the setup i.e. api.example.com.
See the example commands below to test inference with Llama 3.1 8B Instruct. For a list of deployed models, this command can be used (if using Keycloak):
kubectl get apisixroutesTo test on CPU only. Note vllmcpu is appended to the URL.
curl -k https://${BASE_URL}/Llama-3.1-8B-Instruct-vllmcpu/v1/completions -X POST -d '{"model": "meta-llama/Llama-3.1-8B-Instruct", "prompt": "What is Deep Learning?", "max_tokens": 50, "temperature": 0}' -H 'Content-Type: application/json' -H "Authorization: Bearer $TOKEN"To test on Intel® Gaudi® AI Accelerators:
curl -k https://${BASE_URL}/Llama-3.1-8B-Instruct/v1/completions -X POST -d '{"model": "meta-llama/Llama-3.1-8B-Instruct", "prompt": "What is Deep Learning?", "max_tokens": 50, "temperature": 0}' -H 'Content-Type: application/json' -H "Authorization: Bearer $TOKEN"With the deployed model on the server, refer to the post-deployment instructions for options.