Skip to content

Commit 1bd56af

Browse files
authored
Update TGI image versions (#1625)
Signed-off-by: xiaotia3 <[email protected]>
1 parent 583428c commit 1bd56af

File tree

36 files changed

+54
-52
lines changed

36 files changed

+54
-52
lines changed

AgentQnA/docker_compose/amd/gpu/rocm/launch_agent_service_tgi_rocm.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
WORKPATH=$(dirname "$PWD")/..
55
export ip_address=${host_ip}
66
export HUGGINGFACEHUB_API_TOKEN=${your_hf_api_token}
7-
export AGENTQNA_TGI_IMAGE=ghcr.io/huggingface/text-generation-inference:2.3.1-rocm
7+
export AGENTQNA_TGI_IMAGE=ghcr.io/huggingface/text-generation-inference:2.4.1-rocm
88
export AGENTQNA_TGI_SERVICE_PORT="8085"
99

1010
# LLM related environment variables

AgentQnA/docker_compose/amd/gpu/rocm/set_env.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
WORKPATH=$(dirname "$PWD")/..
77
export ip_address=${host_ip}
88
export HUGGINGFACEHUB_API_TOKEN=${your_hf_api_token}
9-
export AGENTQNA_TGI_IMAGE=ghcr.io/huggingface/text-generation-inference:2.3.1-rocm
9+
export AGENTQNA_TGI_IMAGE=ghcr.io/huggingface/text-generation-inference:2.4.1-rocm
1010
export AGENTQNA_TGI_SERVICE_PORT="19001"
1111

1212
# LLM related environment variables

AgentQnA/docker_compose/intel/hpu/gaudi/tgi_gaudi.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33

44
services:
55
tgi-server:
6-
image: ghcr.io/huggingface/tgi-gaudi:2.0.6
6+
image: ghcr.io/huggingface/tgi-gaudi:2.3.1
77
container_name: tgi-server
88
ports:
99
- "8085:80"

AudioQnA/docker_compose/amd/gpu/rocm/compose.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ services:
2525
https_proxy: ${https_proxy}
2626
restart: unless-stopped
2727
tgi-service:
28-
image: ghcr.io/huggingface/text-generation-inference:2.3.1-rocm
28+
image: ghcr.io/huggingface/text-generation-inference:2.4.1-rocm
2929
container_name: tgi-service
3030
ports:
3131
- "3006:80"

AudioQnA/kubernetes/gmc/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ The AudioQnA application is defined as a Custom Resource (CR) file that the abov
1414

1515
The AudioQnA uses the below prebuilt images if you choose a Xeon deployment
1616

17-
- tgi-service: ghcr.io/huggingface/text-generation-inference:1.4
17+
- tgi-service: ghcr.io/huggingface/text-generation-inference:2.4.1
1818
- llm: opea/llm-textgen:latest
1919
- asr: opea/asr:latest
2020
- whisper: opea/whisper:latest
@@ -25,7 +25,7 @@ The AudioQnA uses the below prebuilt images if you choose a Xeon deployment
2525
Should you desire to use the Gaudi accelerator, two alternate images are used for the embedding and llm services.
2626
For Gaudi:
2727

28-
- tgi-service: ghcr.io/huggingface/tgi-gaudi:2.0.6
28+
- tgi-service: ghcr.io/huggingface/tgi-gaudi:2.3.1
2929
- whisper-gaudi: opea/whisper-gaudi:latest
3030
- speecht5-gaudi: opea/speecht5-gaudi:latest
3131

AudioQnA/tests/test_compose_on_rocm.sh

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -34,8 +34,8 @@ function build_docker_images() {
3434
echo "Build all the images with --no-cache, check docker_image_build.log for details..."
3535
service_list="audioqna audioqna-ui whisper speecht5"
3636
docker compose -f build.yaml build ${service_list} --no-cache > ${LOG_PATH}/docker_image_build.log
37-
echo "docker pull ghcr.io/huggingface/text-generation-inference:2.3.1-rocm"
38-
docker pull ghcr.io/huggingface/text-generation-inference:2.3.1-rocm
37+
echo "docker pull ghcr.io/huggingface/text-generation-inference:2.4.1-rocm"
38+
docker pull ghcr.io/huggingface/text-generation-inference:2.4.1-rocm
3939
docker images && sleep 1s
4040
}
4141

ChatQnA/benchmark/accuracy/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -45,10 +45,10 @@ To setup a LLM model, we can use [tgi-gaudi](https://github.com/huggingface/tgi-
4545
```
4646
# please set your llm_port and hf_token
4747
48-
docker run -p {your_llm_port}:80 --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true -e OMPI_MCA_btl_vader_single_copy_mechanism=none -e HF_TOKEN={your_hf_token} --cap-add=sys_nice --ipc=host ghcr.io/huggingface/tgi-gaudi:2.0.1 --model-id mistralai/Mixtral-8x7B-Instruct-v0.1 --max-input-tokens 2048 --max-total-tokens 4096 --sharded true --num-shard 2
48+
docker run -p {your_llm_port}:80 --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true -e OMPI_MCA_btl_vader_single_copy_mechanism=none -e HF_TOKEN={your_hf_token} --cap-add=sys_nice --ipc=host ghcr.io/huggingface/tgi-gaudi:2.3.1 --model-id mistralai/Mixtral-8x7B-Instruct-v0.1 --max-input-tokens 2048 --max-total-tokens 4096 --sharded true --num-shard 2
4949
5050
# for better performance, set `PREFILL_BATCH_BUCKET_SIZE`, `BATCH_BUCKET_SIZE`, `max-batch-total-tokens`, `max-batch-prefill-tokens`
51-
docker run -p {your_llm_port}:80 --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true -e OMPI_MCA_btl_vader_single_copy_mechanism=none -e HF_TOKEN={your_hf_token} -e PREFILL_BATCH_BUCKET_SIZE=1 -e BATCH_BUCKET_SIZE=8 --cap-add=sys_nice --ipc=host ghcr.io/huggingface/tgi-gaudi:2.0.6 --model-id mistralai/Mixtral-8x7B-Instruct-v0.1 --max-input-tokens 2048 --max-total-tokens 4096 --sharded true --num-shard 2 --max-batch-total-tokens 65536 --max-batch-prefill-tokens 2048
51+
docker run -p {your_llm_port}:80 --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e PT_HPU_ENABLE_LAZY_COLLECTIVES=true -e OMPI_MCA_btl_vader_single_copy_mechanism=none -e HF_TOKEN={your_hf_token} -e PREFILL_BATCH_BUCKET_SIZE=1 -e BATCH_BUCKET_SIZE=8 --cap-add=sys_nice --ipc=host ghcr.io/huggingface/tgi-gaudi:2.3.1 --model-id mistralai/Mixtral-8x7B-Instruct-v0.1 --max-input-tokens 2048 --max-total-tokens 4096 --sharded true --num-shard 2 --max-batch-total-tokens 65536 --max-batch-prefill-tokens 2048
5252
```
5353

5454
### Prepare Dataset

ChatQnA/benchmark/accuracy_faqgen/launch_tgi.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ docker run -it --rm \
1919
--ipc=host \
2020
-e HTTPS_PROXY=$https_proxy \
2121
-e HTTP_PROXY=$https_proxy \
22-
ghcr.io/huggingface/tgi-gaudi:2.0.6 \
22+
ghcr.io/huggingface/tgi-gaudi:2.3.1 \
2323
--model-id $model_name \
2424
--max-input-tokens $max_input_tokens \
2525
--max-total-tokens $max_total_tokens \

ChatQnA/docker_compose/amd/gpu/rocm/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -190,7 +190,7 @@ Change the `xxx_MODEL_ID` below for your needs.
190190
# Example: NGINX_PORT=80
191191
export HOST_IP=${host_ip}
192192
export NGINX_PORT=${your_nginx_port}
193-
export CHATQNA_TGI_SERVICE_IMAGE="ghcr.io/huggingface/text-generation-inference:2.3.1-rocm"
193+
export CHATQNA_TGI_SERVICE_IMAGE="ghcr.io/huggingface/text-generation-inference:2.4.1-rocm"
194194
export CHATQNA_EMBEDDING_MODEL_ID="BAAI/bge-base-en-v1.5"
195195
export CHATQNA_RERANK_MODEL_ID="BAAI/bge-reranker-base"
196196
export CHATQNA_LLM_MODEL_ID="meta-llama/Meta-Llama-3-8B-Instruct"

ChatQnA/docker_compose/intel/hpu/gaudi/README.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -158,7 +158,7 @@ The default deployment utilizes Gaudi devices primarily for the `vllm-service`,
158158

159159
### compose_tgi.yaml - TGI Deployment
160160

161-
The TGI (Text Generation Inference) deployment and the default deployment differ primarily in their service configurations and specific focus on handling large language models (LLMs). The TGI deployment includes a unique `tgi-service`, which utilizes the `ghcr.io/huggingface/tgi-gaudi:2.0.6` image and is specifically configured to run on Gaudi hardware. This service is designed to handle LLM tasks with optimizations such as `ENABLE_HPU_GRAPH` and `USE_FLASH_ATTENTION`. The `chatqna-gaudi-backend-server` in the TGI deployment depends on the `tgi-service`, whereas in the default deployment, it relies on the `vllm-service`.
161+
The TGI (Text Generation Inference) deployment and the default deployment differ primarily in their service configurations and specific focus on handling large language models (LLMs). The TGI deployment includes a unique `tgi-service`, which utilizes the `ghcr.io/huggingface/tgi-gaudi:2.3.1` image and is specifically configured to run on Gaudi hardware. This service is designed to handle LLM tasks with optimizations such as `ENABLE_HPU_GRAPH` and `USE_FLASH_ATTENTION`. The `chatqna-gaudi-backend-server` in the TGI deployment depends on the `tgi-service`, whereas in the default deployment, it relies on the `vllm-service`.
162162

163163
| Service Name | Image Name | Gaudi Specific |
164164
| ---------------------------- | ----------------------------------------------------- | -------------- |
@@ -167,7 +167,7 @@ The TGI (Text Generation Inference) deployment and the default deployment differ
167167
| tei-embedding-service | ghcr.io/huggingface/text-embeddings-inference:cpu-1.6 | No |
168168
| retriever | opea/retriever:latest | No |
169169
| tei-reranking-service | ghcr.io/huggingface/tei-gaudi:1.5.0 | 1 card |
170-
| **tgi-service** | ghcr.io/huggingface/tgi-gaudi:2.0.6 | Configurable |
170+
| **tgi-service** | ghcr.io/huggingface/tgi-gaudi:2.3.1 | Configurable |
171171
| chatqna-gaudi-backend-server | opea/chatqna:latest | No |
172172
| chatqna-gaudi-ui-server | opea/chatqna-ui:latest | No |
173173
| chatqna-gaudi-nginx-server | opea/nginx:latest | No |
@@ -178,7 +178,7 @@ This deployment may allocate more Gaudi resources to the tgi-service to optimize
178178

179179
The FAQs(frequently asked questions and answers) generation Deployment will generate FAQs instead of normally text generation. It add a new microservice called `llm-faqgen`, which is a microservice that interacts with the TGI/vLLM LLM server to generate FAQs from input text.
180180

181-
The TGI (Text Generation Inference) deployment and the default deployment differ primarily in their service configurations and specific focus on handling large language models (LLMs). The TGI deployment includes a unique `tgi-service`, which utilizes the `ghcr.io/huggingface/tgi-gaudi:2.0.6` image and is specifically configured to run on Gaudi hardware. This service is designed to handle LLM tasks with optimizations such as `ENABLE_HPU_GRAPH` and `USE_FLASH_ATTENTION`. The `chatqna-gaudi-backend-server` in the TGI deployment depends on the `tgi-service`, whereas in the default deployment, it relies on the `vllm-service`.
181+
The TGI (Text Generation Inference) deployment and the default deployment differ primarily in their service configurations and specific focus on handling large language models (LLMs). The TGI deployment includes a unique `tgi-service`, which utilizes the `ghcr.io/huggingface/tgi-gaudi:2.3.1` image and is specifically configured to run on Gaudi hardware. This service is designed to handle LLM tasks with optimizations such as `ENABLE_HPU_GRAPH` and `USE_FLASH_ATTENTION`. The `chatqna-gaudi-backend-server` in the TGI deployment depends on the `tgi-service`, whereas in the default deployment, it relies on the `vllm-service`.
182182

183183
| Service Name | Image Name | Gaudi Use |
184184
| ---------------------------- | ----------------------------------------------------- | ------------ |
@@ -214,13 +214,13 @@ This setup might allow for more Gaudi devices to be dedicated to the `vllm-servi
214214

215215
### compose_guardrails.yaml - Guardrails Deployment
216216

217-
The _compose_guardrails.yaml_ Docker Compose file introduces enhancements over the default deployment by incorporating additional services focused on safety and ChatQnA response control. Notably, it includes the `tgi-guardrails-service` and `guardrails` services. The `tgi-guardrails-service` uses the `ghcr.io/huggingface/tgi-gaudi:2.0.6` image and is configured to run on Gaudi hardware, providing functionality to manage input constraints and ensure safe operations within defined limits. The guardrails service, using the `opea/guardrails:latest` image, acts as a safety layer that interfaces with the `tgi-guardrails-service` to enforce safety protocols and manage interactions with the large language model (LLM). This backend server now depends on the `tgi-guardrails-service` and `guardrails`, alongside existing dependencies like `redis-vector-db`, `tei-embedding-service`, `retriever`, `tei-reranking-service`, and `vllm-service`. The environment configurations for the backend are also updated to include settings for the guardrail services.
217+
The _compose_guardrails.yaml_ Docker Compose file introduces enhancements over the default deployment by incorporating additional services focused on safety and ChatQnA response control. Notably, it includes the `tgi-guardrails-service` and `guardrails` services. The `tgi-guardrails-service` uses the `ghcr.io/huggingface/tgi-gaudi:2.3.1` image and is configured to run on Gaudi hardware, providing functionality to manage input constraints and ensure safe operations within defined limits. The guardrails service, using the `opea/guardrails:latest` image, acts as a safety layer that interfaces with the `tgi-guardrails-service` to enforce safety protocols and manage interactions with the large language model (LLM). This backend server now depends on the `tgi-guardrails-service` and `guardrails`, alongside existing dependencies like `redis-vector-db`, `tei-embedding-service`, `retriever`, `tei-reranking-service`, and `vllm-service`. The environment configurations for the backend are also updated to include settings for the guardrail services.
218218

219219
| Service Name | Image Name | Gaudi Specific | Uses LLM |
220220
| ---------------------------- | ----------------------------------------------------- | -------------- | -------- |
221221
| redis-vector-db | redis/redis-stack:7.2.0-v9 | No | No |
222222
| dataprep-redis-service | opea/dataprep:latest | No | No |
223-
| _tgi-guardrails-service_ | ghcr.io/huggingface/tgi-gaudi:2.0.6 | 1 card | Yes |
223+
| _tgi-guardrails-service_ | ghcr.io/huggingface/tgi-gaudi:2.3.1 | 1 card | Yes |
224224
| _guardrails_ | opea/guardrails:latest | No | No |
225225
| tei-embedding-service | ghcr.io/huggingface/text-embeddings-inference:cpu-1.6 | No | No |
226226
| retriever | opea/retriever:latest | No | No |
@@ -262,8 +262,8 @@ The table provides a comprehensive overview of the ChatQnA services utilized acr
262262
| retriever | opea/retriever:latest | No | Retrieves data from the Redis database and interacts with embedding services. |
263263
| tei-reranking-service | ghcr.io/huggingface/tei-gaudi:1.5.0 | Yes | Reranks text embeddings, typically using Gaudi hardware for enhanced performance. |
264264
| vllm-service | opea/vllm-gaudi:latest | No | Handles large language model (LLM) tasks, utilizing Gaudi hardware. |
265-
| tgi-service | ghcr.io/huggingface/tgi-gaudi:2.0.6 | Yes | Specific to the TGI deployment, focuses on text generation inference using Gaudi hardware. |
266-
| tgi-guardrails-service | ghcr.io/huggingface/tgi-gaudi:2.0.6 | Yes | Provides guardrails functionality, ensuring safe operations within defined limits. |
265+
| tgi-service | ghcr.io/huggingface/tgi-gaudi:2.3.1 | Yes | Specific to the TGI deployment, focuses on text generation inference using Gaudi hardware. |
266+
| tgi-guardrails-service | ghcr.io/huggingface/tgi-gaudi:2.3.1 | Yes | Provides guardrails functionality, ensuring safe operations within defined limits. |
267267
| guardrails | opea/guardrails:latest | Yes | Acts as a safety layer, interfacing with the `tgi-guardrails-service` to enforce safety protocols. |
268268
| chatqna-gaudi-backend-server | opea/chatqna:latest | No | Serves as the backend for the ChatQnA application, with variations depending on the deployment. |
269269
| chatqna-gaudi-ui-server | opea/chatqna-ui:latest | No | Provides the user interface for the ChatQnA application. |
@@ -288,7 +288,7 @@ The `ghcr.io/huggingface/text-embeddings-inference:cpu-1.6` image supporting `te
288288

289289
### tgi-gaurdrails-service
290290

291-
The `tgi-guardrails-service` uses the `GUARDRAILS_MODEL_ID` parameter to select a [supported model](https://github.com/huggingface/tgi-gaudi?tab=readme-ov-file#tested-models-and-configurations) for the associated `ghcr.io/huggingface/tgi-gaudi:2.0.6` image. Like the `tei-embedding-service` and `tei-reranking-service` services, it doesn't use the `NUM_CARDS` parameter.
291+
The `tgi-guardrails-service` uses the `GUARDRAILS_MODEL_ID` parameter to select a [supported model](https://github.com/huggingface/tgi-gaudi?tab=readme-ov-file#tested-models-and-configurations) for the associated `ghcr.io/huggingface/tgi-gaudi:2.3.1` image. Like the `tei-embedding-service` and `tei-reranking-service` services, it doesn't use the `NUM_CARDS` parameter.
292292

293293
## Conclusion
294294

ChatQnA/docker_compose/intel/hpu/gaudi/compose_guardrails.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ services:
2626
TEI_ENDPOINT: http://tei-embedding-service:80
2727
HUGGINGFACEHUB_API_TOKEN: ${HUGGINGFACEHUB_API_TOKEN}
2828
tgi-guardrails-service:
29-
image: ghcr.io/huggingface/tgi-gaudi:2.0.6
29+
image: ghcr.io/huggingface/tgi-gaudi:2.3.1
3030
container_name: tgi-guardrails-server
3131
ports:
3232
- "8088:80"

ChatQnA/docker_compose/intel/hpu/gaudi/compose_tgi.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -80,7 +80,7 @@ services:
8080
MAX_WARMUP_SEQUENCE_LENGTH: 512
8181
command: --model-id ${RERANK_MODEL_ID} --auto-truncate
8282
tgi-service:
83-
image: ghcr.io/huggingface/tgi-gaudi:2.0.6
83+
image: ghcr.io/huggingface/tgi-gaudi:2.3.1
8484
container_name: tgi-gaudi-server
8585
ports:
8686
- "8005:80"

0 commit comments

Comments
 (0)