Reverting the TGI image version for LLAMA multiple GPUs in GKE samples #1591

raushan2016 · 2025-01-15T02:24:16Z

Description

The current image override the HF_HOME to /tmp from /data. Even after changing the mountpath to /tmp there is some regression in the newer TGI image which results into out of GPU memory on L4 and requires atleast A2 node. Rolling back the image version to get the sample working will investigation happen in the background.

Issue: #1581

Tasks

The contributing guide has been read and followed.
The samples added / modified have been fully tested.
Workflow files have been added / modified, if applicable.
Region tags have been properly added, if new samples.
Editable variables have been used, where applicable.
All dependencies are set to up-to-date versions, as applicable.
Merge this pull-request for me once it is approved.

… changing the mountpath to /tmp there is some regression in the newer TGI image which results into out of GPU memory on L4 and requires atleast A2 node. Rolling back the image version to get the sample working will investigation happen in the background.

code-review-assist

Hi team,

I'm currently reviewing this pull request and will provide my feedback in a few minutes. In the meantime, here's a summary of the changes for you and other reviewers to quickly get up to speed:

This pull request, authored by @raushan2016, reverts the Text Generation Inference (TGI) image version used in several Google Kubernetes Engine (GKE) samples for large language models (LLMs) running on multiple GPUs. The intent is to resolve a regression introduced by a newer TGI image version that caused out-of-GPU memory errors on L4 nodes, requiring the use of more powerful A2 nodes.

Here's a breakdown of the changes:

Image Version Reversion: The PR changes the image specification in several YAML files related to LLM samples. The old image versions (us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu124.2-3.ubuntu2204.py311) are replaced with older, more compatible versions:
- ai-ml/llm-multiple-gpus/falcon-40b/text-generation-inference.yaml: Changed from us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu124.2-3.ubuntu2204.py311 to ghcr.io/huggingface/text-generation-inference:1.4.3 (lines 32)
- ai-ml/llm-multiple-gpus/llama2-70b/text-generation-inference.yaml: Changed from us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu124.2-3.ubuntu2204.py311 to ghcr.io/huggingface/text-generation-inference:1.4.3 (lines 32)
- ai-ml/llm-multiple-gpus/llama3-70b/text-generation-inference.yaml: Changed from us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu124.2-3.ubuntu2204.py311 to ghcr.io/huggingface/text-generation-inference:2.0.4 (lines 32)
- ai-ml/llm-multiple-gpus/mixtral-8x7b/text-generation-inference.yaml: Changed from us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu124.2-3.ubuntu2204.py311 to ghcr.io/huggingface/text-generation-inference:1.4.3 (lines 32)
Clarifying Comments: Comments were added to the YAML files to explain the /data mount path's relation to the HF_HOME environment variable (lines 54-56, 59-61, 61-63, 60-62). This improves readability and understanding.

The PR addresses issue #1581. A deeper investigation into the regression in the newer TGI images will happen separately.

A quick haiku to lighten the mood:

Code flows like a stream,
Bugs fixed, a smoother journey,
Joy in the review.

code-review-assist

This pull request addresses the OOM issues on L4 and A2 nodes by reverting the TGI image versions for falcon-40b, llama2-70b, and mixtral-8x7b to 1.4.3. The llama3-70b image is reverted to 2.0.4 for consistency. The approach seems appropriate given the circumstances, and the added comments explaining the mountPath are helpful. It's crucial to continue investigating the root cause of the OOM issue in the newer TGI images.

moficodes · 2025-01-16T00:17:53Z

@kenthua I am ok with this change.

moficodes · 2025-01-16T00:19:05Z

/lgtm

kenthua · 2025-01-16T18:52:48Z

In syncing with @raushan2016 , we have deeplearning container image that does work for this example.

kenthua · 2025-01-17T00:05:37Z

can we add the 20241210-1804-rc0 image tag?

alvarobartt

Thanks for the PR, added some nits! Also I'll check the issues with both Falcon 40B and Llama 2 70B and come back with another PR on top once those are solved, thanks for your time again! 🤗

ai-ml/llm-multiple-gpus/falcon-40b/text-generation-inference.yaml

ai-ml/llm-multiple-gpus/llama3-70b/text-generation-inference.yaml

Co-authored-by: Alvaro Bartolome <[email protected]>

raushan2016 requested review from alizaidis, yoshi-approver and a team as code owners January 15, 2025 02:24

code-review-assist bot reviewed Jan 15, 2025

View reviewed changes

code-review-assist bot suggested changes Jan 15, 2025

View reviewed changes

moficodes self-requested a review January 16, 2025 00:19

moficodes approved these changes Jan 16, 2025

View reviewed changes

Merge branch 'main' into raushan2016/tgi-image-rollback

fa298b5

Updating the images to GCR which works for these models.

a60e94c

raushan2016 mentioned this pull request Jan 16, 2025

Fix mountPath to use /tmp instead of /data #1584

Open

7 tasks

alvarobartt approved these changes Jan 17, 2025

View reviewed changes

ai-ml/llm-multiple-gpus/falcon-40b/text-generation-inference.yaml Outdated Show resolved Hide resolved

ai-ml/llm-multiple-gpus/llama3-70b/text-generation-inference.yaml Outdated Show resolved Hide resolved

raushan2016 and others added 2 commits January 17, 2025 09:40

Update ai-ml/llm-multiple-gpus/falcon-40b/text-generation-inference.yaml

617f08c

Co-authored-by: Alvaro Bartolome <[email protected]>

Update ai-ml/llm-multiple-gpus/llama3-70b/text-generation-inference.yaml

d192a01

Co-authored-by: Alvaro Bartolome <[email protected]>

moficodes approved these changes Jan 17, 2025

View reviewed changes

Merge branch 'main' into raushan2016/tgi-image-rollback

38d3bf5

raushan2016 merged commit 7683cb2 into GoogleCloudPlatform:main Jan 17, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reverting the TGI image version for LLAMA multiple GPUs in GKE samples #1591

Reverting the TGI image version for LLAMA multiple GPUs in GKE samples #1591

raushan2016 commented Jan 15, 2025 •

edited

Loading

code-review-assist bot left a comment

code-review-assist bot left a comment

moficodes commented Jan 16, 2025

moficodes commented Jan 16, 2025

kenthua commented Jan 16, 2025

kenthua commented Jan 17, 2025

alvarobartt left a comment

Reverting the TGI image version for LLAMA multiple GPUs in GKE samples #1591

Reverting the TGI image version for LLAMA multiple GPUs in GKE samples #1591

Conversation

raushan2016 commented Jan 15, 2025 • edited Loading

Description

Tasks

code-review-assist bot left a comment

Choose a reason for hiding this comment

code-review-assist bot left a comment

Choose a reason for hiding this comment

moficodes commented Jan 16, 2025

moficodes commented Jan 16, 2025

kenthua commented Jan 16, 2025

kenthua commented Jan 17, 2025

alvarobartt left a comment

Choose a reason for hiding this comment

raushan2016 commented Jan 15, 2025 •

edited

Loading