-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reverting the TGI image version for LLAMA multiple GPUs in GKE samples #1591
Reverting the TGI image version for LLAMA multiple GPUs in GKE samples #1591
Conversation
… changing the mountpath to /tmp there is some regression in the newer TGI image which results into out of GPU memory on L4 and requires atleast A2 node. Rolling back the image version to get the sample working will investigation happen in the background.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi team,
I'm currently reviewing this pull request and will provide my feedback in a few minutes. In the meantime, here's a summary of the changes for you and other reviewers to quickly get up to speed:
This pull request, authored by @raushan2016, reverts the Text Generation Inference (TGI) image version used in several Google Kubernetes Engine (GKE) samples for large language models (LLMs) running on multiple GPUs. The intent is to resolve a regression introduced by a newer TGI image version that caused out-of-GPU memory errors on L4 nodes, requiring the use of more powerful A2 nodes.
Here's a breakdown of the changes:
- Image Version Reversion: The PR changes the
image
specification in several YAML files related to LLM samples. The old image versions (us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu124.2-3.ubuntu2204.py311
) are replaced with older, more compatible versions:ai-ml/llm-multiple-gpus/falcon-40b/text-generation-inference.yaml
: Changed fromus-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu124.2-3.ubuntu2204.py311
toghcr.io/huggingface/text-generation-inference:1.4.3
(lines 32)ai-ml/llm-multiple-gpus/llama2-70b/text-generation-inference.yaml
: Changed fromus-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu124.2-3.ubuntu2204.py311
toghcr.io/huggingface/text-generation-inference:1.4.3
(lines 32)ai-ml/llm-multiple-gpus/llama3-70b/text-generation-inference.yaml
: Changed fromus-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu124.2-3.ubuntu2204.py311
toghcr.io/huggingface/text-generation-inference:2.0.4
(lines 32)ai-ml/llm-multiple-gpus/mixtral-8x7b/text-generation-inference.yaml
: Changed fromus-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu124.2-3.ubuntu2204.py311
toghcr.io/huggingface/text-generation-inference:1.4.3
(lines 32)
- Clarifying Comments: Comments were added to the YAML files to explain the
/data
mount path's relation to theHF_HOME
environment variable (lines 54-56, 59-61, 61-63, 60-62). This improves readability and understanding.
The PR addresses issue #1581. A deeper investigation into the regression in the newer TGI images will happen separately.
A quick haiku to lighten the mood:
Code flows like a stream,
Bugs fixed, a smoother journey,
Joy in the review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pull request addresses the OOM issues on L4 and A2 nodes by reverting the TGI image versions for falcon-40b, llama2-70b, and mixtral-8x7b to 1.4.3. The llama3-70b image is reverted to 2.0.4 for consistency. The approach seems appropriate given the circumstances, and the added comments explaining the mountPath are helpful. It's crucial to continue investigating the root cause of the OOM issue in the newer TGI images.
@kenthua I am ok with this change. |
/lgtm |
In syncing with @raushan2016 , we have deeplearning container image that does work for this example. |
can we add the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR, added some nits! Also I'll check the issues with both Falcon 40B and Llama 2 70B and come back with another PR on top once those are solved, thanks for your time again! 🤗
ai-ml/llm-multiple-gpus/falcon-40b/text-generation-inference.yaml
Outdated
Show resolved
Hide resolved
ai-ml/llm-multiple-gpus/llama3-70b/text-generation-inference.yaml
Outdated
Show resolved
Hide resolved
Co-authored-by: Alvaro Bartolome <[email protected]>
Co-authored-by: Alvaro Bartolome <[email protected]>
Description
The current image override the HF_HOME to /tmp from /data. Even after changing the mountpath to /tmp there is some regression in the newer TGI image which results into out of GPU memory on L4 and requires atleast A2 node. Rolling back the image version to get the sample working will investigation happen in the background.
Issue: #1581
Tasks