Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reverting the TGI image version for LLAMA multiple GPUs in GKE samples #1591

Conversation

raushan2016
Copy link
Member

@raushan2016 raushan2016 commented Jan 15, 2025

Description

The current image override the HF_HOME to /tmp from /data. Even after changing the mountpath to /tmp there is some regression in the newer TGI image which results into out of GPU memory on L4 and requires atleast A2 node. Rolling back the image version to get the sample working will investigation happen in the background.

Issue: #1581

Tasks

  • The contributing guide has been read and followed.
  • The samples added / modified have been fully tested.
  • Workflow files have been added / modified, if applicable.
  • Region tags have been properly added, if new samples.
  • Editable variables have been used, where applicable.
  • All dependencies are set to up-to-date versions, as applicable.
  • Merge this pull-request for me once it is approved.

… changing the mountpath to /tmp there is some regression in the newer TGI image which results into out of GPU memory on L4 and requires atleast A2 node. Rolling back the image version to get the sample working will investigation happen in the background.
Copy link

@code-review-assist code-review-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi team,

I'm currently reviewing this pull request and will provide my feedback in a few minutes. In the meantime, here's a summary of the changes for you and other reviewers to quickly get up to speed:

This pull request, authored by @raushan2016, reverts the Text Generation Inference (TGI) image version used in several Google Kubernetes Engine (GKE) samples for large language models (LLMs) running on multiple GPUs. The intent is to resolve a regression introduced by a newer TGI image version that caused out-of-GPU memory errors on L4 nodes, requiring the use of more powerful A2 nodes.

Here's a breakdown of the changes:

  • Image Version Reversion: The PR changes the image specification in several YAML files related to LLM samples. The old image versions (us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu124.2-3.ubuntu2204.py311) are replaced with older, more compatible versions:
    • ai-ml/llm-multiple-gpus/falcon-40b/text-generation-inference.yaml: Changed from us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu124.2-3.ubuntu2204.py311 to ghcr.io/huggingface/text-generation-inference:1.4.3 (lines 32)
    • ai-ml/llm-multiple-gpus/llama2-70b/text-generation-inference.yaml: Changed from us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu124.2-3.ubuntu2204.py311 to ghcr.io/huggingface/text-generation-inference:1.4.3 (lines 32)
    • ai-ml/llm-multiple-gpus/llama3-70b/text-generation-inference.yaml: Changed from us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu124.2-3.ubuntu2204.py311 to ghcr.io/huggingface/text-generation-inference:2.0.4 (lines 32)
    • ai-ml/llm-multiple-gpus/mixtral-8x7b/text-generation-inference.yaml: Changed from us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu124.2-3.ubuntu2204.py311 to ghcr.io/huggingface/text-generation-inference:1.4.3 (lines 32)
  • Clarifying Comments: Comments were added to the YAML files to explain the /data mount path's relation to the HF_HOME environment variable (lines 54-56, 59-61, 61-63, 60-62). This improves readability and understanding.

The PR addresses issue #1581. A deeper investigation into the regression in the newer TGI images will happen separately.


A quick haiku to lighten the mood:

Code flows like a stream,
Bugs fixed, a smoother journey,
Joy in the review.

Copy link

@code-review-assist code-review-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pull request addresses the OOM issues on L4 and A2 nodes by reverting the TGI image versions for falcon-40b, llama2-70b, and mixtral-8x7b to 1.4.3. The llama3-70b image is reverted to 2.0.4 for consistency. The approach seems appropriate given the circumstances, and the added comments explaining the mountPath are helpful. It's crucial to continue investigating the root cause of the OOM issue in the newer TGI images.

@moficodes
Copy link
Collaborator

@kenthua I am ok with this change.

@moficodes
Copy link
Collaborator

/lgtm

@moficodes moficodes self-requested a review January 16, 2025 00:19
@kenthua
Copy link
Contributor

kenthua commented Jan 16, 2025

In syncing with @raushan2016 , we have deeplearning container image that does work for this example.

@kenthua
Copy link
Contributor

kenthua commented Jan 17, 2025

can we add the 20241210-1804-rc0 image tag?

Copy link
Contributor

@alvarobartt alvarobartt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, added some nits! Also I'll check the issues with both Falcon 40B and Llama 2 70B and come back with another PR on top once those are solved, thanks for your time again! 🤗

@raushan2016 raushan2016 merged commit 7683cb2 into GoogleCloudPlatform:main Jan 17, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants