-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Serving Gemma 2 with multiple LoRA adapters with Text Generation Inference (TGI) on Vertex AI notebook #1586
base: main
Are you sure you want to change the base?
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @inardini and reviewers,
I'm currently reviewing this pull request and will post my detailed review in a few minutes. In the meantime, here's a quick summary of the changes for you and other reviewers to get up to speed:
This pull request adds a new notebook demonstrating how to deploy the Gemma 2 2B model from Hugging Face Hub to Vertex AI, using multiple LoRA adapters for different tasks (like coding and SQL). It leverages Hugging Face's Text Generation Inference (TGI) Deep Learning Container (DLC) along with a custom handler for enhanced flexibility.
Here's a breakdown of the changes:
open-models/README.md
: A new entry was added (lines 9-12) to the README file, pointing to the newvertex_ai_tgi_gemma_multi_lora_adapters_deployment.ipynb
notebook.open-models/serving/vertex_ai_tgi_gemma_multi_lora_adapters_deployment.ipynb
: This is a completely new Jupyter Notebook (1569 lines of code) that details the entire process, from setting up the environment and authentication to deploying the model and making predictions using different LoRA adapters. The notebook covers:- Setting up the environment and authentication (Hugging Face and Google Cloud).
- Creating and testing a custom handler for managing multiple LoRA adapters.
- Downloading the base Gemma 2 model and LoRA adapters from Hugging Face Hub.
- Implementing an LLM-based router to select the appropriate adapter based on the user prompt.
- Registering the model on Vertex AI.
- Deploying the model to a Vertex AI endpoint.
- Making online predictions via the Vertex AI SDK,
gcloud
CLI, andcurl
. - Cleaning up resources.
The overall intent is to provide a comprehensive and practical guide for deploying Gemma 2 with LoRA adapters on Vertex AI using the TGI DLC and a custom handler. This allows users to easily leverage the power of fine-tuned models for various tasks within a production-ready environment.
And now, a little haiku to lighten the mood:
Code flows like a stream,
LoRA adapts, sharp and keen,
Vertex AI serves well.
Gemini encountered an error creating the review. You can try again by commenting |
Gemini encountered an error creating the summary. You can try again by commenting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi team,
I'm currently reviewing this pull request and will post my full review in a few minutes. In the meantime, here's a quick summary of the changes for you and other reviewers:
This pull request adds a new Jupyter notebook demonstrating how to deploy the Gemma 2 2B model from Hugging Face, along with multiple LoRA adapters, using Hugging Face's Text Generation Inference (TGI) on Google Vertex AI. The key improvements are:
- New Notebook:
open-models/serving/vertex_ai_tgi_gemma_multi_lora_adapters_deployment.ipynb
has been added. This notebook provides a comprehensive guide and code examples for deploying Gemma 2 with LoRA adapters for tasks like coding and SQL query generation. - Custom Handler: The notebook implements a custom handler to manage the pre-processing, inference, and post-processing steps for efficient model serving on Vertex AI. This allows for flexible handling of multiple LoRA adapters based on the user's prompt.
- README Update: The
open-models/README.md
file has been updated to include a link to the new notebook, improving discoverability. - Spelling Updates: The
.github/actions/spelling/allow.txt
file has been updated to include some additional words, likely to address false positives in the spelling checks.
The notebook covers the entire process, from setting up the environment and authentication to deploying the model and making online predictions using both the Vertex AI SDK and the gcloud
CLI. It also includes a section on cleaning up resources after the deployment.
Overall, this PR significantly enhances the existing examples by providing a practical and detailed guide for deploying a more sophisticated, adapter-based LLM deployment on Vertex AI.
Here's a little haiku to lighten the mood:
Code flows like a stream,
LoRA adapts, sharp and keen,
Vertex AI awaits.
Gemini encountered an error creating the review. You can try again by commenting |
}, | ||
"outputs": [], | ||
"source": [ | ||
"handler_module = '''\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to move this to a separate .py file and upload it directly? This would allow proper syntax hi-lighting and linting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the formatting requirement. For learning purposes, don't you think it is better to have the code the code embedded in the notebook? This approach would allow learners to review the code alongside the explanations and possibly have a deeper understanding of the code's logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@holtskinner let me know what you think
Description
This notebook showcases how to deploy Gemma 2 2B from the Hugging Face Hub with multiple LoRA adapters fine-tuned for different purposes such as coding, or SQL using HuggingFace's Text Generation Inference (TGI) Deep Learning Container (DLC) in combination with a custom handler on Vertex AI.
Thank you for opening a Pull Request!
Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
CONTRIBUTING
Guide.CODEOWNERS
for the file(s).nox -s format
from the repository root to format).