-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding Multimodal RAG (Text + Image) for Retrieval-Augmented Generation Using Llama #72
base: main
Are you sure you want to change the base?
Conversation
Hi @ariG23498 ... would love your feedback on how to improve this workflow or any other suggestions on our work here. |
Hey folks! I did a quick skim on the notebook. Let me first iterate on the workflow, so that you all can point out whether I fully understand the workflow or not:
My suggestions (very experimental):
My suggestions are based on the fact that the current notebook uses other models (CLIP) which I think would add on to the cognitive burden of a beginner. Let's steer away from that, also keep in mind that this is a recipe, so less lines of code would be great. We do not want an end to end production ready notebook -- we want something that does the job and is a good proof of concept. What do yo all feel about the suggestions? |
hi @ariG23498 thanks a lot for the feedback and it wud indeed make the implemenation lot simpler. What do you think about creating two separate notebooks
My suggestion comes from the apprehension that the user might run into oom errors if we load both 11B models in the same notebook envs ? Not sure what wud be a good way to organise workflows involving two chonk models for 24GB consumer gpu in a single notebook. |
If we are to experiment based on this reply are we not loading just one model? The image encoder and the generation model come from the same model. Am I missing something? |
@ariG23498 my question comes from point 1 and 4. I was under the impression that we generate and store embeddings from |
What I meant was if you load the Here is a small snippet
Note how we use the Does this help? |
Ah I see @ariG23498 , I was not aware that the pipeline was setup like that. Can you pls refer me the docs about its individual components like from where you got the |
You are on the right docs. Now that you can do Can you now give the experiment a try and let me know? |
hey @ariG23498 I experimented with
I am assuming the last hidden state is the embedding that we want, pls corredt me if that's not the case. what can be a better strategy to work with embeddings this huge as my kernel keeps dying ? |
Do the kernel still die if you iterate over each individual images (this is going to take more time) |
Unfortunately yes, saving them in memory regardless of batchwise or single iteration seems to be the cause of kernel dying. Should we try to apply some kind of compression technique or projecting the embeddings to CLIP like shape ? What about using the earlier layers of the image encoder, maybe those are a bit smaller ? |
I am not sure how much that would help here.
This is infact a good to explore point. Let me know how that goes and maybe we could brain storm on it? |
Did you figure out a solution yet? |
I think this CLIP adapter for Llama will do the trick The adapter adds extra parameters on top of a frozen LLaMA 7B model, in the form of a projection gate and a bias norm. I think it should work with llama 3.2 as well since 3.2's multi-modal model uses the same frozen weights in its text layers |
This pull request introduces a new notebook demonstrating a Multimodal Retrieval Augmented Generation (RAG) pipeline. The pipeline leverages the following technologies:
CLIP: Encodes both text and images into a shared embedding space for cross-modal retrieval.
FAISS: Efficiently stores and searches the vector embeddings generated by CLIP.
LLaMa Multimodal: Augments retrieved context and generates comprehensive and informative responses, incorporating both text and image information.
This is a collaboration between @silvererudite @atharv-jiwane and me. This PR fixes #64