Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Multimodal RAG (Text + Image) for Retrieval-Augmented Generation Using Llama #72

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

MayankChaturvedi
Copy link

@MayankChaturvedi MayankChaturvedi commented Oct 3, 2024

This pull request introduces a new notebook demonstrating a Multimodal Retrieval Augmented Generation (RAG) pipeline. The pipeline leverages the following technologies:

CLIP: Encodes both text and images into a shared embedding space for cross-modal retrieval.
FAISS: Efficiently stores and searches the vector embeddings generated by CLIP.
LLaMa Multimodal: Augments retrieved context and generates comprehensive and informative responses, incorporating both text and image information.

This is a collaboration between @silvererudite @atharv-jiwane and me. This PR fixes #64

@silvererudite
Copy link

Hi @ariG23498 ... would love your feedback on how to improve this workflow or any other suggestions on our work here.
Note: we will also add hf style docs on the notebook as all the other notebooks in this repo in the next iteration after your feedback.

@ariG23498
Copy link
Collaborator

Hey folks!

I did a quick skim on the notebook.

Let me first iterate on the workflow, so that you all can point out whether I fully understand the workflow or not:

  1. Have a dataset for product image and description
  2. Use CLIP models to embed the image and descriptions
  3. Have a query image (here the oreo image)
  4. Retrieve from the dataset, a sample which closely resembles the query image
  5. Use MLlama to generate with the retrieved image and text

My suggestions (very experimental):

  1. use the mllama vision model to embed the images of the dataset
  2. save the embeddings in the dataset itself (100 rows would be easy to do)
  3. use the embeddings of the images and the query image to retrieve a sample from the dataset
  4. use the image (query image) and the text from the dataset, and do generation (with the instruct model)

My suggestions are based on the fact that the current notebook uses other models (CLIP) which I think would add on to the cognitive burden of a beginner. Let's steer away from that, also keep in mind that this is a recipe, so less lines of code would be great. We do not want an end to end production ready notebook -- we want something that does the job and is a good proof of concept.

What do yo all feel about the suggestions?

@silvererudite
Copy link

hi @ariG23498 thanks a lot for the feedback and it wud indeed make the implemenation lot simpler. What do you think about creating two separate notebooks

  1. One for creating and storing the embeddings from mllama
  2. Another for loading the embeddings and using them with the vision model

My suggestion comes from the apprehension that the user might run into oom errors if we load both 11B models in the same notebook envs ? Not sure what wud be a good way to organise workflows involving two chonk models for 24GB consumer gpu in a single notebook.

@ariG23498
Copy link
Collaborator

My suggestion comes from the apprehension that the user might run into oom errors if we load both 11B models in the same notebook envs ? Not sure what wud be a good way to organise workflows involving two chonk models for 24GB consumer gpu in a single notebook.

If we are to experiment based on this reply are we not loading just one model? The image encoder and the generation model come from the same model.

Am I missing something?

@silvererudite
Copy link

silvererudite commented Oct 7, 2024

Hey folks!

I did a quick skim on the notebook.

Let me first iterate on the workflow, so that you all can point out whether I fully understand the workflow or not:

  1. Have a dataset for product image and description
  2. Use CLIP models to embed the image and descriptions
  3. Have a query image (here the oreo image)
  4. Retrieve from the dataset, a sample which closely resembles the query image
  5. Use MLlama to generate with the retrieved image and text

My suggestions (very experimental):

  1. use the mllama vision model to embed the images of the dataset
  2. save the embeddings in the dataset itself (100 rows would be easy to do)
  3. use the embeddings of the images and the query image to retrieve a sample from the dataset
  4. use the image (query image) and the text from the dataset, and do generation (with the instruct model)

My suggestions are based on the fact that the current notebook uses other models (CLIP) which I think would add on to the cognitive burden of a beginner. Let's steer away from that, also keep in mind that this is a recipe, so less lines of code would be great. We do not want an end to end production ready notebook -- we want something that does the job and is a good proof of concept.

What do yo all feel about the suggestions?

@ariG23498 my question comes from point 1 and 4. I was under the impression that we generate and store embeddings from meta-llama/Llama-3.2-11B-Vision model and use the instruct version for performing rag meta-llama/Llama-3.2-11B-Vision-Instruct in step 4 ??
wouldn't that mean that we need to load two models during inference as well ? one for the query embedding and the other for generating reply using the instruct model ?

@ariG23498
Copy link
Collaborator

What I meant was if you load the MllamaForConditionalGeneration you already get MllamaVisionModel loaded.

Here is a small snippet

import requests
import torch
from PIL import Image
from transformers import MllamaForConditionalGeneration, AutoProcessor

model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct"
model = MllamaForConditionalGeneration.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16)
processor = AutoProcessor.from_pretrained(model_id)

model.vision_model

Note how we use the model.vision_model to access the Vision Model.

Does this help?

@silvererudite
Copy link

silvererudite commented Oct 7, 2024

Ah I see @ariG23498 , I was not aware that the pipeline was setup like that. Can you pls refer me the docs about its individual components like from where you got the model.vision_model . I was following this one
mlama
but it wasnt immediately obvious to me that we cud do things like model.vision_model. Perhaps I am navigating the docs wrong ?

@ariG23498
Copy link
Collaborator

You are on the right docs. Now that you can do model.vision_model the parameters are the same I believe (the ones that are enlisted in the docs)

Can you now give the experiment a try and let me know?

@silvererudite
Copy link

You are on the right docs. Now that you can do model.vision_model the parameters are the same I believe (the ones that are enlisted in the docs)

Can you now give the experiment a try and let me know?

hey @ariG23498 I experimented with model.vision_model outputs as you said and the last layer embedding is really huge torch.Size([1, 1, 4, 1025, 7680]). So was having issues of kernel dying if I try to batch wise generate embeddings of images like this

def generate_image_embeddings(images: list, device: str, batch_size: int = 16) -> np.ndarray:
    embeddings = []
    with torch.no_grad():
        for i in tqdm(range(0, len(images), batch_size)):
            batch_images = images[i:i + batch_size]
            for img in batch_images:
                pil_image = img.convert("RGB") #the images are already PIL objects
                inputs = processor(images=pil_image, return_tensors="pt").to(device)

                embedding = model.vision_model(**inputs).last_hidden_state
                embedding = embedding / embedding.norm(p=2, dim=1, keepdim=True)
                embeddings.append(embedding.to(torch.float).cpu().numpy())
            
        return np.vstack(embeddings)
        
images = image_text_pairs["image"]  # the images column from dataset
image_embeddings = generate_image_embeddings(images, device)
image_text_pairs['image_embedding'] = np.ndarray(image_embeddings)
       

I am assuming the last hidden state is the embedding that we want, pls corredt me if that's not the case. what can be a better strategy to work with embeddings this huge as my kernel keeps dying ?

@ariG23498
Copy link
Collaborator

Do the kernel still die if you iterate over each individual images (this is going to take more time)

@silvererudite
Copy link

silvererudite commented Oct 14, 2024

Do the kernel still die if you iterate over each individual images (this is going to take more time)

Unfortunately yes, saving them in memory regardless of batchwise or single iteration seems to be the cause of kernel dying. Should we try to apply some kind of compression technique or projecting the embeddings to CLIP like shape ? What about using the earlier layers of the image encoder, maybe those are a bit smaller ?

@ariG23498
Copy link
Collaborator

Should we try to apply some kind of compression technique or projecting the embeddings to CLIP like shape?

I am not sure how much that would help here.

What about using the earlier layers of the image encoder, maybe those are a bit smaller ?

This is infact a good to explore point.

Let me know how that goes and maybe we could brain storm on it?

@00hello
Copy link

00hello commented Nov 3, 2024

Did you figure out a solution yet?

@00hello
Copy link

00hello commented Nov 3, 2024

Should we try to apply some kind of compression technique or projecting the embeddings to CLIP like shape ?

I think this CLIP adapter for Llama will do the trick
https://github.com/OpenGVLab/LLaMA-Adapter?tab=readme-ov-file#released-models

The adapter adds extra parameters on top of a frozen LLaMA 7B model, in the form of a projection gate and a bias norm. I think it should work with llama 3.2 as well since 3.2's multi-modal model uses the same frozen weights in its text layers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Multimodal RAG (Text + Image) for Retrieval-Augmented Generation Using Llama
4 participants