Adding Multimodal RAG (Text + Image) for Retrieval-Augmented Generation Using Llama #72

MayankChaturvedi · 2024-10-03T19:42:07Z

This pull request introduces a new notebook demonstrating a Multimodal Retrieval Augmented Generation (RAG) pipeline. The pipeline leverages the following technologies:

CLIP: Encodes both text and images into a shared embedding space for cross-modal retrieval.
FAISS: Efficiently stores and searches the vector embeddings generated by CLIP.
LLaMa Multimodal: Augments retrieved context and generates comprehensive and informative responses, incorporating both text and image information.

This is a collaboration between @silvererudite @atharv-jiwane and me. This PR fixes #64

…ing pipeline

silvererudite · 2024-10-03T20:00:16Z

Hi @ariG23498 ... would love your feedback on how to improve this workflow or any other suggestions on our work here.
Note: we will also add hf style docs on the notebook as all the other notebooks in this repo in the next iteration after your feedback.

ariG23498 · 2024-10-04T06:44:08Z

Hey folks!

I did a quick skim on the notebook.

Let me first iterate on the workflow, so that you all can point out whether I fully understand the workflow or not:

Have a dataset for product image and description
Use CLIP models to embed the image and descriptions
Have a query image (here the oreo image)
Retrieve from the dataset, a sample which closely resembles the query image
Use MLlama to generate with the retrieved image and text

My suggestions (very experimental):

use the mllama vision model to embed the images of the dataset
save the embeddings in the dataset itself (100 rows would be easy to do)
use the embeddings of the images and the query image to retrieve a sample from the dataset
use the image (query image) and the text from the dataset, and do generation (with the instruct model)

My suggestions are based on the fact that the current notebook uses other models (CLIP) which I think would add on to the cognitive burden of a beginner. Let's steer away from that, also keep in mind that this is a recipe, so less lines of code would be great. We do not want an end to end production ready notebook -- we want something that does the job and is a good proof of concept.

What do yo all feel about the suggestions?

silvererudite · 2024-10-04T07:51:57Z

hi @ariG23498 thanks a lot for the feedback and it wud indeed make the implemenation lot simpler. What do you think about creating two separate notebooks

One for creating and storing the embeddings from mllama
Another for loading the embeddings and using them with the vision model

My suggestion comes from the apprehension that the user might run into oom errors if we load both 11B models in the same notebook envs ? Not sure what wud be a good way to organise workflows involving two chonk models for 24GB consumer gpu in a single notebook.

ariG23498 · 2024-10-07T07:13:10Z

My suggestion comes from the apprehension that the user might run into oom errors if we load both 11B models in the same notebook envs ? Not sure what wud be a good way to organise workflows involving two chonk models for 24GB consumer gpu in a single notebook.

If we are to experiment based on this reply are we not loading just one model? The image encoder and the generation model come from the same model.

Am I missing something?

silvererudite · 2024-10-07T07:23:43Z

Hey folks!

I did a quick skim on the notebook.

Let me first iterate on the workflow, so that you all can point out whether I fully understand the workflow or not:

Have a dataset for product image and description

Use CLIP models to embed the image and descriptions

Have a query image (here the oreo image)

Retrieve from the dataset, a sample which closely resembles the query image

Use MLlama to generate with the retrieved image and text

My suggestions (very experimental):

use the mllama vision model to embed the images of the dataset

save the embeddings in the dataset itself (100 rows would be easy to do)

use the embeddings of the images and the query image to retrieve a sample from the dataset

use the image (query image) and the text from the dataset, and do generation (with the instruct model)

My suggestions are based on the fact that the current notebook uses other models (CLIP) which I think would add on to the cognitive burden of a beginner. Let's steer away from that, also keep in mind that this is a recipe, so less lines of code would be great. We do not want an end to end production ready notebook -- we want something that does the job and is a good proof of concept.

What do yo all feel about the suggestions?

@ariG23498 my question comes from point 1 and 4. I was under the impression that we generate and store embeddings from meta-llama/Llama-3.2-11B-Vision model and use the instruct version for performing rag meta-llama/Llama-3.2-11B-Vision-Instruct in step 4 ??
wouldn't that mean that we need to load two models during inference as well ? one for the query embedding and the other for generating reply using the instruct model ?

ariG23498 · 2024-10-07T08:02:10Z

What I meant was if you load the MllamaForConditionalGeneration you already get MllamaVisionModel loaded.

Here is a small snippet

import requests
import torch
from PIL import Image
from transformers import MllamaForConditionalGeneration, AutoProcessor

model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct"
model = MllamaForConditionalGeneration.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16)
processor = AutoProcessor.from_pretrained(model_id)

model.vision_model

Note how we use the model.vision_model to access the Vision Model.

Does this help?

silvererudite · 2024-10-07T09:23:28Z

Ah I see @ariG23498 , I was not aware that the pipeline was setup like that. Can you pls refer me the docs about its individual components like from where you got the model.vision_model . I was following this one
mlama but it wasnt immediately obvious to me that we cud do things like model.vision_model. Perhaps I am navigating the docs wrong ?

ariG23498 · 2024-10-07T10:16:47Z

You are on the right docs. Now that you can do model.vision_model the parameters are the same I believe (the ones that are enlisted in the docs)

Can you now give the experiment a try and let me know?

silvererudite · 2024-10-10T18:31:49Z

You are on the right docs. Now that you can do model.vision_model the parameters are the same I believe (the ones that are enlisted in the docs)

Can you now give the experiment a try and let me know?

hey @ariG23498 I experimented with model.vision_model outputs as you said and the last layer embedding is really huge torch.Size([1, 1, 4, 1025, 7680]). So was having issues of kernel dying if I try to batch wise generate embeddings of images like this

def generate_image_embeddings(images: list, device: str, batch_size: int = 16) -> np.ndarray:
    embeddings = []
    with torch.no_grad():
        for i in tqdm(range(0, len(images), batch_size)):
            batch_images = images[i:i + batch_size]
            for img in batch_images:
                pil_image = img.convert("RGB") #the images are already PIL objects
                inputs = processor(images=pil_image, return_tensors="pt").to(device)

                embedding = model.vision_model(**inputs).last_hidden_state
                embedding = embedding / embedding.norm(p=2, dim=1, keepdim=True)
                embeddings.append(embedding.to(torch.float).cpu().numpy())
            
        return np.vstack(embeddings)
        
images = image_text_pairs["image"]  # the images column from dataset
image_embeddings = generate_image_embeddings(images, device)
image_text_pairs['image_embedding'] = np.ndarray(image_embeddings)

I am assuming the last hidden state is the embedding that we want, pls corredt me if that's not the case. what can be a better strategy to work with embeddings this huge as my kernel keeps dying ?

ariG23498 · 2024-10-14T03:31:09Z

Do the kernel still die if you iterate over each individual images (this is going to take more time)

silvererudite · 2024-10-14T07:04:48Z

Do the kernel still die if you iterate over each individual images (this is going to take more time)

Unfortunately yes, saving them in memory regardless of batchwise or single iteration seems to be the cause of kernel dying. Should we try to apply some kind of compression technique or projecting the embeddings to CLIP like shape ? What about using the earlier layers of the image encoder, maybe those are a bit smaller ?

ariG23498 · 2024-10-14T07:24:24Z

Should we try to apply some kind of compression technique or projecting the embeddings to CLIP like shape?

I am not sure how much that would help here.

What about using the earlier layers of the image encoder, maybe those are a bit smaller ?

This is infact a good to explore point.

Let me know how that goes and maybe we could brain storm on it?

00hello · 2024-11-03T00:42:52Z

Did you figure out a solution yet?

00hello · 2024-11-03T01:09:57Z

Should we try to apply some kind of compression technique or projecting the embeddings to CLIP like shape ?

I think this CLIP adapter for Llama will do the trick
https://github.com/OpenGVLab/LLaMA-Adapter?tab=readme-ov-file#released-models

The adapter adds extra parameters on top of a frozen LLaMA 7B model, in the form of a projection gate and a bias norm. I think it should work with llama 3.2 as well since 3.2's multi-modal model uses the same frozen weights in its text layers

MayankChaturvedi and others added 3 commits October 2, 2024 20:10

Adding boilerplate for multimodal rag

3c9ca92

Fix RAG pipeline errors, refactor code and create the end-to-end work…

8a7792d

…ing pipeline

removing the boilerplate code

997ea5d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding Multimodal RAG (Text + Image) for Retrieval-Augmented Generation Using Llama #72

Adding Multimodal RAG (Text + Image) for Retrieval-Augmented Generation Using Llama #72

Uh oh!

MayankChaturvedi commented Oct 3, 2024 •

edited

Loading

Uh oh!

silvererudite commented Oct 3, 2024

Uh oh!

ariG23498 commented Oct 4, 2024

Uh oh!

silvererudite commented Oct 4, 2024

Uh oh!

ariG23498 commented Oct 7, 2024

Uh oh!

silvererudite commented Oct 7, 2024 •

edited

Loading

Uh oh!

ariG23498 commented Oct 7, 2024

Uh oh!

silvererudite commented Oct 7, 2024 •

edited

Loading

Uh oh!

ariG23498 commented Oct 7, 2024

Uh oh!

silvererudite commented Oct 10, 2024

Uh oh!

ariG23498 commented Oct 14, 2024

Uh oh!

silvererudite commented Oct 14, 2024 •

edited

Loading

Uh oh!

ariG23498 commented Oct 14, 2024

Uh oh!

00hello commented Nov 3, 2024

Uh oh!

00hello commented Nov 3, 2024 •

edited

Loading

Uh oh!

Uh oh!

Adding Multimodal RAG (Text + Image) for Retrieval-Augmented Generation Using Llama #72

Are you sure you want to change the base?

Adding Multimodal RAG (Text + Image) for Retrieval-Augmented Generation Using Llama #72

Uh oh!

Conversation

MayankChaturvedi commented Oct 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

silvererudite commented Oct 3, 2024

Uh oh!

ariG23498 commented Oct 4, 2024

Uh oh!

silvererudite commented Oct 4, 2024

Uh oh!

ariG23498 commented Oct 7, 2024

Uh oh!

silvererudite commented Oct 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ariG23498 commented Oct 7, 2024

Uh oh!

silvererudite commented Oct 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ariG23498 commented Oct 7, 2024

Uh oh!

silvererudite commented Oct 10, 2024

Uh oh!

ariG23498 commented Oct 14, 2024

Uh oh!

silvererudite commented Oct 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ariG23498 commented Oct 14, 2024

Uh oh!

00hello commented Nov 3, 2024

Uh oh!

00hello commented Nov 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

MayankChaturvedi commented Oct 3, 2024 •

edited

Loading

silvererudite commented Oct 7, 2024 •

edited

Loading

silvererudite commented Oct 7, 2024 •

edited

Loading

silvererudite commented Oct 14, 2024 •

edited

Loading

00hello commented Nov 3, 2024 •

edited

Loading