Add Multimodal RAG (Text + Image) for Retrieval-Augmented Generation Using Llama #64

MayankChaturvedi · 2024-09-30T18:04:22Z

A notebook that demonstrates how to use a multimodal RAG that combines two types of inputs, such as text and images, to retrieve relevant information from a dataset and generate new outputs based on the retrieved data.

Example
Input: Takes a text query along with an image (e.g., "Which fruit is this?")
Retrieval: Uses the image and the text to retrieve relevant documents or facts from a knowledge base or external dataset (e.g., Wikipedia articles on animals).
Generation: The system generates a coherent response based on the retrieved information (e.g., "This is a blueberry!").

neural-navigator · 2024-09-30T20:08:44Z

I would love to contribute to this issue @MayankChaturvedi

ariG23498 · 2024-10-01T05:05:25Z

I love the idea!

So to make it even more clear:

Use a multimodal (image and text pair) dataset from the Hugging Face Hub
Embed the dataset using an embedding model
RAG with multimodal Llama

If that is the workflow, I think we should go forward. Also, it would be great to not make this very complicated. We would like to see a very simple notebook that does what it needs to while not making it too complicated.

PS: I have added this issue to the main issue #43

silvererudite · 2024-10-01T06:16:11Z

hello @MayankChaturvedi as proposed by @ariG23498 in this issue #55 I would love to contribute to this work as well. Let me know if you want to divide/collab in any subtask for this.

atharv-jiwane · 2024-10-01T07:01:44Z

Hey @ariG23498 I was redirected to #47 , thank you for that! I would also love to join this team @MayankChaturvedi . Please let me know if there is space for collaboration here too!

MayankChaturvedi · 2024-10-01T08:51:44Z

Hi folks, thanks for your interest in the issue. We need a simple notebook. I will create a branch so that three of us can collaborate on it. Meanwhile I'll also come up with a distribution of tasks.
Let's collaborate on a discord group? - https://discord.gg/rhbqXsyX
@ariG23498 does this setup sound good?

ariG23498 · 2024-10-01T12:36:21Z

@MayankChaturvedi the collaboration sounds great!

Let me know if you folks need help -- the best way of reaching me is this issue. It would be open for others to view and learn 🤗

renuka010 · 2024-10-01T14:13:39Z

Hi @MayankChaturvedi I would love to collaborate on this issue. Let me know if I can contribute to this issue.

MayankChaturvedi mentioned this issue Sep 30, 2024

Call for contributions #43

Open

6 tasks

ariG23498 mentioned this issue Oct 1, 2024

Implement a simple multimodal RAG recipe with LLama Vision 11B model. #55

Open

MayankChaturvedi linked a pull request Oct 3, 2024 that will close this issue

Adding Multimodal RAG (Text + Image) for Retrieval-Augmented Generation Using Llama #72

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Multimodal RAG (Text + Image) for Retrieval-Augmented Generation Using Llama #64

Add Multimodal RAG (Text + Image) for Retrieval-Augmented Generation Using Llama #64

MayankChaturvedi commented Sep 30, 2024

neural-navigator commented Sep 30, 2024

ariG23498 commented Oct 1, 2024 •

edited

Loading

silvererudite commented Oct 1, 2024

atharv-jiwane commented Oct 1, 2024

MayankChaturvedi commented Oct 1, 2024

ariG23498 commented Oct 1, 2024

renuka010 commented Oct 1, 2024

Add Multimodal RAG (Text + Image) for Retrieval-Augmented Generation Using Llama #64

Add Multimodal RAG (Text + Image) for Retrieval-Augmented Generation Using Llama #64

Comments

MayankChaturvedi commented Sep 30, 2024

neural-navigator commented Sep 30, 2024

ariG23498 commented Oct 1, 2024 • edited Loading

silvererudite commented Oct 1, 2024

atharv-jiwane commented Oct 1, 2024

MayankChaturvedi commented Oct 1, 2024

ariG23498 commented Oct 1, 2024

renuka010 commented Oct 1, 2024

ariG23498 commented Oct 1, 2024 •

edited

Loading