You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: MultimodalQnA/README.md
+79-22Lines changed: 79 additions & 22 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,7 +2,7 @@
2
2
3
3
Suppose you possess a set of videos, images, audio files, PDFs, or some combination thereof and wish to perform question-answering to extract insights from these documents. To respond to your questions, the system needs to comprehend a mix of textual, visual, and audio facts drawn from the document contents. The MultimodalQnA framework offers an optimal solution for this purpose.
4
4
5
-
`MultimodalQnA` addresses your questions by dynamically fetching the most pertinent multimodal information (e.g. images, transcripts, and captions) from your collection of video, image, audio, and PDF files. For this purpose, MultimodalQnA utilizes [BridgeTower model](https://huggingface.co/BridgeTower/bridgetower-large-itm-mlm-gaudi), a multimodal encoding transformer model which merges visual and textual data into a unified semantic space. During the ingestion phase, the BridgeTower model embeds both visual cues and auditory facts as texts, and those embeddings are then stored in a vector database. When it comes to answering a question, the MultimodalQnA will fetch its most relevant multimodal content from the vector store and feed it into a downstream Large Vision-Language Model (LVM) as input context to generate a response for the user.
5
+
`MultimodalQnA` addresses your questions by dynamically fetching the most pertinent multimodal information (e.g. images, transcripts, and captions) from your collection of video, image, audio, and PDF files. For this purpose, MultimodalQnA utilizes [BridgeTower model](https://huggingface.co/BridgeTower/bridgetower-large-itm-mlm-gaudi), a multimodal encoding transformer model which merges visual and textual data into a unified semantic space. During the ingestion phase, the BridgeTower model embeds both visual cues and auditory facts as texts, and those embeddings are then stored in a vector database. When it comes to answering a question, the MultimodalQnA will fetch its most relevant multimodal content from the vector store and feed it into a downstream Large Vision-Language Model (LVM) as input context to generate a response for the user, which can be text or audio.
6
6
7
7
The MultimodalQnA architecture shows below:
8
8
@@ -41,12 +41,14 @@ flowchart LR
41
41
UI([UI server<br>]):::orchid
42
42
end
43
43
44
+
ASR{{Whisper service <br>}}
44
45
TEI_EM{{Embedding service <br>}}
45
46
VDB{{Vector DB<br><br>}}
46
47
R_RET{{Retriever service <br>}}
47
48
DP([Data Preparation<br>]):::blue
48
49
LVM_gen{{LVM Service <br>}}
49
50
GW([MultimodalQnA GateWay<br>]):::orange
51
+
TTS{{SpeechT5 service <br>}}
50
52
51
53
%% Data Preparation flow
52
54
%% Ingest data flow
@@ -74,25 +76,42 @@ flowchart LR
74
76
R_RET <-.->VDB
75
77
DP <-.->VDB
76
78
79
+
%% Audio speech recognition used for translating audio queries to text
80
+
GW <-.-> ASR
77
81
82
+
%% Generate spoken responses with text-to-speech using the SpeechT5 model
83
+
GW <-.-> TTS
78
84
79
85
```
80
86
81
87
This MultimodalQnA use case performs Multimodal-RAG using LangChain, Redis VectorDB and Text Generation Inference on [Intel Gaudi2](https://www.intel.com/content/www/us/en/products/details/processors/ai-accelerators/gaudi-overview.html) and [Intel Xeon Scalable Processors](https://www.intel.com/content/www/us/en/products/details/processors/xeon.html), and we invite contributions from other hardware vendors to expand the example.
82
88
89
+
The [Whisper Service](https://github.com/opea-project/GenAIComps/blob/main/comps/asr/src/README.md)
90
+
is used by MultimodalQnA for converting audio queries to text. If a spoken response is requested, the
91
+
[SpeechT5 Service](https://github.com/opea-project/GenAIComps/blob/main/comps/tts/src/README.md) translates the text
92
+
response from the LVM to a speech audio file.
93
+
83
94
The Intel Gaudi2 accelerator supports both training and inference for deep learning models in particular for LLMs. Visit [Habana AI products](https://habana.ai/products) for more details.
84
95
85
96
In the below, we provide a table that describes for each microservice component in the MultimodalQnA architecture, the default configuration of the open source project, hardware, port, and endpoint.
0 commit comments