Skip to content

Commit bbd5344

Browse files
mhbuehlerokhleif-10HarshaRamayanampre-commit-ci[bot]dmsuehir
authored
MultimodalQnA audio features completion (#1698)
Signed-off-by: okhleif-IL <[email protected]> Signed-off-by: Harsha Ramayanam <[email protected]> Signed-off-by: Melanie Buehler <[email protected]> Signed-off-by: dmsuehir <[email protected]> Signed-off-by: Dina Suehiro Jones <[email protected]> Co-authored-by: Omar Khleif <[email protected]> Co-authored-by: Harsha Ramayanam <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Dina Suehiro Jones <[email protected]> Co-authored-by: Liang Lv <[email protected]> Co-authored-by: Abolfazl Shahbazi <[email protected]>
1 parent 2764a6d commit bbd5344

33 files changed

+698
-342
lines changed

MultimodalQnA/README.md

Lines changed: 79 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
Suppose you possess a set of videos, images, audio files, PDFs, or some combination thereof and wish to perform question-answering to extract insights from these documents. To respond to your questions, the system needs to comprehend a mix of textual, visual, and audio facts drawn from the document contents. The MultimodalQnA framework offers an optimal solution for this purpose.
44

5-
`MultimodalQnA` addresses your questions by dynamically fetching the most pertinent multimodal information (e.g. images, transcripts, and captions) from your collection of video, image, audio, and PDF files. For this purpose, MultimodalQnA utilizes [BridgeTower model](https://huggingface.co/BridgeTower/bridgetower-large-itm-mlm-gaudi), a multimodal encoding transformer model which merges visual and textual data into a unified semantic space. During the ingestion phase, the BridgeTower model embeds both visual cues and auditory facts as texts, and those embeddings are then stored in a vector database. When it comes to answering a question, the MultimodalQnA will fetch its most relevant multimodal content from the vector store and feed it into a downstream Large Vision-Language Model (LVM) as input context to generate a response for the user.
5+
`MultimodalQnA` addresses your questions by dynamically fetching the most pertinent multimodal information (e.g. images, transcripts, and captions) from your collection of video, image, audio, and PDF files. For this purpose, MultimodalQnA utilizes [BridgeTower model](https://huggingface.co/BridgeTower/bridgetower-large-itm-mlm-gaudi), a multimodal encoding transformer model which merges visual and textual data into a unified semantic space. During the ingestion phase, the BridgeTower model embeds both visual cues and auditory facts as texts, and those embeddings are then stored in a vector database. When it comes to answering a question, the MultimodalQnA will fetch its most relevant multimodal content from the vector store and feed it into a downstream Large Vision-Language Model (LVM) as input context to generate a response for the user, which can be text or audio.
66

77
The MultimodalQnA architecture shows below:
88

@@ -41,12 +41,14 @@ flowchart LR
4141
UI([UI server<br>]):::orchid
4242
end
4343
44+
ASR{{Whisper service <br>}}
4445
TEI_EM{{Embedding service <br>}}
4546
VDB{{Vector DB<br><br>}}
4647
R_RET{{Retriever service <br>}}
4748
DP([Data Preparation<br>]):::blue
4849
LVM_gen{{LVM Service <br>}}
4950
GW([MultimodalQnA GateWay<br>]):::orange
51+
TTS{{SpeechT5 service <br>}}
5052
5153
%% Data Preparation flow
5254
%% Ingest data flow
@@ -74,25 +76,42 @@ flowchart LR
7476
R_RET <-.->VDB
7577
DP <-.->VDB
7678
79+
%% Audio speech recognition used for translating audio queries to text
80+
GW <-.-> ASR
7781
82+
%% Generate spoken responses with text-to-speech using the SpeechT5 model
83+
GW <-.-> TTS
7884
7985
```
8086

8187
This MultimodalQnA use case performs Multimodal-RAG using LangChain, Redis VectorDB and Text Generation Inference on [Intel Gaudi2](https://www.intel.com/content/www/us/en/products/details/processors/ai-accelerators/gaudi-overview.html) and [Intel Xeon Scalable Processors](https://www.intel.com/content/www/us/en/products/details/processors/xeon.html), and we invite contributions from other hardware vendors to expand the example.
8288

89+
The [Whisper Service](https://github.com/opea-project/GenAIComps/blob/main/comps/asr/src/README.md)
90+
is used by MultimodalQnA for converting audio queries to text. If a spoken response is requested, the
91+
[SpeechT5 Service](https://github.com/opea-project/GenAIComps/blob/main/comps/tts/src/README.md) translates the text
92+
response from the LVM to a speech audio file.
93+
8394
The Intel Gaudi2 accelerator supports both training and inference for deep learning models in particular for LLMs. Visit [Habana AI products](https://habana.ai/products) for more details.
8495

8596
In the below, we provide a table that describes for each microservice component in the MultimodalQnA architecture, the default configuration of the open source project, hardware, port, and endpoint.
8697

8798
<details>
88-
<summary><b>Gaudi default compose.yaml</b></summary>
89-
90-
| MicroService | Open Source Project | HW | Port | Endpoint |
91-
| ------------ | --------------------- | ----- | ---- | ----------------------------------------------------------- |
92-
| Embedding | Langchain | Xeon | 6000 | /v1/embeddings |
93-
| Retriever | Langchain, Redis | Xeon | 7000 | /v1/retrieval |
94-
| LVM | Langchain, TGI | Gaudi | 9399 | /v1/lvm |
95-
| Dataprep | Redis, Langchain, TGI | Gaudi | 6007 | /v1/generate_transcripts, /v1/generate_captions, /v1/ingest |
99+
<summary><b>Gaudi and Xeon default compose.yaml settings</b></summary>
100+
101+
| MicroService | Open Source Project | HW | Port | Endpoint |
102+
| ------------ | ----------------------- | ----- | ---- | ----------------------------------------------------------- |
103+
| Dataprep | Redis, Langchain, TGI | Xeon | 6007 | /v1/generate_transcripts, /v1/generate_captions, /v1/ingest |
104+
| Embedding | Langchain | Xeon | 6000 | /v1/embeddings |
105+
| LVM | Langchain, Transformers | Xeon | 9399 | /v1/lvm |
106+
| Retriever | Langchain, Redis | Xeon | 7000 | /v1/retrieval |
107+
| SpeechT5 | Transformers | Xeon | 7055 | /v1/tts |
108+
| Whisper | Transformers | Xeon | 7066 | /v1/asr |
109+
| Dataprep | Redis, Langchain, TGI | Gaudi | 6007 | /v1/generate_transcripts, /v1/generate_captions, /v1/ingest |
110+
| Embedding | Langchain | Gaudi | 6000 | /v1/embeddings |
111+
| LVM | Langchain, TGI | Gaudi | 9399 | /v1/lvm |
112+
| Retriever | Langchain, Redis | Gaudi | 7000 | /v1/retrieval |
113+
| SpeechT5 | Transformers | Gaudi | 7055 | /v1/tts |
114+
| Whisper | Transformers | Gaudi | 7066 | /v1/asr |
96115

97116
</details>
98117

@@ -104,18 +123,41 @@ By default, the embedding and LVM models are set to a default value as listed be
104123
| --------- | ----- | ----------------------------------------- |
105124
| embedding | Xeon | BridgeTower/bridgetower-large-itm-mlm-itc |
106125
| LVM | Xeon | llava-hf/llava-1.5-7b-hf |
126+
| SpeechT5 | Xeon | microsoft/speecht5_tts |
127+
| Whisper | Xeon | openai/whisper-small |
107128
| embedding | Gaudi | BridgeTower/bridgetower-large-itm-mlm-itc |
108129
| LVM | Gaudi | llava-hf/llava-v1.6-vicuna-13b-hf |
130+
| SpeechT5 | Gaudi | microsoft/speecht5_tts |
131+
| Whisper | Gaudi | openai/whisper-small |
109132

110133
You can choose other LVM models, such as `llava-hf/llava-1.5-7b-hf ` and `llava-hf/llava-1.5-13b-hf`, as needed.
111134

112135
## Deploy MultimodalQnA Service
113136

114137
The MultimodalQnA service can be effortlessly deployed on either Intel Gaudi2 or Intel XEON Scalable Processors.
115138

116-
Currently we support deploying MultimodalQnA services with docker compose.
139+
Currently we support deploying MultimodalQnA services with docker compose. The [`docker_compose`](docker_compose)
140+
directory has folders which include `compose.yaml` files for different hardware types:
141+
142+
```
143+
📂 docker_compose
144+
├── 📂 amd
145+
│   └── 📂 gpu
146+
│   └── 📂 rocm
147+
│   ├── 📄 compose.yaml
148+
│   └── ...
149+
└── 📂 intel
150+
├── 📂 cpu
151+
│   └── 📂 xeon
152+
│   ├── 📄 compose.yaml
153+
│   └── ...
154+
└── 📂 hpu
155+
└── 📂 gaudi
156+
├── 📄 compose.yaml
157+
└── ...
158+
```
117159

118-
### Setup Environment Variable
160+
### Setup Environment Variables
119161

120162
To set up environment variables for deploying MultimodalQnA services, follow these steps:
121163

@@ -124,8 +166,10 @@ To set up environment variables for deploying MultimodalQnA services, follow the
124166
```bash
125167
# Example: export host_ip=$(hostname -I | awk '{print $1}')
126168
export host_ip="External_Public_IP"
169+
170+
# Append the host_ip to the no_proxy list to allow container communication
127171
# Example: no_proxy="localhost, 127.0.0.1, 192.168.1.1"
128-
export no_proxy="Your_No_Proxy"
172+
export no_proxy="${no_proxy},${host_ip}"
129173
```
130174

131175
2. If you are in a proxy environment, also set the proxy-related environment variables:
@@ -137,36 +181,41 @@ To set up environment variables for deploying MultimodalQnA services, follow the
137181

138182
3. Set up other environment variables:
139183

140-
> Notice that you can only choose **one** command below to set up envs according to your hardware. Other that the port numbers may be set incorrectly.
184+
> Choose **one** command below to set env vars according to your hardware. Otherwise, the port numbers may be set incorrectly.
141185
142186
```bash
143187
# on Gaudi
144-
source ./docker_compose/intel/hpu/gaudi/set_env.sh
188+
cd docker_compose/intel/hpu/gaudi
189+
source ./set_env.sh
190+
145191
# on Xeon
146-
source ./docker_compose/intel/cpu/xeon/set_env.sh
192+
cd docker_compose/intel/cpu/xeon
193+
source ./set_env.sh
147194
```
148195

149196
### Deploy MultimodalQnA on Gaudi
150197

151-
Refer to the [Gaudi Guide](./docker_compose/intel/hpu/gaudi/README.md) to build docker images from source.
198+
Refer to the [Gaudi Guide](./docker_compose/intel/hpu/gaudi/README.md) if you would like to build docker images from
199+
source, otherwise images will be pulled from Docker Hub.
152200

153201
Find the corresponding [compose.yaml](./docker_compose/intel/hpu/gaudi/compose.yaml).
154202

155203
```bash
156-
cd GenAIExamples/MultimodalQnA/docker_compose/intel/hpu/gaudi/
204+
# While still in the docker_compose/intel/hpu/gaudi directory, use docker compose to bring up the services
157205
docker compose -f compose.yaml up -d
158206
```
159207

160-
> Notice: Currently only the **Habana Driver 1.17.x** is supported for Gaudi.
208+
> Notice: Currently only the **Habana Driver 1.18.x** is supported for Gaudi.
161209
162210
### Deploy MultimodalQnA on Xeon
163211

164-
Refer to the [Xeon Guide](./docker_compose/intel/cpu/xeon/README.md) for more instructions on building docker images from source.
212+
Refer to the [Xeon Guide](./docker_compose/intel/cpu/xeon/README.md) if you would like to build docker images from
213+
source, otherwise images will be pulled from Docker Hub.
165214

166215
Find the corresponding [compose.yaml](./docker_compose/intel/cpu/xeon/compose.yaml).
167216

168217
```bash
169-
cd GenAIExamples/MultimodalQnA/docker_compose/intel/cpu/xeon/
218+
# While still in the docker_compose/intel/cpu/xeon directory, use docker compose to bring up the services
170219
docker compose -f compose.yaml up -d
171220
```
172221

@@ -190,7 +239,11 @@ docker compose -f compose.yaml up -d
190239

191240
### Text Query following the ingestion of an image
192241

193-
![MultimodalQnA-video-query-screenshot](./assets/img/image-query.png)
242+
![MultimodalQnA-video-query-screenshot](./assets/img/image-query-text.png)
243+
244+
### Text Query following the ingestion of an image using text-to-speech
245+
246+
![MultimodalQnA-video-query-screenshot](./assets/img/image-query-tts.png)
194247

195248
### Audio Ingestion
196249

@@ -202,8 +255,12 @@ docker compose -f compose.yaml up -d
202255

203256
### PDF Ingestion
204257

205-
![MultimodalQnA-upload-pdf-screenshot](./assets/img/ingest_pdf.png)
258+
![MultimodalQnA-upload-pdf-screenshot](./assets/img/pdf-ingestion.png)
206259

207260
### Text query following the ingestion of a PDF
208261

209262
![MultimodalQnA-pdf-query-example-screenshot](./assets/img/pdf-query.png)
263+
264+
### View, Refresh, and Delete ingested media in the Vector Store
265+
266+
![MultimodalQnA-pdf-query-example-screenshot](./assets/img/vector-store.png)
1.51 KB
Loading
12.2 KB
Loading
-404 KB
Binary file not shown.
492 KB
Loading
225 KB
Loading
220 KB
Loading
-219 KB
Binary file not shown.
-124 KB
Binary file not shown.
4.6 KB
Loading

0 commit comments

Comments
 (0)