forked from yyua8222/Sound-VECaps-demo
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.html
362 lines (327 loc) · 22.6 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="generator" content="Hugo 0.66.0" />
<meta name="viewport" content="width=device-width, initial-scale=1">
<link href="https://fonts.googleapis.com/css?family=Roboto:300,400,600" rel="stylesheet" type="text/css">
<link rel="stylesheet" href="//cdnjs.cloudflare.com/ajax/libs/highlight.js/8.4/styles/github.min.css">
<link rel="stylesheet" href="../css/normalize.css">
<link rel="stylesheet" href="../css/skeleton.css">
<link rel="stylesheet" href="../css/custom.css">
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/[email protected]/dist/css/bootstrap.min.css"
integrity="sha384-Gn5384xqQ1aoWXA+058RXPxPg6fy4IWvTNh0E263XmFcJlSAwiGgFAW/dAiS6JXm" crossorigin="anonymous">
<link rel="alternate" href="index.xml" type="application/rss+xml" title="Speech Research">
<link rel="shortcut icon" href="favicon.png" type="image/x-icon" />
<title>IMPROVING AUDIO GENERATION WITH VISUAL ENHANCED CAPTION</title>
<style>
audio {
width: 170px; /* 设置音频播放器的宽度 */
height: 50px; /* 设置音频播放器的高度 */
}
</style>
</head>
<body rightmargin=10px leftmargin=10px topmargin="100" bottommargin="100" line-height:160%>
<font size="5">
<div class="container">
<header role="banner">
</header>
<main role="main">
<article itemscope itemtype="https://schema.org/BlogPosting">
<br></br>
<h1 itemprop="headline" align="center">
<font color="000093" size="6">Sound-VECaps: Improving Audio Generation With Visual Enhanced Captions
</font>
</h1>
<br></br>
<!-- <p style="line-height:1" align="center"><b>
<font color="061E61">Yi Yuan<sup>1</sup>, Dongya Jia<sup>2</sup>, Xiaobin Zhuang<sup>2</sup>,
Yuanzhe Chen<sup>2</sup>, Zhengxi Liu<sup>2</sup>, Zhuo Chen<sup>2</sup></font>
</b></p>
<p style="line-height:1" align="center"><b>
<font color="061E61">Yuping Wang<sup>2</sup>, Yuxuan Wang<sup>2</sup>, Xubo Liu<sup>1</sup>, Xiyuan Kang<sup>1</sup>
, Mark D. Plumbley<sup>1</sup>, Wenwu Wang<sup>1</sup></font>
</b></p>
<p style="line-height:0.6" align="center">
<font color="061E61"><sup>1</sup>University of Surrey</font>
</p>
<p style="line-height:0.6" align="center">
<font color="061E61"> <sup>2</sup>ByteDance</font>
</p> -->
<section itemprop="entry-text">
<br>
<div class="container">
<center>
<p><a href="https://zenodo.org/records/12606207">Dataset on Zenodo</a></p>
</center>
</div>
<h2 id="abstract">
<font color="000093">Abstract</font>
</h2>
<p style="text-align: justify;">
<font color="061E61"> Generative models have shown significant achievements in audio generation tasks. However, existing models struggle with complex and detailed prompts, leading to potential performance degradation.
We hypothesize that this problem stems from the simplicity and scarcity of the training data. This work aims to create a large-scale audio dataset with rich captions for improving audio generation models.
We first develop an automated pipeline to generate detailed captions by transforming predicted visual captions, audio captions, and tagging labels into comprehensive descriptions using a Large Language Model (LLM).
The resulting dataset, Sound-VECaps, comprises 1.66M high-quality audio-caption pairs with enriched details including audio event orders, occurred places and environment information.
We then demonstrate that training the text-to-audio generation models with Sound-VECaps significantly improves the performance on complex prompts. Furthermore, we conduct ablation studies of the models on several downstream audio-language tasks,
showing the potential of Sound-VECaps in advancing audio-text representation learning. Our dataset and models are available online. </font>
</p>
<h2 id="note">
<font color="000093">Note</font>
</h2>
<p style="text-align: justify;">
<font color="061E61"> In this work, we present Sound-VECaps, a lagre-scale caption dataset generated through Large Lange Models (LLMs). The prompt that
LLMs used to construct the proposed caption consists of three different information, visual captions from the video, audio captions from
the waveform, and the label taggings provided by the original dataset. </font>
</p>
<figure>
<p align="center"><img src="pipeline.png" width="100%" class="center" /></p>
<figcaption>
<p style="text-align: center;">
<font color="061E61" ><b>Figure 1:</b> The caption generation pipeline of the Sound-VECaps
</p>
</figcaption>
</figure>
<br></br>
<h2 id="Sound-VECaps Caption Demos">
<font color="000093">Sound-VECaps Caption Demos</font>
</h2>
<table class="table" align="center" style="table-layout: fixed;word-break:break-word; font-size: 14px;">
<thead>
<tr>
<td scope="col" width="18%">
<font color="061E61">Audio</font>
</td>
<td scope="col" width="18%">
<font color="061E61">Wavcaps</font>
</td>
<td scope="col" width="18%">
<font color="061E61">Auto-ACD</font>
</td>
<td scope="col" width="22%">
<font color="061E61">Sound-VECaps_audio</font>
</td>
<td scope="col" width="24%">
<font color="061E61">Sound-VECaps_full</font>
</td>
</tr>
</thead>
<tbody>
<tr>
<td scope="row"><audio controls="controls">
<source src="audio_samples/caption_sample/YCUtbzo2jqkQ.wav"
autoplay />Your browser does not support the audio element.
</audio></td>
<td><font color="061E61">Dogs are barking with background noise.</font></td>
<td><font color="061E61">A dog snores loudly as it sleeps peacefully in a veterinarian's office, surrounded by other domestic animals.</font></td>
<td><font color="061E61">A dog is snoring softly while resting or sleeping, its eyes closed and tongue slightly sticking out, as the sound of domestic animals provides a gentle accompaniment.</font></td>
<td><font color="061E61">A dog, possibly a bulldog, is snoring softly while resting or sleeping on a wooden floor, its eyes closed and tongue slightly sticking out, as the sound of domestic animals in the background provides a gentle accompaniment.</font></td>
</tr>
<tr>
<td scope="row"><audio controls="controls">
<source src="audio_samples/caption_sample/Y9eNBIVq6mNk.wav"
autoplay />Your browser does not support the audio element.
</audio></td>
<td><font color="061E61">A power tool is in use.</font></td>
<td><font color="061E61">The sound of a ratchet and pawl can be heard as mechanisms are being operated in a workshop.</font></td>
<td><font color="061E61">A person is using a drill to tighten fasteners, holding a ratchet and mechanisms, in a well-lit workshop, with a toolbox nearby.</font></td>
<td><font color="061E61">A person is using a drill to tighten fasteners while holding a ratchet and mechanisms, on an orange surface, in a well-lit workshop, with a red toolbox nearby, and the camera remains constant throughout the recording.</font></td>
</tr>
<tr>
<td scope="row"><audio controls="controls">
<source src="audio_samples/caption_sample/YpItdNzDM0_8.wav"
autoplay />Your browser does not support the audio element.
</audio></td>
<td><font color="061E61">Music plays as a man sings, and there are skateboard sounds.</font></td>
<td><font color="061E61">The sound of a skateboard rolling can be heard, accompanied by background music, in a park setting.</font></td>
<td><font color="061E61">A skateboarder performs tricks on stairs and rails, accompanied by music and sounds, as people watch and take photos in a sunny outdoor setting.</font></td>
<td><font color="061E61">A skateboarder performs tricks on concrete stairs and rails while music plays in the background, accompanied by rustling and banging sounds, as people watch and take photos in a sunny outdoor setting with trees and a building.</font></td>
</tr>
<tr>
<td scope="row"><audio controls="controls">
<source src="audio_samples/caption_sample/YKcgMyfsPYEA.wav"
autoplay />Your browser does not support the audio element.
</audio></td>
<td><font color="061E61">Firecrackers pop as men converse in a noisy environment.</font></td>
<td><font color="061E61">Gunshots ring out followed by a man speaking in an urban setting, as indicated by the audio-visual label 'Firecracker; Speech; Outside, urban or manmade'.</font></td>
<td><font color="061E61">Fireworks are going off outside while a man is speaking, followed by a dark scene with bright lights illuminating from the top.</font></td>
<td><font color="061E61">Fireworks are going off outside while a man is speaking, followed by the sound of a dark, possibly nighttime scene with bright lights illuminating from the top.</font></td>
</tr>
<tr>
<td scope="row"><audio controls="controls">
<source src="audio_samples/caption_sample/YOevrLlXH_pA.wav"
autoplay />Your browser does not support the audio element.
</audio></td>
<td><font color="061E61">A group of men are speaking and making mechanical sounds.</font></td>
<td><font color="061E61">A man delivers a speech in a small room, with the audio-visual label indicating the presence of speech.</font></td>
<td><font color="061E61">An adult male is speaking in a room, gesturing with his hands and expressing himself.</font></td>
<td><font color="061E61">An adult male is speaking in a room with various items on shelves, including bottles and possibly art supplies, while gesturing with his hands and expressing himself, with a blurred effect suggesting movement or a low-quality camera.</font></td>
</tr>
<tr>
<td scope="row"><audio controls="controls">
<source src="audio_samples/caption_sample/Y7vBIvetY4SI.wav"
autoplay />Your browser does not support the audio element.
</audio></td>
<td><font color="061E61">Human sounds and music play.</font></td>
<td><font color="061E61">A cat meows while music plays in a dressing room.</font></td>
<td><font color="061E61">A man is singing along to music, accompanied by the sound of a cat meowing, as he moves around in a bathroom setting.</font></td>
<td><font color="061E61">A man with cat-like face paint and a playful expression is singing along to music, accompanied by the sound of a cat meowing, as he moves around in a bathroom or similar setting.</font></td>
</tr>
<tr>
<td scope="row"><audio controls="controls">
<source src="audio_samples/caption_sample/Yk0tIXL-c7mw.wav"
autoplay />Your browser does not support the audio element.
</audio></td>
<td><font color="061E61">Typing, mechanisms, beeps, and ticking can be heard.</font></td>
<td><font color="061E61">The sound of a typewriter fills a small room as the person types on the keyboard.</font></td>
<td><font color="061E61">A person types away on a typewriter, feeding paper into the machine while sitting in a quiet indoor environment, possibly an office or study room, surrounded by blurred background sounds.</font></td>
<td><font color="061E61">A person types away on a vintage green typewriter with a red stripe, feeding paper into the machine while sitting in a quiet indoor environment, possibly an office or study room, surrounded by blurred background sounds.</font></td>
</tr>
</tbody>
</table>
<br></br>
<h2 id="Sound-VECaps Caption Demos">
<font color="000093">TTA Generation Demos (AudioLDM trained on Sound-VECaps)</font>
</h2>
<table class="table" align="center" style="table-layout: fixed;word-break:break-word; font-size: 14px;">
<thead>
<tr>
<td scope="col" width="32%">
<font color="061E61">Video</font>
</td>
<td scope="col" width="36%">
<font color="061E61">Caption</font>
</td>
<td scope="col" width="32%">
<font color="061E61">Result</font>
</td>
</tr>
</thead>
<tbody>
<tr>
<td scope="row"><video width="300" controls>
<source src="audio_samples/generation_sample/video/Y_9mgOkzm-xg.mp4" type="video/mp4">
Your browser does not support HTML video.
</video></td>
<td><font color="061E61">A tattooed man is cooking in a kitchen with a white stove, using a wooden spoon to stir chopped green vegetables in a black skillet. The kitchen is filled with various containers and kitchen tools. Wood clanks on the metal pan, followed by gravel crunching as food and oil sizzle invitingly.</font></td>
<td><audio controls="controls" style="width: 300px;">
<source src="audio_samples/generation_sample/audio/Y_9mgOkzm-xg.wav"
autoplay />Your browser does not support the audio element.
</audio></td>
</tr>
<tr>
<td scope="row"><video width="300" controls>
<source src="audio_samples/generation_sample/video/Y_BSmz3SEW1w.mp4" type="video/mp4">
Your browser does not support HTML video.
</video></td>
<td><font color="061E61">In a dimly lit, rustic indoor setting, pigeons of various colors, including white, black, and brown, rustle and coo around wooden perches and feeding platforms on a rough concrete floor, creating an atmosphere reminiscent of a pigeon loft or shelter.</font></td>
<td><audio controls="controls" style="width: 300px;">
<source src="audio_samples/generation_sample/audio/Y_BSmz3SEW1w.wav"
autoplay />Your browser does not support the audio element.
</audio></td>
</tr>
<tr>
<td scope="row"><video width="300" controls>
<source src="audio_samples/generation_sample/video/Y-CcGuq0yoKo.mp4" type="video/mp4">
Your browser does not support HTML video.
</video></td>
<td><font color="061E61">A woman is speaking from a microphone at an outdoor event, likely a school function, on a stage with a green backdrop, banner with a shield-like emblem, and various plants. The weather appears clear, with several people seated on the stage and in the audience, attentively listening.</font></td>
<td><audio controls="controls" style="width: 300px;">
<source src="audio_samples/generation_sample/audio/Y-CcGuq0yoKo.wav"
autoplay />Your browser does not support the audio element.
</audio></td>
</tr>
<tr>
<td scope="row"><video width="300" controls>
<source src="audio_samples/generation_sample/video/Y-R69Fa-mCaY.mp4" type="video/mp4">
Your browser does not support HTML video.
</video></td>
<td><font color="061E61">A man uses a chainsaw to cut down a tree amid a grassy field with scattered debris. The surroundings include fallen branches, stumps, and logs. The sky is overcast with occasional sunlight filtering through, adding a peaceful yet industrious atmosphere.</font></td>
<td><audio controls="controls" style="width: 300px;">
<source src="audio_samples/generation_sample/audio/Y-R69Fa-mCaY.wav"
autoplay />Your browser does not support the audio element.
</audio></td>
</tr>
<tr>
<td scope="row"><video width="300" controls>
<source src="audio_samples/generation_sample/video/Y0a9wVat2PWk.mp4" type="video/mp4">
Your browser does not support HTML video.
</video></td>
<td><font color="061E61">A train sounds its horn while traveling on the tracks, passing through a lush, green forest with partly cloudy skies. Reflections of the dense evergreens and occasional clearings are visible in the train windows, enhancing the serene, natural ambiance. The train's motion blurs the vibrant landscape, giving a sense of considerable speed.</font></td>
<td><audio controls="controls" style="width: 300px;">
<source src="audio_samples/generation_sample/audio/Y0a9wVat2PWk.wav"
autoplay />Your browser does not support the audio element.
</audio></td>
</tr>
<tr>
<td scope="row"><video width="300" controls>
<source src="audio_samples/generation_sample/video/Y1OyEgzXCkYE.mp4" type="video/mp4">
Your browser does not support HTML video.
</video></td>
<td><font color="061E61">An adult male, likely a political figure, stands behind a podium adorned with the U.S. presidential seal, flanked by U.S. and Myanmar flags. He addresses a crowd under clear skies, discussing Myanmar's democratic progress and reconciliation, as captured in a live CNN broadcast with subtitles highlighting the ongoing peace process.</font></td>
<td><audio controls="controls" style="width: 300px;">
<source src="audio_samples/generation_sample/audio/Y1OyEgzXCkYE.wav"
autoplay />Your browser does not support the audio element.
</audio></td>
</tr>
<tr>
<td scope="row"><video width="300" controls>
<source src="audio_samples/generation_sample/video/Y0yxEvdnimGg.mp4" type="video/mp4">
Your browser does not support HTML video.
</video></td>
<td><font color="061E61">A dog barks as a man speaks amidst chirping birds and wind blowing into a microphone. The scene is an open grassy field with trees, scattered objects, tents, and vehicles, suggesting a park event. The dog, possibly a Border Collie or sheepdog, chases a yellow frisbee under a clear sky.</font></td>
<td><audio controls="controls" style="width: 300px;">
<source src="audio_samples/generation_sample/audio/Y0yxEvdnimGg.wav"
autoplay />Your browser does not support the audio element.
</audio></td>
</tr>
</tbody>
</table>
</p>
</section>
</article>
</main>
</div>
<script>
(function (i, s, o, g, r, a, m) {
i['GoogleAnalyticsObject'] = r; i[r] = i[r] || function () {
(i[r].q = i[r].q || []).push(arguments)
}, i[r].l = 1 * new Date(); a = s.createElement(o),
m = s.getElementsByTagName(o)[0]; a.async = 1; a.src = g; m.parentNode.insertBefore(a, m)
})(window, document, 'script', '//www.google-analytics.com/analytics.js', 'ga');
ga('create', 'UA-139981676-1', 'auto');
ga('send', 'pageview');
</script>
<script src="//cdnjs.cloudflare.com/ajax/libs/highlight.js/8.4/highlight.min.js"></script>
<script>hljs.initHighlightingOnLoad();</script>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
HTML: ["input/TeX","output/HTML-CSS"],
TeX: {
Macros: {
bm: ["\\boldsymbol{#1}", 1],
argmax: ["\\mathop{\\rm arg\\,max}\\limits"],
argmin: ["\\mathop{\\rm arg\\,min}\\limits"]},
extensions: ["AMSmath.js","AMSsymbols.js"],
equationNumbers: { autoNumber: "AMS" } },
extensions: ["tex2jax.js"],
jax: ["input/TeX","output/HTML-CSS"],
tex2jax: { inlineMath: [ ['$','$'], ["\\(","\\)"] ],
displayMath: [ ['$$','$$'], ["\\[","\\]"] ],
processEscapes: true },
"HTML-CSS": { availableFonts: ["TeX"],
linebreaks: { automatic: true } }
});
</script>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
tex2jax: {
skipTags: ['script', 'noscript', 'style', 'textarea', 'pre', 'code']
}
});
</script>
<script type="text/javascript" async
src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.4/MathJax.js?config=TeX-MML-AM_CHTML">
</script>
</body>
</html>