Skip to content

Commit

Permalink
Update index.html
Browse files Browse the repository at this point in the history
  • Loading branch information
yyua8222 authored Aug 19, 2024
1 parent 6edada3 commit 4c2c2c2
Showing 1 changed file with 8 additions and 9 deletions.
17 changes: 8 additions & 9 deletions index.html
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@
<article itemscope itemtype="https://schema.org/BlogPosting">
<br></br>
<h1 itemprop="headline" align="center">
<font color="000093" size="6">IMPROVING AUDIO GENERATION WITH VISUAL ENHANCED CAPTION
<font color="000093" size="6">Sound-VECaps: Improving Audio Generation With Visual Enhanced Captions
</font>
</h1>
<br></br>
Expand All @@ -45,7 +45,7 @@ <h1 itemprop="headline" align="center">
Yuanzhe Chen<sup>2</sup>, Zhengxi Liu<sup>2</sup>, Zhuo Chen<sup>2</sup></font>
</b></p>
<p style="line-height:1" align="center"><b>
<font color="061E61">Yuping Wang<sup>2</sup>, Yuxuan Wang<sup>2</sup>, Xubo Liu<sup>1</sup>
<font color="061E61">Yuping Wang<sup>2</sup>, Yuxuan Wang<sup>2</sup>, Xubo Liu<sup>1</sup>, Xiyuan Kang<sup>1</sup>
, Mark D. Plumbley<sup>1</sup>, Wenwu Wang<sup>1</sup></font>
</b></p>
<p style="line-height:0.6" align="center">
Expand All @@ -65,13 +65,12 @@ <h2 id="abstract">
<font color="000093">Abstract</font>
</h2>
<p style="text-align: justify;">
<font color="061E61"> Generative models have shown significant achievements in audio generation tasks. However, existing models struggle with complex and detailed prompts,
leading to potential performance degradation. We hypothesize that this problem stems from the low quality and relatively small quantity of training data.
In this work, we aim to create a large-scale audio dataset with rich captions for improving audio generation models. We develop an automated pipeline to generate detailed captions for audio-visual datasets by transforming predicted visual captions,
audio captions, and tagging labels into comprehensive descriptions using a Large Language Model (LLM). We introduce Sound-VECaps, a dataset comprising 1.66M high-quality audio-caption pairs with enriched details including audio event orders,
occurred places and environment information. We demonstrate that training with Sound-VECaps significantly enhances the capability of text-to-audio generation models to comprehend and generate audio from complex input prompts,
improving overall system performance. Furthermore, we conduct ablation studies of Sound-VECaps across several audio-language tasks, suggesting its potential in advancing audio-text representation learning.
Our dataset and models are available online. </font>
<font color="061E61"> Generative models have shown significant achievements in audio generation tasks. However, existing models struggle with complex and detailed prompts, leading to potential performance degradation.
We hypothesize that this problem stems from the simplicity and scarcity of the training data. This work aims to create a large-scale audio dataset with rich captions for improving audio generation models.
We first develop an automated pipeline to generate detailed captions by transforming predicted visual captions, audio captions, and tagging labels into comprehensive descriptions using a Large Language Model (LLM).
The resulting dataset, Sound-VECaps, comprises 1.66M high-quality audio-caption pairs with enriched details including audio event orders, occurred places and environment information.
We then demonstrate that training the text-to-audio generation models with Sound-VECaps significantly improves the performance on complex prompts. Furthermore, we conduct ablation studies of the models on several downstream audio-language tasks,
showing the potential of Sound-VECaps in advancing audio-text representation learning. Our dataset and models are available online. </font>
</p>


Expand Down

0 comments on commit 4c2c2c2

Please sign in to comment.