From 4c2c2c21f94aea829d7d169fe61a49d13bc9a7ad Mon Sep 17 00:00:00 2001
From: yyua8222 <49046550+yyua8222@users.noreply.github.com>
Date: Mon, 19 Aug 2024 12:59:54 +0100
Subject: [PATCH] Update index.html

---
 index.html | 17 ++++++++---------
 1 file changed, 8 insertions(+), 9 deletions(-)
diff --git a/index.html b/index.html
index f9d33b6..b4059d7 100644
--- a/index.html
+++ b/index.html
@@ -36,7 +36,7 @@
         <article itemscope itemtype="https://schema.org/BlogPosting">
           <br></br>
           <h1 itemprop="headline" align="center">
-            <font color="000093" size="6">IMPROVING AUDIO GENERATION WITH VISUAL ENHANCED CAPTION
+            <font color="000093" size="6">Sound-VECaps: Improving Audio Generation With Visual Enhanced Captions
             </font>
           </h1>
           <br></br>
@@ -45,7 +45,7 @@ <h1 itemprop="headline" align="center">
                 Yuanzhe Chen<sup>2</sup>, Zhengxi Liu<sup>2</sup>, Zhuo Chen<sup>2</sup></font>
             </b></p>
           <p style="line-height:1" align="center"><b>
-              <font color="061E61">Yuping Wang<sup>2</sup>, Yuxuan Wang<sup>2</sup>, Xubo Liu<sup>1</sup>
+              <font color="061E61">Yuping Wang<sup>2</sup>, Yuxuan Wang<sup>2</sup>, Xubo Liu<sup>1</sup>, Xiyuan Kang<sup>1</sup>
                 , Mark D. Plumbley<sup>1</sup>, Wenwu Wang<sup>1</sup></font>
             </b></p>
           <p style="line-height:0.6" align="center">
@@ -65,13 +65,12 @@ <h2 id="abstract">
               <font color="000093">Abstract</font>
             </h2>
             <p style="text-align: justify;">
-              <font color="061E61"> Generative models have shown significant achievements in audio generation tasks. However, existing models struggle with complex and detailed prompts, 
-                leading to potential performance degradation. We hypothesize that this problem stems from the low quality and relatively small quantity of training data. 
-                In this work, we aim to create a large-scale audio dataset with rich captions for improving audio generation models. We develop an automated pipeline to generate detailed captions for audio-visual datasets by transforming predicted visual captions, 
-                audio captions, and tagging labels into comprehensive descriptions using a Large Language Model (LLM). We introduce Sound-VECaps, a dataset comprising 1.66M high-quality audio-caption pairs with enriched details including audio event orders, 
-                occurred places and environment information. We demonstrate that training with Sound-VECaps significantly enhances the capability of text-to-audio generation models to comprehend and generate audio from complex input prompts, 
-                improving overall system performance. Furthermore, we conduct ablation studies of Sound-VECaps across several audio-language tasks, suggesting its potential in advancing audio-text representation learning. 
-                Our dataset and models are available online.   </font>
+              <font color="061E61"> Generative models have shown significant achievements in audio generation tasks. However, existing models struggle with complex and detailed prompts, leading to potential performance degradation. 
+                We hypothesize that this problem stems from the simplicity and scarcity of the training data. This work aims to create a large-scale audio dataset with rich captions for improving audio generation models. 
+                We first develop an automated pipeline to generate detailed captions by transforming predicted visual captions, audio captions, and tagging labels into comprehensive descriptions using a Large Language Model (LLM). 
+                The resulting dataset, Sound-VECaps, comprises 1.66M high-quality audio-caption pairs with enriched details including audio event orders, occurred places and environment information. 
+                We then demonstrate that training the text-to-audio generation models with Sound-VECaps significantly improves the performance on complex prompts. Furthermore, we conduct ablation studies of the models on several downstream audio-language tasks, 
+                showing the potential of Sound-VECaps in advancing audio-text representation learning. Our dataset and models are available online.   </font>
             </p>