- IMPROVING AUDIO GENERATION WITH VISUAL ENHANCED CAPTION + Sound-VECaps: Improving Audio Generation With Visual Enhanced Captions
@@ -45,7 +45,7 @@
From 4c2c2c21f94aea829d7d169fe61a49d13bc9a7ad Mon Sep 17 00:00:00 2001
From: yyua8222 <49046550+yyua8222@users.noreply.github.com>
Date: Mon, 19 Aug 2024 12:59:54 +0100
Subject: [PATCH] Update index.html
---
index.html | 17 ++++++++---------
1 file changed, 8 insertions(+), 9 deletions(-)
diff --git a/index.html b/index.html
index f9d33b6..b4059d7 100644
--- a/index.html
+++ b/index.html
@@ -36,7 +36,7 @@
- IMPROVING AUDIO GENERATION WITH VISUAL ENHANCED CAPTION
+ Sound-VECaps: Improving Audio Generation With Visual Enhanced Captions
@@ -45,7 +45,7 @@
Yuanzhe Chen2, Zhengxi Liu2, Zhuo Chen2
- Yuping Wang2, Yuxuan Wang2, Xubo Liu1 + Yuping Wang2, Yuxuan Wang2, Xubo Liu1, Xiyuan Kang1 , Mark D. Plumbley1, Wenwu Wang1
@@ -65,13 +65,12 @@
- Generative models have shown significant achievements in audio generation tasks. However, existing models struggle with complex and detailed prompts, - leading to potential performance degradation. We hypothesize that this problem stems from the low quality and relatively small quantity of training data. - In this work, we aim to create a large-scale audio dataset with rich captions for improving audio generation models. We develop an automated pipeline to generate detailed captions for audio-visual datasets by transforming predicted visual captions, - audio captions, and tagging labels into comprehensive descriptions using a Large Language Model (LLM). We introduce Sound-VECaps, a dataset comprising 1.66M high-quality audio-caption pairs with enriched details including audio event orders, - occurred places and environment information. We demonstrate that training with Sound-VECaps significantly enhances the capability of text-to-audio generation models to comprehend and generate audio from complex input prompts, - improving overall system performance. Furthermore, we conduct ablation studies of Sound-VECaps across several audio-language tasks, suggesting its potential in advancing audio-text representation learning. - Our dataset and models are available online. + Generative models have shown significant achievements in audio generation tasks. However, existing models struggle with complex and detailed prompts, leading to potential performance degradation. + We hypothesize that this problem stems from the simplicity and scarcity of the training data. This work aims to create a large-scale audio dataset with rich captions for improving audio generation models. + We first develop an automated pipeline to generate detailed captions by transforming predicted visual captions, audio captions, and tagging labels into comprehensive descriptions using a Large Language Model (LLM). + The resulting dataset, Sound-VECaps, comprises 1.66M high-quality audio-caption pairs with enriched details including audio event orders, occurred places and environment information. + We then demonstrate that training the text-to-audio generation models with Sound-VECaps significantly improves the performance on complex prompts. Furthermore, we conduct ablation studies of the models on several downstream audio-language tasks, + showing the potential of Sound-VECaps in advancing audio-text representation learning. Our dataset and models are available online.