From 4c2c2c21f94aea829d7d169fe61a49d13bc9a7ad Mon Sep 17 00:00:00 2001 From: yyua8222 <49046550+yyua8222@users.noreply.github.com> Date: Mon, 19 Aug 2024 12:59:54 +0100 Subject: [PATCH] Update index.html --- index.html | 17 ++++++++--------- 1 file changed, 8 insertions(+), 9 deletions(-) diff --git a/index.html b/index.html index f9d33b6..b4059d7 100644 --- a/index.html +++ b/index.html @@ -36,7 +36,7 @@


- IMPROVING AUDIO GENERATION WITH VISUAL ENHANCED CAPTION + Sound-VECaps: Improving Audio Generation With Visual Enhanced Captions



@@ -45,7 +45,7 @@

Yuanzhe Chen2, Zhengxi Liu2, Zhuo Chen2

- Yuping Wang2, Yuxuan Wang2, Xubo Liu1 + Yuping Wang2, Yuxuan Wang2, Xubo Liu1, Xiyuan Kang1 , Mark D. Plumbley1, Wenwu Wang1

@@ -65,13 +65,12 @@

Abstract

- Generative models have shown significant achievements in audio generation tasks. However, existing models struggle with complex and detailed prompts, - leading to potential performance degradation. We hypothesize that this problem stems from the low quality and relatively small quantity of training data. - In this work, we aim to create a large-scale audio dataset with rich captions for improving audio generation models. We develop an automated pipeline to generate detailed captions for audio-visual datasets by transforming predicted visual captions, - audio captions, and tagging labels into comprehensive descriptions using a Large Language Model (LLM). We introduce Sound-VECaps, a dataset comprising 1.66M high-quality audio-caption pairs with enriched details including audio event orders, - occurred places and environment information. We demonstrate that training with Sound-VECaps significantly enhances the capability of text-to-audio generation models to comprehend and generate audio from complex input prompts, - improving overall system performance. Furthermore, we conduct ablation studies of Sound-VECaps across several audio-language tasks, suggesting its potential in advancing audio-text representation learning. - Our dataset and models are available online. + Generative models have shown significant achievements in audio generation tasks. However, existing models struggle with complex and detailed prompts, leading to potential performance degradation. + We hypothesize that this problem stems from the simplicity and scarcity of the training data. This work aims to create a large-scale audio dataset with rich captions for improving audio generation models. + We first develop an automated pipeline to generate detailed captions by transforming predicted visual captions, audio captions, and tagging labels into comprehensive descriptions using a Large Language Model (LLM). + The resulting dataset, Sound-VECaps, comprises 1.66M high-quality audio-caption pairs with enriched details including audio event orders, occurred places and environment information. + We then demonstrate that training the text-to-audio generation models with Sound-VECaps significantly improves the performance on complex prompts. Furthermore, we conduct ablation studies of the models on several downstream audio-language tasks, + showing the potential of Sound-VECaps in advancing audio-text representation learning. Our dataset and models are available online.