Audio-Agent: Leveraging LLMs For Audio Generation, Editing and Composition

Wang, Zixuan, Tai, Yu-Wing, Tang, Chi-Keung

arXiv.org Artificial Intelligence 

We introduce Audio-Agent, a multimodal framework for audio generation, editing and composition based on text or video inputs. Conventional approaches for text-to-audio (TTA) tasks often make single-pass inferences from text descriptions. While straightforward, this design struggles to produce high-quality audio when given complex text conditions. In our method, we utilize a pre-trained TTA diffusion network as the audio generation agent to work in tandem with GPT-4, which decomposes the text condition into atomic, specific instructions, and calls the agent for audio generation. Consequently, Audio-Agent generates high-quality audio that is closely aligned with the provided text or video while also supporting variable-length generation. For video-to-audio (VTA) tasks, most existing methods require training a timestamp detector to synchronize video events with generated audio, a process that can be tedious and time-consuming. We propose a simpler approach by fine-tuning a pre-trained Large Language Model (LLM), e.g., Gemma2-2B-it, to obtain both semantic and temporal conditions to bridge video and audio modality. Thus our framework provides a comprehensive solution for both TTA and VTA tasks without substantial computational overhead in training. Multimodal deep generative models have gained increasing attention these years. Essentially, the models are trained to perform tasks based on different kinds of input called modalities, mimicking how humans make decisions from different kinds of senses such as vision and smell Suzuki & Matsuo (2022).