TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization

Hung, Chia-Yu, Majumder, Navonil, Kong, Zhifeng, Mehrish, Ambuj, Valle, Rafael, Catanzaro, Bryan, Poria, Soujanya

Dec-30-2024–arXiv.org Artificial Intelligence

A key challenge in aligning TTA models lies in the difficulty of creating preference pairs, as TTA lacks structured mechanisms like verifiable rewards or gold-standard answers available for Large Language Models (LLMs). We demonstrate that the audio preference dataset generated using CRPO outperforms existing alternatives. We open source all code and models to support further research in TTA generation. Audio plays a vital role in daily life and creative industries, from enhancing communication and storytelling to enriching experiences in music, sound effects, and podcasts. Recent advancements in text-to-audio (TTA) generation (Majumder et al., 2024; Ghosal et al., 2023; Liu et al., 2023; 2024b; Xue et al., 2024; Vyas et al., 2023; Huang et al., 2023b;a) and offer a transformative approach, enabling the automatic creation of diverse and expressive audio content directly from textual descriptions. This technology holds immense potential to streamline audio production workflows and unlock new possibilities in multimedia content creation. However, many existing models face challenges with controllability, occasionally struggling to fully capture the details in the input prompts, especially when the prompts are complex. This can sometimes result in generated audio that omits certain events or diverges from the user intent. At times, the generated audio may even contain input-adjacent, but unmentioned and unintended, events, that could be characterized as hallucinations. In contrast, the recent advancements in Large Language Models (LLMs) (Ouyang et al., 2022) have been significantly driven by the alignment stage after pre-training and supervised fine-tuning. This alignment stage, often leveraging reinforcement learning from human feedback (RLHF) or other reward-based optimization methods, endows the generated outputs with human preferences, ethical considerations, and task-specific requirements (Ouyang et al., 2022).

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

Dec-30-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States > Minnesota (0.28)

Genre:
- Research Report > New Finding (0.67)

Industry:
- Leisure & Entertainment (0.93)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)