Goto

Collaborating Authors

 Valle, Rafael


Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities

arXiv.org Artificial Intelligence

Understanding and reasoning over non-speech sounds and music are crucial for both humans and AI agents to interact effectively with their environments. In this paper, we introduce Audio Flamingo 2 (AF2), an Audio-Language Model (ALM) with advanced audio understanding and reasoning capabilities. AF2 leverages (i) a custom CLAP model, (ii) synthetic Audio QA data for fine-grained audio reasoning, and (iii) a multi-stage curriculum learning strategy. AF2 achieves state-of-the-art performance with only a 3B parameter small language model, surpassing large open-source and proprietary models across over 20 benchmarks. Next, for the first time, we extend audio understanding to long audio segments (30 secs to 5 mins) and propose LongAudio, a large and novel dataset for training ALMs on long audio captioning and question-answering tasks. Fine-tuning AF2 on LongAudio leads to exceptional performance on our proposed LongAudioBench, an expert annotated benchmark for evaluating ALMs on long audio understanding capabilities. We conduct extensive ablation studies to confirm the efficacy of our approach. Project Website: https://research.nvidia.com/labs/adlr/AF2/.


UniWav: Towards Unified Pre-training for Speech Representation Learning and Generation

arXiv.org Artificial Intelligence

Pre-training and representation learning have been playing an increasingly important role in modern speech processing. Nevertheless, different applications have been relying on different foundation models, since predominant pre-training techniques are either designed for discriminative tasks or generative tasks. In this work, we make the first attempt at building a unified pre-training framework for both types of tasks in speech. We show that with the appropriate design choices for pre-training, one can jointly learn a representation encoder and generative audio decoder that can be applied to both types of tasks. We propose UniWav, an encoder-decoder framework designed to unify pre-training representation learning and generative tasks. On speech recognition, text-to-speech, and speech tokenization, UniWav achieves comparable performance to different existing foundation models, each trained on a specific task. Our findings suggest that a single general-purpose foundation model for speech can be built to replace different foundation models, reducing the overhead and cost of pre-training.


Koel-TTS: Enhancing LLM based Speech Generation with Preference Alignment and Classifier Free Guidance

arXiv.org Artificial Intelligence

While autoregressive speech token generation models produce speech with remarkable variety and naturalness, their inherent lack of controllability often results in issues such as hallucinations and undesired vocalizations that do not conform to conditioning inputs. We introduce Koel-TTS, a suite of enhanced encoder-decoder Transformer TTS models that address these challenges by incorporating preference alignment techniques guided by automatic speech recognition and speaker verification models. Additionally, we incorporate classifier-free guidance to further improve synthesis adherence to the transcript and reference speaker audio. Our experiments demonstrate that these optimizations significantly enhance target speaker similarity, intelligibility, and naturalness of synthesized speech. Notably, Koel-TTS directly maps text and context audio to acoustic tokens, and on the aforementioned metrics, outperforms state-of-the-art TTS models, despite being trained on a significantly smaller dataset. Audio samples and demos are available on our website.


A2SB: Audio-to-Audio Schrodinger Bridges

arXiv.org Artificial Intelligence

Audio in the real world may be perturbed due to numerous factors, causing the audio quality to be degraded. The following work presents an audio restoration model tailored for high-res music at 44.1kHz. SB), is capable of both bandwidth extension (predicting high-frequency components) and inpainting (re-generating missing segments). SB is end-to-end without need of a vocoder to predict waveform outputs, able to restore hour-long audio inputs, and trained on permissively licensed music data. SB is capable of achieving state-of-the-art bandwidth extension and inpainting quality on several out-of-distribution music test sets. Our demo website is https: //research.nvidia.com/labs/adlr/A2SB/ Audio in the real world may be perturbed due to numerous factors such as recording devices, data compression, and online transferring. For instance, certain recording devices and compression methods may result in low sampling rate, and online transferring may cause a short audio segment to be lost. These problems are usually ill-posed (Narayanaswamy et al., 2021; Moliner et al., 2023) and are usually solved with data-driven generative models. Many of these methods are task-specific, designed for the speech domain, or trained to only restore the degraded magnitude - which requires an additional vocoder to transform restored magnitude into waveform.


TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization

arXiv.org Artificial Intelligence

A key challenge in aligning TTA models lies in the difficulty of creating preference pairs, as TTA lacks structured mechanisms like verifiable rewards or gold-standard answers available for Large Language Models (LLMs). We demonstrate that the audio preference dataset generated using CRPO outperforms existing alternatives. We open source all code and models to support further research in TTA generation. Audio plays a vital role in daily life and creative industries, from enhancing communication and storytelling to enriching experiences in music, sound effects, and podcasts. Recent advancements in text-to-audio (TTA) generation (Majumder et al., 2024; Ghosal et al., 2023; Liu et al., 2023; 2024b; Xue et al., 2024; Vyas et al., 2023; Huang et al., 2023b;a) and offer a transformative approach, enabling the automatic creation of diverse and expressive audio content directly from textual descriptions. This technology holds immense potential to streamline audio production workflows and unlock new possibilities in multimedia content creation. However, many existing models face challenges with controllability, occasionally struggling to fully capture the details in the input prompts, especially when the prompts are complex. This can sometimes result in generated audio that omits certain events or diverges from the user intent. At times, the generated audio may even contain input-adjacent, but unmentioned and unintended, events, that could be characterized as hallucinations. In contrast, the recent advancements in Large Language Models (LLMs) (Ouyang et al., 2022) have been significantly driven by the alignment stage after pre-training and supervised fine-tuning. This alignment stage, often leveraging reinforcement learning from human feedback (RLHF) or other reward-based optimization methods, endows the generated outputs with human preferences, ethical considerations, and task-specific requirements (Ouyang et al., 2022).


ETTA: Elucidating the Design Space of Text-to-Audio Models

arXiv.org Artificial Intelligence

Recent years have seen significant progress in Text-To-Audio (TTA) synthesis, enabling users to enrich their creative workflows with synthetic audio generated from natural language prompts. Despite this progress, the effects of data, model architecture, training objective functions, and sampling strategies on target benchmarks are not well understood. With the purpose of providing a holistic understanding of the design space of TTA models, we set up a large-scale empirical experiment focused on diffusion and flow matching models. Our contributions include: 1) AF-Synthetic, a large dataset of high quality synthetic captions obtained from an audio understanding model; 2) a systematic comparison of different architectural, training, and inference design choices for TTA models; 3) an analysis of sampling methods and their Pareto curves with respect to generation quality and inference speed. We leverage the knowledge obtained from this extensive analysis to propose our best model dubbed Elucidated Text-To-Audio (ETTA). When evaluated on AudioCaps and MusicCaps, ETTA provides improvements over the baselines trained on publicly available data, while being competitive with models trained on proprietary data. Finally, we show ETTA's improved ability to generate creative audio following complex and imaginative captions -- a task that is more challenging than current benchmarks.


OMCAT: Omni Context Aware Transformer

arXiv.org Artificial Intelligence

Large Language Models (LLMs) have made significant strides in text generation and comprehension, with recent advancements extending into multimodal LLMs that integrate visual and audio inputs. However, these models continue to struggle with fine-grained, cross-modal temporal understanding, particularly when correlating events across audio and video streams. We address these challenges with two key contributions: a new dataset and model, called OCTAV and OMCAT respectively. OCTAV (Omni Context and Temporal Audio Video) is a novel dataset designed to capture event transitions across audio and video. Second, OMCAT (Omni Context Aware Transformer) is a powerful model that leverages RoTE (Rotary Time Embeddings), an innovative extension of RoPE, to enhance temporal grounding and computational efficiency in time-anchored tasks. Our model demonstrates state-of-the-art performance on Audio-Visual Question Answering (AVQA) tasks and the OCTAV benchmark, showcasing significant gains in temporal reasoning and cross-modal alignment, as validated through comprehensive experiments and ablation studies. Our dataset and code will be made publicly available. The link to our demo page is https://om-cat.github.io. Okay, so the sound of children playing There is a sound of children playing from There are two sounds of children from from 6 to 7 seconds. After this sound, 16 to 17 seconds and from 17 to 25 playing, one from 6 to 7 seconds and from 7 to 16 seconds, a man is talking while seconds, the man is holding a shovel the other one from 16 to 17 seconds. Which one are you referring to? Figure 1: Illustration of a video sequence from our proposed OCTAV dataset. Large language models (LLMs) (Achiam et al., 2023; Touvron et al., 2023) have achieved remarkable breakthroughs in both text generation and comprehension (McKeown, 1992; Achiam et al., 2023) tasks. Since then, significant progress has been made to extend LLMs to multimodal LLMs (Cheng et al., 2024; Li et al., 2023b; Maaz et al., 2023; Li et al., 2024), which integrate visual and audio inputs with textual instructions to provide understanding in multimodal contexts (Yang et al., 2022b; Chen et al., 2023a;b). In this paper, we address these limitations by proposing a new dataset OCTAV and a model called OMCAT. The Omni Context and Temporal Audio Video dataset, OCTAV, consists of question-answer pairs for a video. The Omni Context Aware Transformer, OMCAT, addresses the limitations of existing models (Maaz et al., 2023; Tang et al., 2024; Su et al., 2023; Cheng et al., 2024) through a unified audio and visual language model by effectively incorporating time representations to ground the modalities temporally. However, these models still face challenges in handling fine-grained, cross-modal temporal understanding when both audio and video are provided.


Synthio: Augmenting Small-Scale Audio Classification Datasets with Synthetic Data

arXiv.org Artificial Intelligence

We present Synthio, a novel approach for augmenting small-scale audio classification datasets with synthetic data. Our goal is to improve audio classification accuracy with limited labeled data. Traditional data augmentation techniques, which apply artificial transformations (e.g., adding random noise or masking segments), struggle to create data that captures the true diversity present in real-world audios. To address this shortcoming, we propose to augment the dataset with synthetic audio generated from text-to-audio (T2A) diffusion models. However, synthesizing effective augmentations is challenging because not only should the generated data be acoustically consistent with the underlying small-scale dataset, but they should also have sufficient compositional diversity. To overcome the first challenge, we align the generations of the T2A model with the small-scale dataset using preference optimization. This ensures that the acoustic characteristics of the generated data remain consistent with the small-scale dataset. To address the second challenge, we propose a novel caption generation technique that leverages the reasoning capabilities of Large Language Models to (1) generate diverse and meaningful audio captions and (2) iteratively refine their quality. The generated captions are then used to prompt the aligned T2A model. We extensively evaluate Synthio on ten datasets and four simulated limited-data settings. Results indicate our method consistently outperforms all baselines by 0.1%-39% using a T2A model trained only on weakly-captioned AudioSet.


Improving Text-To-Audio Models with Synthetic Captions

arXiv.org Artificial Intelligence

It is an open challenge to obtain high quality training data, especially captions, for text-to-audio models. Although prior methods have leveraged \textit{text-only language models} to augment and improve captions, such methods have limitations related to scale and coherence between audio and captions. In this work, we propose an audio captioning pipeline that uses an \textit{audio language model} to synthesize accurate and diverse captions for audio at scale. We leverage this pipeline to produce a dataset of synthetic captions for AudioSet, named \texttt{AF-AudioSet}, and then evaluate the benefit of pre-training text-to-audio models on these synthetic captions. Through systematic evaluations on AudioCaps and MusicCaps, we find leveraging our pipeline and synthetic captions leads to significant improvements on audio generation quality, achieving a new \textit{state-of-the-art}.


Improving Robustness of LLM-based Speech Synthesis by Learning Monotonic Alignment

arXiv.org Artificial Intelligence

Despite their remarkable Large Language Model (LLM) based text-to-speech (TTS) systems achievements, LLM-based TTS models suffer from have demonstrated remarkable capabilities in handling attention errors resulting in mis-aligned speech, repeating and large speech datasets and generating natural speech for new missing words, analogous to hallucinations [15, 16] exhibited speakers. However, LLM-based TTS models are not robust by LLMs in the text domain. This issue becomes more prominent as the generated output can contain repeating words, missing when the input text is challenging and contains repeating words and mis-aligned speech (referred to as hallucinations or words. For certain inputs, the probabilistic autoregressive inference attention errors), especially when the text contains multiple occurrences of LLM-based TTS models can result in looping or infinite of the same token. We examine these challenges silences [17]. This issue makes LLM-based TTS models unreliable in an encoder-decoder transformer model and find that certain for real-world applications.