audio prompt
Live Music Models
Lyria Team, null, Caillon, Antoine, McWilliams, Brian, Tarakajian, Cassie, Simon, Ian, Manco, Ilaria, Engel, Jesse, Constant, Noah, Li, Yunpeng, Denk, Timo I., Lalama, Alberto, Agostinelli, Andrea, Huang, Cheng-Zhi Anna, Manilow, Ethan, Brower, George, Erdogan, Hakan, Lei, Heidi, Rolnick, Itai, Grishchenko, Ivan, Orsini, Manu, Kastelic, Matej, Zuluaga, Mauricio, Verzetti, Mauro, Dooley, Michael, Skopek, Ondrej, Ferrer, Rafael, Petridis, Savvas, Borsos, Zalán, Oord, Äaron van den, Eck, Douglas, Collins, Eli, Baldridge, Jason, Hume, Tom, Donahue, Chris, Han, Kehang, Roberts, Adam
We introduce a new class of generative models for music called live music models that produce a continuous stream of music in real-time with synchronized user control. We release Magenta RealTime, an open-weights live music model that can be steered using text or audio prompts to control acoustic style. On automatic metrics of music quality, Magenta RealTime outperforms other open-weights music generation models, despite using fewer parameters and offering first-of-its-kind live generation capabilities. We also release Lyria RealTime, an API-based model with extended controls, offering access to our most powerful model with wide prompt coverage. These models demonstrate a new paradigm for AI-assisted music creation that emphasizes human-in-the-loop interaction for live music performance.
- Media > Music (1.00)
- Leisure & Entertainment (1.00)
LeVo: High-Quality Song Generation with Multi-Preference Alignment
Lei, Shun, Xu, Yaoxun, Lin, Zhiwei, Zhang, Huaicheng, Tan, Wei, Chen, Hangting, Yu, Jianwei, Zhang, Yixuan, Yang, Chenyu, Zhu, Haina, Wang, Shuai, Wu, Zhiyong, Yu, Dong
Recent advances in large language models (LLMs) and audio language models have significantly improved music generation, particularly in lyrics-to-song generation. However, existing approaches still struggle with the complex composition of songs and the scarcity of high-quality data, leading to limitations in audio quality, musicality, instruction following, and vocal-instrument harmony. To address these challenges, we introduce LeVo, a language model based framework consisting of LeLM and Music Codec. LeLM is capable of parallel modeling of two types of tokens: mixed tokens, which represent the combined audio of vocals and accompaniment to achieve better vocal-instrument harmony, and dual-track tokens, which separately encode vocals and accompaniment for high-quality song generation. It employs two decoder-only transformers and a modular extension training strategy to prevent interference between different token types. To further enhance musicality and instruction following ability, we introduce a multi-preference alignment method based on Direct Preference Optimization (DPO). This method handles diverse human preferences through a semi-automatic data construction process and post-training. Experimental results demonstrate that LeVo significantly outperforms existing open-source methods in both objective and subjective metrics, while performing competitively with industry systems. Ablation studies further justify the effectiveness of our designs. Audio examples and source code are available at https://levo-demo.github.io and https://github.com/tencent-ailab/songgeneration.
- Leisure & Entertainment (1.00)
- Media > Music (0.89)
MIDI-VALLE: Improving Expressive Piano Performance Synthesis Through Neural Codec Language Modelling
Tang, Jingjing, Wang, Xin, Zhang, Zhe, Yamagishi, Junichi, Wiggins, Geraint, Fazekas, George
Generating expressive audio performances from music scores requires models to capture both instrument acoustics and human interpretation. Traditional music performance synthesis pipelines follow a two-stage approach, first generating expressive performance MIDI from a score, then synthesising the MIDI into audio. However, the synthesis models often struggle to generalise across diverse MIDI sources, musical styles, and recording environments. To address these challenges, we propose MIDI-VALLE, a neural codec language model adapted from the VALLE framework, which was originally designed for zero-shot personalised text-to-speech (TTS) synthesis. For performance MIDI-to-audio synthesis, we improve the architecture to condition on a reference audio performance and its corresponding MIDI. Unlike previous TTS-based systems that rely on piano rolls, MIDI-VALLE encodes both MIDI and audio as discrete tokens, facilitating a more consistent and robust modelling of piano performances. Furthermore, the model's generalisation ability is enhanced by training on an extensive and diverse piano performance dataset. Evaluation results show that MIDI-VALLE significantly outperforms a state-of-the-art baseline, achieving over 75% lower Frechet Audio Distance on the ATEPP and Maestro datasets. In the listening test, MIDI-VALLE received 202 votes compared to 58 for the baseline, demonstrating improved synthesis quality and generalisation across diverse performance MIDI inputs.
- Asia > Japan (0.04)
- North America > United States (0.04)
- Europe > United Kingdom > England > Greater London > London (0.04)
- (4 more...)
- Media > Music (1.00)
- Leisure & Entertainment (1.00)
- Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.50)
- Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.46)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.35)
TAIL: Text-Audio Incremental Learning
Sun, Yingfei, Gu, Xu, Ji, Wei, Zhao, Hanbin, Fei, Hao, Yin, Yifang, Zimmermann, Roger
Many studies combine text and audio to capture multi-modal information but they overlook the model's generalization ability on new datasets. Introducing new datasets may affect the feature space of the original dataset, leading to catastrophic forgetting. Meanwhile, large model parameters can significantly impact training performance. To address these limitations, we introduce a novel task called Text-Audio Incremental Learning (TAIL) task for text-audio retrieval, and propose a new method, PTAT, Prompt Tuning for Audio-Text incremental learning. This method utilizes prompt tuning to optimize the model parameters while incorporating an audio-text similarity and feature distillation module to effectively mitigate catastrophic forgetting. We benchmark our method and previous incremental learning methods on AudioCaps, Clotho, BBC Sound Effects and Audioset datasets, and our method outperforms previous methods significantly, particularly demonstrating stronger resistance to forgetting on older datasets. Compared to the full-parameters Finetune (Sequential) method, our model only requires 2.42\% of its parameters, achieving 4.46\% higher performance.
- Asia > China (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- Asia > Singapore (0.04)
- Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
Hard-Synth: Synthesizing Diverse Hard Samples for ASR using Zero-Shot TTS and LLM
Yu, Jiawei, Li, Yuang, Qiao, Xiaosong, Zhao, Huan, Zhao, Xiaofeng, Tang, Wei, Zhang, Min, Yang, Hao, Su, Jinsong
Text-to-speech (TTS) models have been widely adopted to enhance automatic speech recognition (ASR) systems using text-only corpora, thereby reducing the cost of labeling real speech data. Existing research primarily utilizes additional text data and predefined speech styles supported by TTS models. In this paper, we propose Hard-Synth, a novel ASR data augmentation method that leverages large language models (LLMs) and advanced zero-shot TTS. Our approach employs LLMs to generate diverse in-domain text through rewriting, without relying on additional text data. Rather than using predefined speech styles, we introduce a hard prompt selection method with zero-shot TTS to clone speech styles that the ASR model finds challenging to recognize. Experiments demonstrate that Hard-Synth significantly enhances the Conformer model, achieving relative word error rate (WER) reductions of 6.5\%/4.4\% on LibriSpeech dev/test-other subsets. Additionally, we show that Hard-Synth is data-efficient and capable of reducing bias in ASR.
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Oceania > Australia > Queensland > Brisbane (0.04)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- (5 more...)
TCSinger: Zero-Shot Singing Voice Synthesis with Style Transfer and Multi-Level Style Control
Zhang, Yu, Jiang, Ziyue, Li, Ruiqi, Pan, Changhao, He, Jinzheng, Huang, Rongjie, Wang, Chuxin, Zhao, Zhou
Zero-shot singing voice synthesis (SVS) with style transfer and style control aims to generate high-quality singing voices with unseen timbres and styles (including singing method, emotion, rhythm, technique, and pronunciation) from audio and text prompts. However, the multifaceted nature of singing styles poses a significant challenge for effective modeling, transfer, and control. Furthermore, current SVS models often fail to generate singing voices rich in stylistic nuances for unseen singers. To address these challenges, we introduce TCSinger, the first zero-shot SVS model for style transfer across cross-lingual speech and singing styles, along with multi-level style control. Specifically, TCSinger proposes three primary modules: 1) the clustering style encoder employs a clustering vector quantization model to stably condense style information into a compact latent space; 2) the Style and Duration Language Model (S\&D-LM) concurrently predicts style information and phoneme duration, which benefits both; 3) the style adaptive decoder uses a novel mel-style adaptive normalization method to generate singing voices with enhanced details. Experimental results show that TCSinger outperforms all baseline models in synthesis quality, singer similarity, and style controllability across various tasks, including zero-shot style transfer, multi-level style control, cross-lingual style transfer, and speech-to-singing style transfer. Singing voice samples can be accessed at https://tcsinger.github.io/.
- Media > Music (0.93)
- Leisure & Entertainment (0.93)
Laugh Now Cry Later: Controlling Time-Varying Emotional States of Flow-Matching-Based Zero-Shot Text-to-Speech
Wu, Haibin, Wang, Xiaofei, Eskimez, Sefik Emre, Thakker, Manthan, Tompkins, Daniel, Tsai, Chung-Hsien, Li, Canrun, Xiao, Zhen, Zhao, Sheng, Li, Jinyu, Kanda, Naoyuki
People change their tones of voice, often accompanied by nonverbal vocalizations (NVs) such as laughter and cries, to convey rich emotions. However, most text-to-speech (TTS) systems lack the capability to generate speech with rich emotions, including NVs. This paper introduces EmoCtrl-TTS, an emotion-controllable zero-shot TTS that can generate highly emotional speech with NVs for any speaker. EmoCtrl-TTS leverages arousal and valence values, as well as laughter embeddings, to condition the flow-matching-based zero-shot TTS. To achieve high-quality emotional speech generation, EmoCtrl-TTS is trained using more than 27,000 hours of expressive data curated based on pseudo-labeling. Comprehensive evaluations demonstrate that EmoCtrl-TTS excels in mimicking the emotions of audio prompts in speech-to-speech translation scenarios. We also show that EmoCtrl-TTS can capture emotion changes, express strong emotions, and generate various NVs in zero-shot TTS. See https://aka.ms/emoctrl-tts for demo samples.
- North America > United States (0.14)
- Asia > Taiwan (0.04)
An Investigation of Noise Robustness for Flow-Matching-Based Zero-Shot TTS
Wang, Xiaofei, Eskimez, Sefik Emre, Thakker, Manthan, Yang, Hemin, Zhu, Zirun, Tang, Min, Xia, Yufei, Li, Jinzhu, Zhao, Sheng, Li, Jinyu, Kanda, Naoyuki
Recently, zero-shot text-to-speech (TTS) systems, capable of synthesizing any speaker's voice from a short audio prompt, have made rapid advancements. However, the quality of the generated speech significantly deteriorates when the audio prompt contains noise, and limited research has been conducted to address this issue. In this paper, we explored various strategies to enhance the quality of audio generated from noisy audio prompts within the context of flow-matching-based zero-shot TTS. Our investigation includes comprehensive training strategies: unsupervised pre-training with masked speech denoising, multi-speaker detection and DNSMOS-based data filtering on the pre-training data, and fine-tuning with random noise mixing. The results of our experiments demonstrate significant improvements in intelligibility, speaker similarity, and overall audio quality compared to the approach of applying speech enhancement to the audio prompt.
- North America > United States (0.14)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- (2 more...)
Meta's open-source ImageBind AI aims to mimic human perception
Meta is open-sourcing an AI tool called ImageBind that predicts connections between data similar to how humans perceive or imagine an environment. While image generators like Midjourney, Stable Diffusion and DALL-E 2 pair words with images, allowing you to generate visual scenes based only on a text description, ImageBind casts a broader net. It can link text, images / videos, audio, 3D measurements (depth), temperature data (thermal), and motion data (from inertial measurement units) -- and it does this without having to first train on every possibility. It's an early stage of a framework that could eventually generate complex environments from an input as simple as a text prompt, image or audio recording (or some combination of the three). You could view ImageBind as moving machine learning closer to human learning.
AudioGen: Textually Guided Audio Generation
Kreuk, Felix, Synnaeve, Gabriel, Polyak, Adam, Singer, Uriel, Défossez, Alexandre, Copet, Jade, Parikh, Devi, Taigman, Yaniv, Adi, Yossi
We tackle the problem of generating audio samples conditioned on descriptive text captions. In this work, we propose AaudioGen, an auto-regressive generative model that generates audio samples conditioned on text inputs. AudioGen operates on a learnt discrete audio representation. The task of text-to-audio generation poses multiple challenges. Due to the way audio travels through a medium, differentiating ``objects'' can be a difficult task (e.g., separating multiple people simultaneously speaking). This is further complicated by real-world recording conditions (e.g., background noise, reverberation, etc.). Scarce text annotations impose another constraint, limiting the ability to scale models. Finally, modeling high-fidelity audio requires encoding audio at high sampling rate, leading to extremely long sequences. To alleviate the aforementioned challenges we propose an augmentation technique that mixes different audio samples, driving the model to internally learn to separate multiple sources. We curated 10 datasets containing different types of audio and text annotations to handle the scarcity of text-audio data points. For faster inference, we explore the use of multi-stream modeling, allowing the use of shorter sequences while maintaining a similar bitrate and perceptual quality. We apply classifier-free guidance to improve adherence to text. Comparing to the evaluated baselines, AudioGen outperforms over both objective and subjective metrics. Finally, we explore the ability of the proposed method to generate audio continuation conditionally and unconditionally. Samples: https://felixkreuk.github.io/audiogen
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
- Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Speech (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)