AITopics | audio prompt

Collaborating Authors

audio prompt

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Live Music Models

Lyria Team, null, Caillon, Antoine, McWilliams, Brian, Tarakajian, Cassie, Simon, Ian, Manco, Ilaria, Engel, Jesse, Constant, Noah, Li, Yunpeng, Denk, Timo I., Lalama, Alberto, Agostinelli, Andrea, Huang, Cheng-Zhi Anna, Manilow, Ethan, Brower, George, Erdogan, Hakan, Lei, Heidi, Rolnick, Itai, Grishchenko, Ivan, Orsini, Manu, Kastelic, Matej, Zuluaga, Mauricio, Verzetti, Mauro, Dooley, Michael, Skopek, Ondrej, Ferrer, Rafael, Petridis, Savvas, Borsos, Zalán, Oord, Äaron van den, Eck, Douglas, Collins, Eli, Baldridge, Jason, Hume, Tom, Donahue, Chris, Han, Kehang, Roberts, Adam

arXiv.org Artificial IntelligenceNov-6-2025

We introduce a new class of generative models for music called live music models that produce a continuous stream of music in real-time with synchronized user control. We release Magenta RealTime, an open-weights live music model that can be steered using text or audio prompts to control acoustic style. On automatic metrics of music quality, Magenta RealTime outperforms other open-weights music generation models, despite using fewer parameters and offering first-of-its-kind live generation capabilities. We also release Lyria RealTime, an API-based model with extended controls, offering access to our most powerful model with wide prompt coverage. These models demonstrate a new paradigm for AI-assisted music creation that emphasizes human-in-the-loop interaction for live music performance.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2508.04651

Country: Asia > Vietnam > Hanoi > Hanoi (0.04)

Genre: Research Report (0.83)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.87)

Add feedback

LeVo: High-Quality Song Generation with Multi-Preference Alignment

Lei, Shun, Xu, Yaoxun, Lin, Zhiwei, Zhang, Huaicheng, Tan, Wei, Chen, Hangting, Yu, Jianwei, Zhang, Yixuan, Yang, Chenyu, Zhu, Haina, Wang, Shuai, Wu, Zhiyong, Yu, Dong

arXiv.org Artificial IntelligenceOct-24-2025

Recent advances in large language models (LLMs) and audio language models have significantly improved music generation, particularly in lyrics-to-song generation. However, existing approaches still struggle with the complex composition of songs and the scarcity of high-quality data, leading to limitations in audio quality, musicality, instruction following, and vocal-instrument harmony. To address these challenges, we introduce LeVo, a language model based framework consisting of LeLM and Music Codec. LeLM is capable of parallel modeling of two types of tokens: mixed tokens, which represent the combined audio of vocals and accompaniment to achieve better vocal-instrument harmony, and dual-track tokens, which separately encode vocals and accompaniment for high-quality song generation. It employs two decoder-only transformers and a modular extension training strategy to prevent interference between different token types. To further enhance musicality and instruction following ability, we introduce a multi-preference alignment method based on Direct Preference Optimization (DPO). This method handles diverse human preferences through a semi-automatic data construction process and post-training. Experimental results demonstrate that LeVo significantly outperforms existing open-source methods in both objective and subjective metrics, while performing competitively with industry systems. Ablation studies further justify the effectiveness of our designs. Audio examples and source code are available at https://levo-demo.github.io and https://github.com/tencent-ailab/songgeneration.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2506.0752

Country: Asia > China (0.93)

Genre: Research Report > New Finding (0.87)

Industry:

Leisure & Entertainment (1.00)
Media > Music (0.89)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

MIDI-VALLE: Improving Expressive Piano Performance Synthesis Through Neural Codec Language Modelling

Tang, Jingjing, Wang, Xin, Zhang, Zhe, Yamagishi, Junichi, Wiggins, Geraint, Fazekas, George

arXiv.org Artificial IntelligenceJul-14-2025

Generating expressive audio performances from music scores requires models to capture both instrument acoustics and human interpretation. Traditional music performance synthesis pipelines follow a two-stage approach, first generating expressive performance MIDI from a score, then synthesising the MIDI into audio. However, the synthesis models often struggle to generalise across diverse MIDI sources, musical styles, and recording environments. To address these challenges, we propose MIDI-VALLE, a neural codec language model adapted from the VALLE framework, which was originally designed for zero-shot personalised text-to-speech (TTS) synthesis. For performance MIDI-to-audio synthesis, we improve the architecture to condition on a reference audio performance and its corresponding MIDI. Unlike previous TTS-based systems that rely on piano rolls, MIDI-VALLE encodes both MIDI and audio as discrete tokens, facilitating a more consistent and robust modelling of piano performances. Furthermore, the model's generalisation ability is enhanced by training on an extensive and diverse piano performance dataset. Evaluation results show that MIDI-VALLE significantly outperforms a state-of-the-art baseline, achieving over 75% lower Frechet Audio Distance on the ATEPP and Maestro datasets. In the listening test, MIDI-VALLE received 202 votes compared to 58 for the baseline, demonstrating improved synthesis quality and generalisation across diverse performance MIDI inputs.

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

2507.0853

Country:

Asia > Japan (0.04)
North America > United States (0.04)
Europe > United Kingdom > England > Greater London > London (0.04)
(4 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.50)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.35)

Add feedback

TAIL: Text-Audio Incremental Learning

Sun, Yingfei, Gu, Xu, Ji, Wei, Zhao, Hanbin, Fei, Hao, Yin, Yifang, Zimmermann, Roger

arXiv.org Artificial IntelligenceMar-6-2025

Many studies combine text and audio to capture multi-modal information but they overlook the model's generalization ability on new datasets. Introducing new datasets may affect the feature space of the original dataset, leading to catastrophic forgetting. Meanwhile, large model parameters can significantly impact training performance. To address these limitations, we introduce a novel task called Text-Audio Incremental Learning (TAIL) task for text-audio retrieval, and propose a new method, PTAT, Prompt Tuning for Audio-Text incremental learning. This method utilizes prompt tuning to optimize the model parameters while incorporating an audio-text similarity and feature distillation module to effectively mitigate catastrophic forgetting. We benchmark our method and previous incremental learning methods on AudioCaps, Clotho, BBC Sound Effects and Audioset datasets, and our method outperforms previous methods significantly, particularly demonstrating stronger resistance to forgetting on older datasets. Compared to the full-parameters Finetune (Sequential) method, our model only requires 2.42\% of its parameters, achieving 4.46\% higher performance.

dataset, international conference, learning, (14 more...)

arXiv.org Artificial Intelligence

2503.04258

Country:

Asia > China (0.04)
North America > United States > New York > New York County > New York City (0.04)
Asia > Singapore (0.04)
Asia > Myanmar > Tanintharyi Region > Dawei (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.67)

Add feedback

Hard-Synth: Synthesizing Diverse Hard Samples for ASR using Zero-Shot TTS and LLM

Yu, Jiawei, Li, Yuang, Qiao, Xiaosong, Zhao, Huan, Zhao, Xiaofeng, Tang, Wei, Zhang, Min, Yang, Hao, Su, Jinsong

arXiv.org Artificial IntelligenceNov-20-2024

Text-to-speech (TTS) models have been widely adopted to enhance automatic speech recognition (ASR) systems using text-only corpora, thereby reducing the cost of labeling real speech data. Existing research primarily utilizes additional text data and predefined speech styles supported by TTS models. In this paper, we propose Hard-Synth, a novel ASR data augmentation method that leverages large language models (LLMs) and advanced zero-shot TTS. Our approach employs LLMs to generate diverse in-domain text through rewriting, without relying on additional text data. Rather than using predefined speech styles, we introduce a hard prompt selection method with zero-shot TTS to clone speech styles that the ASR model finds challenging to recognize. Experiments demonstrate that Hard-Synth significantly enhances the Conformer model, achieving relative word error rate (WER) reductions of 6.5\%/4.4\% on LibriSpeech dev/test-other subsets. Additionally, we show that Hard-Synth is data-efficient and capable of reducing bias in ASR.

asr model, speech recognition, tts model, (12 more...)

arXiv.org Artificial Intelligence

2411.13159

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
Oceania > Australia > Queensland > Brisbane (0.04)
North America > United States > California > Los Angeles County > Long Beach (0.04)
(5 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

TCSinger: Zero-Shot Singing Voice Synthesis with Style Transfer and Multi-Level Style Control

Zhang, Yu, Jiang, Ziyue, Li, Ruiqi, Pan, Changhao, He, Jinzheng, Huang, Rongjie, Wang, Chuxin, Zhao, Zhou

arXiv.org Artificial IntelligenceOct-3-2024

Zero-shot singing voice synthesis (SVS) with style transfer and style control aims to generate high-quality singing voices with unseen timbres and styles (including singing method, emotion, rhythm, technique, and pronunciation) from audio and text prompts. However, the multifaceted nature of singing styles poses a significant challenge for effective modeling, transfer, and control. Furthermore, current SVS models often fail to generate singing voices rich in stylistic nuances for unseen singers. To address these challenges, we introduce TCSinger, the first zero-shot SVS model for style transfer across cross-lingual speech and singing styles, along with multi-level style control. Specifically, TCSinger proposes three primary modules: 1) the clustering style encoder employs a clustering vector quantization model to stably condense style information into a compact latent space; 2) the Style and Duration Language Model (S\&D-LM) concurrently predicts style information and phoneme duration, which benefits both; 3) the style adaptive decoder uses a novel mel-style adaptive normalization method to generate singing voices with enhanced details. Experimental results show that TCSinger outperforms all baseline models in synthesis quality, singer similarity, and style controllability across various tasks, including zero-shot style transfer, multi-level style control, cross-lingual style transfer, and speech-to-singing style transfer. Singing voice samples can be accessed at https://tcsinger.github.io/.

artificial intelligence, large language model, natural language, (16 more...)

arXiv.org Artificial Intelligence

doi: 10.18653/v1/2024.emnlp-main.117

2409.15977

Country:

North America > Canada > Quebec > Montreal (0.04)
Asia > China > Shanghai > Shanghai (0.04)

Genre: Research Report > New Finding (0.48)

Industry:

Media > Music (0.93)
Leisure & Entertainment (0.93)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Laugh Now Cry Later: Controlling Time-Varying Emotional States of Flow-Matching-Based Zero-Shot Text-to-Speech

Wu, Haibin, Wang, Xiaofei, Eskimez, Sefik Emre, Thakker, Manthan, Tompkins, Daniel, Tsai, Chung-Hsien, Li, Canrun, Xiao, Zhen, Zhao, Sheng, Li, Jinyu, Kanda, Naoyuki

arXiv.org Artificial IntelligenceJul-16-2024

People change their tones of voice, often accompanied by nonverbal vocalizations (NVs) such as laughter and cries, to convey rich emotions. However, most text-to-speech (TTS) systems lack the capability to generate speech with rich emotions, including NVs. This paper introduces EmoCtrl-TTS, an emotion-controllable zero-shot TTS that can generate highly emotional speech with NVs for any speaker. EmoCtrl-TTS leverages arousal and valence values, as well as laughter embeddings, to condition the flow-matching-based zero-shot TTS. To achieve high-quality emotional speech generation, EmoCtrl-TTS is trained using more than 27,000 hours of expressive data curated based on pseudo-labeling. Comprehensive evaluations demonstrate that EmoCtrl-TTS excels in mimicking the emotions of audio prompts in speech-to-speech translation scenarios. We also show that EmoCtrl-TTS can capture emotion changes, express strong emotions, and generate various NVs in zero-shot TTS. See https://aka.ms/emoctrl-tts for demo samples.

emoctrl-tts, emotion, speech, (15 more...)

arXiv.org Artificial Intelligence

2407.12229

Country:

North America > United States (0.14)
Asia > Taiwan (0.04)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.86)

Add feedback

An Investigation of Noise Robustness for Flow-Matching-Based Zero-Shot TTS

Wang, Xiaofei, Eskimez, Sefik Emre, Thakker, Manthan, Yang, Hemin, Zhu, Zirun, Tang, Min, Xia, Yufei, Li, Jinzhu, Zhao, Sheng, Li, Jinyu, Kanda, Naoyuki

arXiv.org Artificial IntelligenceJun-9-2024

Recently, zero-shot text-to-speech (TTS) systems, capable of synthesizing any speaker's voice from a short audio prompt, have made rapid advancements. However, the quality of the generated speech significantly deteriorates when the audio prompt contains noise, and limited research has been conducted to address this issue. In this paper, we explored various strategies to enhance the quality of audio generated from noisy audio prompts within the context of flow-matching-based zero-shot TTS. Our investigation includes comprehensive training strategies: unsupervised pre-training with masked speech denoising, multi-speaker detection and DNSMOS-based data filtering on the pre-training data, and fine-tuning with random noise mixing. The results of our experiments demonstrate significant improvements in intelligibility, speaker similarity, and overall audio quality compared to the approach of applying speech enhancement to the audio prompt.

audio prompt, noise, tts model, (13 more...)

arXiv.org Artificial Intelligence

2406.05699

Country:

North America > United States (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
(2 more...)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.86)

Add feedback

Meta's open-source ImageBind AI aims to mimic human perception

EngadgetMay-9-2023, 18:15:00 GMT

Meta is open-sourcing an AI tool called ImageBind that predicts connections between data similar to how humans perceive or imagine an environment. While image generators like Midjourney, Stable Diffusion and DALL-E 2 pair words with images, allowing you to generate visual scenes based only on a text description, ImageBind casts a broader net. It can link text, images / videos, audio, 3D measurements (depth), temperature data (thermal), and motion data (from inertial measurement units) -- and it does this without having to first train on every possibility. It's an early stage of a framework that could eventually generate complex environments from an input as simple as a text prompt, image or audio recording (or some combination of the three). You could view ImageBind as moving machine learning closer to human learning.

meta, mimic human perception, modality, (8 more...)

Engadget

Industry: Leisure & Entertainment (0.36)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.56)

Add feedback

AudioGen: Textually Guided Audio Generation

Kreuk, Felix, Synnaeve, Gabriel, Polyak, Adam, Singer, Uriel, Défossez, Alexandre, Copet, Jade, Parikh, Devi, Taigman, Yaniv, Adi, Yossi

arXiv.org Artificial IntelligenceMar-5-2023

We tackle the problem of generating audio samples conditioned on descriptive text captions. In this work, we propose AaudioGen, an auto-regressive generative model that generates audio samples conditioned on text inputs. AudioGen operates on a learnt discrete audio representation. The task of text-to-audio generation poses multiple challenges. Due to the way audio travels through a medium, differentiating ``objects'' can be a difficult task (e.g., separating multiple people simultaneously speaking). This is further complicated by real-world recording conditions (e.g., background noise, reverberation, etc.). Scarce text annotations impose another constraint, limiting the ability to scale models. Finally, modeling high-fidelity audio requires encoding audio at high sampling rate, leading to extremely long sequences. To alleviate the aforementioned challenges we propose an augmentation technique that mixes different audio samples, driving the model to internally learn to separate multiple sources. We curated 10 datasets containing different types of audio and text annotations to handle the scarcity of text-audio data points. For faster inference, we explore the use of multi-stream modeling, allowing the use of shorter sequences while maintaining a similar bitrate and perceptual quality. We apply classifier-free guidance to improve adherence to text. Comparing to the evaluated baselines, AudioGen outperforms over both objective and subjective metrics. Finally, we explore the ability of the proposed method to generate audio continuation conditionally and unconditionally. Samples: https://felixkreuk.github.io/audiogen

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2209.15352

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States > Washington > King County > Seattle (0.04)
Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)
(2 more...)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback