music generation
The challenge of realistic music generation: modelling raw audio at scale
Realistic music generation is a challenging task. When building generative models of music that are learnt from data, typically high-level representations such as scores or MIDI are used that abstract away the idiosyncrasies of a particular performance. But these nuances are very important for our perception of musicality and realism, so in this work we embark on modelling music in the raw audio domain. It has been shown that autoregressive models excel at generating raw audio waveforms of speech, but when applied to music, we find them biased towards capturing local signal structure at the expense of modelling long-range correlations. This is problematic because music exhibits structure at many different timescales. In this work, we explore autoregressive discrete autoencoders (ADAs) as a means to enable autoregressive models to capture long-range correlations in waveforms. We find that they allow us to unconditionally generate piano music directly in the raw audio domain, which shows stylistic consistency across tens of seconds.
- Media > Music (1.00)
- Leisure & Entertainment (1.00)
- Media > Music (0.86)
- Leisure & Entertainment (0.86)
- Asia > China > Jiangsu Province > Nanjing (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- Media > Music (1.00)
- Leisure & Entertainment (1.00)
Art2Music: Generating Music for Art Images with Multi-modal Feeling Alignment
Hong, Jiaying, Zhu, Ting, Markchom, Thanet, Liang, Huizhi
With the rise of AI-generated content (AIGC), generating perceptually natural and feeling-aligned music from multimodal inputs has become a central challenge. Existing approaches often rely on explicit emotion labels that require costly annotation, underscoring the need for more flexible feeling-aligned methods. To support multimodal music generation, we construct ArtiCaps, a pseudo feeling-aligned image-music-text dataset created by semantically matching descriptions from ArtEmis and MusicCaps. We further propose Art2Music, a lightweight cross-modal framework that synthesizes music from artistic images and user comments. In the first stage, images and text are encoded with OpenCLIP and fused using a gated residual module; the fused representation is decoded by a bidirectional LSTM into Mel-spectrograms with a frequency-weighted L1 loss to enhance high-frequency fidelity. In the second stage, a fine-tuned HiFi-GAN vocoder reconstructs high-quality audio waveforms. Experiments on ArtiCaps show clear improvements in Mel-Cepstral Distortion, Frechet Audio Distance, Log-Spectral Distance, and cosine similarity. A small LLM-based rating study further verifies consistent cross-modal feeling alignment and offers interpretable explanations of matches and mismatches across modalities. These results demonstrate improved perceptual naturalness, spectral fidelity, and semantic consistency. Art2Music also maintains robust performance with only 50k training samples, providing a scalable solution for feeling-aligned creative audio generation in interactive art, personalized soundscapes, and digital art exhibitions.
- Europe > United Kingdom > England > Tyne and Wear > Newcastle (0.04)
- Europe > United Kingdom > England > Berkshire > Reading (0.04)
- Media > Music (1.00)
- Leisure & Entertainment (1.00)
Learning and composing of classical music using restricted Boltzmann machines
Kobayashi, Mutsumi, Watanabe, Hiroshi
We investigate how machine learning models acquire the ability to compose music and how musical information is internally represented within such models. We develop a composition algorithm based on a restricted Boltzmann machine (RBM), a simple generative model capable of producing musical pieces of arbitrary length. We convert musical scores into piano-roll image representations and train the RBM in an unsupervised manner. We confirm that the trained RBM can generate new musical pieces; however, by analyzing the model's responses and internal structure, we find that the learned information is not stored in a form directly interpretable by humans. This study contributes to a better understanding of how machine learning models capable of music composition may internally represent musical structure and highlights issues related to the interpretability of generative models in creative tasks.
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)
- North America > United States > Colorado > Boulder County > Boulder (0.04)
- Asia > Thailand > Bangkok > Bangkok (0.04)
- Media > Music (1.00)
- Leisure & Entertainment (1.00)
InstructAudio: Unified speech and music generation with natural language instruction
Qiang, Chunyu, Yin, Kang, Wang, Xiaopeng, Liang, Yuzhe, Zhao, Jiahui, Fu, Ruibo, Wang, Tianrui, Gong, Cheng, Zhang, Chen, Wang, Longbiao, Dang, Jianwu
Text-to-speech (TTS) and text-to-music (TTM) models face significant limitations in instruction-based control. TTS systems usually depend on reference audio for timbre, offer only limited text-level attribute control, and rarely support dialogue generation. TTM systems are constrained by input conditioning requirements that depend on expert knowledge annotations. The high heterogeneity of these input control conditions makes them difficult to joint modeling with speech synthesis. Despite sharing common acoustic modeling characteristics, these two tasks have long been developed independently, leaving open the challenge of achieving unified modeling through natural language instructions. We introduce InstructAudio, a unified framework that enables instruction-based (natural language descriptions) control of acoustic attributes including timbre (gender, age), paralinguistic (emotion, style, accent), and musical (genre, instrument, rhythm, atmosphere). It supports expressive speech, music, and dialogue generation in English and Chinese. The model employs joint and single diffusion transformer layers with a standardized instruction-phoneme input format, trained on 50K hours of speech and 20K hours of music data, enabling multi-task learning and cross-modal alignment. Fig. 1 visualizes performance comparisons with mainstream TTS and TTM models, demonstrating that InstructAudio achieves optimal results on most metrics. To our best knowledge, InstructAudio represents the first instruction-controlled framework unifying speech and music generation. Audio samples are available at: https://qiangchunyu.github.io/InstructAudio/
- Media > Music (1.00)
- Leisure & Entertainment (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.70)
- Information Technology > Artificial Intelligence > Speech > Acoustic Processing (0.49)
Explicit Tonal Tension Conditioning via Dual-Level Beam Search for Symbolic Music Generation
Ebrahimzadeh, Maral, Bernardes, Gilberto, Stober, Sebastian
State-of-the-art symbolic music generation models have recently achieved remarkable output quality, yet explicit control over compositional features, such as tonal tension, remains challenging. We propose a novel approach that integrates a computational tonal tension model, based on tonal interval vector analysis, into a Transformer framework. Our method employs a two-level beam search strategy during inference. At the token level, generated candidates are re-ranked using model probability and diversity metrics to maintain overall quality. At the bar level, a tension-based re-ranking is applied to ensure that the generated music aligns with a desired tension curve. Objective evaluations indicate that our approach effectively modulates tonal tension, and subjective listening tests confirm that the system produces outputs that align with the target tension. These results demonstrate that explicit tension conditioning through a dual-level beam search provides a powerful and intuitive tool to guide AI-generated music. Furthermore, our experiments demonstrate that our method can generate multiple distinct musical interpretations under the same tension condition.
- Europe > United Kingdom > England > Greater London > London (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Portugal > Porto > Porto (0.04)
- Europe > Germany > Saxony-Anhalt > Magdeburg (0.04)
- Media > Music (1.00)
- Leisure & Entertainment (1.00)
MusicAIR: A Multimodal AI Music Generation Framework Powered by an Algorithm-Driven Core
Liao, Callie C., Liao, Duoduo, Zhang, Ellie L.
Recent advances in generative AI have made music generation a prominent research focus. However, many neural-based models rely on large datasets, raising concerns about copyright infringement and high-performance costs. In contrast, we propose MusicAIR, an innovative multimodal AI music generation framework powered by a novel algorithm-driven symbolic music core, effectively mitigating copyright infringement risks. The music core algorithms connect critical lyrical and rhythmic information to automatically derive musical features, creating a complete, coherent melodic score solely from the lyrics. The MusicAIR framework facilitates music generation from lyrics, text, and images. The generated score adheres to established principles of music theory, lyrical structure, and rhythmic conventions. We developed Generate AI Music (GenAIM), a web tool using MusicAIR for lyric-to-song, text-to-music, and image-to-music generation. In our experiments, we evaluated AI-generated music scores produced by the system using both standard music metrics and innovative analysis that compares these compositions with original works. The system achieves an average key confidence of 85%, outperforming human composers at 79%, and aligns closely with established music theory standards, demonstrating its ability to generate diverse, human-like compositions. As a co-pilot tool, GenAIM can serve as a reliable music composition assistant and a possible educational composition tutor while simultaneously lowering the entry barrier for all aspiring musicians, which is innovative and significantly contributes to AI for music generation.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
- Media > Music (1.00)
- Leisure & Entertainment (1.00)
- North America > United States > Wisconsin > Dane County > Madison (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- Europe > United Kingdom > England > Greater London > London (0.04)
- Media > Music (1.00)
- Leisure & Entertainment (1.00)