AITopics | Chen, Jitong

Collaborating Authors

Chen, Jitong

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

StemGen: A music generation model that listens

Parker, Julian D., Spijkervet, Janne, Kosta, Katerina, Yesiler, Furkan, Kuznetsov, Boris, Wang, Ju-Chiang, Avent, Matt, Chen, Jitong, Le, Duc

arXiv.org Artificial IntelligenceJan-16-2024

End-to-end generation of musical audio using deep learning techniques has seen an explosion of activity recently. However, most models concentrate on generating fully mixed music in response to abstract conditioning information. In this work, we present an alternative paradigm for producing music generation models that can listen and respond to musical context. We describe how such a model can be constructed using a non-autoregressive, transformer-based model architecture and present a number of novel architectural and sampling improvements. We train the described architecture on both an open-source and a proprietary dataset. We evaluate the produced models using standard quality metrics and a new approach based on music information retrieval descriptors. The resulting model reaches the audio quality of state-of-the-art text-conditioned models, as well as exhibiting strong musical coherence with its context.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2312.08723

Genre: Research Report (0.64)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.89)

Add feedback

InstructME: An Instruction Guided Music Edit And Remix Framework with Latent Diffusion Models

Han, Bing, Dai, Junyu, Hao, Weituo, He, Xinyan, Guo, Dong, Chen, Jitong, Wang, Yuxuan, Qian, Yanmin, Song, Xuchen

arXiv.org Artificial IntelligenceDec-12-2023

Music editing primarily entails the modification of instrument tracks or remixing in the whole, which offers a novel reinterpretation of the original piece through a series of operations. These music processing methods hold immense potential across various applications but demand substantial expertise. Prior methodologies, although effective for image and audio modifications, falter when directly applied to music. This is attributed to music's distinctive data nature, where such methods can inadvertently compromise the intrinsic harmony and coherence of music. In this paper, we develop InstructME, an Instruction guided Music Editing and remixing framework based on latent diffusion models. Our framework fortifies the U-Net with multi-scale aggregation in order to maintain consistency before and after editing. In addition, we introduce chord progression matrix as condition information and incorporate it in the semantic space to improve melodic harmony while editing. For accommodating extended musical pieces, InstructME employs a chunk transformer, enabling it to discern long-term temporal dependencies within music sequences. We tested InstructME in instrument-editing, remixing, and multi-round editing. Both subjective and objective evaluations indicate that our proposed method significantly surpasses preceding systems in music quality, text relevance and harmony. Demo samples are available at https://musicedit.github.io/

artificial intelligence, machine learning, music, (17 more...)

arXiv.org Artificial Intelligence

2308.1436

Country: Europe > Germany (0.14)

Genre: Research Report (1.00)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.47)

Add feedback

Efficient Neural Music Generation

Lam, Max W. Y., Tian, Qiao, Li, Tang, Yin, Zongyu, Feng, Siyuan, Tu, Ming, Ji, Yuliang, Xia, Rui, Ma, Mingbo, Song, Xuchen, Chen, Jitong, Wang, Yuping, Wang, Yuxuan

arXiv.org Artificial IntelligenceMay-25-2023

Recent progress in music generation has been remarkably advanced by the state-of-the-art MusicLM, which comprises a hierarchy of three LMs, respectively, for semantic, coarse acoustic, and fine acoustic modelings. Yet, sampling with the MusicLM requires processing through these LMs one by one to obtain the fine-grained acoustic tokens, making it computationally expensive and prohibitive for a real-time generation. Efficient music generation with a quality on par with MusicLM remains a significant challenge. In this paper, we present MeLoDy (M for music; L for LM; D for diffusion), an LM-guided diffusion model that generates music audios of state-of-the-art quality meanwhile reducing 95.7% or 99.6% forward passes in MusicLM, respectively, for sampling 10s or 30s music. MeLoDy inherits the highest-level LM from MusicLM for semantic modeling, and applies a novel dual-path diffusion (DPD) model and an audio VAE-GAN to efficiently decode the conditioning semantic tokens into waveform. DPD is proposed to simultaneously model the coarse and fine acoustics by incorporating the semantic information into segments of latents effectively via cross-attention at each denoising step. Our experimental results suggest the superiority of MeLoDy, not only in its practical advantages on sampling speed and infinitely continuable generation, but also in its state-of-the-art musicality, audio quality, and text correlation. Our samples are available at https://Efficient-MeLoDy.github.io/.

arxiv preprint arxiv, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2305.15719

Genre: Research Report > New Finding (0.48)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Supervised Chorus Detection for Popular Music Using Convolutional Neural Network and Multi-task Learning

Wang, Ju-Chiang, Smith, Jordan B. L., Chen, Jitong, Song, Xuchen, Wang, Yuxuan

arXiv.org Artificial IntelligenceApr-21-2021

This paper presents a novel supervised approach to detecting the chorus segments in popular music. Traditional approaches to this task are mostly unsupervised, with pipelines designed to target some quality that is assumed to define "chorusness," which usually means seeking the loudest or most frequently repeated sections. We propose to use a convolutional neural network with a multi-task learning objective, which simultaneously fits two temporal activation curves: one indicating "chorusness" as a function of time, and the other the location of the boundaries. We also propose a post-processing method that jointly takes into account the chorus and boundary predictions to produce binary output. In experiments using three datasets, we compare our system to a set of public implementations of other segmentation and chorus-detection algorithms, and find our approach performs significantly better.

dataset, deep learning, neural network, (19 more...)

arXiv.org Artificial Intelligence

2103.14253

Genre: Research Report (0.82)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.71)

Add feedback

Neural Voice Cloning with a Few Samples

Arik, Sercan, Chen, Jitong, Peng, Kainan, Ping, Wei, Zhou, Yanqi

Neural Information Processing SystemsDec-31-2018

Voice cloning is a highly desired feature for personalized speech interfaces. We introduce a neural voice cloning system that learns to synthesize a person's voice from only a few audio samples. We study two approaches: speaker adaptation and speaker encoding. Speaker adaptation is based on fine-tuning a multi-speaker generative model. Speaker encoding is based on training a separate model to directly infer a new speaker embedding, which will be applied to a multi-speaker generative model. In terms of naturalness of the speech and similarity to the original speaker, both approaches can achieve good performance, even with a few cloning audios. While speaker adaptation can achieve slightly better naturalness and similarity, cloning time and required memory for the speaker encoding approach are significantly less, making it more favorable for low-resource deployment.

deep learning, speaker adaptation, speech recognition, (18 more...)

Neural Information Processing Systems

Country: North America > Canada (0.14)

Industry: Information Technology > Security & Privacy (0.94)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Neural Voice Cloning with a Few Samples

Arik, Sercan, Chen, Jitong, Peng, Kainan, Ping, Wei, Zhou, Yanqi

Neural Information Processing SystemsDec-31-2018

deep learning, speaker adaptation, speech recognition, (18 more...)

Neural Information Processing Systems

Country: North America > Canada (0.14)

Industry: Information Technology > Security & Privacy (0.94)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech

Ping, Wei, Peng, Kainan, Chen, Jitong

arXiv.org Artificial IntelligenceJul-30-2018

In this work, we propose an alternative solution for parallel wave generation by WaveNet. In contrast to parallel WaveNet (Oord et al., 2018), we distill a Gaussian inverse autoregressive flow from the autoregressive WaveNet by minimizing a novel regularized KL divergence between their highly-peaked output distributions. Our method computes the KL divergence in closed-form, which simplifies the training algorithm and provides very efficient distillation. In addition, we propose the first text-to-wave neural architecture for speech synthesis, which is fully convolutional and enables fast end-to-end training from scratch. It significantly outperforms the previous pipeline that connects a text-to-spectrogram model to a separately trained WaveNet (Ping et al., 2018). We also successfully distill a parallel waveform synthesizer conditioned on the hidden representation in this end-to-end model.

artificial intelligence, neural network, wavenet, (16 more...)

arXiv.org Artificial Intelligence

1807.07281

Genre: Research Report (0.83)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback