AITopics | Ye, Zhenhui

Collaborating Authors

Ye, Zhenhui

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis

Jiang, Ziyue, Ren, Yi, Li, Ruiqi, Ji, Shengpeng, Ye, Zhenhui, Zhang, Chen, Jionghao, Bai, Yang, Xiaoda, Zuo, Jialong, Zhang, Yu, Liu, Rui, Yin, Xiang, Zhao, Zhou

arXiv.org Artificial IntelligenceFeb-26-2025

While recent zero-shot text-to-speech (TTS) models have significantly improved speech quality and expressiveness, mainstream systems still suffer from issues related to speech-text alignment modeling: 1) models without explicit speech-text alignment modeling exhibit less robustness, especially for hard sentences in practical applications; 2) predefined alignment-based models suffer from naturalness constraints of forced alignments. This paper introduces \textit{S-DiT}, a TTS system featuring an innovative sparse alignment algorithm that guides the latent diffusion transformer (DiT). Specifically, we provide sparse alignment boundaries to S-DiT to reduce the difficulty of alignment learning without limiting the search space, thereby achieving high naturalness. Moreover, we employ a multi-condition classifier-free guidance strategy for accent intensity adjustment and adopt the piecewise rectified flow technique to accelerate the generation process. Experiments demonstrate that S-DiT achieves state-of-the-art zero-shot TTS speech quality and supports highly flexible control over accent intensity. Notably, our system can generate high-quality one-minute speech with only 8 sampling steps. Audio samples are available at https://sditdemo.github.io/sditdemo/.

large language model, machine learning, megatts 3, (18 more...)

arXiv.org Artificial Intelligence

2502.18924

Country: Asia (0.28)

Genre: Research Report > New Finding (0.68)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.88)

Add feedback

FreeBind: Free Lunch in Unified Multimodal Space via Knowledge Fusion

Wang, Zehan, Zhang, Ziang, Cheng, Xize, Huang, Rongjie, Liu, Luping, Ye, Zhenhui, Huang, Haifeng, Zhao, Yang, Jin, Tao, Gao, Peng, Zhao, Zhou

arXiv.org Artificial IntelligenceMay-10-2024

Unified multi-model representation spaces are the foundation of multimodal understanding and generation. However, the billions of model parameters and catastrophic forgetting problems make it challenging to further enhance pre-trained unified spaces. In this work, we propose FreeBind, an idea that treats multimodal representation spaces as basic units, and freely augments pre-trained unified space by integrating knowledge from extra expert spaces via "space bonds". Specifically, we introduce two kinds of basic space bonds: 1) Space Displacement Bond and 2) Space Combination Bond. Based on these basic bonds, we design Complex Sequential & Parallel Bonds to effectively integrate multiple spaces simultaneously. Benefiting from the modularization concept, we further propose a coarse-to-fine customized inference strategy to flexibly adjust the enhanced unified space for different purposes. Experimentally, we bind ImageBind with extra image-text and audio-text expert spaces, resulting in three main variants: ImageBind++, InternVL_IB, and InternVL_IB++. These resulting spaces outperform ImageBind on 5 audio-image-text downstream tasks across 9 datasets. Moreover, via customized inference, it even surpasses the advanced audio-text and image-text expert spaces.

expert space, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2405.04883

Country: Europe > Switzerland > Zürich > Zürich (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Information Fusion (0.72)

Add feedback

Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for Text-to-Speech

Jiang, Ziyue, Su, Zhe, Zhao, Zhou, Yang, Qian, Ren, Yi, Liu, Jinglin, Ye, Zhenhui

arXiv.org Artificial IntelligenceOct-19-2023

Polyphone disambiguation aims to capture accurate pronunciation knowledge from natural text sequences for reliable Text-to-speech (TTS) systems. However, previous approaches require substantial annotated training data and additional efforts from language experts, making it difficult to extend high-quality neural TTS systems to out-of-domain daily conversations and countless languages worldwide. This paper tackles the polyphone disambiguation problem from a concise and novel perspective: we propose Dict-TTS, a semantic-aware generative text-to-speech model with an online website dictionary (the existing prior information in the natural language). Specifically, we design a semantics-to-pronunciation attention (S2PA) module to match the semantic patterns between the input text sequence and the prior semantics in the dictionary and obtain the corresponding pronunciations; The S2PA module can be easily trained with the end-to-end TTS model without any annotated phoneme labels. Experimental results in three languages show that our model outperforms several strong baseline models in terms of pronunciation accuracy and improves the prosody modeling of TTS systems. Further extensive analyses demonstrate that each design in Dict-TTS is effective.

machine learning, natural language, pronunciation, (20 more...)

arXiv.org Artificial Intelligence

2206.02147

Country:

Europe (0.93)
North America > United States > Hawaii (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.92)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.91)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias

Jiang, Ziyue, Ren, Yi, Ye, Zhenhui, Liu, Jinglin, Zhang, Chen, Yang, Qian, Ji, Shengpeng, Huang, Rongjie, Wang, Chunfeng, Yin, Xiang, Ma, Zejun, Zhao, Zhou

arXiv.org Artificial IntelligenceJun-6-2023

Scaling text-to-speech to a large and wild dataset has been proven to be highly effective in achieving timbre and speech style generalization, particularly in zero-shot TTS. However, previous works usually encode speech into latent using audio codec and use autoregressive language models or diffusion models to generate it, which ignores the intrinsic nature of speech and may lead to inferior or uncontrollable results. We argue that speech can be decomposed into several attributes (e.g., content, timbre, prosody, and phase) and each of them should be modeled using a module with appropriate inductive biases. From this perspective, we carefully design a novel and large zero-shot TTS system called Mega-TTS, which is trained with large-scale wild data and models different attributes in different ways: 1) Instead of using latent encoded by audio codec as the intermediate feature, we still choose spectrogram as it separates the phase and other attributes very well. Phase can be appropriately constructed by the GAN-based vocoder and does not need to be modeled by the language model. 2) We model the timbre using global vectors since timbre is a global attribute that changes slowly over time. 3) We further use a VQGAN-based acoustic model to generate the spectrogram and a latent code language model to fit the distribution of prosody, since prosody changes quickly over time in a sentence, and language models can capture both local and long-range dependencies. We scale Mega-TTS to multi-domain datasets with 20K hours of speech and evaluate its performance on unseen speakers. Experimental results demonstrate that Mega-TTS surpasses state-of-the-art TTS systems on zero-shot TTS, speech editing, and cross-lingual TTS tasks, with superior naturalness, robustness, and speaker similarity due to the proper inductive bias of each module. Audio samples are available at https://mega-tts.github.io/demo-page.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2306.03509

Country: Europe (0.14)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)

Add feedback

Make-A-Voice: Unified Voice Synthesis With Discrete Representation

Huang, Rongjie, Zhang, Chunlei, Wang, Yongqi, Yang, Dongchao, Liu, Luping, Ye, Zhenhui, Jiang, Ziyue, Weng, Chao, Zhao, Zhou, Yu, Dong

arXiv.org Artificial IntelligenceMay-30-2023

Various applications of voice synthesis have been developed independently despite the fact that they generate "voice" as output in common. In addition, the majority of voice synthesis models currently rely on annotated audio data, but it is crucial to scale them to self-supervised datasets in order to effectively capture the wide range of acoustic variations present in human voice, including speaker identity, emotion, and prosody. In this work, we propose Make-A-Voice, a unified framework for synthesizing and manipulating voice signals from discrete representations. Make-A-Voice leverages a "coarse-to-fine" approach to model the human voice, which involves three stages: 1) semantic stage: model high-level transformation between linguistic content and self-supervised semantic tokens, 2) acoustic stage: introduce varying control signals as acoustic conditions for semantic-to-acoustic modeling, and 3) generation stage: synthesize high-fidelity waveforms from acoustic tokens. Make-A-Voice offers notable benefits as a unified voice synthesis framework: 1) Data scalability: the major backbone (i.e., acoustic and generation stage) does not require any annotations, and thus the training data could be scaled up. 2) Controllability and conditioning flexibility: we investigate different conditioning mechanisms and effectively handle three voice synthesis applications, including text-to-speech (TTS), voice conversion (VC), and singing voice synthesis (SVS) by re-synthesizing the discrete voice representations with prompt guidance. Experimental results demonstrate that Make-A-Voice exhibits superior audio quality and style similarity compared with competitive baseline models. Audio samples are available at https://Make-A-Voice.github.io

machine learning, make-a-voice, natural language, (16 more...)

arXiv.org Artificial Intelligence

2305.19269

Genre: Research Report > New Finding (0.88)

Industry: Media (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation

Huang, Jiawei, Ren, Yi, Huang, Rongjie, Yang, Dongchao, Ye, Zhenhui, Zhang, Chen, Liu, Jinglin, Yin, Xiang, Ma, Zejun, Zhao, Zhou

arXiv.org Artificial IntelligenceMay-29-2023

Large diffusion models have been successful in text-to-audio (T2A) synthesis tasks, but they often suffer from common issues such as semantic misalignment and poor temporal consistency due to limited natural language understanding and data scarcity. Additionally, 2D spatial structures widely used in T2A works lead to unsatisfactory audio quality when generating variable-length audio samples since they do not adequately prioritize temporal information. To address these challenges, we propose Make-an-Audio 2, a latent diffusion-based T2A method that builds on the success of Make-an-Audio. Our approach includes several techniques to improve semantic alignment and temporal consistency: Firstly, we use pre-trained large language models (LLMs) to parse the text into structured pairs for better temporal information capture. We also introduce another structured-text encoder to aid in learning semantic alignment during the diffusion denoising process. To improve the performance of variable length generation and enhance the temporal information extraction, we design a feed-forward Transformer-based diffusion denoiser. Finally, we use LLMs to augment and transform a large amount of audio-label data into audio-text datasets to alleviate the problem of scarcity of temporal data. Extensive experiments show that our method outperforms baseline models in both objective and subjective metrics, and achieves significant gains in temporal information understanding, semantic consistency, and sound quality.

large language model, machine learning, make-an-audio 2, (17 more...)

arXiv.org Artificial Intelligence

2305.18474

Genre: Research Report (0.64)

Industry:

Information Technology (0.46)
Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation

Huang, Rongjie, Liu, Huadai, Cheng, Xize, Ren, Yi, Li, Linjun, Ye, Zhenhui, He, Jinzheng, Zhang, Lichao, Liu, Jinglin, Yin, Xiang, Zhao, Zhou

arXiv.org Artificial IntelligenceMay-24-2023

Direct speech-to-speech translation (S2ST) aims to convert speech from one language into another, and has demonstrated significant progress to date. Despite the recent success, current S2ST models still suffer from distinct degradation in noisy environments and fail to translate visual speech (i.e., the movement of lips and teeth). In this work, we present AV-TranSpeech, the first audio-visual speech-to-speech (AV-S2ST) translation model without relying on intermediate text. AV-TranSpeech complements the audio stream with visual information to promote system robustness and opens up a host of practical applications: dictation or dubbing archival films. To mitigate the data scarcity with limited parallel AV-S2ST data, we 1) explore self-supervised pre-training with unlabeled audio-visual data to learn contextual representation, and 2) introduce cross-modal distillation with S2ST models trained on the audio-only corpus to further reduce the requirements of visual data. Experimental results on two language pairs demonstrate that AV-TranSpeech outperforms audio-only models under all settings regardless of the type of noise. With low-resource audio-visual data (10h, 30h), cross-modal distillation yields an improvement of 7.6 BLEU on average compared with baselines. Audio samples are available at https://AV-TranSpeech.github.io

machine learning, natural language, translation, (17 more...)

arXiv.org Artificial Intelligence

2305.15403

Genre: Research Report (0.82)

Industry:

Media (0.34)
Leisure & Entertainment (0.34)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

RMSSinger: Realistic-Music-Score based Singing Voice Synthesis

He, Jinzheng, Liu, Jinglin, Ye, Zhenhui, Huang, Rongjie, Cui, Chenye, Liu, Huadai, Zhao, Zhou

arXiv.org Artificial IntelligenceMay-17-2023

We are interested in a challenging task, Realistic-Music-Score based Singing Voice Synthesis (RMS-SVS). RMS-SVS aims to generate high-quality singing voices given realistic music scores with different note types (grace, slur, rest, etc.). Though significant progress has been achieved, recent singing voice synthesis (SVS) methods are limited to fine-grained music scores, which require a complicated data collection pipeline with time-consuming manual annotation to align music notes with phonemes. Furthermore, these manual annotation destroys the regularity of note durations in music scores, making fine-grained music scores inconvenient for composing. To tackle these challenges, we propose RMSSinger, the first RMS-SVS method, which takes realistic music scores as input, eliminating most of the tedious manual annotation and avoiding the aforementioned inconvenience. Note that music scores are based on words rather than phonemes, in RMSSinger, we introduce word-level modeling to avoid the time-consuming phoneme duration annotation and the complicated phoneme-level mel-note alignment. Furthermore, we propose the first diffusion-based pitch modeling method, which ameliorates the naturalness of existing pitch-modeling methods. To achieve these, we collect a new dataset containing realistic music scores and singing voices according to these realistic music scores from professional singers. Extensive experiments on the dataset demonstrate the effectiveness of our methods. Audio samples are available at https://rmssinger.github.io/.

artificial intelligence, machine learning, music score, (15 more...)

arXiv.org Artificial Intelligence

2305.10686

Genre: Research Report (0.50)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech (0.94)
Information Technology > Artificial Intelligence > Natural Language (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)

Add feedback

AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head

Huang, Rongjie, Li, Mingze, Yang, Dongchao, Shi, Jiatong, Chang, Xuankai, Ye, Zhenhui, Wu, Yuning, Hong, Zhiqing, Huang, Jiawei, Liu, Jinglin, Ren, Yi, Zhao, Zhou, Watanabe, Shinji

arXiv.org Artificial IntelligenceApr-25-2023

Large language models (LLMs) have exhibited remarkable capabilities across a variety of domains and tasks, challenging our understanding of learning and cognition. Despite the recent success, current LLMs are not capable of processing complex audio information or conducting spoken conversations (like Siri or Alexa). In this work, we propose a multi-modal AI system named AudioGPT, which complements LLMs (i.e., ChatGPT) with 1) foundation models to process complex audio information and solve numerous understanding and generation tasks; and 2) the input/output interface (ASR, TTS) to support spoken dialogue. With an increasing demand to evaluate multi-modal LLMs of human intention understanding and cooperation with foundation models, we outline the principles and processes and test AudioGPT in terms of consistency, capability, and robustness. Experimental results demonstrate the capabilities of AudioGPT in solving AI tasks with speech, music, sound, and talking head understanding and generation in multi-round dialogues, which empower humans to create rich and diverse audio content with unprecedented ease.

artificial intelligence, audiogpt, natural language, (16 more...)

arXiv.org Artificial Intelligence

2304.12995

Country: Europe (0.28)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models

Huang, Rongjie, Huang, Jiawei, Yang, Dongchao, Ren, Yi, Liu, Luping, Li, Mingze, Ye, Zhenhui, Liu, Jinglin, Yin, Xiang, Zhao, Zhou

arXiv.org Artificial IntelligenceJan-29-2023

Large-scale multimodal generative modeling has created milestones in text-to-image and text-to-video generation. Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data. In this work, we propose Make-An-Audio with a prompt-enhanced diffusion model that addresses these gaps by 1) introducing pseudo prompt enhancement with a distill-then-reprogram approach, it alleviates data scarcity with orders of magnitude concept compositions by using language-free audios; 2) leveraging spectrogram autoencoder to predict the self-supervised audio representation instead of waveforms. Together with robust contrastive language-audio pretraining (CLAP) representations, Make-An-Audio achieves state-of-the-art results in both objective and subjective benchmark evaluation. Moreover, we present its controllability and generalization for X-to-Audio with "No Modality Left Behind", for the first time unlocking the ability to generate high-definition, high-fidelity audios given a user-defined modality input. Audio samples are available at https://Text-to-Audio.github.io

artificial intelligence, machine learning, natural language, (14 more...)

arXiv.org Artificial Intelligence

2301.12661

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback