Wang, Yuping
Equivariant Map and Agent Geometry for Autonomous Driving Motion Prediction
Wang, Yuping, Chen, Jier
In autonomous driving, deep learning enabled motion prediction is a popular topic. A critical gap in traditional motion prediction methodologies lies in ensuring equivariance under Euclidean geometric transformations and maintaining invariant interaction relationships. This research introduces a groundbreaking solution by employing EqMotion, a theoretically geometric equivariant and interaction invariant motion prediction model for particles and humans, plus integrating agent-equivariant high-definition (HD) map features for context aware motion prediction in autonomous driving. The use of EqMotion as backbone marks a significant departure from existing methods by rigorously ensuring motion equivariance and interaction invariance. Equivariance here implies that an output motion must be equally transformed under the same Euclidean transformation as an input motion, while interaction invariance preserves the manner in which agents interact despite transformations. These properties make the network robust to arbitrary Euclidean transformations and contribute to more accurate prediction. In addition, we introduce an equivariant method to process the HD map to enrich the spatial understanding of the network while preserving the overall network equivariance property. By applying these technologies, our model is able to achieve high prediction accuracy while maintain a lightweight design and efficient data utilization.
Halo: Estimation and Reduction of Hallucinations in Open-Source Weak Large Language Models
Elaraby, Mohamed, Lu, Mengyin, Dunn, Jacob, Zhang, Xueying, Wang, Yu, Liu, Shizhu, Tian, Pingchuan, Wang, Yuping, Wang, Yuxuan
Large Language Models (LLMs) have revolutionized Natural Language Processing (NLP). Although convenient for research and practical applications, open-source LLMs with fewer parameters often suffer from severe hallucinations compared to their larger counterparts. This paper focuses on measuring and reducing hallucinations in BLOOM 7B, a representative of such weaker open-source LLMs that are publicly available for research and commercial applications. We introduce HaloCheck, a lightweight BlackBox knowledge-free framework designed to quantify the severity of hallucinations in LLMs. Additionally, we explore techniques like knowledge injection and teacher-student approaches to alleviate hallucinations in low-parameter LLMs. Our experiments effectively demonstrate the reduction of hallucinations in challenging domains for these LLMs.
AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining
Liu, Haohe, Tian, Qiao, Yuan, Yi, Liu, Xubo, Mei, Xinhao, Kong, Qiuqiang, Wang, Yuping, Wang, Wenwu, Wang, Yuxuan, Plumbley, Mark D.
Although audio generation shares commonalities across different types of audio, such as speech, music, and sound effects, designing models for each type requires careful consideration of specific objectives and biases that can significantly differ from those of other types. To bring us closer to a unified perspective of audio generation, this paper proposes a framework that utilizes the same learning method for speech, music, and sound effect generation. Our framework introduces a general representation of audio, called "language of audio" (LOA). Any audio can be translated into LOA based on AudioMAE, a self-supervised pre-trained representation learning model. In the generation process, we translate any modalities into LOA by using a GPT-2 model, and we perform self-supervised audio generation learning with a latent diffusion model conditioned on LOA. The proposed framework naturally brings advantages such as in-context learning abilities and reusable self-supervised pretrained AudioMAE and latent diffusion models. Experiments on the major benchmarks of text-to-audio, text-to-music, and text-to-speech demonstrate state-of-the-art or competitive performance against previous approaches. Our code, pretrained model, and demo are available at https://audioldm.github.io/audioldm2.
PolyVoice: Language Models for Speech to Speech Translation
Dong, Qianqian, Huang, Zhiying, Tian, Qiao, Xu, Chen, Ko, Tom, Zhao, Yunlong, Feng, Siyuan, Li, Tang, Wang, Kexin, Cheng, Xuxin, Yue, Fengpeng, Bai, Ye, Chen, Xi, Lu, Lu, Ma, Zejun, Wang, Yuping, Wang, Mingxuan, Wang, Yuxuan
We propose PolyVoice, a language model-based framework for speech-to-speech translation (S2ST) system. Our framework consists of two language models: a translation language model and a speech synthesis language model. We use discretized speech units, which are generated in a fully unsupervised way, and thus our framework can be used for unwritten languages. For the speech synthesis part, we adopt the existing VALL-E X approach and build a unit-based audio language model. This grants our framework the ability to preserve the voice characteristics and the speaking style of the original speech. We examine our system on Chinese $\rightarrow$ English and English $\rightarrow$ Spanish pairs. Experimental results show that our system can generate speech with high translation quality and audio quality. Speech samples are available at https://speechtranslation.github.io/polyvoice.
Efficient Neural Music Generation
Lam, Max W. Y., Tian, Qiao, Li, Tang, Yin, Zongyu, Feng, Siyuan, Tu, Ming, Ji, Yuliang, Xia, Rui, Ma, Mingbo, Song, Xuchen, Chen, Jitong, Wang, Yuping, Wang, Yuxuan
Recent progress in music generation has been remarkably advanced by the state-of-the-art MusicLM, which comprises a hierarchy of three LMs, respectively, for semantic, coarse acoustic, and fine acoustic modelings. Yet, sampling with the MusicLM requires processing through these LMs one by one to obtain the fine-grained acoustic tokens, making it computationally expensive and prohibitive for a real-time generation. Efficient music generation with a quality on par with MusicLM remains a significant challenge. In this paper, we present MeLoDy (M for music; L for LM; D for diffusion), an LM-guided diffusion model that generates music audios of state-of-the-art quality meanwhile reducing 95.7% or 99.6% forward passes in MusicLM, respectively, for sampling 10s or 30s music. MeLoDy inherits the highest-level LM from MusicLM for semantic modeling, and applies a novel dual-path diffusion (DPD) model and an audio VAE-GAN to efficiently decode the conditioning semantic tokens into waveform. DPD is proposed to simultaneously model the coarse and fine acoustics by incorporating the semantic information into segments of latents effectively via cross-attention at each denoising step. Our experimental results suggest the superiority of MeLoDy, not only in its practical advantages on sampling speed and infinitely continuable generation, but also in its state-of-the-art musicality, audio quality, and text correlation. Our samples are available at https://Efficient-MeLoDy.github.io/.