Wang, Yu-Chiang Frank
UniWav: Towards Unified Pre-training for Speech Representation Learning and Generation
Liu, Alexander H., Lee, Sang-gil, Yang, Chao-Han Huck, Gong, Yuan, Wang, Yu-Chiang Frank, Glass, James R., Valle, Rafael, Catanzaro, Bryan
Pre-training and representation learning have been playing an increasingly important role in modern speech processing. Nevertheless, different applications have been relying on different foundation models, since predominant pre-training techniques are either designed for discriminative tasks or generative tasks. In this work, we make the first attempt at building a unified pre-training framework for both types of tasks in speech. We show that with the appropriate design choices for pre-training, one can jointly learn a representation encoder and generative audio decoder that can be applied to both types of tasks. We propose UniWav, an encoder-decoder framework designed to unify pre-training representation learning and generative tasks. On speech recognition, text-to-speech, and speech tokenization, UniWav achieves comparable performance to different existing foundation models, each trained on a specific task. Our findings suggest that a single general-purpose foundation model for speech can be built to replace different foundation models, reducing the overhead and cost of pre-training.
Plan2Align: Predictive Planning Based Test-Time Preference Alignment in Paragraph-Level Machine Translation
Wang, Kuang-Da, Chen, Teng-Ruei, Hung, Yu Heng, Ding, Shuoyang, Wu, Yueh-Hua, Wang, Yu-Chiang Frank, Yang, Chao-Han Huck, Peng, Wen-Chih, Hsieh, Ping-Chun
Machine Translation (MT) has been predominantly designed for sentence-level translation using transformer-based architectures. While next-token prediction based Large Language Models (LLMs) demonstrate strong capabilities in long-text translation, non-extensive language models often suffer from omissions and semantic inconsistencies when processing paragraphs. Existing preference alignment methods improve sentence-level translation but fail to ensure coherence over extended contexts due to the myopic nature of next-token generation. We introduce Plan2Align, a test-time alignment framework that treats translation as a predictive planning problem, adapting Model Predictive Control to iteratively refine translation outputs. Experiments on WMT24 Discourse-Level Literary Translation show that Plan2Align significantly improves paragraph-level translation, achieving performance surpassing or on par with the existing training-time and test-time alignment methods on LLaMA-3.1 8B.
MotionMatcher: Motion Customization of Text-to-Video Diffusion Models via Motion Feature Matching
Wu, Yen-Siang, Huang, Chi-Pin, Yang, Fu-En, Wang, Yu-Chiang Frank
Similarly, to control the pacing and flow of AI-generated videos, users should have control over the dynamics and composition of videos produced by generative models. To this end, numerous motion control methods [25, 33, 57, 59, 61, 63, 72] have been proposed to control moving object trajectories in videos generated by text-to-video (T2V) diffusion models [4, 17]. Motion customization, in particular, aims to control T2V diffusion models with the motion of a reference video [26, 31, 36, 71, 76]. With the assistance of the reference video, users are able to specify the desired object movements and camera framing in detail. Formally speaking, given a reference video, motion customization aims to adjust a pre-trained T2V diffusion model, so the output videos sampled from the adjusted model follow the object movements and camera framing of the reference video (see Figure 1 for an example). Given that motion is a high-level concept involving both spatial and temporal dimensions [65, 71], motion customization is considered a non-trivial task. Recently, many motion customization methods have been proposed to eliminate the influence of visual appearance in the reference video. Among them, a standout strategy is fine-tuning the pre-trained T2V diffusion model to reconstruct the frame differences of the reference video.
V2V-LLM: Vehicle-to-Vehicle Cooperative Autonomous Driving with Multi-Modal Large Language Models
Chiu, Hsu-kuang, Hachiuma, Ryo, Wang, Chien-Yi, Smith, Stephen F., Wang, Yu-Chiang Frank, Chen, Min-Hung
Current autonomous driving vehicles rely mainly on their individual sensors to understand surrounding scenes and plan for future trajectories, which can be unreliable when the sensors are malfunctioning or occluded. To address this problem, cooperative perception methods via vehicle-to-vehicle (V2V) communication have been proposed, but they have tended to focus on detection and tracking. How those approaches contribute to overall cooperative planning performance is still under-explored. Inspired by recent progress using Large Language Models (LLMs) to build autonomous driving systems, we propose a novel problem setting that integrates an LLM into cooperative autonomous driving, with the proposed Vehicle-to-Vehicle Question-Answering (V2V-QA) dataset and benchmark. We also propose our baseline method Vehicle-to-Vehicle Large Language Model (V2V-LLM), which uses an LLM to fuse perception information from multiple connected autonomous vehicles (CAVs) and answer driving-related questions: grounding, notable object identification, and planning. Experimental results show that our proposed V2V-LLM can be a promising unified model architecture for performing various tasks in cooperative autonomous driving, and outperforms other baseline methods that use different fusion approaches. Our work also creates a new research direction that can improve the safety of future autonomous driving systems. Our project website: https://eddyhkchiu.github.io/v2vllm.github.io/ .
3D Gaussian Inpainting with Depth-Guided Cross-View Consistency
Huang, Sheng-Yu, Chou, Zi-Ting, Wang, Yu-Chiang Frank
When performing 3D inpainting using novel-view rendering methods like Neural Radiance Field (NeRF) or 3D Gaussian Splatting (3DGS), how to achieve texture and geometry consistency across camera views has been a challenge. In this paper, we propose a framework of 3D Gaussian Inpainting with Depth-Guided Cross-View Consistency (3DGIC) for cross-view consistent 3D inpainting. Guided by the rendered depth information from each training view, our 3DGIC exploits background pixels visible across different views for updating the inpainting mask, allowing us to refine the 3DGS for inpainting purposes.Through extensive experiments on benchmark datasets, we confirm that our 3DGIC outperforms current state-of-the-art 3D inpainting methods quantitatively and qualitatively.
Detecting the Undetectable: Assessing the Efficacy of Current Spoof Detection Methods Against Seamless Speech Edits
Huang, Sung-Feng, Kuo, Heng-Cheng, Chen, Zhehuai, Yang, Xuesong, Yang, Chao-Han Huck, Tsao, Yu, Wang, Yu-Chiang Frank, Lee, Hung-yi, Fu, Szu-Wei
Neural speech editing advancements have raised concerns about their misuse in spoofing attacks. Traditional partially edited speech corpora primarily focus on cut-and-paste edits, which, while maintaining speaker consistency, often introduce detectable discontinuities. Recent methods, like A\textsuperscript{3}T and Voicebox, improve transitions by leveraging contextual information. To foster spoofing detection research, we introduce the Speech INfilling Edit (SINE) dataset, created with Voicebox. We detailed the process of re-implementing Voicebox training and dataset creation. Subjective evaluations confirm that speech edited using this novel technique is more challenging to detect than conventional cut-and-paste methods. Despite human difficulty, experimental results demonstrate that self-supervised-based detectors can achieve remarkable performance in detection, localization, and generalization across different edit methods. The dataset and related models will be made publicly available.
EoRA: Training-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation
Liu, Shih-Yang, Yang, Huck, Wang, Chien-Yi, Fung, Nai Chit, Yin, Hongxu, Sakr, Charbel, Muralidharan, Saurav, Cheng, Kwang-Ting, Kautz, Jan, Wang, Yu-Chiang Frank, Molchanov, Pavlo, Chen, Min-Hung
Although Large Language Models (LLMs) exhibit superior performance across diverse applications, their empirical deployment remains challenging due to their associated considerable model size and high inference costs. To mitigate these emerging challenges, model compression research such as post-training compression (Ashkboos et al., 2024; Ma et al., 2023) and compression-aware training (Alvarez & Salzmann, 2017; Lym et al., 2019; Liu et al., 2024, 2023c) has been extensively explored to reduce the computational resource demands of serving LLMs (Zhu et al., 2023). However, most existing methods either incur significant accuracy degradation compared to uncompressed models or have high training time. Additionally, their flexibility is often limited by a discrete set of compression formats (e.g., 2:4 sparsity, 3/4-bit quantization), making it challenging to meet the diverse capacity and efficiency requirements of different users. To overcome the above flexibility limitation, we re-formulate the model compression problem into the customized compensation problem: Given a compressed model, we aim to introduce residual low-rank paths to compensate for compression errors under customized requirements from users, such as tasks, compression ratios, etc. Rather than focusing solely on producing compressed models with minimal performance degradation, by incorporating these residual paths, the compensated model gains greater flexibility in adjusting overall capacity, without being constrained by specific compression formats.
NeKo: Toward Post Recognition Generative Correction Large Language Models with Task-Oriented Experts
Lin, Yen-Ting, Yang, Chao-Han Huck, Chen, Zhehuai, Zelasko, Piotr, Yang, Xuesong, Chen, Zih-Ching, Puvvada, Krishna C, Fu, Szu-Wei, Hu, Ke, Chiu, Jun Wei, Balam, Jagadeesh, Ginsburg, Boris, Wang, Yu-Chiang Frank
Construction of a general-purpose post-recognition error corrector poses a crucial question: how can we most effectively train a model on a large mixture of domain datasets? The answer would lie in learning dataset-specific features and digesting their knowledge in a single model. Previous methods achieve this by having separate correction language models, resulting in a significant increase in parameters. In this work, we present Mixture-of-Experts as a solution, highlighting that MoEs are much more than a scalability tool. We propose a Multi-Task Correction MoE, where we train the experts to become an ``expert'' of speech-to-text, language-to-text and vision-to-text datasets by learning to route each dataset's tokens to its mapped expert. Experiments on the Open ASR Leaderboard show that we explore a new state-of-the-art performance by achieving an average relative $5.0$% WER reduction and substantial improvements in BLEU scores for speech and translation tasks. On zero-shot evaluation, NeKo outperforms GPT-3.5 and Claude-Opus with $15.5$% to $27.6$% relative WER reduction in the Hyporadise benchmark. NeKo performs competitively on grammar and post-OCR correction as a multi-task model.
Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data
Lu, Ke-Han, Chen, Zhehuai, Fu, Szu-Wei, Yang, Chao-Han Huck, Balam, Jagadeesh, Ginsburg, Boris, Wang, Yu-Chiang Frank, Lee, Hung-yi
Recent end-to-end speech language models (SLMs) have expanded upon the capabilities of large language models (LLMs) by incorporating pre-trained speech models. However, these SLMs often undergo extensive speech instruction-tuning to bridge the gap between speech and text modalities. This requires significant annotation efforts and risks catastrophic forgetting of the original language capabilities. In this work, we present a simple yet effective automatic process for creating speech-text pair data that carefully injects speech paralinguistic understanding abilities into SLMs while preserving the inherent language capabilities of the text-based LLM. Our model demonstrates general capabilities for speech-related tasks without the need for speech instruction-tuning data, achieving impressive performance on Dynamic-SUPERB and AIR-Bench-Chat benchmarks. Furthermore, our model exhibits the ability to follow complex instructions derived from LLMs, such as specific output formatting and chain-of-thought reasoning. Our approach not only enhances the versatility and effectiveness of SLMs but also reduces reliance on extensive annotated datasets, paving the way for more efficient and capable speech understanding systems.
DoRA: Weight-Decomposed Low-Rank Adaptation
Liu, Shih-Yang, Wang, Chien-Yi, Yin, Hongxu, Molchanov, Pavlo, Wang, Yu-Chiang Frank, Cheng, Kwang-Ting, Chen, Min-Hung
Among the widely used parameter-efficient fine-tuning (PEFT) methods, LoRA and its variants have gained considerable popularity because of avoiding additional inference costs. However, there still often exists an accuracy gap between these methods and full fine-tuning (FT). In this work, we first introduce a novel weight decomposition analysis to investigate the inherent differences between FT and LoRA. Aiming to resemble the learning capacity of FT from the findings, we propose Weight-Decomposed Low-Rank Adaptation (DoRA). DoRA decomposes the pre-trained weight into two components, magnitude and direction, for fine-tuning, specifically employing LoRA for directional updates to efficiently minimize the number of trainable parameters. By employing \ours, we enhance both the learning capacity and training stability of LoRA while avoiding any additional inference overhead. \ours~consistently outperforms LoRA on fine-tuning LLaMA, LLaVA, and VL-BART on various downstream tasks, such as commonsense reasoning, visual instruction tuning, and image/video-text understanding. Code is available at https://github.com/NVlabs/DoRA.