Luo, Run
OpenOmni: Large Language Models Pivot Zero-shot Omnimodal Alignment across Language with Real-time Self-Aware Emotional Speech Synthesis
Luo, Run, Lin, Ting-En, Zhang, Haonan, Wu, Yuchuan, Liu, Xiong, Yang, Min, Li, Yongbin, Chen, Longze, Li, Jiaming, Zhang, Lei, Chen, Yangyi, Alinejad-Rokny, Hamid, Huang, Fei
Recent advancements in omnimodal learning have been achieved in understanding and generation across images, text, and speech, though mainly within proprietary models. Limited omnimodal datasets and the inherent challenges associated with real-time emotional speech generation have hindered open-source progress. To address these issues, we propose openomni, a two-stage training method combining omnimodal alignment and speech generation to develop a state-of-the-art omnimodal large language model. In the alignment phase, a pre-trained speech model is further trained on text-image tasks to generalize from vision to speech in a (near) zero-shot manner, outperforming models trained on tri-modal datasets. In the speech generation phase, a lightweight decoder facilitates real-time emotional speech through training on speech tasks and preference learning. Experiments demonstrate that openomni consistently improves across omnimodal, vision-language, and speech-language evaluations, enabling natural, emotion-rich dialogues and real-time emotional speech generation.
PersonaMath: Enhancing Math Reasoning through Persona-Driven Data Augmentation
Luo, Jing, Luo, Run, Chen, Longze, Zhu, Liang, Ao, Chang, Li, Jiaming, Chen, Yukun, Cheng, Xin, Yang, Wen, Su, Jiayuan, Li, Chengming, Yang, Min
While closed-source Large Language Models (LLMs) demonstrate strong mathematical problem-solving abilities, open-source models continue to struggle with such tasks. To bridge this gap, we propose a data augmentation approach and introduce PersonaMathQA, a dataset derived from MATH and GSM8K, on which we train the PersonaMath models. Our approach consists of two stages: the first stage is learning from Persona Diversification, and the second stage is learning from Reflection. In the first stage, we regenerate detailed chain-of-thought (CoT) solutions as instructions using a closed-source LLM and introduce a novel personadriven data augmentation technique to enhance the dataset's quantity and diversity. In the second stage, we incorporate reflection to fully leverage more challenging and valuable questions. Evaluation of our PersonaMath models on MATH and GSM8K reveals that the PersonaMath-7B model (based on LLaMA-2-7B) achieves an accuracy of 24.2% on MATH and 68.7% on GSM8K, surpassing all baseline methods and achieving state-of-the-art performance. Notably, our dataset contains only 70.3K data points--merely 17.8% of MetaMathQA and 27% of MathInstruct--yet our model outperforms these baselines, demonstrating the high quality and diversity of our dataset, which enables more efficient model training. "There are a thousand Hamlets in a thousand people's eyes" Among these tasks, solving math problems stands out as particularly challenging due to its complexity and the requirement for multi-step reasoning to reach a solution. While some closed-source models, such as GPT-4o (OpenAI, 2024a), Claude 3.5 Sonnet (Anthropic, 2024), and Gemini 1.5 Pro (Reid et al., 2024), have demonstrated strong math-solving capabilities, current open-source models (e.g., LLaMA (Touvron et al., 2023; Dubey et al., 2024)) continue to struggle in this area. Therefore, enhancing the math problem-solving abilities of open-source models is a prominent desiderata. A widely adopted and effective approach for improving the math-solving capabilities of open-source models is fine-tuning, owing to the accessibility of their weights (Yuan et al., 2023; Yue et al., 2023; The method consists of two stages: Stage 1 (top) and Stage 2 (bottom). Stage 1 focuses on using closed-source LLMs to automatically generate detailed CoT solutions and apply our persona-driven rewriting method to rephrase the questions.
Ruler: A Model-Agnostic Method to Control Generated Length for Large Language Models
Li, Jiaming, Zhang, Lei, Li, Yunshui, Liu, Ziqiang, bai, yuelin, Luo, Run, Chen, Longze, Yang, Min
The instruction-following ability of large language models enables humans to interact with AI agents in a natural way. However, when required to generate responses of a specific length, large language models often struggle to meet users' needs due to their inherent difficulty in accurately perceiving numerical constraints. To explore the ability of large language models to control the length of generated responses, we propose the Target Length Generation Task (TLG) and design two metrics, Precise Match (PM) and Flexible Match (FM) to evaluate the model's performance in adhering to specified response lengths. Furthermore, we introduce a novel, model-agnostic approach called Ruler, which employs Meta Length Tokens (MLTs) to enhance the instruction-following ability of large language models under length-constrained instructions. Specifically, Ruler equips LLMs with the ability to generate responses of a specified length based on length constraints within the instructions. Moreover, Ruler can automatically generate appropriate MLT when length constraints are not explicitly provided, demonstrating excellent versatility and generalization. Comprehensive experiments show the effectiveness of Ruler across different LLMs on Target Length Generation Task, e.g., at All Level 27.97 average gain on PM, 29.57 average gain on FM. In addition, we conduct extensive ablation experiments to further substantiate the efficacy and generalization of Ruler. Our code and data is available at https://github.com/Geaming2002/Ruler.
Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA
Wang, Minzheng, Chen, Longze, Fu, Cheng, Liao, Shengyi, Zhang, Xinghua, Wu, Bingli, Yu, Haiyang, Xu, Nan, Zhang, Lei, Luo, Run, Li, Yunshui, Yang, Min, Huang, Fei, Li, Yongbin
Long-context modeling capabilities have garnered widespread attention, leading to the emergence of Large Language Models (LLMs) with ultra-context windows. Meanwhile, benchmarks for evaluating long-context LLMs are gradually catching up. However, existing benchmarks employ irrelevant noise texts to artificially extend the length of test cases, diverging from the real-world scenarios of long-context applications. To bridge this gap, we propose a novel long-context benchmark, Loong, aligning with realistic scenarios through extended multi-document question answering (QA). Unlike typical document QA, in Loong's test cases, each document is relevant to the final answer, ignoring any document will lead to the failure of the answer. Furthermore, Loong introduces four types of tasks with a range of context lengths: Spotlight Locating, Comparison, Clustering, and Chain of Reasoning, to facilitate a more realistic and comprehensive evaluation of long-context understanding. Extensive experiments indicate that existing long-context language models still exhibit considerable potential for enhancement. Retrieval augmented generation (RAG) achieves poor performance, demonstrating that Loong can reliably assess the model's long-context modeling capabilities.
Long Context is Not Long at All: A Prospector of Long-Dependency Data for Large Language Models
Chen, Longze, Liu, Ziqiang, He, Wanwei, Li, Yunshui, Luo, Run, Yang, Min
Long-context modeling capabilities are important for large language models (LLMs) in various applications. However, directly training LLMs with long context windows is insufficient to enhance this capability since some training samples do not exhibit strong semantic dependencies across long contexts. In this study, we propose a data mining framework \textbf{ProLong} that can assign each training sample with a long dependency score, which can be used to rank and filter samples that are more advantageous for enhancing long-context modeling abilities in LLM training. Specifically, we first use delta perplexity scores to measure the \textit{Dependency Strength} between text segments in a given document. Then we refine this metric based on the \textit{Dependency Distance} of these segments to incorporate spatial relationships across long-contexts. Final results are calibrated with a \textit{Dependency Specificity} metric to prevent trivial dependencies introduced by repetitive patterns. Moreover, a random sampling approach is proposed to optimize the computational efficiency of ProLong. Comprehensive experiments on multiple benchmarks indicate that ProLong effectively identifies documents that carry long dependencies and LLMs trained on these documents exhibit significantly enhanced long-context modeling capabilities.
VDialogUE: A Unified Evaluation Benchmark for Visually-grounded Dialogue
Li, Yunshui, Hui, Binyuan, Yin, Zhaochao, He, Wanwei, Luo, Run, Long, Yuxing, Yang, Min, Huang, Fei, Li, Yongbin
Visually-grounded dialog systems, which integrate multiple modes of communication such as text and visual inputs, have become an increasingly popular area of investigation. However, the absence of a standardized evaluation framework poses a challenge in assessing the development of this field. To this end, we propose \textbf{VDialogUE}, a \textbf{V}isually-grounded \textbf{Dialog}ue benchmark for \textbf{U}nified \textbf{E}valuation. It defines five core multi-modal dialogue tasks and covers six datasets. Furthermore, in order to provide a comprehensive assessment of the model's performance across all tasks, we developed a novel evaluation metric called VDscore, which is based on the Analytic Hierarchy Process~(AHP) method. Additionally, we present a straightforward yet efficient baseline model, named \textbf{VISIT}~(\textbf{VIS}ually-grounded d\textbf{I}alog \textbf{T}ransformer), to promote the advancement of general multi-modal dialogue systems. It progressively builds its multi-modal foundation and dialogue capability via a two-stage pre-training strategy. We believe that the VDialogUE benchmark, along with the evaluation scripts and our baseline models, will accelerate the development of visually-grounded dialog systems and lead to the development of more sophisticated and effective pre-trained models.