Huang, Wenhao
A Comprehensive Survey on Long Context Language Modeling
Liu, Jiaheng, Zhu, Dawei, Bai, Zhiqi, He, Yancheng, Liao, Huanxuan, Que, Haoran, Wang, Zekun, Zhang, Chenchen, Zhang, Ge, Zhang, Jiebin, Zhang, Yuanxing, Chen, Zhuo, Guo, Hangyu, Li, Shilong, Liu, Ziqiang, Shan, Yong, Song, Yifan, Tian, Jiayi, Wu, Wenhao, Zhou, Zhejian, Zhu, Ruijie, Feng, Junlan, Gao, Yang, He, Shizhu, Li, Zhoujun, Liu, Tianyu, Meng, Fanyu, Su, Wenbo, Tan, Yingshui, Wang, Zili, Yang, Jian, Ye, Wei, Zheng, Bo, Zhou, Wangchunshu, Huang, Wenhao, Li, Sujian, Zhang, Zhaoxiang
Efficient processing of long contexts has been a persistent pursuit in Natural Language Processing. With the growing number of long documents, dialogues, and other textual data, it is important to develop Long Context Language Models (LCLMs) that can process and analyze extensive inputs in an effective and efficient way. In this paper, we present a comprehensive survey on recent advances in long-context modeling for large language models. Our survey is structured around three key aspects: how to obtain effective and efficient LCLMs, how to train and deploy LCLMs efficiently, and how to evaluate and analyze LCLMs comprehensively. For the first aspect, we discuss data strategies, architectural designs, and workflow approaches oriented with long context processing. For the second aspect, we provide a detailed examination of the infrastructure required for LCLM training and inference. For the third aspect, we present evaluation paradigms for long-context comprehension and long-form generation, as well as behavioral analysis and mechanism interpretability of LCLMs. Beyond these three key aspects, we thoroughly explore the diverse application scenarios where existing LCLMs have been deployed and outline promising future development directions. This survey provides an up-to-date review of the literature on long-context LLMs, which we wish to serve as a valuable resource for both researchers and engineers. An associated GitHub repository collecting the latest papers and repos is available at: \href{https://github.com/LCLM-Horizon/A-Comprehensive-Survey-For-Long-Context-Language-Modeling}{\color[RGB]{175,36,67}{LCLM-Horizon}}.
YuE: Scaling Open Foundation Models for Long-Form Music Generation
Yuan, Ruibin, Lin, Hanfeng, Guo, Shuyue, Zhang, Ge, Pan, Jiahao, Zang, Yongyi, Liu, Haohe, Liang, Yiming, Ma, Wenye, Du, Xingjian, Du, Xinrun, Ye, Zhen, Zheng, Tianyu, Ma, Yinghao, Liu, Minghao, Tian, Zeyue, Zhou, Ziya, Xue, Liumeng, Qu, Xingwei, Li, Yizhi, Wu, Shangda, Shen, Tianhao, Ma, Ziyang, Zhan, Jun, Wang, Chunhui, Wang, Yatian, Chi, Xiaowei, Zhang, Xinyue, Yang, Zhenzhu, Wang, Xiangzhou, Liu, Shansong, Mei, Lingrui, Li, Peng, Wang, Junjie, Yu, Jianwei, Pang, Guojian, Li, Xu, Wang, Zihao, Zhou, Xiaohuan, Yu, Lijun, Benetos, Emmanouil, Chen, Yong, Lin, Chenghua, Chen, Xie, Xia, Gus, Zhang, Zhaoxiang, Zhang, Chao, Chen, Wenhu, Zhou, Xinyu, Qiu, Xipeng, Dannenberg, Roger, Liu, Jiaheng, Yang, Jian, Huang, Wenhao, Xue, Wei, Tan, Xu, Guo, Yike
We tackle the task of long-form music generation--particularly the challenging \textbf{lyrics-to-song} problem--by introducing YuE, a family of open foundation models based on the LLaMA2 architecture. Specifically, YuE scales to trillions of tokens and generates up to five minutes of music while maintaining lyrical alignment, coherent musical structure, and engaging vocal melodies with appropriate accompaniment. It achieves this through (1) track-decoupled next-token prediction to overcome dense mixture signals, (2) structural progressive conditioning for long-context lyrical alignment, and (3) a multitask, multiphase pre-training recipe to converge and generalize. In addition, we redesign the in-context learning technique for music generation, enabling versatile style transfer (e.g., converting Japanese city pop into an English rap while preserving the original accompaniment) and bidirectional generation. Through extensive evaluation, we demonstrate that YuE matches or even surpasses some of the proprietary systems in musicality and vocal agility. In addition, fine-tuning YuE enables additional controls and enhanced support for tail languages. Furthermore, beyond generation, we show that YuE's learned representations can perform well on music understanding tasks, where the results of YuE match or exceed state-of-the-art methods on the MARBLE benchmark. Keywords: lyrics2song, song generation, long-form, foundation model, music generation
SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines
Team, M-A-P, Du, Xinrun, Yao, Yifan, Ma, Kaijing, Wang, Bingli, Zheng, Tianyu, Zhu, Kang, Liu, Minghao, Liang, Yiming, Jin, Xiaolong, Wei, Zhenlin, Zheng, Chujie, Deng, Kaixin, Jia, Shian, Jiang, Sichao, Liao, Yiyan, Li, Rui, Li, Qinrui, Li, Sirun, Li, Yizhi, Li, Yunwen, Ma, Dehua, Ni, Yuansheng, Que, Haoran, Wang, Qiyao, Wen, Zhoufutu, Wu, Siwei, Xing, Tianshun, Xu, Ming, Yang, Zhenzhu, Wang, Zekun Moore, Zhou, Junting, Bai, Yuelin, Bu, Xingyuan, Cai, Chenglin, Chen, Liang, Chen, Yifan, Cheng, Chengtuo, Cheng, Tianhao, Ding, Keyi, Huang, Siming, Huang, Yun, Li, Yaoru, Li, Yizhe, Li, Zhaoqun, Liang, Tianhao, Lin, Chengdong, Lin, Hongquan, Ma, Yinghao, Pang, Tianyang, Peng, Zhongyuan, Peng, Zifan, Qi, Qige, Qiu, Shi, Qu, Xingwei, Quan, Shanghaoran, Tan, Yizhou, Wang, Zili, Wang, Chenqing, Wang, Hao, Wang, Yiya, Wang, Yubo, Xu, Jiajun, Yang, Kexin, Yuan, Ruibin, Yue, Yuanhao, Zhan, Tianyang, Zhang, Chun, Zhang, Jinyang, Zhang, Xiyue, Zhang, Xingjian, Zhang, Yue, Zhao, Yongchi, Zheng, Xiangyu, Zhong, Chenghua, Gao, Yang, Li, Zhoujun, Liu, Dayiheng, Liu, Qian, Liu, Tianyu, Ni, Shiwen, Peng, Junran, Qin, Yujia, Su, Wenbo, Wang, Guoyin, Wang, Shi, Yang, Jian, Yang, Min, Cao, Meng, Yue, Xiang, Zhang, Zhaoxiang, Zhou, Wangchunshu, Liu, Jiaheng, Lin, Qunshu, Huang, Wenhao, Zhang, Ge
Large language models (LLMs) have demonstrated remarkable proficiency in mainstream academic disciplines such as mathematics, physics, and computer science. However, human knowledge encompasses over 200 specialized disciplines, far exceeding the scope of existing benchmarks. The capabilities of LLMs in many of these specialized fields-particularly in light industry, agriculture, and service-oriented disciplines-remain inadequately evaluated. To address this gap, we present SuperGPQA, a comprehensive benchmark that evaluates graduate-level knowledge and reasoning capabilities across 285 disciplines. Our benchmark employs a novel Human-LLM collaborative filtering mechanism to eliminate trivial or ambiguous questions through iterative refinement based on both LLM responses and expert feedback. Our experimental results reveal significant room for improvement in the performance of current state-of-the-art LLMs across diverse knowledge domains (e.g., the reasoning-focused model DeepSeek-R1 achieved the highest accuracy of 61.82% on SuperGPQA), highlighting the considerable gap between current model capabilities and artificial general intelligence. Additionally, we present comprehensive insights from our management of a large-scale annotation process, involving over 80 expert annotators and an interactive Human-LLM collaborative system, offering valuable methodological guidance for future research initiatives of comparable scope.
SuPreME: A Supervised Pre-training Framework for Multimodal ECG Representation Learning
Cai, Mingsheng, Jiang, Jiuming, Huang, Wenhao, Liu, Che, Arcucci, Rossella
Cardiovascular diseases are a leading cause of death and disability worldwide. Electrocardiogram (ECG) recordings are critical for diagnosing and monitoring cardiac health, but obtaining large-scale annotated ECG datasets is labor-intensive and time-consuming. Recent ECG Self-Supervised Learning (eSSL) methods mitigate this by learning features without extensive labels but fail to capture fine-grained clinical semantics and require extensive task-specific fine-tuning. To address these challenges, we propose $\textbf{SuPreME}$, a $\textbf{Su}$pervised $\textbf{Pre}$-training framework for $\textbf{M}$ultimodal $\textbf{E}$CG representation learning. SuPreME applies Large Language Models (LLMs) to extract structured clinical entities from free-text ECG reports, filter out noise and irrelevant content, enhance clinical representation learning, and build a high-quality, fine-grained labeled dataset. By using text-based cardiac queries instead of traditional categorical labels, SuPreME enables zero-shot classification of unseen diseases without additional fine-tuning. We evaluate SuPreME on six downstream datasets covering 127 cardiac conditions, achieving superior zero-shot AUC performance over state-of-the-art eSSL and multimodal methods by over 1.96\%. Results demonstrate the effectiveness of SuPreME in leveraging structured, clinically relevant knowledge for high-quality ECG representations. All code and data will be released upon acceptance.
CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models
Zhang, Alexander, Dong, Marcus, Liu, Jiaheng, Zhang, Wei, Wang, Yejie, Yang, Jian, Zhang, Ge, Liu, Tianyu, Peng, Zhongyuan, Tan, Yingshui, Zhang, Yuanxing, Wang, Zhexu, Wang, Weixun, He, Yancheng, Deng, Ken, Zhou, Wangchunshu, Huang, Wenhao, Zhang, Zhaoxiang
The critique capacity of Large Language Models (LLMs) is essential for reasoning abilities, which can provide necessary suggestions (e.g., detailed analysis and constructive feedback). Therefore, how to evaluate the critique capacity of LLMs has drawn great attention and several critique benchmarks have been proposed. However, existing critique benchmarks usually have the following limitations: (1). Focusing on diverse reasoning tasks in general domains and insufficient evaluation on code tasks (e.g., only covering code generation task), where the difficulty of queries is relatively easy (e.g., the code queries of CriticBench are from Humaneval and MBPP). (2). Lacking comprehensive evaluation from different dimensions. To address these limitations, we introduce a holistic code critique benchmark for LLMs called CodeCriticBench. Specifically, our CodeCriticBench includes two mainstream code tasks (i.e., code generation and code QA) with different difficulties. Besides, the evaluation protocols include basic critique evaluation and advanced critique evaluation for different characteristics, where fine-grained evaluation checklists are well-designed for advanced settings. Finally, we conduct extensive experimental results of existing LLMs, which show the effectiveness of CodeCriticBench.
SimpleVQA: Multimodal Factuality Evaluation for Multimodal Large Language Models
Cheng, Xianfu, Zhang, Wei, Zhang, Shiwei, Yang, Jian, Guan, Xiangyuan, Wu, Xianjie, Li, Xiang, Zhang, Ge, Liu, Jiaheng, Mai, Yuying, Zeng, Yutao, Wen, Zhoufutu, Jin, Ke, Wang, Baorui, Zhou, Weixiao, Lu, Yunhong, Li, Tongliang, Huang, Wenhao, Li, Zhoujun
The increasing application of multi-modal large language models (MLLMs) across various sectors have spotlighted the essence of their output reliability and accuracy, particularly their ability to produce content grounded in factual information (e.g. common and domain-specific knowledge). In this work, we introduce SimpleVQA, the first comprehensive multi-modal benchmark to evaluate the factuality ability of MLLMs to answer natural language short questions. SimpleVQA is characterized by six key features: it covers multiple tasks and multiple scenarios, ensures high quality and challenging queries, maintains static and timeless reference answers, and is straightforward to evaluate. Our approach involves categorizing visual question-answering items into 9 different tasks around objective events or common knowledge and situating these within 9 topics. Rigorous quality control processes are implemented to guarantee high-quality, concise, and clear answers, facilitating evaluation with minimal variance via an LLM-as-a-judge scoring system. Using SimpleVQA, we perform a comprehensive assessment of leading 18 MLLMs and 8 text-only LLMs, delving into their image comprehension and text generation abilities by identifying and analyzing error cases.
Aligning Instruction Tuning with Pre-training
Liang, Yiming, Zheng, Tianyu, Du, Xinrun, Zhang, Ge, Qu, Xingwei, Yue, Xiang, Zheng, Chujie, Liu, Jiaheng, Ma, Lei, Chen, Wenhu, Wang, Guoyin, Zhang, Zhaoxiang, Huang, Wenhao, Zhang, Jiajun
Instruction tuning enhances large language models (LLMs) to follow human instructions across diverse tasks, relying on high-quality datasets to guide behavior. However, these datasets, whether manually curated or synthetically generated, are often narrowly focused and misaligned with the broad distributions captured during pre-training, limiting LLM generalization and effective use of pre-trained knowledge. We propose *Aligning Instruction Tuning with Pre-training* (AITP), a method that bridges this gap by identifying coverage shortfalls in instruction-tuning datasets and rewriting underrepresented pre-training data into high-quality instruction-response pairs. This approach enriches dataset diversity while preserving task-specific objectives. Evaluations on three fully open LLMs across eight benchmarks demonstrate consistent performance improvements with AITP. Ablations highlight the benefits of adaptive data selection, controlled rewriting, and balanced integration, emphasizing the importance of aligning instruction tuning with pre-training distributions to unlock the full potential of LLMs.
MIO: A Foundation Model on Multimodal Tokens
Wang, Zekun, Zhu, King, Xu, Chunpu, Zhou, Wangchunshu, Liu, Jiaheng, Zhang, Yibo, Wang, Jiashuo, Shi, Ning, Li, Siyu, Li, Yizhi, Que, Haoran, Zhang, Zhaoxiang, Zhang, Yuanxing, Zhang, Ge, Xu, Ke, Fu, Jie, Huang, Wenhao
In this paper, we introduce MIO, a novel foundation model built on multimodal tokens, capable of understanding and generating speech, text, images, and videos in an end-to-end, autoregressive manner. While the emergence of large language models (LLMs) and multimodal large language models (MM-LLMs) propels advancements in artificial general intelligence through their versatile capabilities, they still lack true any-to-any understanding and generation. Recently, the release of GPT-4o has showcased the remarkable potential of any-to-any LLMs for complex real-world tasks, enabling omnidirectional input and output across images, speech, and text. However, it is closed-source and does not support the generation of multimodal interleaved sequences. To address this gap, we present MIO, which is trained on a mixture of discrete tokens across four modalities using causal multimodal modeling. Our experimental results indicate that MIO exhibits competitive, and in some cases superior, performance compared to previous dual-modal baselines, any-to-any model baselines, and even modality-specific baselines. Moreover, MIO demonstrates advanced capabilities inherent to its any-to-any feature, such as interleaved video-text generation, chain-of-visual-thought reasoning, visual guideline generation, instructional image editing, etc. Codes and models are available at https://github.com/MIO-Team/MIO. The advent of Large Language Models (LLMs) is commonly considered the dawn of artificial general intelligence (AGI) (OpenAI et al., 2023; Bubeck et al., 2023), given their generalist capabilities such as complex reasoning (Wei et al., 2022), role playing (Wang et al., 2023c), and creative writing (Wang et al., 2024a). These MM-LLMs typically involve an external multimodal encoder, such as EVA-CLIP (Sun et al., 2023b) or CLAP (Elizalde et al., 2022), with an alignment module such as Q-Former (Li et al., 2023b) or MLP (Liu et al., 2023b) for multimodal understanding. These modules align non-textual-modality data features into the embedding space of the LLM backbone. Another line of work involves building any-to-any and end-to-end MM-LLMs that can input and output non-textual modality data. I/O Consistency indicates whether the model ensures that the input and output representations for the same data remain consistent. SFT refers to whether the model undergoes a unified (Uni.)
SafeAgentBench: A Benchmark for Safe Task Planning of Embodied LLM Agents
Yin, Sheng, Pang, Xianghe, Ding, Yuanzhuo, Chen, Menglan, Bi, Yutong, Xiong, Yichen, Huang, Wenhao, Xiang, Zhen, Shao, Jing, Chen, Siheng
With the integration of large language models (LLMs), embodied agents have strong capabilities to execute complicated instructions in natural language, paving a way for the potential deployment of embodied robots. However, a foreseeable issue is that those embodied agents can also flawlessly execute some hazardous tasks, potentially causing damages in real world. To study this issue, we present SafeAgentBench -- a new benchmark for safety-aware task planning of embodied LLM agents. SafeAgentBench includes: (1) a new dataset with 750 tasks, covering 10 potential hazards and 3 task types; (2) SafeAgentEnv, a universal embodied environment with a low-level controller, supporting multi-agent execution with 17 high-level actions for 8 state-of-the-art baselines; and (3) reliable evaluation methods from both execution and semantic perspectives. Experimental results show that the best-performing baseline gets 69% success rate for safe tasks, but only 5% rejection rate for hazardous tasks, indicating significant safety risks. More details and codes are available at https://github.com/shengyin1224/SafeAgentBench.
Long Video Diffusion Generation with Segmented Cross-Attention and Content-Rich Video Data Curation
Yan, Xin, Cai, Yuxuan, Wang, Qiuyue, Zhou, Yuan, Huang, Wenhao, Yang, Huan
We introduce Presto, a novel video diffusion model designed to generate 15-second videos with long-range coherence and rich content. Extending video generation methods to maintain scenario diversity over long durations presents significant challenges. To address this, we propose a Segmented Cross-Attention (SCA) strategy, which splits hidden states into segments along the temporal dimension, allowing each segment to cross-attend to a corresponding sub-caption. SCA requires no additional parameters, enabling seamless incorporation into current DiT-based architectures. To facilitate high-quality long video generation, we build the LongTake-HD dataset, consisting of 261k content-rich videos with scenario coherence, annotated with an overall video caption and five progressive sub-captions. Experiments show that our Presto achieves 78.5% on the VBench Semantic Score and 100% on the Dynamic Degree, outperforming existing state-of-the-art video generation methods. This demonstrates that our proposed Presto significantly enhances content richness, maintains long-range coherence, and captures intricate textual details. More details are displayed on our project page: https://presto-video.github.io/.