Wang, Xintao
DiffMoE: Dynamic Token Selection for Scalable Diffusion Transformers
Shi, Minglei, Yuan, Ziyang, Yang, Haotian, Wang, Xintao, Zheng, Mingwu, Tao, Xin, Zhao, Wenliang, Zheng, Wenzhao, Zhou, Jie, Lu, Jiwen, Wan, Pengfei, Zhang, Di, Gai, Kun
Diffusion models have demonstrated remarkable success in various image generation tasks, but their performance is often limited by the uniform processing of inputs across varying conditions and noise levels. To address this limitation, we propose a novel approach that leverages the inherent heterogeneity of the diffusion process. Our method, DiffMoE, introduces a batch-level global token pool that enables experts to access global token distributions during training, promoting specialized expert behavior. To unleash the full potential of the diffusion process, DiffMoE incorporates a capacity predictor that dynamically allocates computational resources based on noise levels and sample complexity. Through comprehensive evaluation, DiffMoE achieves state-of-the-art performance among diffusion models on ImageNet benchmark, substantially outperforming both dense architectures with 3x activated parameters and existing MoE approaches while maintaining 1x activated parameters. The effectiveness of our approach extends beyond class-conditional generation to more challenging tasks such as text-to-image generation, demonstrating its broad applicability across different diffusion model applications. Project Page: https://shiml20.github.io/DiffMoE/
PowerAttention: Exponentially Scaling of Receptive Fields for Effective Sparse Attention
Chen, Lida, Xu, Dong, An, Chenxin, Wang, Xintao, Zhang, Yikai, Chen, Jiangjie, Liang, Zujie, Wei, Feng, Liang, Jiaqing, Xiao, Yanghua, Wang, Wei
Large Language Models (LLMs) face efficiency bottlenecks due to the quadratic complexity of the attention mechanism when processing long contexts. Sparse attention methods offer a promising solution, but existing approaches often suffer from incomplete effective context and/or require complex implementation of pipeline. We present a comprehensive analysis of sparse attention for autoregressive LLMs from the respective of receptive field, recognize the suboptimal nature of existing methods for expanding the receptive field, and introduce PowerAttention, a novel sparse attention design that facilitates effective and complete context extension through the theoretical analysis. PowerAttention achieves exponential receptive field growth in $d$-layer LLMs, allowing each output token to attend to $2^d$ tokens, ensuring completeness and continuity of the receptive field. Experiments demonstrate that PowerAttention outperforms existing static sparse attention methods by $5\sim 40\%$, especially on tasks demanding long-range dependencies like Passkey Retrieval and RULER, while maintaining a comparable time complexity to sliding window attention. Efficiency evaluations further highlight PowerAttention's superior speedup in both prefilling and decoding phases compared with dynamic sparse attentions and full attention ($3.0\times$ faster on 128K context), making it a highly effective and user-friendly solution for processing long sequences in LLMs.
CoSER: Coordinating LLM-Based Persona Simulation of Established Roles
Wang, Xintao, Wang, Heng, Zhang, Yifei, Yuan, Xinfeng, Xu, Rui, Huang, Jen-tse, Yuan, Siyu, Guo, Haoran, Chen, Jiangjie, Wang, Wei, Xiao, Yanghua, Zhou, Shuchang
Role-playing language agents (RPLAs) have emerged as promising applications of large language models (LLMs). However, simulating established characters presents a challenging task for RPLAs, due to the lack of authentic character datasets and nuanced evaluation methods using such data. In this paper, we present CoSER, a collection of a high-quality dataset, open models, and an evaluation protocol towards effective RPLAs of established characters. The CoSER dataset covers 17,966 characters from 771 renowned books. It provides authentic dialogues with real-world intricacies, as well as diverse data types such as conversation setups, character experiences and internal thoughts. Drawing from acting methodology, we introduce given-circumstance acting for training and evaluating role-playing LLMs, where LLMs sequentially portray multiple characters in book scenes. Using our dataset, we develop CoSER 8B and CoSER 70B, i.e., advanced open role-playing LLMs built on LLaMA-3.1 models. Extensive experiments demonstrate the value of the CoSER dataset for RPLA training, evaluation and retrieval. Moreover, CoSER 70B exhibits state-of-the-art performance surpassing or matching GPT-4o on our evaluation and three existing benchmarks, i.e., achieving 75.80% and 93.47% accuracy on the InCharacter and LifeChoice benchmarks respectively.
Improving Video Generation with Human Feedback
Liu, Jie, Liu, Gongye, Liang, Jiajun, Yuan, Ziyang, Liu, Xiaokun, Zheng, Mingwu, Wu, Xiele, Wang, Qiulin, Qin, Wenyu, Xia, Menghan, Wang, Xintao, Liu, Xiaohong, Yang, Fei, Wan, Pengfei, Zhang, Di, Gai, Kun, Yang, Yujiu, Ouyang, Wanli
Video generation has achieved significant advances through rectified flow techniques, but issues like unsmooth motion and misalignment between videos and prompts persist. In this work, we develop a systematic pipeline that harnesses human feedback to mitigate these problems and refine the video generation model. Specifically, we begin by constructing a large-scale human preference dataset focused on modern video generation models, incorporating pairwise annotations across multi-dimensions. We then introduce VideoReward, a multi-dimensional video reward model, and examine how annotations and various design choices impact its rewarding efficacy. From a unified reinforcement learning perspective aimed at maximizing reward with KL regularization, we introduce three alignment algorithms for flow-based models by extending those from diffusion models. These include two training-time strategies: direct preference optimization for flow (Flow-DPO) and reward weighted regression for flow (Flow-RWR), and an inference-time technique, Flow-NRG, which applies reward guidance directly to noisy videos. Experimental results indicate that VideoReward significantly outperforms existing reward models, and Flow-DPO demonstrates superior performance compared to both Flow-RWR and standard supervised fine-tuning methods. Additionally, Flow-NRG lets users assign custom weights to multiple objectives during inference, meeting personalized video quality needs. Project page: https://gongyeliu.github.io/videoalign.
MINDECHO: Role-Playing Language Agents for Key Opinion Leaders
Xu, Rui, Lu, Dakuan, Tan, Xiaoyu, Wang, Xintao, Yuan, Siyu, Chen, Jiangjie, Chu, Wei, Yinghui, Xu
Large language models~(LLMs) have demonstrated impressive performance in various applications, among which role-playing language agents (RPLAs) have engaged a broad user base. Now, there is a growing demand for RPLAs that represent Key Opinion Leaders (KOLs), \ie, Internet celebrities who shape the trends and opinions in their domains. However, research in this line remains underexplored. In this paper, we hence introduce MINDECHO, a comprehensive framework for the development and evaluation of KOL RPLAs. MINDECHO collects KOL data from Internet video transcripts in various professional fields, and synthesizes their conversations leveraging GPT-4. Then, the conversations and the transcripts are used for individualized model training and inference-time retrieval, respectively. Our evaluation covers both general dimensions (\ie, knowledge and tones) and fan-centric dimensions for KOLs. Extensive experiments validate the effectiveness of MINDECHO in developing and evaluating KOL RPLAs.
Evaluating Character Understanding of Large Language Models via Character Profiling from Fictional Works
Yuan, Xinfeng, Yuan, Siyu, Cui, Yuhan, Lin, Tianhe, Wang, Xintao, Xu, Rui, Chen, Jiangjie, Yang, Deqing
Large language models (LLMs) have demonstrated impressive performance and spurred numerous AI applications, in which role-playing agents (RPAs) are particularly popular, especially for fictional characters. The prerequisite for these RPAs lies in the capability of LLMs to understand characters from fictional works. Previous efforts have evaluated this capability via basic classification tasks or characteristic imitation, failing to capture the nuanced character understanding with LLMs. In this paper, we propose evaluating LLMs' character understanding capability via the character profiling task, i.e., summarizing character profiles from corresponding materials, a widely adopted yet understudied practice for RPA development. Specifically, we construct the CroSS dataset from literature experts and assess the generated profiles by comparing ground truth references and their applicability in downstream tasks. Our experiments, which cover various summarization methods and LLMs, have yielded promising results. These results strongly validate the character understanding capability of LLMs. Resources are available at https://github.com/Joanna0123/character_profiling.
Ground Every Sentence: Improving Retrieval-Augmented LLMs with Interleaved Reference-Claim Generation
Xia, Sirui, Wang, Xintao, Liang, Jiaqing, Zhang, Yifei, Zhou, Weikang, Deng, Jiaji, Yu, Fei, Xiao, Yanghua
Retrieval-Augmented Generation (RAG) has been widely adopted to enhance Large Language Models (LLMs) in knowledge-intensive tasks. Recently, Attributed Text Generation (ATG) has attracted growing attention, which provides citations to support the model's responses in RAG, so as to enhance the credibility of LLM-generated content and facilitate verification. Prior methods mainly adopt coarse-grained attributions, linking to passage-level references or providing paragraph-level citations. However, these methods still fall short in verifiability and require certain time costs for fact checking. This paper proposes a fine-grained ATG method called ReClaim(Refer & Claim), which alternates the generation of references and answers step by step. Unlike traditional coarse-grained attribution, ReClaim allows the model to add sentence-level fine-grained citations to each answer sentence in long-form question-answering tasks. Our experiments encompass various training and inference methods and multiple LLMs, verifying the effectiveness of our approach.
Capturing Minds, Not Just Words: Enhancing Role-Playing Language Models with Personality-Indicative Data
Ran, Yiting, Wang, Xintao, Xu, Rui, Yuan, Xinfeng, Liang, Jiaqing, Xiao, Yanghua, Yang, Deqing
Role-playing agents (RPA) have been a popular application area for large language models (LLMs), attracting significant interest from both industry and academia.While existing RPAs well portray the characters' knowledge and tones, they face challenges in capturing their minds, especially for small role-playing language models (RPLMs). In this paper, we propose to enhance RPLMs via personality-indicative data. Specifically, we leverage questions from psychological scales and distill advanced RPAs to generate dialogues that grasp the minds of characters. Experimental results validate that RPLMs trained with our dataset exhibit advanced role-playing capabilities for both general and personality-related evaluations. Code and data are available at \href{https://github.com/alienet1109/RolePersonality}{this URL}.
Image Conductor: Precision Control for Interactive Video Synthesis
Li, Yaowei, Wang, Xintao, Zhang, Zhaoyang, Wang, Zhouxia, Yuan, Ziyang, Xie, Liangbin, Zou, Yuexian, Shan, Ying
Filmmaking and animation production often require sophisticated techniques for coordinating camera transitions and object movements, typically involving labor-intensive real-world capturing. Despite advancements in generative AI for video creation, achieving precise control over motion for interactive video asset generation remains challenging. To this end, we propose Image Conductor, a method for precise control of camera transitions and object movements to generate video assets from a single image. An well-cultivated training strategy is proposed to separate distinct camera and object motion by camera LoRA weights and object LoRA weights. To further address cinematographic variations from ill-posed trajectories, we introduce a camera-free guidance technique during inference, enhancing object movements while eliminating camera transitions. Additionally, we develop a trajectory-oriented video motion data curation pipeline for training. Quantitative and qualitative experiments demonstrate our method's precision and fine-grained control in generating motion-controllable videos from images, advancing the practical application of interactive video synthesis. Project webpage available at https://liyaowei-stu.github.io/project/ImageConductor/
Teaching Large Language Models to Express Knowledge Boundary from Their Own Signals
Chen, Lida, Liang, Zujie, Wang, Xintao, Liang, Jiaqing, Xiao, Yanghua, Wei, Feng, Chen, Jinglei, Hao, Zhenghong, Han, Bing, Wang, Wei
Large language models (LLMs) have achieved great success, but their occasional content fabrication, or hallucination, limits their practical application. Hallucination arises because LLMs struggle to admit ignorance due to inadequate training on knowledge boundaries. We call it a limitation of LLMs that they can not accurately express their knowledge boundary, answering questions they know while admitting ignorance to questions they do not know. In this paper, we aim to teach LLMs to recognize and express their knowledge boundary, so they can reduce hallucinations caused by fabricating when they do not know. We propose CoKE, which first probes LLMs' knowledge boundary via internal confidence given a set of questions, and then leverages the probing results to elicit the expression of the knowledge boundary. Extensive experiments show CoKE helps LLMs express knowledge boundaries, answering known questions while declining unknown ones, significantly improving in-domain and out-of-domain performance.