Li, Peng
Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks
Zhang, Wenqi, Wang, Mengna, Liu, Gangao, Huixin, Xu, Jiang, Yiwei, Shen, Yongliang, Hou, Guiyang, Zheng, Zhe, Zhang, Hang, Li, Xin, Lu, Weiming, Li, Peng, Zhuang, Yueting
Recent advances in deep thinking models have demonstrated remarkable reasoning capabilities on mathematical and coding tasks. However, their effectiveness in embodied domains which require continuous interaction with environments through image action interleaved trajectories remains largely -unexplored. We present Embodied Reasoner, a model that extends o1 style reasoning to interactive embodied search tasks. Unlike mathematical reasoning that relies primarily on logical deduction, embodied scenarios demand spatial understanding, temporal reasoning, and ongoing self-reflection based on interaction history. To address these challenges, we synthesize 9.3k coherent Observation-Thought-Action trajectories containing 64k interactive images and 90k diverse thinking processes (analysis, spatial reasoning, reflection, planning, and verification). We develop a three-stage training pipeline that progressively enhances the model's capabilities through imitation learning, self-exploration via rejection sampling, and self-correction through reflection tuning. The evaluation shows that our model significantly outperforms those advanced visual reasoning models, e.g., it exceeds OpenAI o1, o3-mini, and Claude-3.7 by +9\%, 24\%, and +13\%. Analysis reveals our model exhibits fewer repeated searches and logical inconsistencies, with particular advantages in complex long-horizon tasks. Real-world environments also show our superiority while exhibiting fewer repeated searches and logical inconsistency cases.
Gradient-guided Attention Map Editing: Towards Efficient Contextual Hallucination Mitigation
Wang, Yu, Zhang, Jiaxin, Gao, Xiang, Cui, Wendi, Li, Peng, Das, Kamalika
In tasks like summarization and open-book question answering (QA), Large Language Models (LLMs) often encounter "contextual hallucination", where they produce irrelevant or incorrect responses despite having access to accurate source information. This typically occurs because these models tend to prioritize self-generated content over the input context, causing them to disregard pertinent details. To address this challenge, we introduce a novel method called "Guided Attention Map Editing" (GAME), which dynamically adjusts attention maps to improve contextual relevance. During inference, GAME employs a trained classifier to identify attention maps prone to inducing hallucinations and executes targeted interventions. These interventions, guided by gradient-informed "edit directions'', strategically redistribute attention weights across various heads to effectively reduce hallucination. Comprehensive evaluations on challenging summarization and open-book QA tasks show that GAME consistently reduces hallucinations across a variety of open-source models. Specifically, GAME reduces hallucinations by 10% in the XSum summarization task while achieving a 7X speed-up in computational efficiency compared to the state-of-the-art baselines.
YuE: Scaling Open Foundation Models for Long-Form Music Generation
Yuan, Ruibin, Lin, Hanfeng, Guo, Shuyue, Zhang, Ge, Pan, Jiahao, Zang, Yongyi, Liu, Haohe, Liang, Yiming, Ma, Wenye, Du, Xingjian, Du, Xinrun, Ye, Zhen, Zheng, Tianyu, Ma, Yinghao, Liu, Minghao, Tian, Zeyue, Zhou, Ziya, Xue, Liumeng, Qu, Xingwei, Li, Yizhi, Wu, Shangda, Shen, Tianhao, Ma, Ziyang, Zhan, Jun, Wang, Chunhui, Wang, Yatian, Chi, Xiaowei, Zhang, Xinyue, Yang, Zhenzhu, Wang, Xiangzhou, Liu, Shansong, Mei, Lingrui, Li, Peng, Wang, Junjie, Yu, Jianwei, Pang, Guojian, Li, Xu, Wang, Zihao, Zhou, Xiaohuan, Yu, Lijun, Benetos, Emmanouil, Chen, Yong, Lin, Chenghua, Chen, Xie, Xia, Gus, Zhang, Zhaoxiang, Zhang, Chao, Chen, Wenhu, Zhou, Xinyu, Qiu, Xipeng, Dannenberg, Roger, Liu, Jiaheng, Yang, Jian, Huang, Wenhao, Xue, Wei, Tan, Xu, Guo, Yike
We tackle the task of long-form music generation--particularly the challenging \textbf{lyrics-to-song} problem--by introducing YuE, a family of open foundation models based on the LLaMA2 architecture. Specifically, YuE scales to trillions of tokens and generates up to five minutes of music while maintaining lyrical alignment, coherent musical structure, and engaging vocal melodies with appropriate accompaniment. It achieves this through (1) track-decoupled next-token prediction to overcome dense mixture signals, (2) structural progressive conditioning for long-context lyrical alignment, and (3) a multitask, multiphase pre-training recipe to converge and generalize. In addition, we redesign the in-context learning technique for music generation, enabling versatile style transfer (e.g., converting Japanese city pop into an English rap while preserving the original accompaniment) and bidirectional generation. Through extensive evaluation, we demonstrate that YuE matches or even surpasses some of the proprietary systems in musicality and vocal agility. In addition, fine-tuning YuE enables additional controls and enhanced support for tail languages. Furthermore, beyond generation, we show that YuE's learned representations can perform well on music understanding tasks, where the results of YuE match or exceed state-of-the-art methods on the MARBLE benchmark. Keywords: lyrics2song, song generation, long-form, foundation model, music generation
Earlier Tokens Contribute More: Learning Direct Preference Optimization From Temporal Decay Perspective
Shao, Ruichen, Li, Bei, Liu, Gangao, Chen, Yang, Zhou, Xiang, Wang, Jingang, Cai, Xunliang, Li, Peng
A BSTRACT Direct Preference Optimization (DPO) has gained attention as an efficient alternative to reinforcement learning from human feedback (RLHF) for aligning large language models (LLMs) with human preferences. Despite its advantages, DPO suffers from a length bias, generating responses longer than those from the reference model. Existing solutions like SimPO and SamPO address this issue but uniformly treat the contribution of rewards across sequences, overlooking temporal dynamics. To this end, we propose an enhanced preference optimization method that incorporates a temporal decay factor controlled by a gamma parameter. This dynamic weighting mechanism adjusts the influence of each reward based on its position in the sequence, prioritizing earlier tokens that are more critical for alignment. By adaptively focusing on more relevant feedback, our approach mitigates overfitting to less pertinent data and remains responsive to evolving human preferences. Experimental results on several benchmarks show that our approach consistently outperforms vanilla DPO by 5.9-8.8 points on AlpacaEval 2 and 3.3-9.7 points on Arena-Hard across different model architectures and sizes. Furthermore, additional experiments on mathematical and reasoning benchmarks (MMLU, GSM8K, and MA TH) confirm that our method enhances performance without compromising general capabilities. Our codebase would be available at https://github.com/LotuSrc/D2PO . 1 I NTRODUCTION Direct Preference Optimization (DPO) (Rafailov et al., 2023) has recently emerged as a highly efficient alternative for aligning large language models (LLMs) with human preferences (Askell et al., 2021; Ouyang et al., 2022). Unlike reinforcement learning from human feedback (RLHF), which involves training a reward model followed by iterative policy updates, DPO reframes the problem as a binary classification task directly over human preference data. Compared to supervised fine-tuning, DPO enables the model not only to learn what is good but also to be aware of what is bad. This formulation allows DPO to optimize preference alignment in a single-stage training process, bypassing the complexities of reinforcement learning, such as policy sampling or extensive hyperparameter tuning.
Enhancing Language Multi-Agent Learning with Multi-Agent Credit Re-Assignment for Interactive Environment Generalization
He, Zhitao, Liu, Zijun, Li, Peng, Fung, May, Yan, Ming, Zhang, Ji, Huang, Fei, Liu, Yang
LLM-based agents have made significant advancements in interactive environments, such as mobile operations and web browsing, and other domains beyond computer using. Current multi-agent systems universally excel in performance, compared to single agents, but struggle with generalization across environments due to predefined roles and inadequate strategies for generalizing language agents. The challenge of achieving both strong performance and good generalization has hindered the progress of multi-agent systems for interactive environments. To address these issues, we propose CollabUIAgents, a multi-agent reinforcement learning framework with a novel multi-agent credit re-assignment (CR) strategy, assigning process rewards with LLMs rather than environment-specific rewards and learning with synthesized preference data, in order to foster generalizable, collaborative behaviors among the role-free agents' policies. Empirical results show that our framework improves both performance and cross-environment generalizability of multi-agent systems. Moreover, our 7B-parameter system achieves results on par with or exceed strong closed-source models, and the LLM that guides the CR. We also provide insights in using granular CR rewards effectively for environment generalization, and accommodating trained LLMs in multi-agent systems.
LLM-USO: Large Language Model-based Universal Sizing Optimizer
S, Karthik Somayaji N., Li, Peng
The design of analog circuits is a cornerstone of integrated circuit (IC) development, requiring the optimization of complex, interconnected sub-structures such as amplifiers, comparators, and buffers. Traditionally, this process relies heavily on expert human knowledge to refine design objectives by carefully tuning sub-components while accounting for their interdependencies. Existing methods, such as Bayesian Optimization (BO), offer a mathematically driven approach for efficiently navigating large design spaces. However, these methods fall short in two critical areas compared to human expertise: (i) they lack the semantic understanding of the sizing solution space and its direct correlation with design objectives before optimization, and (ii) they fail to reuse knowledge gained from optimizing similar sub-structures across different circuits. To overcome these limitations, we propose the Large Language Model-based Universal Sizing Optimizer (LLM-USO), which introduces a novel method for knowledge representation to encode circuit design knowledge in a structured text format. This representation enables the systematic reuse of optimization insights for circuits with similar sub-structures. LLM-USO employs a hybrid framework that integrates BO with large language models (LLMs) and a learning summary module. This approach serves to: (i) infuse domain-specific knowledge into the BO process and (ii) facilitate knowledge transfer across circuits, mirroring the cognitive strategies of expert designers. Specifically, LLM-USO constructs a knowledge summary mechanism to distill and apply design insights from one circuit to related ones. It also incorporates a knowledge summary critiquing mechanism to ensure the accuracy and quality of the summaries and employs BO-guided suggestion filtering to identify optimal design points efficiently.
Contrastive Private Data Synthesis via Weighted Multi-PLM Fusion
Zou, Tianyuan, Liu, Yang, Li, Peng, Xiong, Yufei, Zhang, Jianqing, Liu, Jingjing, Ye, Xiaozhou, Ouyang, Ye, Zhang, Ya-Qin
Substantial quantity and high quality are the golden rules of making a good training dataset with sample privacy protection equally important. Generating synthetic samples that resemble high-quality private data while ensuring Differential Privacy (DP), a formal privacy guarantee, promises scalability and practicality. However, existing methods relying on pre-trained models for data synthesis %that avoid fine-tuning large pre-trained generative models often struggle in data-deficient scenarios, suffering from limited sample size, inevitable generation noise and existing pre-trained model bias. To address these challenges, we propose a novel contrAstive private data Synthesis via Weighted multiple Pre-trained language models (PLM) framework, named as WASP. WASP utilizes limited private samples for more accurate private data distribution estimation via a Top-Q voting mechanism, and leverages low-quality synthetic samples for contrastive generation via collaboration among dynamically weighted multiple pre-trained models.Extensive experiments on 6 well-developed datasets with 6 open-source and 3 closed-source PLMs demonstrate the superiority of WASP in improving model performance over diverse downstream tasks. Code is available at https://anonymous.4open.science/r/WASP.
Perspective Transition of Large Language Models for Solving Subjective Tasks
Wang, Xiaolong, Zhang, Yuanchi, Wang, Ziyue, Xu, Yuzhuang, Luo, Fuwen, Wang, Yile, Li, Peng, Liu, Yang
Large language models (LLMs) have revolutionized the field of natural language processing, enabling remarkable progress in various tasks. Different from objective tasks such as commonsense reasoning and arithmetic question-answering, the performance of LLMs on subjective tasks is still limited, where the perspective on the specific problem plays crucial roles for better interpreting the context and giving proper response. For example, in certain scenarios, LLMs may perform better when answering from an expert role perspective, potentially eliciting their relevant domain knowledge. In contrast, in some scenarios, LLMs may provide more accurate responses when answering from a third-person standpoint, enabling a more comprehensive understanding of the problem and potentially mitigating inherent biases. In this paper, we propose Reasoning through Perspective Transition (RPT), a method based on in-context learning that enables LLMs to dynamically select among direct, role, and third-person perspectives for the best way to solve corresponding subjective problem. Through extensive experiments on totally 12 subjective tasks by using both closed-source and open-source LLMs including GPT-4, GPT-3.5, Llama-3, and Qwen-2, our method outperforms widely used single fixed perspective based methods such as chain-of-thought prompting and expert prompting, highlights the intricate ways that LLMs can adapt their perspectives to provide nuanced and contextually appropriate responses for different problems.
Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control
Gu, Zekai, Yan, Rui, Lu, Jiahao, Li, Peng, Dou, Zhiyang, Si, Chenyang, Dong, Zhen, Liu, Qifeng, Lin, Cheng, Liu, Ziwei, Wang, Wenping, Liu, Yuan
Diffusion models have demonstrated impressive performance in generating high-quality videos from text prompts or images. However, precise control over the video generation process, such as camera manipulation or content editing, remains a significant challenge. Existing methods for controlled video generation are typically limited to a single control type, lacking the flexibility to handle diverse control demands. In this paper, we introduce Diffusion as Shader (DaS), a novel approach that supports multiple video control tasks within a unified architecture. Our key insight is that achieving versatile video control necessitates leveraging 3D control signals, as videos are fundamentally 2D renderings of dynamic 3D content. Unlike prior methods limited to 2D control signals, DaS leverages 3D tracking videos as control inputs, making the video diffusion process inherently 3D-aware. This innovation allows DaS to achieve a wide range of video controls by simply manipulating the 3D tracking videos. A further advantage of using 3D tracking videos is their ability to effectively link frames, significantly enhancing the temporal consistency of the generated videos. With just 3 days of fine-tuning on 8 H800 GPUs using less than 10k videos, DaS demonstrates strong control capabilities across diverse tasks, including mesh-to-video generation, camera control, motion transfer, and object manipulation.
Towards 3D Acceleration for low-power Mixture-of-Experts and Multi-Head Attention Spiking Transformers
Xu, Boxun, Hwang, Junyoung, Vanna-iampikul, Pruek, Yin, Yuxuan, Lim, Sung Kyu, Li, Peng
Spiking Neural Networks(SNNs) provide a brain-inspired and event-driven mechanism that is believed to be critical to unlock energy-efficient deep learning. The mixture-of-experts approach mirrors the parallel distributed processing of nervous systems, introducing conditional computation policies and expanding model capacity without scaling up the number of computational operations. Additionally, spiking mixture-of-experts self-attention mechanisms enhance representation capacity, effectively capturing diverse patterns of entities and dependencies between visual or linguistic tokens. However, there is currently a lack of hardware support for highly parallel distributed processing needed by spiking transformers, which embody a brain-inspired computation. This paper introduces the first 3D hardware architecture and design methodology for Mixture-of-Experts and Multi-Head Attention spiking transformers. By leveraging 3D integration with memory-on-logic and logic-on-logic stacking, we explore such brain-inspired accelerators with spatially stackable circuitry, demonstrating significant optimization of energy efficiency and latency compared to conventional 2D CMOS integration.