Zhang, Haifeng
PLM: Efficient Peripheral Language Models Hardware-Co-Designed for Ubiquitous Computing
Deng, Cheng, Sun, Luoyang, Jiang, Jiwen, Zeng, Yongcheng, Wu, Xinjian, Zhao, Wenxin, Xiao, Qingfa, Wang, Jiachuan, Li, Haoyang, Chen, Lei, Ni, Lionel M., Zhang, Haifeng, Wang, Jun
While scaling laws have been continuously validated in large language models (LLMs) with increasing model parameters, the inherent tension between the inference demands of LLMs and the limited resources of edge devices poses a critical challenge to the development of edge intelligence. Recently, numerous small language models have emerged, aiming to distill the capabilities of LLMs into smaller footprints. However, these models often retain the fundamental architectural principles of their larger counterparts, still imposing considerable strain on the storage and bandwidth capacities of edge devices. In this paper, we introduce the PLM, a Peripheral Language Model, developed through a co-design process that jointly optimizes model architecture and edge system constraints. The PLM utilizes a Multi-head Latent Attention mechanism and employs the squared ReLU activation function to encourage sparsity, thereby reducing peak memory footprint during inference. During training, we collect and reorganize open-source datasets, implement a multi-phase training strategy, and empirically investigate the Warmup-Stable-Decay-Constant (WSDC) learning rate scheduler. Additionally, we incorporate Reinforcement Learning from Human Feedback (RLHF) by adopting the ARIES preference learning approach. Following a two-phase SFT process, this method yields performance gains of 2% in general tasks, 9% in the GSM8K task, and 11% in coding tasks. In addition to its novel architecture, evaluation results demonstrate that PLM outperforms existing small language models trained on publicly available data while maintaining the lowest number of activated parameters. Furthermore, deployment across various edge devices, including consumer-grade GPUs, mobile phones, and Raspberry Pis, validates PLM's suitability for peripheral applications. The PLM series models are publicly available at https://github.com/plm-team/PLM.
Vairiational Stochastic Games
Zhao, Zhiyu, Zhang, Haifeng
The Control as Inference (CAI) framework has successfully transformed single-agent reinforcement learning (RL) by reframing control tasks as probabilistic inference problems. However, the extension of CAI to multi-agent, general-sum stochastic games (SGs) remains underexplored, particularly in decentralized settings where agents operate independently without centralized coordination. In this paper, we propose a novel variational inference framework tailored to decentralized multi-agent systems. Our framework addresses the challenges posed by non-stationarity and unaligned agent objectives, proving that the resulting policies form an $\epsilon$-Nash equilibrium. Additionally, we demonstrate theoretical convergence guarantees for the proposed decentralized algorithms. Leveraging this framework, we instantiate multiple algorithms to solve for Nash equilibrium, mean-field Nash equilibrium, and correlated equilibrium, with rigorous theoretical convergence analysis.
ARIES: Stimulating Self-Refinement of Large Language Models by Iterative Preference Optimization
Zeng, Yongcheng, Cui, Xinyu, Jin, Xuanfa, Liu, Guoqing, Sun, Zexu, He, Quan, Li, Dong, Yang, Ning, Hao, Jianye, Zhang, Haifeng, Wang, Jun
A truly intelligent Large Language Model (LLM) should be capable of correcting errors in its responses through external interactions. However, even the most advanced models often face challenges in improving their outputs. In this paper, we explore how to cultivate LLMs with the self-refinement capability through iterative preference training, and how this ability can be leveraged to improve model performance during inference. To this end, we introduce a novel post-training and inference framework, called ARIES: Adaptive Refinement and Iterative Enhancement Structure. This method iteratively performs preference training and self-refinement-based data collection. During training, ARIES strengthen the model's direct question-answering capability while simultaneously unlocking its self-refinement potential. During inference, ARIES harnesses this self-refinement capability to generate a series of progressively refined responses, which are then filtered using either the Reward Model Scoring or a simple yet effective Rule-Based Selection mechanism, specifically tailored to our approach, to construct a dataset for the next round of preference training. Experimental results demonstrate the remarkable performance of ARIES. When applied to the Llama-3.1-8B model and under the self-refinement setting, ARIES surpasses powerful models such as GPT-4o, achieving 62.3% length-controlled (LC) and a 63.3% raw win rates on AlpacaEval 2, outperforming Iterative DPO by 27.8% and 35.5% respectively, as well as a 50.3% win rate on Arena-Hard, surpassing Iterative DPO by 26.6%. Furthermore, ARIES consistently enhances performance on mathematical reasoning tasks like GSM8K and MATH.
A New Approach to Solving SMAC Task: Generating Decision Tree Code from Large Language Models
Deng, Yue, Ma, Weiyu, Fan, Yuxin, Zhang, Yin, Zhang, Haifeng, Zhao, Jian
StarCraft Multi-Agent Challenge (SMAC) is one of the most commonly used experimental environments in multi-agent reinforcement learning (MARL), where the specific task is to control a set number of allied units to defeat enemy forces. Traditional MARL algorithms often require interacting with the environment for up to 1 million steps to train a model, and the resulting policies are typically non-interpretable with weak transferability. In this paper, we propose a novel approach to solving SMAC tasks called LLM-SMAC. In our framework, agents leverage large language models (LLMs) to generate decision tree code by providing task descriptions. The model is further self-reflection using feedback from the rewards provided by the environment. We conduct experiments in the SMAC and demonstrate that our method can produce high-quality, interpretable decision trees with minimal environmental exploration. Moreover, these models exhibit strong transferability, successfully applying to similar SMAC environments without modification. We believe this approach offers a new direction for solving decision-making tasks in the future.
Efficient Reinforcement Learning with Large Language Model Priors
Yan, Xue, Song, Yan, Feng, Xidong, Yang, Mengyue, Zhang, Haifeng, Ammar, Haitham Bou, Wang, Jun
In sequential decision-making (SDM) tasks, methods like reinforcement learning (RL) and heuristic search have made notable advances in specific cases. However, they often require extensive exploration and face challenges in generalizing across diverse environments due to their limited grasp of the underlying decision dynamics. In contrast, large language models (LLMs) have recently emerged as powerful general-purpose tools, due to their capacity to maintain vast amounts of domainspecific knowledge. To harness this rich prior knowledge for efficiently solving complex SDM tasks, we propose treating LLMs as prior action distributions and integrating them into RL frameworks through Bayesian inference methods, making use of variational inference and direct posterior sampling. The proposed approaches facilitate the seamless incorporation of fixed LLM priors into both policy-based and value-based RL frameworks. Our experiments show that incorporating LLMbased action priors significantly reduces exploration and optimization complexity, substantially improving sample efficiency compared to traditional RL techniques, e.g., using LLM priors decreases the number of required samples by over 90% in offline learning scenarios. Traditional approaches to SDM, such as optimal control (Garcia et al., 1989), heuristic search (ลwiechowski et al., 2023) and reinforcement learning (RL) (Mnih, 2013), have seen substantial success. Notably, AlphaGo (Silver et al., 2016) and AlphaStar (Vinyals et al., 2019), both based on deep reinforcement learning (DRL), have achieved human-level proficiency in the games of Go and StarCraft II, respectively. However, these methods still suffer from high computational complexity, along with poor generalizability and limited applicability across diverse domains (Dulac-Arnold et al., 2015; Cobbe et al., 2019). Recently, Large Language Models (LLMs) have emerged as effective tools for tackling diverse general-purpose tasks, such as in dialogue systems (Brooks et al., 2023), decision-making (Zhao et al., 2024a), and mathematical reasoning (Imani et al., 2023).
Token-level Direct Preference Optimization
Zeng, Yongcheng, Liu, Guoqing, Ma, Weiyu, Yang, Ning, Zhang, Haifeng, Wang, Jun
Fine-tuning pre-trained Large Language Models (LLMs) is essential to align them with human values and intentions. This process often utilizes methods like pairwise comparisons and KL divergence against a reference LLM, focusing on the evaluation of full answers generated by the models. However, the generation of these responses occurs in a token level, following a sequential, auto-regressive fashion. In this paper, we introduce Token-level Direct Preference Optimization (TDPO), a novel approach to align LLMs with human preferences by optimizing policy at the token level. Unlike previous methods, which face challenges in divergence efficiency, TDPO incorporates forward KL divergence constraints for each token, improving alignment and diversity. Utilizing the Bradley-Terry model for a token-based reward system, TDPO enhances the regulation of KL divergence, while preserving simplicity without the need for explicit reward modeling. Experimental results across various text tasks demonstrate TDPO's superior performance in balancing alignment with generation diversity. Notably, fine-tuning with TDPO strikes a better balance than DPO in the controlled sentiment generation and single-turn dialogue datasets, and significantly improves the quality of generated responses compared to both DPO and PPO-based RLHF methods. Our code is open-sourced at https://github.com/Vance0124/Token-level-Direct-Preference-Optimization.
Learning Macroeconomic Policies based on Microfoundations: A Dynamic Stackelberg Mean Field Game Approach
Mi, Qirui, Zhao, Zhiyu, Xia, Siyu, Song, Yan, Wang, Jun, Zhang, Haifeng
The Lucas critique emphasizes the importance of considering the impact of policy changes on the expectations of micro-level agents in macroeconomic policymaking. However, the inherently self-interested nature of large-scale micro-agents, who pursue long-term benefits, complicates the formulation of optimal macroeconomic policies. This paper proposes a novel general framework named Dynamic Stackelberg Mean Field Games (Dynamic SMFG) to model such policymaking within sequential decision-making processes, with the government as the leader and households as dynamic followers. Dynamic SMFGs capture the dynamic interactions among large-scale households and their response to macroeconomic policy changes. To solve dynamic SMFGs, we propose the Stackelberg Mean Field Reinforcement Learning (SMFRL) algorithm, which leverages the population distribution of followers to represent high-dimensional joint state and action spaces. In experiments, our method surpasses macroeconomic policies in the real world, existing AI-based and economic methods. It allows the leader to approach the social optimum with the highest performance, while large-scale followers converge toward their best response to the leader's policy. Besides, we demonstrate that our approach retains effectiveness even when some households do not adopt the SMFG policy. In summary, this paper contributes to the field of AI for economics by offering an effective tool for modeling and solving macroeconomic policy-making issues.
Learning to Discuss Strategically: A Case Study on One Night Ultimate Werewolf
Jin, Xuanfa, Wang, Ziyan, Du, Yali, Fang, Meng, Zhang, Haifeng, Wang, Jun
Communication is a fundamental aspect of human society, facilitating the exchange of information and beliefs among people. Despite the advancements in large language models (LLMs), recent agents built with these often neglect the control over discussion tactics, which are essential in communication scenarios and games. As a variant of the famous communication game Werewolf, One Night Ultimate Werewolf (ONUW) requires players to develop strategic discussion policies due to the potential role changes that increase the uncertainty and complexity of the game. In this work, we first present the existence of the Perfect Bayesian Equilibria (PBEs) in two scenarios of the ONUW game: one with discussion and one without. The results showcase that the discussion greatly changes players' utilities by affecting their beliefs, emphasizing the significance of discussion tactics. Based on the insights obtained from the analyses, we propose an RL-instructed language agent framework, where a discussion policy trained by reinforcement learning (RL) is employed to determine appropriate discussion tactics to adopt. Our experimental results on several ONUW game settings demonstrate the effectiveness and generalizability of our proposed framework.
AI-Olympics: Exploring the Generalization of Agents through Open Competitions
Wang, Chen, Song, Yan, Wu, Shuai, Wu, Sa, Zhang, Ruizhi, Lin, Shu, Zhang, Haifeng
Between 2021 and 2023, AI-Olympics, a series of online AI competitions was hosted by the online evaluation platform Jidi in collaboration with the IJCAI committee. In these competitions, an agent is required to accomplish diverse sports tasks in a two-dimensional continuous world, while competing against an opponent. This paper provides a brief overview of the competition series and highlights notable findings. We aim to contribute insights to the field of multi-agent decision-making and explore the generalization of agents through engineering efforts.
Correlated Mean Field Imitation Learning
Zhao, Zhiyu, Yang, Ning, Yan, Xue, Zhang, Haifeng, Wang, Jun, Yang, Yaodong
We investigate multi-agent imitation learning (IL) within the framework of mean field games (MFGs), considering the presence of time-varying correlated signals. Existing MFG IL algorithms assume demonstrations are sampled from Mean Field Nash Equilibria (MFNE), limiting their adaptability to real-world scenarios. For example, in the traffic network equilibrium influenced by public routing recommendations, recommendations introduce time-varying correlated signals into the game, not captured by MFNE and other existing correlated equilibrium concepts. To address this gap, we propose Adaptive Mean Field Correlated Equilibrium (AMFCE), a general equilibrium incorporating time-varying correlated signals. We establish the existence of AMFCE under mild conditions and prove that MFNE is a subclass of AMFCE. We further propose Correlated Mean Field Imitation Learning (CMFIL), a novel IL framework designed to recover the AMFCE, accompanied by a theoretical guarantee on the quality of the recovered policy. Experimental results, including a real-world traffic flow prediction problem, demonstrate the superiority of CMFIL over state-of-the-art IL baselines, highlighting the potential of CMFIL in understanding large population behavior under correlated signals.