Fu, Haobo
Optimizing Latent Goal by Learning from Trajectory Preference
Zhao, Guangyu, Lian, Kewei, Lin, Haowei, Fu, Haobo, Fu, Qiang, Cai, Shaofei, Wang, Zihao, Liang, Yitao
Recently, pre-training foundation policies in open-world environments with web-scale unlabeled datasets have become an increasingly popular trend in the domain of sequential control(Baker et al., 2022; Brohan et al., 2023a; Collaboration et al., 2024; Yang et al., 2023; Zhang et al., 2022). These foundation policies possess broad world knowledge, which can be transferred to downstream tasks. In the realm of foundation policies, there exists a category known as goal-conditioned policies, which are capable of processing input goals (instructions) and executing the corresponding tasks (Chane-Sane et al., 2021; Ding et al., 2019). The goal can be in different modalities, such as text instructions (Lifshitz et al., 2024), video demonstrations (Cai et al., 2023b), or multi-model instructions (Brohan et al., 2023a,b; Cai et al., 2024)). However, much like large language models, these instruction-following policies are highly susceptible to the selection of "prompts"(Kim et al., 2024; Lifshitz et al., 2024; Wang et al., 2023a,b). Researchers rely on trial and error to find the optimal prompt manually, and sometimes the quality of prompts doesn't align with human judgment. For instance, OpenVLA (Kim et al., 2024) shows a large performance gap when using "Pepsi can" compared to "Pepsi" as the prompt; for the same task of collecting wood logs, GROOT's performance varies significantly depending on the reference video used. Moreover, it is unclear whether an agent's failure to complete a task is due to the foundation policy's inherent limitations or the lack of a suitable prompt. A common viewpoint from the LLM community thinks that most of the abilities are learned from the pre-training phase (Ouyang et al., 2022; Zhao et al., 2023a), while post-training is a method to
Diverse Policies Recovering via Pointwise Mutual Information Weighted Imitation Learning
Yang, Hanlin, Yao, Jian, Liu, Weiming, Wang, Qing, Qin, Hanmin, Kong, Hansheng, Tang, Kirk, Xiong, Jiechao, Yu, Chao, Li, Kai, Xing, Junliang, Chen, Hongwu, Zhuo, Juchao, Fu, Qiang, Wei, Yang, Fu, Haobo
Recovering a spectrum of diverse policies from a set of expert trajectories is an important research topic in imitation learning. After determining a latent style for a trajectory, previous diverse policies recovering methods usually employ a vanilla behavioral cloning learning objective conditioned on the latent style, treating each state-action pair in the trajectory with equal importance. Based on an observation that in many scenarios, behavioral styles are often highly relevant with only a subset of state-action pairs, this paper presents a new principled method in diverse polices recovery. In particular, after inferring or assigning a latent style for a trajectory, we enhance the vanilla behavioral cloning by incorporating a weighting mechanism based on pointwise mutual information. This additional weighting reflects the significance of each state-action pair's contribution to learning the style, thus allowing our method to focus on state-action pairs most representative of that style. We provide theoretical justifications for our new objective, and extensive empirical evaluations confirm the effectiveness of our method in recovering diverse policies from expert data.
Minimizing Weighted Counterfactual Regret with Optimistic Online Mirror Descent
Xu, Hang, Li, Kai, Liu, Bingyun, Fu, Haobo, Fu, Qiang, Xing, Junliang, Cheng, Jian
Counterfactual regret minimization (CFR) is a family of algorithms for effectively solving imperfect-information games. It decomposes the total regret into counterfactual regrets, utilizing local regret minimization algorithms, such as Regret Matching (RM) or RM+, to minimize them. Recent research establishes a connection between Online Mirror Descent (OMD) and RM+, paving the way for an optimistic variant PRM+ and its extension PCFR+. However, PCFR+ assigns uniform weights for each iteration when determining regrets, leading to substantial regrets when facing dominated actions. This work explores minimizing weighted counterfactual regret with optimistic OMD, resulting in a novel CFR variant PDCFR+. It integrates PCFR+ and Discounted CFR (DCFR) in a principled manner, swiftly mitigating negative effects of dominated actions and consistently leveraging predictions to accelerate convergence. Theoretical analyses prove that PDCFR+ converges to a Nash equilibrium, particularly under distinct weighting schemes for regrets and average strategies. Experimental results demonstrate PDCFR+'s fast convergence in common imperfect-information games. The code is available at https://github.com/rpSebastian/PDCFRPlus.
Reaching Consensus in Cooperative Multi-Agent Reinforcement Learning with Goal Imagination
Wang, Liangzhou, Zhu, Kaiwen, Zhu, Fengming, Yao, Xinghu, Zhang, Shujie, Ye, Deheng, Fu, Haobo, Fu, Qiang, Yang, Wei
Reaching consensus is key to multi-agent coordination. To accomplish a cooperative task, agents need to coherently select optimal joint actions to maximize the team reward. However, current cooperative multi-agent reinforcement learning (MARL) methods usually do not explicitly take consensus into consideration, which may cause miscoordination problem. In this paper, we propose a model-based consensus mechanism to explicitly coordinate multiple agents. The proposed Multi-agent Goal Imagination (MAGI) framework guides agents to reach consensus with an Imagined common goal. The common goal is an achievable state with high value, which is obtained by sampling from the distribution of future states. We directly model this distribution with a self-supervised generative model, thus alleviating the "curse of dimensinality" problem induced by multi-agent multi-step policy rollout commonly used in model-based methods. We show that such efficient consensus mechanism can guide all agents cooperatively reaching valuable future states. Results on Multi-agent Particle-Environments and Google Research Football environment demonstrate the superiority of MAGI in both sample efficiency and performance.
Enhance Reasoning for Large Language Models in the Game Werewolf
Wu, Shuang, Zhu, Liwen, Yang, Tao, Xu, Shiwei, Fu, Qiang, Wei, Yang, Fu, Haobo
This paper presents an innovative framework that integrates Large Language Models (LLMs) with an external Thinker module to enhance the reasoning capabilities of LLM-based agents. Unlike augmenting LLMs with prompt engineering, Thinker directly harnesses knowledge from databases and employs various optimization techniques. The framework forms a reasoning hierarchy where LLMs handle intuitive System-1 tasks such as natural language processing, while the Thinker focuses on cognitive System-2 tasks that require complex logical analysis and domain-specific knowledge. Our framework is presented using a 9-player Werewolf game that demands dual-system reasoning. We introduce a communication protocol between LLMs and the Thinker, and train the Thinker using data from 18800 human sessions and reinforcement learning. Experiments demonstrate the framework's effectiveness in deductive reasoning, speech generation, and online game evaluation. Additionally, we fine-tune a 6B LLM to surpass GPT4 when integrated with the Thinker. This paper also contributes the largest dataset for social deduction games to date.
Not All Tasks Are Equally Difficult: Multi-Task Deep Reinforcement Learning with Dynamic Depth Routing
He, Jinmin, Li, Kai, Zang, Yifan, Fu, Haobo, Fu, Qiang, Xing, Junliang, Cheng, Jian
Multi-task reinforcement learning endeavors to accomplish a set of different tasks with a single policy. To enhance data efficiency by sharing parameters across multiple tasks, a common practice segments the network into distinct modules and trains a routing network to recombine these modules into task-specific policies. However, existing routing approaches employ a fixed number of modules for all tasks, neglecting that tasks with varying difficulties commonly require varying amounts of knowledge. This work presents a Dynamic Depth Routing (D2R) framework, which learns strategic skipping of certain intermediate modules, thereby flexibly choosing different numbers of modules for each task. Under this framework, we further introduce a ResRouting method to address the issue of disparate routing paths between behavior and target policies during off-policy training. In addition, we design an automatic route-balancing mechanism to encourage continued routing exploration for unmastered tasks without disturbing the routing of mastered ones. We conduct extensive experiments on various robotics manipulation tasks in the Meta-World benchmark, where D2R achieves state-of-the-art performance with significantly improved learning efficiency.
Diversity from Human Feedback
Wang, Ren-Jian, Xue, Ke, Wang, Yutong, Yang, Peng, Fu, Haobo, Fu, Qiang, Qian, Chao
Diversity plays a significant role in many problems, such as ensemble learning, reinforcement learning, and combinatorial optimization. How to define the diversity measure is a longstanding problem. Many methods rely on expert experience to define a proper behavior space and then obtain the diversity measure, which is, however, challenging in many scenarios. In this paper, we propose the problem of learning a behavior space from human feedback and present a general method called Diversity from Human Feedback (DivHF) to solve it. DivHF learns a behavior descriptor consistent with human preference by querying human feedback. The learned behavior descriptor can be combined with any distance measure to define a diversity measure. We demonstrate the effectiveness of DivHF by integrating it with the Quality-Diversity optimization algorithm MAP-Elites and conducting experiments on the QDax suite. The results show that DivHF learns a behavior space that aligns better with human requirements compared to direct data-driven approaches and leads to more diverse solutions under human preference. Our contributions include formulating the problem, proposing the DivHF method, and demonstrating its effectiveness through experiments.
Policy Space Diversity for Non-Transitive Games
Yao, Jian, Liu, Weiming, Fu, Haobo, Yang, Yaodong, McAleer, Stephen, Fu, Qiang, Yang, Wei
Policy-Space Response Oracles (PSRO) is an influential algorithm framework for approximating a Nash Equilibrium (NE) in multi-agent non-transitive games. Many previous studies have been trying to promote policy diversity in PSRO. A major weakness in existing diversity metrics is that a more diverse (according to their diversity metrics) population does not necessarily mean (as we proved in the paper) a better approximation to a NE. To alleviate this problem, we propose a new diversity metric, the improvement of which guarantees a better approximation to a NE. Meanwhile, we develop a practical and well-justified method to optimize our diversity metric using only state-action samples. By incorporating our diversity regularization into the best response solving in PSRO, we obtain a new PSRO variant, Policy Space Diversity PSRO (PSD-PSRO). We present the convergence property of PSD-PSRO. Empirically, extensive experiments on various games demonstrate that PSD-PSRO is more effective in producing significantly less exploitable policies than state-of-the-art PSRO variants.
Maximum Entropy Heterogeneous-Agent Reinforcement Learning
Liu, Jiarong, Zhong, Yifan, Hu, Siyi, Fu, Haobo, Fu, Qiang, Chang, Xiaojun, Yang, Yaodong
Multi-agent reinforcement learning (MARL) has been shown effective for cooperative games in recent years. However, existing state-of-the-art methods face challenges related to sample complexity, training instability, and the risk of converging to a suboptimal Nash Equilibrium. In this paper, we propose a unified framework for learning stochastic policies to resolve these issues. We embed cooperative MARL problems into probabilistic graphical models, from which we derive the maximum entropy (MaxEnt) objective for MARL. Based on the MaxEnt framework, we propose Heterogeneous-Agent Soft Actor-Critic (HASAC) algorithm. Theoretically, we prove the monotonic improvement and convergence to quantal response equilibrium (QRE) properties of HASAC. Furthermore, we generalize a unified template for MaxEnt algorithmic design named Maximum Entropy Heterogeneous-Agent Mirror Learning (MEHAML), which provides any induced method with the same guarantees as HASAC. We evaluate HASAC on six benchmarks: Bi-DexHands, Multi-Agent MuJoCo, StarCraft Multi-Agent Challenge, Google Research Football, Multi-Agent Particle Environment, and Light Aircraft Game. Results show that HASAC consistently outperforms strong baselines, exhibiting better sample efficiency, robustness, and sufficient exploration. See our project page at \url{https://sites.google.com/view/meharl}.
L2E: Learning to Exploit Your Opponent
Wu, Zhe, Li, Kai, Zhao, Enmin, Xu, Hang, Zhang, Meng, Fu, Haobo, An, Bo, Xing, Junliang
Opponent modeling is essential to exploit sub-optimal opponents in strategic interactions. Most previous works focus on building explicit models to directly predict the opponents' styles or strategies, which require a large amount of data to train the model and lack adaptability to unknown opponents. In this work, we propose a novel Learning to Exploit (L2E) framework for implicit opponent modeling. L2E acquires the ability to exploit opponents by a few interactions with different opponents during training, thus can adapt to new opponents with unknown styles during testing quickly. We propose a novel opponent strategy generation algorithm that produces effective opponents for training automatically. We evaluate L2E on two poker games and one grid soccer game, which are the commonly used benchmarks for opponent modeling. Comprehensive experimental results indicate that L2E quickly adapts to diverse styles of unknown opponents.