Markov Models
Multi-agent cooperation through learning-aware policy gradients
Meulemans, Alexander, Kobayashi, Seijin, von Oswald, Johannes, Scherrer, Nino, Elmoznino, Eric, Richards, Blake, Lajoie, Guillaume, Arcas, Blaise Agรผera y, Sacramento, Joรฃo
Self-interested individuals often fail to cooperate, posing a fundamental challenge for multi-agent learning. How can we achieve cooperation among self-interested, independent learning agents? Promising recent work has shown that in certain tasks cooperation can be established between learning-aware agents who model the learning dynamics of each other. Here, we present the first unbiased, higher-derivative-free policy gradient algorithm for learning-aware reinforcement learning, which takes into account that other agents are themselves learning through trial and error based on multiple noisy trials. We then leverage efficient sequence models to condition behavior on long observation histories that contain traces of the learning dynamics of other agents. Training long-context policies with our algorithm leads to cooperative behavior and high returns on standard social dilemmas, including a challenging environment where temporally-extended action coordination is required. Finally, we derive from the iterated prisoner's dilemma a novel explanation for how and when cooperation arises among self-interested learning-aware agents.
Hierarchical Multi-agent Reinforcement Learning for Cyber Network Defense
Singh, Aditya Vikram, Rathbun, Ethan, Graham, Emma, Oakley, Lisa, Boboila, Simona, Oprea, Alina, Chin, Peter
Recent advances in multi-agent reinforcement learning (MARL) have created opportunities to solve complex real-world tasks. Cybersecurity is a notable application area, where defending networks against sophisticated adversaries remains a challenging task typically performed by teams of security operators. In this work, we explore novel MARL strategies for building autonomous cyber network defenses that address challenges such as large policy spaces, partial observability, and stealthy, deceptive adversarial strategies. To facilitate efficient and generalized learning, we propose a hierarchical Proximal Policy Optimization (PPO) architecture that decomposes the cyber defense task into specific sub-tasks like network investigation and host recovery. Our approach involves training sub-policies for each sub-task using PPO enhanced with domain expertise. These sub-policies are then leveraged by a master defense policy that coordinates their selection to solve complex network defense tasks. Furthermore, the sub-policies can be fine-tuned and transferred with minimal cost to defend against shifts in adversarial behavior or changes in network settings. We conduct extensive experiments using CybORG Cage 4, the state-of-the-art MARL environment for cyber defense. Comparisons with multiple baselines across different adversaries show that our hierarchical learning approach achieves top performance in terms of convergence speed, episodic return, and several interpretable metrics relevant to cybersecurity, including the fraction of clean machines on the network, precision, and false positives on recoveries.
Leveraging Graph Neural Networks and Multi-Agent Reinforcement Learning for Inventory Control in Supply Chains
Kotecha, Niki, Chanona, Antonio del Rio
Inventory control in modern supply chains has attracted significant attention due to the increasing number of disruptive shocks and the challenges posed by complex dynamics, uncertainties, and limited collaboration. Traditional methods, which often rely on static parameters, struggle to adapt to changing environments. This paper proposes a Multi-Agent Reinforcement Learning (MARL) framework with Graph Neural Networks (GNNs) for state representation to address these limitations. Our approach redefines the action space by parameterizing heuristic inventory control policies, making it adaptive as the parameters dynamically adjust based on system conditions. By leveraging the inherent graph structure of supply chains, our framework enables agents to learn the system's topology, and we employ a centralized learning, decentralized execution scheme that allows agents to learn collaboratively while overcoming information-sharing constraints. Additionally, we incorporate global mean pooling and regularization techniques to enhance performance. We test the capabilities of our proposed approach on four different supply chain configurations and conduct a sensitivity analysis. This work paves the way for utilizing MARL-GNN frameworks to improve inventory management in complex, decentralized supply chain environments.
Tuning-free coreset Markov chain Monte Carlo
Chen, Naitong, Huggins, Jonathan H., Campbell, Trevor
A Bayesian coreset is a small, weighted subset of a data set that replaces the full data during inference to reduce computational cost. The state-of-the-art coreset construction algorithm, Coreset Markov chain Monte Carlo (Coreset MCMC), uses draws from an adaptive Markov chain targeting the coreset posterior to train the coreset weights via stochastic gradient optimization. However, the quality of the constructed coreset, and thus the quality of its posterior approximation, is sensitive to the stochastic optimization learning rate. In this work, we propose a learning-rate-free stochastic gradient optimization procedure, Hot-start Distance over Gradient (Hot DoG), Figure 1: Relative Coreset MCMC posterior approximation for training coreset weights in Coreset MCMC error (average squared coordinate-wise z-score) without user tuning effort. Empirical results using ADAM with different learning rates versus the demonstrate that Hot DoG provides higher proposed Hot DoG method (with fixed r = 0.001). Median quality posterior approximations than other values after 200,000 optimization iterations across learning-rate-free stochastic gradient methods, 10 trials are used for the relative comparison for a variety and performs competitively to optimallytuned of datasets, models, and coreset sizes.
Learning to Look: Seeking Information for Decision Making via Policy Factorization
Dass, Shivin, Hu, Jiaheng, Abbatematteo, Ben, Stone, Peter, Martรญn-Martรญn, Roberto
Intelligent decisions can only be made based on the right information. When operating in the environment, an intelligent agent actively seeks the information that enables it to select the right actions and proceeds with the task only when it is confident enough. For example, when following a video recipe, a chef would look at the TV to obtain information about the next ingredient to grasp, and later look at a timer to decide when to turn off the stove. In contrast, current learning robots assume that the information needed for manipulation is readily available in their sensor signals (e.g., from a stationary camera looking at a tabletop manipulation setting) or rely on a given low-dimensional state representation predefined by a human (e.g., object pose) that also has to provide the means for the robot to perceive it. In this work, our goal is to endow robots with the capabilities to learn to perform information-seeking actions to find the information that enables manipulation, using as supervision the quality of the informed actions and switching between active perception and manipulation only based on the uncertainty about what manipulation action should come next. Performing actions to reveal information has been previously explored in the subfields of active and interactive perception. In active perception [1, 2, 3], an agent changes the parameters of its sensors (e.g., camera pose [4, 5, 6] or parameters [7, 8, 9]) to infer information such as object pose, shape, or material. Interactive perception [10] solutions go one step further and enable agents to change the state of the environment to create information-rich signals to perceive kinematics [11, 12], material [13], or other properties [14, 15, 16, 17].
SkillMimicGen: Automated Demonstration Generation for Efficient Skill Learning and Deployment
Garrett, Caelan, Mandlekar, Ajay, Wen, Bowen, Fox, Dieter
Imitation learning from human demonstrations is an effective paradigm for robot manipulation, but acquiring large datasets is costly and resource-intensive, especially for long-horizon tasks. To address this issue, we propose SkillMimicGen (SkillGen), an automated system for generating demonstration datasets from a few human demos. SkillGen segments human demos into manipulation skills, adapts these skills to new contexts, and stitches them together through free-space transit and transfer motion. We also propose a Hybrid Skill Policy (HSP) framework for learning skill initiation, control, and termination components from SkillGen datasets, enabling skills to be sequenced using motion planning at test-time. We demonstrate that SkillGen greatly improves data generation and policy learning performance over a state-of-the-art data generation framework, resulting in the capability to produce data for large scene variations, including clutter, and agents that are on average 24% more successful. We demonstrate the efficacy of SkillGen by generating over 24K demonstrations across 18 task variants in simulation from just 60 human demonstrations, and training proficient, often near-perfect, HSP agents. Finally, we apply SkillGen to 3 real-world manipulation tasks and also demonstrate zero-shot sim-to-real transfer on a long-horizon assembly task. Videos, and more at https://skillgen.github.io.
VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks
Jang, Lawrence, Li, Yinheng, Ding, Charles, Lin, Justin, Liang, Paul Pu, Zhao, Dan, Bonatti, Rogerio, Koishida, Kazuhito
Videos are often used to learn or extract the necessary information to complete tasks in ways different than what text and static imagery alone can provide. However, many existing agent benchmarks neglect long-context video understanding, instead focusing on text or static image inputs. To bridge this gap, we introduce VideoWebArena (VideoWA), a benchmark for evaluating the capabilities of long-context multimodal agents for video understanding. VideoWA consists of 2,021 web agent tasks based on manually crafted video tutorials, which total almost four hours of content. For our benchmark, we define a taxonomy of long-context video-based agent tasks with two main areas of focus: skill retention and factual retention. While skill retention tasks evaluate whether an agent can use a given human demonstration to complete a task efficiently, the factual retention task evaluates whether an agent can retrieve instruction-relevant information from a video to complete a task. We find that the best model achieves 13.3% success on factual retention tasks and 45.8% on factual retention QA pairs, far below human performance at 73.9% and 79.3%, respectively. On skill retention tasks, long-context models perform worse with tutorials than without, exhibiting a 5% performance decrease in WebArena tasks and a 10.3% decrease in VisualWebArena tasks. Our work highlights the need to improve the agentic abilities of long-context multimodal models and provides a testbed for future development with long-context video agents.
PyTSC: A Unified Platform for Multi-Agent Reinforcement Learning in Traffic Signal Control
Effective Traffic Signal Control (TSC) is fundamental to urban traffic management, responsible for guiding the movement of vehicles through intersections by controlling traffic lights. The primary goals of TSC are to minimize traffic congestion, enhance traffic flow, and improve safety for both vehicles and pedestrians. Poor TSC optimization leads to increased congestion, fuel consumption, and pollution. Longer wait times at signals lead to increased fuel consumption, which not only exacerbates environmental issues through higher emissions but also results in economic losses due to delays. Moreover, inefficient TSC negatively impacts the quality of life in urban areas, contributing to increased noise and air pollution.
Markov Chain of Thought for Efficient Mathematical Reasoning
Yang, Wen, Fan, Kai, Liao, Minpeng
Chain of Thought (CoT) of multi-step benefits from the logical structure of the reasoning steps and task-specific actions, significantly enhancing the mathematical reasoning capabilities of large language models. As the prevalence of long CoT, the number of reasoning steps exceeds manageable token limits and leads to higher computational demands. Inspired by the fundamental logic of human cognition, ``derive, then reduce'', we conceptualize the standard multi-step CoT as a novel Markov Chain of Thought (MCoT). In this study, we consider the mathematical reasoning task, defining each reasoning step as text accompanied by a Python code snippet. To facilitate a longer reasoning path, self-correction is enabled through interactions with the code interpreter. Our MCoT aims to compress previous reasoning steps into a simplified question, enabling efficient next-step inference without relying on a lengthy KV cache. In our experiments, we curate the \texttt{MCoTInstruct} dataset, and the empirical results indicate that MCoT not only significantly enhances efficiency but also maintains comparable accuracy. While much remains to be explored, this work paves the way for exploring the long CoT reasoning abilities of LLMs.
Incremental Learning of Affordances using Markov Logic Networks
Potter, George, Burghouts, Gertjan, Sijs, Joris
Abstract--Affordances enable robots to have a semantic understanding of their surroundings. Challenges are contradicting formulas and I. Markov Logic Networks can solve these problems [Richardson and Domingos, 2006], Affordances play an important role in semantic understanding [Domingos and Lowd, 2019]. of scenes in robotics. These affordances, first introduced by Gibson [Gibson, 1979], are the potential actions that an A Markov Logic Network (MLN) is a knowledge object affords to an agent depending on object properties and base of first-order logic formulas with a weight attached state, action effects, situational context and agent capabilities. MLNs can compactly represent the robot, an object, and the possible interactions between the regularities in the world and allow reasoning over these two [Andries et al., 2018]. These affordances allow the robot regularities. The weight of a formula in the knowledge base to reason about its beliefs of the world in relation to the tasks is a measure of how likely that formula is to occur given and actions it may execute within the environment. Table I provides an example MLN in partially known environments, these affordances, in combination that consists of three formulas. The formulas do not conflict with reasoning about them, may result in more options logically, but semantically seem incorrect when taking into for the robot to choose from. As a result affordances increase account that each formula is x, y.