Markov Models
CODA: Coordinating the Cerebrum and Cerebellum for a Dual-Brain Computer Use Agent with Decoupled Reinforcement Learning
Sun, Zeyi, Cao, Yuhang, Liang, Jianze, Sun, Qiushi, Liu, Ziyu, Zhang, Zhixiong, Zang, Yuhang, Dong, Xiaoyi, Chen, Kai, Lin, Dahua, Wang, Jiaqi
Autonomous agents for Graphical User Interfaces (GUIs) face significant challenges in specialized domains such as scientific computing, where both long-horizon planning and precise execution are required. Existing approaches suffer from a trade-off: generalist agents excel at planning but perform poorly in execution, while specialized agents demonstrate the opposite weakness. Recent compositional frameworks attempt to bridge this gap by combining a planner and an actor, but they are typically static and non-trainable, which prevents adaptation from experience. This is a critical limitation given the scarcity of high-quality data in scientific domains. To address these limitations, we introduce CODA, a novel and trainable compositional framework that integrates a generalist planner (Cerebrum) with a specialist executor (Cerebellum), trained via a dedicated two-stage pipeline. In the first stage, Specialization, we apply a decoupled GRPO approach to train an expert planner for each scientific application individually, bootstrapping from a small set of task trajectories. In the second stage, Generalization, we aggregate all successful trajectories from the specialized experts to build a consolidated dataset, which is then used for supervised fine-tuning of the final planner. This equips CODA with both robust execution and cross-domain generalization. Evaluated on four challenging applications from the ScienceBoard benchmark, CODA significantly outperforms baselines and establishes a new state of the art among open-source models.
PoolFlip: A Multi-Agent Reinforcement Learning Security Environment for Cyber Defense
Cadet, Xavier, Boboila, Simona, Dharmawan, Sie Hendrata, Oprea, Alina, Chin, Peter
Cyber defense requires automating defensive decision-making under stealthy, deceptive, and continuously evolving adversarial strategies. The FlipIt game provides a foundational framework for modeling interactions between a defender and an advanced adversary that compromises a system without being immediately detected. In FlipIt, the attacker and defender compete to control a shared resource by performing a Flip action and paying a cost. However, the existing FlipIt frameworks rely on a small number of heuristics or specialized learning techniques, which can lead to brittleness and the inability to adapt to new attacks. To address these limitations, we introduce PoolFlip, a multi-agent gym environment that extends the FlipIt game to allow efficient learning for attackers and defenders. Furthermore, we propose Flip-PSRO, a multi-agent reinforcement learning (MARL) approach that leverages population-based training to train defender agents equipped to generalize against a range of unknown, potentially adaptive opponents. Our empirical results suggest that Flip-PSRO defenders are $2\times$ more effective than baselines to generalize to a heuristic attack not exposed in training. In addition, our newly designed ownership-based utility functions ensure that Flip-PSRO defenders maintain a high level of control while optimizing performance.
Hierarchical Decentralized Stochastic Control for Cyber-Physical Systems
Kaza, Kesav, Anantharaman, Ramachandran, Meshram, Rahul
This paper introduces a two-timescale hierarchical decentralized control architecture for Cyber-Physical Systems (CPS). The system consists of a global controller (GC), and N local controllers (LCs). The GC operates at a slower timescale, imposing budget constraints on the actions of LCs, which function at a faster timescale. Applications can be found in energy grid planning, wildfire management, and other decentralized resource allocation problems. We propose and analyze two optimization frameworks for this setting: COpt and FOpt. In COpt, both GC and LCs together optimize infinite-horizon discounted rewards, while in FOpt the LCs optimize finite-horizon episodic rewards, and the GC optimizes infinite-horizon rewards. Although both frameworks share identical reward functions, their differing horizons can lead to different optimal policies. In particular, FOpt grants greater autonomy to LCs by allowing their policies to be determined only by local objectives, unlike COpt. To our knowledge, these frameworks have not been studied in the literature. We establish the formulations, prove the existence of optimal policies, and prove the convergence of their value iteration algorithms. We further show that COpt always achieves a higher value function than FOpt and derive explicit bounds on their difference. Finally, we establish a set of sufficient structural conditions under which the two frameworks become equivalent.
Playstyle and Artificial Intelligence: An Initial Blueprint Through the Lens of Video Games
Contemporary artificial intelligence (AI) development largely centers on rational decision-making, valued for its measurability and suitability for objective evaluation. Y et in real-world contexts, an intelligent agent's decisions are shaped not only by logic but also by deeper influences such as beliefs, values, and preferences. The diversity of human decision-making styles emerges from these differences, highlighting that "style" is an essential but often overlooked dimension of intelligence. This dissertation introduces playstyle as an alternative lens for observing and analyzing the decision-making behavior of intelligent agents, and examines its foundational meaning and historical context from a philosophical perspective. By analyzing how beliefs and values drive intentions and actions, we construct a two-tier framework for style formation: the external interaction loop with the environment and the internal cognitive loop of deliberation. On this basis, we formalize style-related characteristics and propose measurable indicators such as style capacity, style popularity, and evolutionary dynamics. The study focuses on three core research directions: (1) Defining and measuring playstyle, proposing a general playstyle metric based on discretized state spaces, and extending it to quantify strategic diversity and competitive balance; (2) Expressing and generating playstyle, exploring how reinforcement learning and imitation learning can be used to train agents exhibiting specific stylistic tendencies, and introducing a novel approach for human-like style learning and modeling; and (3) Practical applications, analyzing the potential of these techniques in domains such as game design and interactive entertainment. Finally, the dissertation outlines future extensions, including the role of style as a core element in building artificial general intelligence (AGI). By investigating stylistic variation, we aim to rethink autonomy, value expression, and even offer a tangible perspective on the ultimate i philosophical question: What is the soul?
Uncertainty-Resilient Active Intention Recognition for Robotic Assistants
Saborรญo, Juan Carlos, Vinci, Marc, Lima, Oscar, Stock, Sebastian, Niecksch, Lennart, Gรผnther, Martin, Sung, Alexander, Hertzberg, Joachim, Atzmรผller, Martin
-- Purposeful behavior in robotic assistants requires the integration of multiple components and technological advances. Often, the problem is reduced to recognizing explicit prompts, which limits autonomy, or is oversimplified through assumptions such as near-perfect information. We argue that a critical gap remains unaddressed - specifically, the challenge of reasoning about the uncertain outcomes and perception errors inherent to human intention recognition. In response, we present a framework designed to be resilient to uncertainty and sensor noise, integrating real-time sensor data with a combination of planners. Our integrated framework has been successfully tested on a physical robot with promising results. Robotic assistants may be integrated into modern industrial environments, e.g., delivering tools, parts or modules interleaved with tidying the workspace. Such tasks, however, require a combination of robust planning, navigation, grasping, and perception-particularly when explicit commands are not available and the robot must identify and pursue goals, in collaborative spaces shared with people.
HiPlan: Hierarchical Planning for LLM-Based Agents with Adaptive Global-Local Guidance
Li, Ziyue, Chang, Yuan, Yu, Gaihong, Le, Xiaoqiu
Large language model (LLM)-based agents have demonstrated remarkable capabilities in decision-making tasks, but struggle significantly with complex, long-horizon planning scenarios. This arises from their lack of macroscopic guidance, causing disorientation and failures in complex tasks, as well as insufficient continuous oversight during execution, rendering them unresponsive to environmental changes and prone to deviations. To tackle these challenges, we introduce HiPlan, a hierarchical planning framework that provides adaptive global-local guidance to boost LLM-based agents'decision-making. HiPlan decomposes complex tasks into milestone action guides for general direction and step-wise hints for detailed actions. During the offline phase, we construct a milestone library from expert demonstrations, enabling structured experience reuse by retrieving semantically similar tasks and milestones. In the execution phase, trajectory segments from past milestones are dynamically adapted to generate step-wise hints that align current observations with the milestone objectives, bridging gaps and correcting deviations. Extensive experiments across two challenging benchmarks demonstrate that HiPlan substantially outperforms strong baselines, and ablation studies validate the complementary benefits of its hierarchical components.
Multi-Component VAE with Gaussian Markov Random Field
Oubari, Fouad, El-Baha, Mohamed, Meunier, Raphael, Dรฉcatoire, Rodrigue, Mougeot, Mathilde
Multi-component datasets with intricate dependencies, like industrial assemblies or multi-modal imaging, challenge current generative modeling techniques. Existing Multi-component Variational AutoEncoders typically rely on simplified aggregation strategies, neglecting critical nuances and consequently compromising structural coherence across generated components. To explicitly address this gap, we introduce the Gaussian Markov Random Field Multi-Component Variational AutoEncoder , a novel generative framework embedding Gaussian Markov Random Fields into both prior and posterior distributions. This design choice explicitly models cross-component relationships, enabling richer representation and faithful reproduction of complex interactions. Empirically, our GMRF MCVAE achieves state-of-the-art performance on a synthetic Copula dataset specifically constructed to evaluate intricate component relationships, demonstrates competitive results on the PolyMNIST benchmark, and significantly enhances structural coherence on the real-world BIKED dataset. Our results indicate that the GMRF MCVAE is especially suited for practical applications demanding robust and realistic modeling of multi-component coherence
Limitations of refinement methods for weak to strong generalization
Somerstep, Seamus, Ritov, Ya'acov, Yurochkin, Mikhail, Maity, Subha, Sun, Yuekai
Standard techniques for aligning large language models (LLMs) utilize human-produced data, which could limit the capability of any aligned LLM to human level. Label refinement and weak training have emerged as promising strategies to address this superalignment problem. In this work, we adopt probabilistic assumptions commonly used to study label refinement and analyze whether refinement can be outperformed by alternative approaches, including computationally intractable oracle methods. We show that both weak training and label refinement suffer from irreducible error, leaving a performance gap between label refinement and the oracle. These results motivate future research into developing alternative methods for weak to strong generalization that synthesize the practicality of label refinement or weak training and the optimality of the oracle procedure.
Efficient Computation of Blackwell Optimal Policies using Rational Functions
Mukherjee, Dibyangshu, Kalyanakrishnan, Shivaram
Markov Decision Problems (MDPs) provide a founda-tional framework for modelling sequential decision-making across diverse domains, guided by optimality criteria such as discounted and average rewards. However, these criteria have inherent limitations: discounted optimality may overly prioritise short-term rewards, while average optimality relies on strong structural assumptions. Blackwell optimality addresses these challenges, offering a robust and comprehensive criterion that ensures optimality under both discounted and average reward frameworks. Despite its theoretical appeal, existing algorithms for computing Blackwell Optimal (BO) policies are computationally expensive or hard to implement. In this paper we describe procedures for computing BO policies using an ordering of rational functions in the vicinity of 1 . We adapt state-of-the-art algorithms for deterministic and general MDPs, replacing numerical evaluations with symbolic operations on rational functions to derive bounds independent of bit complexity. For deterministic MDPs, we give the first strongly polynomial-time algorithms for computing BO policies, and for general MDPs we obtain the first subexponential-time algorithm. We further generalise several policy iteration algorithms, extending the best known upper bounds from the discounted to the Blackwell criterion.
Arnold: a generalist muscle transformer policy
Chiappa, Alberto Silvio, An, Boshi, Simos, Merkourios, Li, Chengkun, Mathis, Alexander
Controlling high-dimensional and nonlinear musculoskeletal models of the human body is a foundational scientific challenge. Recent machine learning breakthroughs have heralded policies that master individual skills like reaching, object manipulation and locomotion in musculoskeletal systems with many degrees of freedom. However, these agents are merely "specialists", achieving high performance for a single skill. In this work, we develop Arnold, a generalist policy that masters multiple tasks and embodiments. Arnold combines behavior cloning and fine-tuning with PPO to achieve expert or super-expert performance in 14 challenging control tasks from dexterous object manipulation to locomotion. A key innovation is Arnold's sensorimotor vocabulary, a compositional representation of the semantics of heterogeneous sensory modalities, objectives, and actuators. Arnold leverages this vocabulary via a transformer architecture to deal with the variable observation and action spaces of each task. This framework supports efficient multi-task, multi-embodiment learning and facilitates rapid adaptation to novel tasks. Finally, we analyze Arnold to provide insights into biological motor control, corroborating recent findings on the limited transferability of muscle synergies across tasks.