Markov Models
A note on concentration inequalities for the overlapped batch mean variance estimators for Markov chains
Moulines, Eric, Naumov, Alexey, Samsonov, Sergey
In this paper, we study the concentration properties of quadratic forms associated with Markov chains using the martingale decomposition method introduced by Atchadรฉ and Cattaneo (2014). In particular, we derive concentration inequalities for the overlapped batch mean (OBM) estimators of the asymptotic variance for uniformly geometrically ergodic Markov chains. Our main result provides an explicit control of the $p$-th moment of the difference between the OBM estimator and the asymptotic variance of the Markov chain with explicit dependence upon $p$ and mixing time of the underlying Markov chain.
Modeling Unseen Environments with Language-guided Composable Causal Components in Reinforcement Learning
Generalization in reinforcement learning (RL) remains a significant challenge, especially when agents encounter novel environments with unseen dynamics. Drawing inspiration from human compositional reasoning--where known components are reconfigured to handle new situations--we introduce World Modeling with Compositional Causal Components (WM3C). This novel framework enhances RL generalization by learning and leveraging compositional causal components. Unlike previous approaches focusing on invariant representation learning or meta-learning, WM3C identifies and utilizes causal dynamics among composable elements, facilitating robust adaptation to new tasks. Our approach integrates language as a compositional modality to decompose the latent space into meaningful components and provides theoretical guarantees for their unique identification under mild assumptions. Our practical implementation uses a masked autoencoder with mutual information constraints and adaptive sparsity regularization to capture high-level semantic information and effectively disentangle transition dynamics. Experiments on numerical simulations and real-world robotic manipulation tasks demonstrate that WM3C significantly outperforms existing methods in identifying latent processes, improving policy learning, and generalizing to unseen tasks. Reinforcement learning (RL) has rapidly progressed, driving innovations in domains such as game playing, robotics, and autonomous driving (Silver et al., 2018; Vinyals et al., 2019; Shi et al., 2022; Kiran et al., 2020). Deep reinforcement learning (DRL) methods, including Deep Q-Networks (DQN), Soft Actor-Critic (SAC), and Proximal Policy Optimization (PPO), have addressed various challenges in RL, such as stability in training, exploration in large state spaces, and efficient policy optimization (Haarnoja et al., 2018; Schulman et al., 2017; Mnih et al., 2015; 2016; Fuji-moto et al., 2018). These breakthroughs underscore the pivotal role of DRL in advancing artificial intelligence. Despite these substantial advancements, one of the most pressing issues of DRL is the generalization of learned policies to novel, unseen environments (Gamrian & Goldberg, 2018; Song et al., 2019; Cobbe et al., 2018). For example, the policy excels in push ball to place A might perform notoriously poorly in the task push ball to place B .
Credit Assignment and Efficient Exploration based on Influence Scope in Multi-agent Reinforcement Learning
Han, Shuai, Dastani, Mehdi, Wang, Shihan
Training cooperative agents in sparse-reward scenarios poses significant challenges for multi-agent reinforcement learning (MARL). Without clear feedback on actions at each step in sparse-reward setting, previous methods struggle with precise credit assignment among agents and effective exploration. In this paper, we introduce a novel method to deal with both credit assignment and exploration problems in reward-sparse domains. Accordingly, we propose an algorithm that calculates the Influence Scope of Agents (ISA) on states by taking specific value of the dimensions/attributes of states that can be influenced by individual agents. The mutual dependence between agents' actions and state attributes are then used to calculate the credit assignment and to delimit the exploration space for each individual agent. We then evaluate ISA in a variety of sparse-reward multi-agent scenarios. The results show that our method significantly outperforms the state-of-art baselines.
Edge-Optimized Deep Learning & Pattern Recognition Techniques for Non-Intrusive Load Monitoring of Energy Time Series
The growing global energy demand and the urgent need for sustainability call for innovative ways to boost energy efficiency. While advanced energy-saving systems exist, they often fall short without user engagement. Providing feedback on energy consumption behavior is key to promoting sustainable practices. Non-Intrusive Load Monitoring (NILM) offers a promising solution by disaggregating total household energy usage, recorded by a central smart meter, into appliance-level data. This empowers users to optimize consumption. Advances in AI, IoT, and smart meter adoption have further enhanced NILM's potential. Despite this promise, real-world NILM deployment faces major challenges. First, existing datasets mainly represent regions like the USA and UK, leaving places like the Mediterranean underrepresented. This limits understanding of regional consumption patterns, such as heavy use of air conditioners and electric water heaters. Second, deep learning models used in NILM require high computational power, often relying on cloud services. This increases costs, raises privacy concerns, and limits scalability, especially for households with poor connectivity. This thesis tackles these issues with key contributions. It presents an interoperable data collection framework and introduces the Plegma Dataset, focused on underrepresented Mediterranean energy patterns. It also explores advanced deep neural networks and model compression techniques for efficient edge deployment. By bridging theoretical advances with practical needs, this work aims to make NILM scalable, efficient, and adaptable for global energy sustainability.
Reinforcement Learning for Game-Theoretic Resource Allocation on Graphs
Game-theoretic resource allocation on graphs (GRAG) involves two players competing over multiple steps to control nodes of interest on a graph, a problem modeled as a multi-step Colonel Blotto Game (MCBG). Finding optimal strategies is challenging due to the dynamic action space and structural constraints imposed by the graph. To address this, we formulate the MCBG as a Markov Decision Process (MDP) and apply Reinforcement Learning (RL) methods, specifically Deep Q-Network (DQN) and Proximal Policy Optimization (PPO). To enforce graph constraints, we introduce an action-displacement adjacency matrix that dynamically generates valid action sets at each step. We evaluate RL performance across a variety of graph structures and initial resource distributions, comparing against random, greedy, and learned RL policies. Experimental results show that both DQN and PPO consistently outperform baseline strategies and converge to a balanced $50\%$ win rate when competing against the learned RL policy. Particularly, on asymmetric graphs, RL agents successfully exploit structural advantages and adapt their allocation strategies, even under disadvantageous initial resource distributions.
Realistic Counterfactual Explanations for Machine Learning-Controlled Mobile Robots using 2D LiDAR
Remman, Sindre Benjamin, Lekkas, Anastasios M.
This paper presents a novel method for generating realistic counterfactual explanations (CFEs) in machine learning (ML)-based control for mobile robots using 2D LiDAR. ML models, especially artificial neural networks (ANNs), can provide advanced decision-making and control capabilities by learning from data. However, they often function as black boxes, making it challenging to interpret them. This is especially a problem in safety-critical control applications. To generate realistic CFEs, we parameterize the LiDAR space with simple shapes such as circles and rectangles, whose parameters are chosen by a genetic algorithm, and the configurations are transformed into LiDAR data by raycasting. Our model-agnostic approach generates CFEs in the form of synthetic LiDAR data that resembles a base LiDAR state but is modified to produce a pre-defined ML model control output based on a query from the user. We demonstrate our method on a mobile robot, the TurtleBot3, controlled using deep reinforcement learning (DRL) in real-world and simulated scenarios. Our method generates logical and realistic CFEs, which helps to interpret the DRL agent's decision making. This paper contributes towards advancing explainable AI in mobile robotics, and our method could be a tool for understanding, debugging, and improving ML-based autonomous control.
Online Episodic Convex Reinforcement Learning
Moreno, Bianca Marin, Eldowa, Khaled, Gaillard, Pierre, Brรฉgรจre, Margaux, Oudjane, Nadia
We study online learning in episodic finite-horizon Markov decision processes (MDPs) with convex objective functions, known as the concave utility reinforcement learning (CURL) problem. This setting generalizes RL from linear to convex losses on the state-action distribution induced by the agent's policy. The non-linearity of CURL invalidates classical Bellman equations and requires new algorithmic approaches. We introduce the first algorithm achieving near-optimal regret bounds for online CURL without any prior knowledge on the transition function. To achieve this, we use an online mirror descent algorithm with varying constraint sets and a carefully designed exploration bonus. We then address for the first time a bandit version of CURL, where the only feedback is the value of the objective function on the state-action distribution induced by the agent's policy. We achieve a sub-linear regret bound for this more challenging problem by adapting techniques from bandit convex optimization to the MDP setting.
A Multi-Agent Reinforcement Learning Approach for Cooperative Air-Ground-Human Crowdsensing in Emergency Rescue
Lu, Wenhao, Zhu, Zhengqiu, Zhao, Yong, Tian, Yonglin, Zeng, Junjie, Zhang, Jun, Liu, Zhong, Wang, Fei-Yue
Mobile crowdsensing is evolving beyond traditional human-centric models by integrating heterogeneous entities like unmanned aerial vehicles (UAVs) and unmanned ground vehicles (UGVs). Optimizing task allocation among these diverse agents is critical, particularly in challenging emergency rescue scenarios characterized by complex environments, limited communication, and partial observability. This paper tackles the Heterogeneous-Entity Collaborative-Sensing Task Allocation (HECTA) problem specifically for emergency rescue, considering humans, UAVs, and UGVs. We introduce a novel ``Hard-Cooperative'' policy where UGVs prioritize recharging low-battery UAVs, alongside performing their sensing tasks. The primary objective is maximizing the task completion rate (TCR) under strict time constraints. We rigorously formulate this NP-hard problem as a decentralized partially observable Markov decision process (Dec-POMDP) to effectively handle sequential decision-making under uncertainty. To solve this, we propose HECTA4ER, a novel multi-agent reinforcement learning algorithm built upon a Centralized Training with Decentralized Execution architecture. HECTA4ER incorporates tailored designs, including specialized modules for complex feature extraction, utilization of action-observation history via hidden states, and a mixing network integrating global and local information, specifically addressing the challenges of partial observability. Furthermore, theoretical analysis confirms the algorithm's convergence properties. Extensive simulations demonstrate that HECTA4ER significantly outperforms baseline algorithms, achieving an average 18.42% increase in TCR. Crucially, a real-world case study validates the algorithm's effectiveness and robustness in dynamic sensing scenarios, highlighting its strong potential for practical application in emergency response.
MLE-Dojo: Interactive Environments for Empowering LLM Agents in Machine Learning Engineering
Qiang, Rushi, Zhuang, Yuchen, Li, Yinghao, K, Dingu Sagar V, Zhang, Rongzhi, Li, Changhao, Wong, Ian Shu-Hei, Yang, Sherry, Liang, Percy, Zhang, Chao, Dai, Bo
We introduce MLE-Dojo, a Gym-style framework for systematically reinforcement learning, evaluating, and improving autonomous large language model (LLM) agents in iterative machine learning engineering (MLE) workflows. Unlike existing benchmarks that primarily rely on static datasets or single-attempt evaluations, MLE-Dojo provides an interactive environment enabling agents to iteratively experiment, debug, and refine solutions through structured feedback loops. Built upon 200+ real-world Kaggle challenges, MLE-Dojo covers diverse, open-ended MLE tasks carefully curated to reflect realistic engineering scenarios such as data processing, architecture search, hyperparameter tuning, and code debugging. Its fully executable environment supports comprehensive agent training via both supervised fine-tuning and reinforcement learning, facilitating iterative experimentation, realistic data sampling, and real-time outcome verification. Extensive evaluations of eight frontier LLMs reveal that while current models achieve meaningful iterative improvements, they still exhibit significant limitations in autonomously generating long-horizon solutions and efficiently resolving complex errors. Furthermore, MLE-Dojo's flexible and extensible architecture seamlessly integrates diverse data sources, tools, and evaluation protocols, uniquely enabling model-based agent tuning and promoting interoperability, scalability, and reproducibility. We open-source our framework and benchmarks to foster community-driven innovation towards next-generation MLE agents.
Towards Human-Centric Autonomous Driving: A Fast-Slow Architecture Integrating Large Language Model Guidance with Reinforcement Learning
Xu, Chengkai, Liu, Jiaqi, Guo, Yicheng, Zhang, Yuhang, Hang, Peng, Sun, Jian
Autonomous driving has made significant strides through data-driven techniques, achieving robust performance in standardized tasks. However, existing methods frequently overlook user-specific preferences, offering limited scope for interaction and adaptation with users. To address these challenges, we propose a "fast-slow" decision-making framework that integrates a Large Language Model (LLM) for high-level instruction parsing with a Reinforcement Learning (RL) agent for low-level real-time decision. In this dual system, the LLM operates as the "slow" module, translating user directives into structured guidance, while the RL agent functions as the "fast" module, making time-critical maneuvers under stringent latency constraints. By decoupling high-level decision making from rapid control, our framework enables personalized user-centric operation while maintaining robust safety margins. Experimental evaluations across various driving scenarios demonstrate the effectiveness of our method. Compared to baseline algorithms, the proposed architecture not only reduces collision rates but also aligns driving behaviors more closely with user preferences, thereby achieving a human-centric mode. By integrating user guidance at the decision level and refining it with real-time control, our framework bridges the gap between individual passenger needs and the rigor required for safe, reliable driving in complex traffic environments.