Reinforcement Learning
Cloud-Fog-Edge Collaborative Computing for Sequential MIoT Workflow: A Two-Tier DDPG-Based Scheduling Framework
Fu, Yuhao, Zhang, Yinghao, Liu, Yalin, Tao, Bishenghui, Ruan, Junhong
The Medical Internet of Things (MIoT) demands stringent end-to-end latency guarantees for sequential healthcare workflows deployed over heterogeneous cloud-fog-edge infrastructures. Scheduling these sequential workflows to minimize makespan is an NP-hard problem. To tackle this challenge, we propose a Two-tier DDPG-based scheduling framework that decomposes the scheduling decision into a hierarchical process: a global controller performs layer selection (edge, fog, or cloud), while specialized local controllers handle node assignment within the chosen layer. The primary optimization objective is the minimization of the workflow makespan. Experiments results validate our approach, demonstrating increasingly superior performance over baselines as workflow complexity rises. This trend highlights the frameworks ability to learn effective long-term strategies, which is critical for complex, large-scale MIoT scheduling scenarios.
Enhanced Evolutionary Multi-Objective Deep Reinforcement Learning for Reliable and Efficient Wireless Rechargeable Sensor Networks
Tong, Bowei, Kang, Hui, Li, Jiahui, Sun, Geng, Wang, Jiacheng, Yang, Yaoqi, Xu, Bo, Niyato, Dusit
Abstract--Despite rapid advancements in sensor networks, conventional battery-powered sensor networks suffer from limited operational lifespans and frequent maintenance requirements that severely constrain their deployment in remote and inaccessible environments. As such, wireless rechargeable sensor networks (WRSNs) with mobile charging capabilities offer a promising solution to extend network lifetime. In this paper, we investigate a typical scenario where mobile chargers move and charge the sensor, thereby maintaining the network connectivity while minimizing the energy waste. Specifically, we formulate a multi-objective optimization problem that simultaneously maximizes the network node survival rate and mobile charger energy usage efficiency across multiple time slots, which presents NP-hard computational complexity with long-term temporal dependencies that make traditional optimization approaches ineffective. T o address these challenges, we propose an enhanced evolutionary multi-objective deep reinforcement learning algorithm, which integrates a long short-term memory (LSTM)-based policy network for temporal pattern recognition, a multilayer perceptron-based prospective increment model for future state prediction, and a time-varying Pareto policy evaluation method for dynamic preference adaptation. Extensive simulation results demonstrate that the proposed algorithm significantly outperforms existing approaches in balancing node survival rate and energy efficiency while generating diverse Pareto-optimal solutions. Moreover, we reveal that the LSTM-enhanced policy network achieves 25% faster convergence compared to conventional neural networks, and the time-varying evaluation method adapts effectively to changing network conditions with improved long-term performance stability. Bowei Tong, Hui Kang, and Jiahui Li are with the College of Computer Science and Technology, Jilin University, Changchun 130012, China (e-mails: tongbw25@mails.jlu.edu.cn; Geng Sun is with the College of Computer Science and Technology, Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, China, and also with the College of Computing and Data Science, Nanyang Technological University, Singapore 639798 (e-mail: sungeng@jlu.edu.cn). Jiacheng Wang and Dusit Niyato are with the College of Computing and Data Science, Nanyang Technological University, Singapore 639798 (e-mail: jiacheng.wang@ntu.edu.sg; Bo Xu is with the School of Information and Communication Engineering, Hainan University, Haikou 570228, China (e-mail: 996458@hainanu.edu.cn).
Confounding Robust Deep Reinforcement Learning: A Causal Approach
Li, Mingxuan, Zhang, Junzhe, Bareinboim, Elias
A key task in Artificial Intelligence is learning effective policies for controlling agents in unknown environments to optimize performance measures. Off-policy learning methods, like Q-learning, allow learners to make optimal decisions based on past experiences. This paper studies off-policy learning from biased data in complex and high-dimensional domains where \emph{unobserved confounding} cannot be ruled out a priori. Building on the well-celebrated Deep Q-Network (DQN), we propose a novel deep reinforcement learning algorithm robust to confounding biases in observed data. Specifically, our algorithm attempts to find a safe policy for the worst-case environment compatible with the observations. We apply our method to twelve confounded Atari games, and find that it consistently dominates the standard DQN in all games where the observed input to the behavioral and target policies mismatch and unobserved confounders exist.
Interpret Policies in Deep Reinforcement Learning using SILVER with RL-Guided Labeling: A Model-level Approach to High-dimensional and Multi-action Environments
Qian, Yiyu, Nguyen, Su, Chen, Chao, Zhou, Qinyue, Zhao, Liyuan
Deep reinforcement learning (RL) achieves remarkable performance but lacks interpretability, limiting trust in policy behavior. The existing SIL VER framework (Li, Siddique, and Cao 2025) explains RL policy via Shapley-based regression but remains restricted to low-dimensional, binary-action domains. We propose SIL VER with RL-guided labeling, an enhanced variant that extends SIL VER to multi-action and high-dimensional environments by incorporating the RL policy's own action outputs into the boundary points identification. Our method first extracts compact feature representations from image observations, performs SHAP-based feature attribution, and then employs RL-guided labeling to generate behaviorally consistent boundary datasets. Surrogate models, such as decision trees and regression-based functions, are subsequently trained to interpret RL policy's decision structure. We evaluate the proposed framework on two Atari environments using three deep RL algorithms and conduct human-subject study to assess the clarity and trustworthiness of the derived interpretable policy. Results show that our approach maintains competitive task performance while substantially improving transparency and human understanding of agent behavior.
P-CAFE: Personalized Cost-Aware Incremental Feature Selection For Electronic Health Records
Kashani, Naama, Cohen, Mira, Shaham, Uri
Electronic Health Records (EHRs) serve as comprehensive digital repositories of patient health information, encompassing both structured and unstructured data (Bates et al., 2014). A thorough understanding of EHR data can significantly enhance various aspects of patient care, including disease prediction, healthcare quality improvement, and resource allocation (Shickel et al., 2018; Kim et al., 2019). However, EHR data presents unique challenges: it is often high-dimensional, multimodal, sparse, and temporal (Wu et al., 2010; Menachemi and Collum, 2011; Xiao et al., 2018). Records typically include a diverse array of modalities, such as demographics, diagnoses, procedures, medications, prescriptions, radiological images, clinical notes, and laboratory results. The data is inherently sparse, as medical events occur irregularly, and sequential, as patient histories accumulate over time. To address these complexities, many approaches employ feature selection (FS) -- the process of identifying the most informative variables from high-dimensional input to improve model performance, interpretability, and robustness (Remeseiro and Bolon-Canedo, 2019; Chandrashekar and Sahin, 2014). Y et, to the best of our knowledge, existing FS methods applied to EHRs either ignore multimodality or fail to capture temporal dynamics.
Online Learning for Dynamic Vickrey-Clarke-Groves Mechanism in Unknown Environments
Leon, Vincent, Etesami, S. Rasoul
We consider the problem of online dynamic mechanism design for sequential auctions in unknown environments, where the underlying market and, thus, the bidders' values vary over time as interactions between the seller and the bidders progress. We model the sequential auctions as an infinite-horizon average-reward Markov decision process (MDP). In each round, the seller determines an allocation and sets a payment for each bidder, while each bidder receives a private reward and submits a sealed bid to the seller. The state, which represents the underlying market, evolves according to an unknown transition kernel and the seller's allocation policy without episodic resets. We first extend the Vickrey-Clarke-Groves (VCG) mechanism to sequential auctions, thereby obtaining a dynamic counterpart that preserves the desired properties: efficiency, truthfulness, and individual rationality. We then focus on the online setting and develop a reinforcement learning algorithm for the seller to learn the underlying MDP and implement a mechanism that closely resembles the dynamic VCG mechanism. We show that the learned mechanism approximately satisfies efficiency, truthfulness, and individual rationality and achieves guaranteed performance in terms of various notions of regret.
A Smooth Sea Never Made a Skilled SAILOR: Robust Imitation via Learning to Search
Jain, Arnav Kumar, Mohta, Vibhakar, Kim, Subin, Bhardwaj, Atiksh, Ren, Juntao, Feng, Yunhai, Choudhury, Sanjiban, Swamy, Gokul
The fundamental limitation of the behavioral cloning (BC) approach to imitation learning is that it only teaches an agent what the expert did at states the expert visited. This means that when a BC agent makes a mistake which takes them out of the support of the demonstrations, they often don't know how to recover from it. In this sense, BC is akin to giving the agent the fish -- giving them dense supervision across a narrow set of states -- rather than teaching them to fish: to be able to reason independently about achieving the expert's outcome even when faced with unseen situations at test-time. In response, we explore learning to search (L2S) from expert demonstrations, i.e. learning the components required to, at test time, plan to match expert outcomes, even after making a mistake. These include (1) a world model and (2) a reward model. We carefully ablate the set of algorithmic and design decisions required to combine these and other components for stable and sample/interaction-efficient learning of recovery behavior without additional human corrections. Across a dozen visual manipulation tasks from three benchmarks, our approach SAILOR consistently out-performs state-of-the-art Diffusion Policies trained via BC on the same data. Furthermore, scaling up the amount of demonstrations used for BC by 5-10x still leaves a performance gap. We find that SAILOR can identify nuanced failures and is robust to reward hacking. Our code is available at https://github.com/arnavkj1995/SAILOR .
Learning Decentralized Routing Policies via Graph Attention-based Multi-Agent Reinforcement Learning in Lunar Delay-Tolerant Networks
Lozano-Cuadra, Federico, Soret, Beatriz, Net, Marc Sanchez, Cauligi, Abhishek, Rossi, Federico
Abstract-- We present a fully decentralized routing framework for multi-robot exploration missions operating under the constraints of a Lunar Delay-T olerant Network (LDTN). In this setting, autonomous rovers must relay collected data to a lander under intermittent connectivity and unknown mobility patterns. We formulate the problem as a Partially Observable Markov Decision Problem (POMDP) and propose a Graph Attention-based Multi-Agent Reinforcement Learning (GA T - MARL) policy that performs Centralized Training, Decentralized Execution (CTDE). Our method relies only on local observations and does not require global topology updates or packet replication, unlike classical approaches such as shortest path and controlled flooding-based algorithms. Through Monte Carlo simulations in randomized exploration environments, GA T -MARL provides higher delivery rates, no duplications, and fewer packet losses, and is able to leverage short-term mobility forecasts; offering a scalable solution for future space robotic systems for planetary exploration, as demonstrated by successful generalization to larger rover teams. The renewed interest in planetary and lunar surface exploration has accelerated the development of autonomous multi-robot systems.
Learning Personalized Ad Impact via Contextual Reinforcement Learning under Delayed Rewards
Cheng, Yuwei, Zhao, Zifeng, Xu, Haifeng
Online advertising platforms use automated auctions to connect advertisers with potential customers, requiring effective bidding strategies to maximize profits. Accurate ad impact estimation requires considering three key factors: delayed and long-term effects, cumulative ad impacts such as reinforcement or fatigue, and customer heterogeneity. However, these effects are often not jointly addressed in previous studies. To capture these factors, we model ad bidding as a Contextual Markov Decision Process (CMDP) with delayed Poisson rewards. For efficient estimation, we propose a two-stage maximum likelihood estimator combined with data-splitting strategies, ensuring controlled estimation error based on the first-stage estimator's (in)accuracy. Building on this, we design a reinforcement learning algorithm to derive efficient personalized bidding strategies. This approach achieves a near-optimal regret bound of $\tilde{O}{(dH^2\sqrt{T})}$, where $d$ is the contextual dimension, $H$ is the number of rounds, and $T$ is the number of customers. Our theoretical findings are validated by simulation experiments.
Integrating Machine Learning into Belief-Desire-Intention Agents: Current Advances and Open Challenges
Agiollo, Andrea, Omicini, Andrea
Thanks to the remarkable human-like capabilities of machine learning (ML) models in perceptual and cognitive tasks, frameworks integrating ML within rational agent architectures are gaining traction. Yet, the landscape remains fragmented and incoherent, often focusing on embedding ML into generic agent containers while overlooking the expressive power of rational architectures--such as Belief-Desire-Intention (BDI) agents. This paper presents a fine-grained systematisation of existing approaches, using the BDI paradigm as a reference. Our analysis illustrates the fast-evolving literature on rational agents enhanced by ML, and identifies key research opportunities and open challenges for designing effective rational ML agents.