Markov Models
Safe and Efficient Robot Action Planning in the Presence of Unconcerned Humans
Amiri, Mohsen, Hosseinzadeh, Mehdi
This paper proposes a robot action planning scheme that provides an efficient and probabilistically safe plan for a robot interacting with an unconcerned human -- someone who is either unaware of the robot's presence or unwilling to engage in ensuring safety. The proposed scheme is predictive, meaning that the robot is required to predict human actions over a finite future horizon; such predictions are often inaccurate in real-world scenarios. One possible approach to reduce the uncertainties is to provide the robot with the capability of reasoning about the human's awareness of potential dangers. This paper discusses that by using a binary variable, so-called danger awareness coefficient, it is possible to differentiate between concerned and unconcerned humans, and provides a learning algorithm to determine this coefficient by observing human actions. Moreover, this paper argues how humans rely on predictions of other agents' future actions (including those of robots in human-robot interaction) in their decision-making. It also shows that ignoring this aspect in predicting human's future actions can significantly degrade the efficiency of the interaction, causing agents to deviate from their optimal paths. The proposed robot action planning scheme is verified and validated via extensive simulation and experimental studies on a LoCoBot WidowX-250.
An Offline Multi-Agent Reinforcement Learning Framework for Radio Resource Management
Offline multi-agent reinforcement learning (MARL) addresses key limitations of online MARL, such as safety concerns, expensive data collection, extended training intervals, and high signaling overhead caused by online interactions with the environment. In this work, we propose an offline MARL algorithm for radio resource management (RRM), focusing on optimizing scheduling policies for multiple access points (APs) to jointly maximize the sum and tail rates of user equipment (UEs). We evaluate three training paradigms: centralized, independent, and centralized training with decentralized execution (CTDE). Our simulation results demonstrate that the proposed offline MARL framework outperforms conventional baseline approaches, achieving over a 15\% improvement in a weighted combination of sum and tail rates. Additionally, the CTDE framework strikes an effective balance, reducing the computational complexity of centralized methods while addressing the inefficiencies of independent training. These results underscore the potential of offline MARL to deliver scalable, robust, and efficient solutions for resource management in dynamic wireless networks.
Optimizing Return Distributions with Distributional Dynamic Programming
Pires, Bernardo รvila, Rowland, Mark, Borsa, Diana, Guo, Zhaohan Daniel, Khetarpal, Khimya, Barreto, Andrรฉ, Abel, David, Munos, Rรฉmi, Dabney, Will
We introduce distributional dynamic programming (DP) methods for optimizing statistical functionals of the return distribution, with standard reinforcement learning as a special case. Previous distributional DP methods could optimize the same class of expected utilities as classic DP. To go beyond expected utilities, we combine distributional DP with stock augmentation, a technique previously introduced for classic DP in the context of risk-sensitive RL, where the MDP state is augmented with a statistic of the rewards obtained so far (since the first time step). We find that a number of recently studied problems can be formulated as stock-augmented return distribution optimization, and we show that we can use distributional DP to solve them. We analyze distributional value and policy iteration, with bounds and a study of what objectives these distributional DP methods can or cannot optimize. We describe a number of applications outlining how to use distributional DP to solve different stock-augmented return distribution optimization problems, for example maximizing conditional value-at-risk, and homeostatic regulation. To highlight the practical potential of stock-augmented return distribution optimization and distributional DP, we combine the core ideas of distributional value iteration with the deep RL agent DQN, and empirically evaluate it for solving instances of the applications discussed.
The regret lower bound for communicating Markov Decision Processes
Boone, Victor, Maillard, Odalric-Ambrym
This paper is devoted to the extension of the regret lower bound beyond ergodic Markov decision processes (MDPs) in the problem dependent setting. While the regret lower bound for ergodic MDPs is well-known and reached by tractable algorithms, we prove that the regret lower bound becomes significatively more complex in communicating MDPs. Our lower bound revisits the necessary explorative behavior of consistent learning agents and further explains that all optimal regions of the environment must be overvisited compared to sub-optimal ones, a phenomenon that we refer to as co-exploration. In tandem, we show that these two explorative and co-explorative behaviors are intertwined with navigation constraints obtained by scrutinizing the navigation structure at logarithmic scale. The resulting lower bound is expressed as the solution of an optimization problem that, in many standard classes of MDPs, can be specialized to recover existing results. From a computational perspective, it is provably $\Sigma_2^\textrm{P}$-hard in general and as a matter of fact, even testing the membership to the feasible region is coNP-hard. We further provide an algorithm to approximate the lower bound in a constructive way.
Reviews: Non-Asymptotic Gap-Dependent Regret Bounds for Tabular MDPs
The paper contributes useful structural results in regret minimization in the Markov Decision Process setting of RL, specifically for the class of tabular (i.e., unstructured) finite-horizon episodic MDPs. The paper is likely to stimulate the finite-sample analysis of online learning in MDPs via its new theoretical techniques.
Review for NeurIPS paper: Planning in Markov Decision Processes with Gap-Dependent Sample Complexity
Additional Feedback: Post-rebuttal The authors addressed some of my concerns. As the authors would redesign some of the experiments in the revision, I'd raise my score to 6. Comments and questions: 1. Are there any lower bound results on the sample complexity of planning? Are there any particular reasons, and what is the high-level idea of this algorithm? If I understand correctly this rule is to get the gap-dependent sample complexity. What if we use the simple greedy policy for the first action, and what will go wrong in the proof?
Reviews: Differentially Private Markov Chain Monte Carlo
This work provides a detailed Renyi DP analysis of a modified MCMC acceptance test, and empirically demonstrates its efficacy. Originality: the RDP analysis and modified acceptance test is a novel contribution. Quality: the work is a complete piece on exploring this MCMC method, with a detailed analysis and experiments. Clarity: the work is fairly clearly written, but it can be easy to lose track of exactly what parameters remain as choices to be tuned in a list of various corrective factors and approximations. Significance: the work gives an MCMC method with privacy without convergence, which permits privacy guarantees to be given over a multitude of problems without doubts or guess work about when to stop the chain.
Reviews: Differentially Private Markov Chain Monte Carlo
Although the analysis follows some known ideas from the literature on private SGD, there are a number of new tricks which make the current approach interesting, most notably the observation that one can use randomized acceptance tests to preserve privacy in an MCMC algorithm. When preparing the final version of this manuscript the authors should carefully consider the points raised in the reviews regarding: clarifying where the contributions lie with respect to previous work; provide high-level intuitions of the proofs to help a reader navigate the derivations; discuss the role of approximations used in the paper, where they affect the privacy or utility of the method, and where there is some room left for improvement.
Reviews: State Aggregation Learning from Markov Transition Data
This paper studies the problem of learning soft state aggregation of a Markov model, where there are r hidden meta states, each corresponds to a distribution over the observed state of the Markov model. Under the anchor state assumption, the authors propose an algorithm that provably learns the state aggregation model from the Markov chain's trajectory. They evaluated their algorithm on a Manhattan taxi-trip dataset which yields interesting discoveries. There has been lots of work on estimating the low rank transition matrix itself and on matrix factorization in the topic modelling setting, and this work seems to be connecting the two problems. The paper is presented well and easy to follow. I have the following questions regarding the novelty and impact of this paper.