Undirected Networks
Bayesian Sequential Optimal Experimental Design for Nonlinear Models Using Policy Gradient Reinforcement Learning
We present a mathematical framework and computational methods to optimally design a finite number of sequential experiments. We formulate this sequential optimal experimental design (sOED) problem as a finite-horizon partially observable Markov decision process (POMDP) in a Bayesian setting and with information-theoretic utilities. It is built to accommodate continuous random variables, general non-Gaussian posteriors, and expensive nonlinear forward models. sOED then seeks an optimal design policy that incorporates elements of both feedback and lookahead, generalizing the suboptimal batch and greedy designs. We solve for the sOED policy numerically via policy gradient (PG) methods from reinforcement learning, and derive and prove the PG expression for sOED. Adopting an actor-critic approach, we parameterize the policy and value functions using deep neural networks and improve them using gradient estimates produced from simulated episodes of designs and observations. The overall PG-sOED method is validated on a linear-Gaussian benchmark, and its advantages over batch and greedy designs are demonstrated through a contaminant source inversion problem in a convection-diffusion field.
Proximal Reinforcement Learning: Efficient Off-Policy Evaluation in Partially Observed Markov Decision Processes
Bennett, Andrew, Kallus, Nathan
In applications of offline reinforcement learning to observational data, such as in healthcare or education, a general concern is that observed actions might be affected by unobserved factors, inducing confounding and biasing estimates derived under the assumption of a perfect Markov decision process (MDP) model. Here we tackle this by considering off-policy evaluation in a partially observed MDP (POMDP). Specifically, we consider estimating the value of a given target policy in a POMDP given trajectories with only partial state observations generated by a different and unknown policy that may depend on the unobserved state. We tackle two questions: what conditions allow us to identify the target policy value from the observed data and, given identification, how to best estimate it. To answer these, we extend the framework of proximal causal inference to our POMDP setting, providing a variety of settings where identification is made possible by the existence of so-called bridge functions. We then show how to construct semiparametrically efficient estimators in these settings. We term the resulting framework proximal reinforcement learning (PRL). We demonstrate the benefits of PRL in an extensive simulation study.
Coresets for Time Series Clustering
Huang, Lingxiao, Sudhir, K., Vishnoi, Nisheeth K.
We study the problem of constructing coresets for clustering problems with time series data. This problem has gained importance across many fields including biology, medicine, and economics due to the proliferation of sensors facilitating real-time measurement and rapid drop in storage costs. In particular, we consider the setting where the time series data on $N$ entities is generated from a Gaussian mixture model with autocorrelations over $k$ clusters in $\mathbb{R}^d$. Our main contribution is an algorithm to construct coresets for the maximum likelihood objective for this mixture model. Our algorithm is efficient, and under a mild boundedness assumption on the covariance matrices of the underlying Gaussians, the size of the coreset is independent of the number of entities $N$ and the number of observations for each entity, and depends only polynomially on $k$, $d$ and $1/\varepsilon$, where $\varepsilon$ is the error parameter. We empirically assess the performance of our coreset with synthetic data.
Universal Decision Models
Humans are universal decision makers: we reason causally to understand the world; we act competitively to gain advantage in commerce, games, and war; and we are able to learn to make better decisions through trial and error. In this paper, we propose Universal Decision Model (UDM), a mathematical formalism based on category theory. Decision objects in a UDM correspond to instances of decision tasks, ranging from causal models and dynamical systems such as Markov decision processes and predictive state representations, to network multiplayer games and Witsenhausen's intrinsic models, which generalizes all these previous formalisms. A UDM is a category of objects, which include decision objects, observation objects, and solution objects. Bisimulation morphisms map between decision objects that capture structure-preserving abstractions. We formulate universal properties of UDMs, including information integration, decision solvability, and hierarchical abstraction. We describe universal functorial representations of UDMs, and propose an algorithm for computing the minimal object in a UDM using algebraic topology. We sketch out an application of UDMs to causal inference in network economics, using a complex multiplayer producer-consumer two-sided marketplace.
D2RLIR : an improved and diversified ranking function in interactive recommendation systems based on deep reinforcement learning
Baghi, Vahid, Motehayeri, Seyed Mohammad Seyed, Moeini, Ali, Abedian, Rooholah
Recently, interactive recommendation systems based on reinforcement learning have been attended by researchers due to the consider recommendation procedure as a dynamic process and update the recommendation model based on immediate user feedback, which is neglected in traditional methods. The existing works have two significant drawbacks. Firstly, inefficient ranking function to produce the Top-N recommendation list. Secondly, focusing on recommendation accuracy and inattention to other evaluation metrics such as diversity. This paper proposes a deep reinforcement learning based recommendation system by utilizing Actor-Critic architecture to model dynamic users' interaction with the recommender agent and maximize the expected long-term reward. Furthermore, we propose utilizing Spotify's ANNoy algorithm to find the most similar items to generated action by actor-network. After that, the Total Diversity Effect Ranking algorithm is used to generate the recommendations concerning relevancy and diversity. Moreover, we apply positional encoding to compute representations of the user's interaction sequence without using sequence-aligned recurrent neural networks. Extensive experiments on the MovieLens dataset demonstrate that our proposed model is able to generate a diverse while relevance recommendation list based on the user's preferences.
Finite Horizon Q-learning: Stability, Convergence and Simulations
VP, Vivek, Bhatnagar, Dr. Shalabh
Q-learning is a popular reinforcement learning algorithm. This algorithm has however been studied and analysed mainly in the infinite horizon setting. There are several important applications which can be modeled in the framework of finite horizon Markov decision processes. We develop a version of Q-learning algorithm for finite horizon Markov decision processes (MDP) and provide a full proof of its stability and convergence. Our analysis of stability and convergence of finite horizon Q-learning is based entirely on the ordinary differential equations (O.D.E) method. We also demonstrate the performance of our algorithm on a setting of random MDP.
Play to Grade: Testing Coding Games as Classifying Markov Decision Process
Nie, Allen, Brunskill, Emma, Piech, Chris
Contemporary coding education often presents students with the task of developing programs that have user interaction and complex dynamic systems, such as mouse based games. While pedagogically compelling, there are no contemporary autonomous methods for providing feedback. Notably, interactive programs are impossible to grade by traditional unit tests. In this paper we formalize the challenge of providing feedback to interactive programs as a task of classifying Markov Decision Processes (MDPs). Each student's program fully specifies an MDP where the agent needs to operate and decide, under reasonable generalization, if the dynamics and reward model of the input MDP should be categorized as correct or broken. We demonstrate that by designing a cooperative objective between an agent and an autoregressive model, we can use the agent to sample differential trajectories from the input MDP that allows a classifier to determine membership: Play to Grade. Our method enables an automatic feedback system for interactive code assignments. We release a dataset of 711,274 anonymized student submissions to a single assignment with hand-coded bug labels to support future research.
Learning Domain Invariant Representations in Goal-conditioned Block MDPs
Han, Beining, Zheng, Chongyi, Chan, Harris, Paster, Keiran, Zhang, Michael R., Ba, Jimmy
Deep Reinforcement Learning (RL) is successful in solving many complex Markov Decision Processes (MDPs) problems. However, agents often face unanticipated environmental changes after deployment in the real world. These changes are often spurious and unrelated to the underlying problem, such as background shifts for visual input agents. Unfortunately, deep RL policies are usually sensitive to these changes and fail to act robustly against them. This resembles the problem of domain generalization in supervised learning. In this work, we study this problem for goal-conditioned RL agents. We propose a theoretical framework in the Block MDP setting that characterizes the generalizability of goal-conditioned policies to new environments. Under this framework, we develop a practical method PA-SkewFit that enhances domain generalization. The empirical evaluation shows that our goal-conditioned RL agent can perform well in various unseen test environments, improving by 50% over baselines.
Entropy-based adaptive Hamiltonian Monte Carlo
Hirt, Marcel, Titsias, Michalis K., Dellaportas, Petros
Hamiltonian Monte Carlo (HMC) is a popular Markov Chain Monte Carlo (MCMC) algorithm to sample from an unnormalized probability distribution. A leapfrog integrator is commonly used to implement HMC in practice, but its performance can be sensitive to the choice of mass matrix used therein. We develop a gradient-based algorithm that allows for the adaptation of the mass matrix by encouraging the leapfrog integrator to have high acceptance rates while also exploring all dimensions jointly. In contrast to previous work that adapt the hyperparameters of HMC using some form of expected squared jumping distance, the adaptation strategy suggested here aims to increase sampling efficiency by maximizing an approximation of the proposal entropy. We illustrate that using multiple gradients in the HMC proposal can be beneficial compared to a single gradient-step in Metropolis-adjusted Langevin proposals. Empirical evidence suggests that the adaptation method can outperform different versions of HMC schemes by adjusting the mass matrix to the geometry of the target distribution and by providing some control on the integration time.
Applications of Multi-Agent Reinforcement Learning in Future Internet: A Comprehensive Survey
Li, Tianxu, Zhu, Kun, Luong, Nguyen Cong, Niyato, Dusit, Wu, Qihui, Zhang, Yang, Chen, Bing
Future Internet involves several emerging technologies such as 5G and beyond 5G networks, vehicular networks, unmanned aerial vehicle (UAV) networks, and Internet of Things (IoTs). Moreover, future Internet becomes heterogeneous and decentralized with a large number of involved network entities. Each entity may need to make its local decision to improve the network performance under dynamic and uncertain network environments. Standard learning algorithms such as single-agent Reinforcement Learning (RL) or Deep Reinforcement Learning (DRL) have been recently used to enable each network entity as an agent to learn an optimal decision-making policy adaptively through interacting with the unknown environments. However, such an algorithm fails to model the cooperations or competitions among network entities, and simply treats other entities as a part of the environment that may result in the non-stationarity issue. Multi-agent Reinforcement Learning (MARL) allows each network entity to learn its optimal policy by observing not only the environments, but also other entities' policies. As a result, MARL can significantly improve the learning efficiency of the network entities, and it has been recently used to solve various issues in the emerging networks. In this paper, we thus review the applications of MARL in the emerging networks. In particular, we provide a tutorial of MARL and a comprehensive survey of applications of MARL in next generation Internet. In particular, we first introduce single-agent RL and MARL. Then, we review a number of applications of MARL to solve emerging issues in future Internet. The issues consist of network access, transmit power control, computation offloading, content caching, packet routing, trajectory design for UAV-aided networks, and network security issues.