Goto

Collaborating Authors

 Undirected Networks


Unbiased Estimation of the Gradient of the Log-Likelihood for a Class of Continuous-Time State-Space Models

arXiv.org Machine Learning

In this paper, we consider static parameter estimation for a class of continuous-time state-space models. Our goal is to obtain an unbiased estimate of the gradient of the log-likelihood (score function), which is an estimate that is unbiased even if the stochastic processes involved in the model must be discretized in time. To achieve this goal, we apply a doubly randomized scheme, that involves a novel coupled conditional particle filter (CCPF) on the second level of randomization. Our novel estimate helps facilitate the application of gradient-based estimation algorithms, such as stochastic-gradient Langevin descent. We illustrate our methodology in the context of stochastic gradient descent (SGD) in several numerical examples and compare with the Rhee & Glynn estimator.


Task-Guided Inverse Reinforcement Learning Under Partial Information

arXiv.org Artificial Intelligence

We study the problem of inverse reinforcement learning (IRL), where the learning agent recovers a reward function using expert demonstrations. Most of the existing IRL techniques make the often unrealistic assumption that the agent has access to full information about the environment. We remove this assumption by developing an algorithm for IRL in partially observable Markov decision processes (POMDPs), where an agent cannot directly observe the current state of the POMDP. The algorithm addresses several limitations of existing techniques that do not take the \emph{information asymmetry} between the expert and the agent into account. First, it adopts causal entropy as the measure of the likelihood of the expert demonstrations as opposed to entropy in most existing IRL techniques and avoids a common source of algorithmic complexity. Second, it incorporates task specifications expressed in temporal logic into IRL. Such specifications may be interpreted as side information available to the learner a priori in addition to the demonstrations, and may reduce the information asymmetry between the expert and the agent. Nevertheless, the resulting formulation is still nonconvex due to the intrinsic nonconvexity of the so-called \emph{forward problem}, i.e., computing an optimal policy given a reward function, in POMDPs. We address this nonconvexity through sequential convex programming and introduce several extensions to solve the forward problem in a scalable manner. This scalability allows computing policies that incorporate memory at the expense of added computational cost yet also achieves higher performance compared to memoryless policies. We demonstrate that, even with severely limited data, the algorithm learns reward functions and policies that satisfy the task and induce a similar behavior to the expert by leveraging the side information and incorporating memory into the policy.


Sample-Efficient Reinforcement Learning for Linearly-Parameterized MDPs with a Generative Model

arXiv.org Machine Learning

The curse of dimensionality is a widely known issue in reinforcement learning (RL). In the tabular setting where the state space $\mathcal{S}$ and the action space $\mathcal{A}$ are both finite, to obtain a nearly optimal policy with sampling access to a generative model, the minimax optimal sample complexity scales linearly with $|\mathcal{S}|\times|\mathcal{A}|$, which can be prohibitively large when $\mathcal{S}$ or $\mathcal{A}$ is large. This paper considers a Markov decision process (MDP) that admits a set of state-action features, which can linearly express (or approximate) its probability transition kernel. We show that a model-based approach (resp.$~$Q-learning) provably learns an $\varepsilon$-optimal policy (resp.$~$Q-function) with high probability as soon as the sample size exceeds the order of $\frac{K}{(1-\gamma)^{3}\varepsilon^{2}}$ (resp.$~$$\frac{K}{(1-\gamma)^{4}\varepsilon^{2}}$), up to some logarithmic factor. Here $K$ is the feature dimension and $\gamma\in(0,1)$ is the discount factor of the MDP. Both sample complexity bounds are provably tight, and our result for the model-based approach matches the minimax lower bound. Our results show that for arbitrarily large-scale MDP, both the model-based approach and Q-learning are sample-efficient when $K$ is relatively small, and hence the title of this paper.


Exploitation vs Caution: Risk-sensitive Policies for Offline Learning

arXiv.org Artificial Intelligence

Offline model learning for planning is a branch of machine learning that trains agents to perform actions in an unknown environment using a fixed batch of previously collected experiences. The limited size of the data set hinders the estimate of the Value function of the relative Markov Decision Process (MDP), bounding the performance of the obtained policy in the real world. In this context, recent works showed that planning with a discount factor lower than the one used during the evaluation phase yields more performing policies. However, the optimal discount factor is finally chosen by cross-validation. Our aim is to show that looking for a sub-optimal solution of a Bayesian MDP might lead to better performances with respect to the current baselines that work in the offline setting. Hence, we propose Exploitation vs Caution (EvC), an algorithm that automatically selects the policy that solves a Risk-sensitive Bayesian MDP in a set of policies obtained by solving several MDPs characterized by different discount factors and transition dynamics. On one hand, the Bayesian formalism elegantly includes model uncertainty and on another hand the introduction of a risk-sensitive utility function guarantees robustness. We evaluated the proposed approach in different discrete simple environments offering a fair variety of MDP classes. We also compared the obtained results with state-of-the-art offline learning for planning baselines such as MOPO and MOReL. In the tested scenarios EvC is more robust than the said approaches suggesting that sub-optimally solving an Offline Risk-sensitive Bayesian MDP (ORBMDP) could define a sound framework for planning under model uncertainty.


Targeted stochastic gradient Markov chain Monte Carlo for hidden Markov models with rare latent states

arXiv.org Machine Learning

Markov chain Monte Carlo (MCMC) algorithms for hidden Markov models often rely on the forward-backward sampler. This makes them computationally slow as the length of the time series increases, motivating the recent development of sub-sampling-based approaches. These approximate the full posterior by using small random subsequences of the data at each MCMC iteration within stochastic gradient MCMC. In the presence of imbalanced data resulting from rare latent states, subsequences often exclude rare latent state data, leading to inaccurate inference and prediction/detection of rare events. We propose a targeted sub-sampling (TASS) approach that over-samples observations corresponding to rare latent states when calculating the stochastic gradient of parameters associated with them. TASS uses an initial clustering of the data to construct subsequence weights that reduce the variance in gradient estimation. This leads to improved sampling efficiency, in particular in settings where the rare latent states correspond to extreme observations. We demonstrate substantial gains in predictive and inferential accuracy on real and synthetic examples.


Financial Engineering and Artificial Intelligence in Python

#artificialintelligence

Have you ever thought about what would happen if you combined the power of machine learning and artificial intelligence with financial engineering? Today, you can stop imagining, and start doing. This course will teach you the core fundamentals of financial engineering, with a machine learning twist. We will learn about the greatest flub made in the past decade by marketers posing as "machine learning experts" who promise to teach unsuspecting students how to "predict stock prices with LSTMs". You will learn exactly why their methodology is fundamentally flawed and why their results are complete nonsense.


Trajectory Modeling via Random Utility Inverse Reinforcement Learning

arXiv.org Artificial Intelligence

We consider the problem of modeling trajectories of drivers in a road network from the perspective of inverse reinforcement learning. As rational agents, drivers are trying to maximize some reward function unknown to an external observer as they make up their trajectories. We apply the concept of random utility from microeconomic theory to model the unknown reward function as a function of observable features plus an error term which represents features known only to the driver. We develop a parameterized generative model for the trajectories based on a random utility Markov decision process formulation of drivers decisions. We show that maximum entropy inverse reinforcement learning is a particular case of our proposed formulation when we assume a Gumbel density function for the unobserved reward error terms. We illustrate Bayesian inference on model parameters through a case study with real trajectory data from a large city obtained from sensors placed on sparsely distributed points on the street network.


Multi-Agent Deep Reinforcement Learning using Attentive Graph Neural Architectures for Real-Time Strategy Games

arXiv.org Artificial Intelligence

In real-time strategy (RTS) game artificial intelligence research, various multi-agent deep reinforcement learning (MADRL) algorithms are widely and actively used nowadays. Most of the research is based on StarCraft II environment because it is the most well-known RTS games in world-wide. In our proposed MADRL-based algorithm, distributed MADRL is fundamentally used that is called QMIX. In addition to QMIX-based distributed computation, we consider state categorization which can reduce computational complexity significantly. Furthermore, self-attention mechanisms are used for identifying the relationship among agents in the form of graphs. Based on these approaches, we propose a categorized state graph attention policy (CSGA-policy). As observed in the performance evaluation of our proposed CSGA-policy with the most well-known StarCraft II simulation environment, our proposed algorithm works well in various settings, as expected.


Geometric variational inference

arXiv.org Machine Learning

Efficiently accessing the information contained in non-linear and high dimensional probability distributions remains a core challenge in modern statistics. Traditionally, estimators that go beyond point estimates are either categorized as Variational Inference (VI) or Markov-Chain Monte-Carlo (MCMC) techniques. While MCMC methods that utilize the geometric properties of continuous probability distributions to increase their efficiency have been proposed, VI methods rarely use the geometry. This work aims to fill this gap and proposes geometric Variational Inference (geoVI), a method based on Riemannian geometry and the Fisher information metric. It is used to construct a coordinate transformation that relates the Riemannian manifold associated with the metric to Euclidean space. The distribution, expressed in the coordinate system induced by the transformation, takes a particularly simple form that allows for an accurate variational approximation by a normal distribution. Furthermore, the algorithmic structure allows for an efficient implementation of geoVI which is demonstrated on multiple examples, ranging from low-dimensional illustrative ones to non-linear, hierarchical Bayesian inverse problems in thousands of dimensions.


Removing the mini-batching error in Bayesian inference using Adaptive Langevin dynamics

arXiv.org Machine Learning

The computational cost of usual Monte Carlo methods for sampling a posteriori laws in Bayesian inference scales linearly with the number of data points. One option to reduce it to a fraction of this cost is to resort to mini-batching in conjunction with unadjusted discretizations of Langevin dynamics, in which case only a random fraction of the data is used to estimate the gradient. However, this leads to an additional noise in the dynamics and hence a bias on the invariant measure which is sampled by the Markov chain. We advocate using the so-called Adaptive Langevin dynamics, which is a modification of standard inertial Langevin dynamics with a dynamical friction which automatically corrects for the increased noise arising from mini-batching. We investigate the practical relevance of the assumptions underpinning Adaptive Langevin (constant covariance for the estimation of the gradient), which are not satisfied in typical models of Bayesian inference; and show how to extend the approach to more general situations.