Shah, Devavrat
Stable Reinforcement Learning with Unbounded State Space
Shah, Devavrat, Xie, Qiaomin, Xu, Zhi
We consider the problem of reinforcement learning (RL) with unbounded state space motivated by the classical problem of scheduling in a queueing network. Traditional policies as well as error metric that are designed for finite, bounded or compact state space, require infinite samples for providing any meaningful performance guarantee (e.g. $\ell_\infty$ error) for unbounded state space. That is, we need a new notion of performance metric. As the main contribution of this work, inspired by the literature in queuing systems and control theory, we propose stability as the notion of "goodness": the state dynamics under the policy should remain in a bounded region with high probability. As a proof of concept, we propose an RL policy using Sparse-Sampling-based Monte Carlo Oracle and argue that it satisfies the stability property as long as the system dynamics under the optimal policy respects a Lyapunov function. The assumption of existence of a Lyapunov function is not restrictive as it is equivalent to the positive recurrence or stability property of any Markov chain, i.e., if there is any policy that can stabilize the system then it must possess a Lyapunov function. And, our policy does not utilize the knowledge of the specific Lyapunov function. To make our method sample efficient, we provide an improved, sample efficient Sparse-Sampling-based Monte Carlo Oracle with Lipschitz value function that may be of interest in its own right. Furthermore, we design an adaptive version of the algorithm, based on carefully constructed statistical tests, which finds the correct tuning parameter automatically.
Iterative ranking from pair-wise comparisons
Negahban, Sahand, Oh, Sewoong, Shah, Devavrat
The question of aggregating pairwise comparisons to obtain a global ranking over a collection of objects has been of interest for a very long time: be it ranking of online gamers (e.g. MSR's TrueSkill system) and chess players, aggregating social opinions, or deciding which product to sell based on transactions. In most settings, in addition to obtaining ranking, finding'scores' for each object (e.g. In this paper, we propose a novel iterative rank aggregation algorithm for discovering scores for objects from pairwise comparisons. The algorithm has a natural random walk interpretation over the graph of objects with edges present between two objects if they are compared; the scores turn out to be the stationary probability of this random walk.
A Latent Source Model for Online Collaborative Filtering
Bresler, Guy, Chen, George H., Shah, Devavrat
Despite the prevalence of collaborative filtering in recommendation systems, there has been little theoretical development on why and how well it works, especially in the online'' setting, where items are recommended to users over time. We address this theoretical gap by introducing a model for online recommendation systems, cast item recommendation under the model as a learning problem, and analyze the performance of a cosine-similarity collaborative filtering method. In our model, each of $n$ users either likes or dislikes each of $m$ items. We assume there to be $k$ types of users, and all the users of a given type share a common string of probabilities determining the chance of liking each item. At each time step, we recommend an item to each user, where a key distinction from related bandit literature is that once a user consumes an item (e.g., watches a movie), then that item cannot be recommended to the same user again.
Q-learning with Nearest Neighbors
Shah, Devavrat, Xie, Qiaomin
We consider model-free reinforcement learning for infinite-horizon discounted Markov Decision Processes (MDPs) with a continuous state space and unknown transition kernel, when only a single sample path under an arbitrary policy of the system is available. We consider the Nearest Neighbor Q-Learning (NNQL) algorithm to learn the optimal Q function using nearest neighbor regression method. As the main contribution, we provide tight finite sample analysis of the convergence rate. In particular, for MDPs with a $d$-dimensional state space and the discounted factor $\gamma \in (0,1)$, given an arbitrary sample path with covering time'' $L$, we establish that the algorithm is guaranteed to output an $\varepsilon$-accurate estimate of the optimal Q-function using $\Ot(L/(\varepsilon 3(1-\gamma) 7))$ samples. Indeed, we establish a lower bound that argues that the dependence of $ \Omegat(1/\varepsilon {d 2})$ is necessary.
Structure learning of antiferromagnetic Ising models
Bresler, Guy, Gamarnik, David, Shah, Devavrat
In this paper we investigate the computational complexity of learning the graph structure underlying a discrete undirected graphical model from i.i.d. Our first result is an unconditional computational lower bound of $\Omega (p {d/2})$ for learning general graphical models on $p$ nodes of maximum degree $d$, for the class of statistical algorithms recently introduced by Feldman et al. The construction is related to the notoriously difficult learning parities with noise problem in computational learning theory. Our lower bound shows that the $\widetilde O(p {d 2})$ runtime required by Bresler, Mossel, and Sly's exhaustive-search algorithm cannot be significantly improved without restricting the class of models. Aside from structural assumptions on the graph such as it being a tree, hypertree, tree-like, etc., most recent papers on structure learning assume that the model has the correlation decay property.
Iterative Collaborative Filtering for Sparse Noisy Tensor Estimation
Shah, Devavrat, Yu, Christina Lee
We consider the task of tensor estimation, i.e. estimating a low-rank 3-order $n \times n \times n$ tensor from noisy observations of randomly chosen entries in the sparse regime. In the context of matrix (2-order tensor) estimation, a variety of algorithms have been proposed and analyzed in the literature including the popular collaborative filtering algorithm that is extremely well utilized in practice. However, in the context of tensor estimation, there is limited progress. No natural extensions of collaborative filtering are known beyond ``flattening'' the tensor into a matrix and applying standard collaborative filtering. As the main contribution of this work, we introduce a generalization of the collaborative filtering algorithm for the setting of tensor estimation and argue that it achieves sample complexity that (nearly) matches the conjectured lower bound on the sample complexity. Interestingly, our generalization uses the matrix obtained from the ``flattened'' tensor to compute similarity as in the classical collaborative filtering but by defining a novel ``graph'' using it. The algorithm recovers the tensor with mean-squared-error (MSE) decaying to $0$ as long as each entry is observed independently with probability $p = \Omega(n^{-3/2 + \epsilon})$ for any arbitrarily small $\epsilon > 0$. It turns out that $p = \Omega(n^{-3/2})$ is the conjectured lower bound as well as ``connectivity threshold'' of graph considered to compute similarity in our algorithm.
Model Agnostic High-Dimensional Error-in-Variable Regression
Agarwal, Anish, Shah, Devavrat, Shen, Dennis, Song, Dogyoon
We consider the problem of high-dimensional error-in-variable regression where we only observe a sparse, noisy version of the covariate data. We propose an algorithm that utilizes matrix estimation (ME) as a key subroutine to de-noise the corrupted data, and then performs ordinary least squares regression. When the ME subroutine is instantiated with hard singular value thresholding (HSVT), our results indicate that if the number of samples scales as $\omega( \rho^{-4} r \log^5 (p))$, then our in- and out-of-sample prediction error decays to $0$ as $p \rightarrow \infty$; $\rho$ represents the fraction of observed data, $r$ is the (approximate) rank of the true covariate matrix, and $p$ is the number of covariates. As an important byproduct of our approach, we demonstrate that HSVT with regression acts as implicit $\ell_0$-regularization since HSVT aims to find a low-rank structure within the covariance matrix. Thus, we can view the sparsity of the estimated parameter as a consequence of the covariate structure rather than a model assumption as is often considered in the literature. Moreover, our non-asymptotic bounds match (up to $\log^4(p)$ factors) the best guaranteed sample complexity results in the literature for algorithms that require precise knowledge of the underlying model; we highlight that our approach is model agnostic. In our analysis, we obtain two technical results of independent interest: first, we provide a simple bound on the spectral norm of random matrices with independent sub-exponential rows with randomly missing entries; second, we bound the max column sum error -- a nonstandard error metric -- for HSVT. Our setting enables us to apply our results to applications such as synthetic control for causal inference, time series analysis, and regression with privacy. It is important to note that the existing inventory of methods is unable to analyze these applications.
On Reinforcement Learning Using Monte Carlo Tree Search with Supervised Learning: Non-Asymptotic Analysis
Shah, Devavrat, Xie, Qiaomin, Xu, Zhi
Inspired by the success of AlphaGo Zero (AGZ) which utilizes Monte Carlo Tree Search (MCTS) with Supervised Learning via Neural Network to learn the optimal policy and value function, in this work, we focus on establishing formally that such an approach indeed finds optimal policy asymptotically, as well as establishing non-asymptotic guarantees in the process. We shall focus on infinite-horizon discounted Markov Decision Process to establish the results. To start with, it requires establishing the MCTS's claimed property in the literature that for any given query state, MCTS provides approximate value function for the state with enough simulation steps of MDP. We provide non-asymptotic analysis establishing this property by analyzing a non-stationary multi-arm bandit setup. Our proof suggests that MCTS needs to be utilized with polynomial rather than logarithmic "upper confidence bound" for establishing its desired performance -- interestingly enough, AGZ chooses such polynomial bound. Using this as a building block, combined with nearest neighbor supervised learning, we argue that MCTS acts as a "policy improvement" operator; it has a natural "bootstrapping" property to iteratively improve value function approximation for all states, due to combining with supervised learning, despite evaluating at only finitely many states. In effect, we establish that to learn $\varepsilon$ approximation of value function in $\ell_\infty$ norm, MCTS combined with nearest-neighbors requires samples scaling as $\widetilde{O}\big(\varepsilon^{-(d+4)}\big)$, where $d$ is the dimension of the state space. This is nearly optimal due to a minimax lower bound of $\widetilde{\Omega}\big(\varepsilon^{-(d+2)}\big).$
Mixture Learning from Partial Observations and Its Application to Ranking
Shah, Devavrat, Song, Dogyoon
Despite recent advances in rank aggregation and mixture learning, there has been a limited amount of success for learning a mixture model for ranking data. Motivated by the problem of learning a mixture of ranking models from pair-wise comparisons, we consider mixture learning from partial observations. The generic approaches for mixture learning do not generalize to this setting. Matrix estimation, however, provides a way to recover a structured underlying matrix from its partial, noisy observations. We utilize matrix estimation as a pre-processing step to extend the mixture learning problem to allow for partial observations. Instantiating our matrix estimation subroutine with singular value thresholding, we provide a bound on the estimation error with respect to $\|\cdot\|_{2,\infty}$-norm. In particular, we show that if $p$ (the fraction of observed entries) scales as $\tilde{\Omega}((\frac{r}{d})^{\frac{1}{3}})$, then the normalized $\|\cdot\|_{2,\infty}$ error vanishes to $0$ as long as the underlying $N \times d$ ($N\geq d$) matrix is rank $r$; this holds true even if the noise is correlated across columns. As an application, we argue if $\Gamma p=\tilde{\Omega}(\sqrt{r})$, then the mixture components can be correctly identified with $N=poly(d)$ samples; $\Gamma$ is the minimum gap between the mixture means. Further, we argue a large class of popular ranking models (e.g., Mallow, Multinomial Logit (MNL) Model) satisfy the sub-gaussian property when viewed through a pairwise embedding lens. Hence, our method provides a sufficient condition for efficiently recovering the mixture components for an important class of models. For example, mixtures of $r$ components can be clustered correctly using $\tilde{O}(rn^4)$ pair-wise comparisons when the components are well-separated and distributed as per either a Mallows, MNL, or any Random Utility Model over $n$ items.
Q-learning with Nearest Neighbors
Shah, Devavrat, Xie, Qiaomin
We consider model-free reinforcement learning for infinite-horizon discounted Markov Decision Processes (MDPs) with a continuous state space and unknown transition kernel, when only a single sample path under an arbitrary policy of the system is available. We consider the Nearest Neighbor Q-Learning (NNQL) algorithm to learn the optimal Q function using nearest neighbor regression method. As the main contribution, we provide tight finite sample analysis of the convergence rate. In particular, for MDPs with a $d$-dimensional state space and the discounted factor $\gamma \in (0,1)$, given an arbitrary sample path with ``covering time'' $L$, we establish that the algorithm is guaranteed to output an $\varepsilon$-accurate estimate of the optimal Q-function using $\Ot(L/(\varepsilon^3(1-\gamma)^7))$ samples. For instance, for a well-behaved MDP, the covering time of the sample path under the purely random policy scales as $\Ot(1/\varepsilon^d),$ so the sample complexity scales as $\Ot(1/\varepsilon^{d+3}).$ Indeed, we establish a lower bound that argues that the dependence of $ \Omegat(1/\varepsilon^{d+2})$ is necessary.