Goto

Collaborating Authors

 Muthukumar, Vidya


Online Model Selection for Reinforcement Learning with Function Approximation

arXiv.org Machine Learning

Deep reinforcement learning has achieved impressive successes yet often requires a very large amount of interaction data. This result is perhaps unsurprising, as using complicated function approximation often requires more data to fit, and early theoretical results on linear Markov decision processes provide regret bounds that scale with the dimension of the linear approximation. Ideally, we would like to automatically identify the minimal dimension of the approximation that is sufficient to encode an optimal policy. Towards this end, we consider the problem of model selection in RL with function approximation, given a set of candidate RL algorithms with known regret guarantees. The learner's goal is to adapt to the complexity of the optimal algorithm without knowing it \textit{a priori}. We present a meta-algorithm that successively rejects increasingly complex models using a simple statistical test. Given at least one candidate that satisfies realizability, we prove the meta-algorithm adapts to the optimal complexity with $\tilde{O}(L^{5/6} T^{2/3})$ regret compared to the optimal candidate's $\tilde{O}(\sqrt T)$ regret, where $T$ is the number of episodes and $L$ is the number of algorithms. The dimension and horizon dependencies remain optimal with respect to the best candidate, and our meta-algorithmic approach is flexible to incorporate multiple candidate algorithms and models. Finally, we show that the meta-algorithm automatically admits significantly improved instance-dependent regret bounds that depend on the gaps between the maximal values attainable by the candidates.


On the proliferation of support vectors in high dimensions

arXiv.org Machine Learning

The support vector machine (SVM) is a well-established classification method whose name refers to the particular training examples, called support vectors, that determine the maximum margin separating hyperplane. The SVM classifier is known to enjoy good generalization properties when the number of support vectors is small compared to the number of training examples. However, recent research has shown that in sufficiently high-dimensional linear classification problems, the SVM can generalize well despite a proliferation of support vectors where all training examples are support vectors. In this paper, we identify new deterministic equivalences for this phenomenon of support vector proliferation, and use them to (1) substantially broaden the conditions under which the phenomenon occurs in high-dimensional settings, and (2) prove a nearly matching converse result.


Classification vs regression in overparameterized regimes: Does the loss function matter?

arXiv.org Machine Learning

Paradigmatic problems in supervised machine learning (ML) involve predicting an output response from an input, based on patterns extracted from a (training) dataset. In classification, the output response is (finitely) discrete and we need to classify input data into one of these discrete categories. In regression, the output is continuous, typically a real number or a vector. Owing to this important distinction in output response, the two tasks are typically treated differently. The differences in treatment manifest in two phases of modern ML: optimization (training), which consists of an algorithmic procedure to extract a predictor from the training data, typically by minimizing the training loss (also called empirical risk); and generalization (testing), which consists of an evaluation of the obtained predictor on a separate test, or validation, dataset. Traditionally, the choice of loss functions for both phases is starkly different across classification and regression tasks. The squared-loss function is typically used both for the training and the testing phases in regression. In contrast, the hinge or logistic (cross-entropy for multi-class problems) loss functions are typically used in the training phase of classification, while the very different 0-1 loss function is used for testing.


OSOM: A Simultaneously Optimal Algorithm for Multi-Armed and Linear Contextual Bandits

arXiv.org Machine Learning

We consider the stochastic linear (multi-armed) contextual bandit problem with the possibility of hidden \textit{simple multi-armed bandit} structure in which the rewards are independent of the contextual information. Algorithms that are designed solely for one of the regimes are known to be sub-optimal for their alternate regime. We design a single computationally efficient algorithm that simultaneously obtains problem-dependent optimal regret rates in the simple multi-armed bandit regime and minimax optimal regret rates in the linear contextual bandit regime, without knowing a priori which of the two models generates the rewards. These results are proved under the condition of stochasticity of contextual information over multiple rounds. Our results should be viewed as a step towards principled data-dependent policy class selection for contextual bandits.


Harmless interpolation of noisy data in regression

arXiv.org Machine Learning

In classification problems (i.e. when the labels Y are discrete), the scaling of the test error with respect to n is determined by characterizations of the VC-dimension [2]/Rademacher complexity [3] of the function class, which in the worst case increases with its number of parameters. In regression (i.e. when the labels Y are continuous), the mean-squared error of the ordinary least-squares estimate is characterized by the condition number of the regression matrix, which is reasonable for appropriate ratios of d/n but tends to increase astronomically as d approaches n. The qualitative fear is the same: if the function class is too complex, it starts to overfit noise and can generalize poorly to unseen test data. But there is a gap between "can" and "will" -- and indeed this conventional wisdom has been challenged by the recent advent of deeper and deeper neural networks. In particular, a thought-provoking paper [4] noted that several deep neural networks generalize well despite achieving zero or close to zero training error, and being so expressive that they even have the ability to fit pure noise. As they put it, "understanding deep learning requires rethinking generalization". How can we reconcile the fact that good interpolative solutions exist with the classical bias-variance tradeoff? These phenomena are being actively investigated in a statistical sense [5,6] and a computational sense [7-9] in classification problems and/or noiseless models.


Worst-case vs Average-case Design for Estimation from Fixed Pairwise Comparisons

arXiv.org Machine Learning

Pairwise comparison data arises in many domains, including tournament rankings, web search, and preference elicitation. Given noisy comparisons of a fixed subset of pairs of items, we study the problem of estimating the underlying comparison probabilities under the assumption of strong stochastic transitivity (SST). We also consider the noisy sorting subclass of the SST model. We show that when the assignment of items to the topology is arbitrary, these permutation-based models, unlike their parametric counterparts, do not admit consistent estimation for most comparison topologies used in practice. We then demonstrate that consistent estimation is possible when the assignment of items to the topology is randomized, thus establishing a dichotomy between worst-case and average-case designs. We propose two estimators in the average-case setting and analyze their risk, showing that it depends on the comparison topology only through the degree sequence of the topology. The rates achieved by these estimators are shown to be optimal for a large class of graphs. Our results are corroborated by simulations on multiple comparison topologies.