Goto

Collaborating Authors

 Search


Lass-0: sparse non-convex regression by local search

arXiv.org Machine Learning

We compute approximate solutions to L0 regularized linear regression using L1 regularization, also known as the Lasso, as an initialization step. Our algorithm, the Lass-0 ("Lass-zero"), uses a computationally efficient stepwise search to determine a locally optimal L0 solution given any L1 regularization solution. We present theoretical results of consistency under orthogonality and appropriate handling of redundant features. Empirically, we use synthetic data to demonstrate that Lass-0 solutions are closer to the true sparse support than L1 regularization models. Additionally, in real-world data Lass-0 finds more parsimonious solutions than L1 regularization while maintaining similar predictive accuracy.


Maximin Action Identification: A New Bandit Framework for Games

arXiv.org Machine Learning

We study an original problem of pure exploration in a strategic bandit model motivated by Monte Carlo Tree Search. It consists in identifying the best action in a game, when the player may sample random outcomes of sequentially chosen pairs of actions. We propose two strategies for the fixed-confidence setting: Maximin-LUCB, based on lower-and upper-confidence bounds; and Maximin-Racing, which operates by successively eliminating the sub-optimal actions. We discuss the sample complexity of both methods and compare their performance empirically.


A Minimax Theory for Adaptive Data Analysis

arXiv.org Machine Learning

In adaptive data analysis, the user makes a sequence of queries on the data, where at each step the choice of query may depend on the results in previous steps. The releases are often randomized in order to reduce overfitting for such adaptively chosen queries. In this paper, we propose a minimax framework for adaptive data analysis. Assuming Gaussianity of queries, we establish the first sharp minimax lower bound on the squared error in the order of $O(\frac{\sqrt{k}\sigma^2}{n})$, where $k$ is the number of queries asked, and $\sigma^2/n$ is the ordinary signal-to-noise ratio for a single query. Our lower bound is based on the construction of an approximately least favorable adversary who picks a sequence of queries that are most likely to be affected by overfitting. This approximately least favorable adversary uses only one level of adaptivity, suggesting that the minimax risk for 1-step adaptivity with k-1 initial releases and that for $k$-step adaptivity are on the same order. The key technical component of the lower bound proof is a reduction to finding the convoluting distribution that optimally obfuscates the sign of a Gaussian signal. Our lower bound construction also reveals a transparent and elementary proof of the matching upper bound as an alternative approach to Russo and Zou (2015), who used information-theoretic tools to provide the same upper bound. We believe that the proposed framework opens up opportunities to obtain theoretical insights for many other settings of adaptive data analysis, which would extend the idea to more practical realms.


The Automatic Statistician: A Relational Perspective

arXiv.org Machine Learning

Gaussian Processes (GPs) provide a general and analytically tractable way of modeling complex time-varying, nonparametric functions. The Automatic Bayesian Covariance Discovery (ABCD) system constructs natural-language description of time-series data by treating unknown time-series data nonparametrically using GP with a composite covariance kernel function. Unfortunately, learning a composite covariance kernel with a single time-series data set often results in less informative kernel that may not give qualitative, distinctive descriptions of data. We address this challenge by proposing two relational kernel learning methods which can model multiple time-series data sets by finding common, shared causes of changes. We show that the relational kernel learning methods find more accurate models for regression problems on several real-world data sets; US stock data, US house price index data and currency exchange rate data.


Minimax Lower Bounds for Realizable Transductive Classification

arXiv.org Machine Learning

Transductive learning considers a training set of m labeled samples and a test set of u unlabeled samples, with the goal of best labeling that particular test set. Conversely, inductive learning considers a training set of m labeled samples drawn iid from P(X,Y), with the goal of best labeling any future samples drawn iid from P(X). This comparison suggests that transduction is a much easier type of inference than induction, but is this really the case? This paper provides a negative answer to this question, by proving the first known minimax lower bounds for transductive, realizable, binary classification. Our lower bounds show that m should be at least Ω(d/ǫ log(1/δ)/ǫ) when ǫ-learning a concept class H of finite VC-dimension d with confidence 1 δ, for all m u. This result draws three important conclusions. First, general transduction is as hard as general induction, since both problems have Ω(d/m) minimax values. Second, the use of unlabeled data does not help general transduction, since supervised learning algorithms such as ERM and (Hanneke, 2015) match our transductive lower bounds while ignoring the unlabeled test set. Third, our transductive lower bounds imply lower bounds for semi-supervised learning, which add to the important discussion about the role of unlabeled data in machine learning.


Mapping Tractography Across Subjects

arXiv.org Machine Learning

Diffusion magnetic resonance imaging (dMRI) and tractography provide means to study the anatomical structures within the white matter of the brain. When studying tractography data across subjects, it is usually necessary to align, i.e. to register, tractographies together. This registration step is most often performed by applying the transformation resulting from the registration of other volumetric images (T1, FA). In contrast with registration methods that "transform" tractographies, in this work, we try to find which streamline in one tractography correspond to which streamline in the other tractography, without any transformation. In other words, we try to find a "mapping" between the tractographies. We propose a graph-based solution for the tractography mapping problem and we explain similarities and differences with the related well-known graph matching problem. Specifically, we define a loss function based on the pairwise streamline distance and reformulate the mapping problem as combinatorial optimization of that loss function. We show preliminary promising results where we compare the proposed method, implemented with simulated annealing, against a standard registration techniques in a task of segmentation of the corticospinal tract.


Minimax Structured Normal Means Inference

arXiv.org Machine Learning

The prevalence of high-dimensional signals in modern scientific investigation has inspired an influx of research on recovering structural information from noisy data. These problems arise across a variety of scientific and engineering disciplines; for example identifying cluster structure in communication or social networks, multiple hypothesis testing in genomics, or anomaly detection in sensor networking. Specific structural assumptions include sparsity [13], low-rankedness [11], cluster structure [15], and many others [8]. The literature in this direction focuses on three inference goals: detection, localization or recovery, and estimation or denoising. Detection tasks involve deciding whether an observation contains some meaningful information or is simply ambient noise, while recovery and estimation tasks involve more precisely characterizing the information contained in a signal. These problems are closely related, but also exhibit important differences, and this paper focuses on the recovery problem, where the goal is to identify, from a finite collection of signals, which signal produced the observed data. One frustration among researchers is that algorithmic and analytic techniques for these problems differ significantly for different structural assumptions. This issue was recently resolved in the context of the estimation, where the atomic norm [8] has provided a unifying algorithmic and analytical framework, but no such theory is available for detection and recovery problems. In this paper, we provide a unification for the recovery problem, leading to deeper understanding of how signal structure affects statistical performance.


Minimax Lower Bounds for Linear Independence Testing

arXiv.org Machine Learning

Linear independence testing is a fundamental information-theoretic and statistical problem that can be posed as follows: given $n$ points $\{(X_i,Y_i)\}^n_{i=1}$ from a $p+q$ dimensional multivariate distribution where $X_i \in \mathbb{R}^p$ and $Y_i \in\mathbb{R}^q$, determine whether $a^T X$ and $b^T Y$ are uncorrelated for every $a \in \mathbb{R}^p, b\in \mathbb{R}^q$ or not. We give minimax lower bound for this problem (when $p+q,n \to \infty$, $(p+q)/n \leq \kappa < \infty$, without sparsity assumptions). In summary, our results imply that $n$ must be at least as large as $\sqrt {pq}/\|\Sigma_{XY}\|_F^2$ for any procedure (test) to have non-trivial power, where $\Sigma_{XY}$ is the cross-covariance matrix of $X,Y$. We also provide some evidence that the lower bound is tight, by connections to two-sample testing and regression in specific settings.


Multimodal Hierarchical Dirichlet Process-based Active Perception

arXiv.org Machine Learning

In this paper, we propose an active perception method for recognizing object categories based on the multimodal hierarchical Dirichlet process (MHDP). The MHDP enables a robot to form object categories using multimodal information, e.g., visual, auditory, and haptic information, which can be observed by performing actions on an object. However, performing many actions on a target object requires a long time. In a real-time scenario, i.e., when the time is limited, the robot has to determine the set of actions that is most effective for recognizing a target object. We propose an MHDP-based active perception method that uses the information gain (IG) maximization criterion and lazy greedy algorithm. We show that the IG maximization criterion is optimal in the sense that the criterion is equivalent to a minimization of the expected Kullback--Leibler divergence between a final recognition state and the recognition state after the next set of actions. However, a straightforward calculation of IG is practically impossible. Therefore, we derive an efficient Monte Carlo approximation method for IG by making use of a property of the MHDP. We also show that the IG has submodular and non-decreasing properties as a set function because of the structure of the graphical model of the MHDP. Therefore, the IG maximization problem is reduced to a submodular maximization problem. This means that greedy and lazy greedy algorithms are effective and have a theoretical justification for their performance. We conducted an experiment using an upper-torso humanoid robot and a second one using synthetic data. The experimental results show that the method enables the robot to select a set of actions that allow it to recognize target objects quickly and accurately. The results support our theoretical outcomes.


Minimax Time Series Prediction

Neural Information Processing Systems

We consider an adversarial formulation of the problem ofpredicting a time series with square loss. The aim is to predictan arbitrary sequence of vectors almost as well as the bestsmooth comparator sequence in retrospect. Our approach allowsnatural measures of smoothness such as the squared norm ofincrements. More generally, we consider a linear time seriesmodel and penalize the comparator sequence through the energy ofthe implied driving noise terms. We derive the minimax strategyfor all problems of this type and show that it can be implementedefficiently. The optimal predictions are linear in the previousobservations. We obtain an explicit expression for the regret interms of the parameters defining the problem. For typical,simple definitions of smoothness, the computation of the optimalpredictions involves only sparse matrices. In the case ofnorm-constrained data, where the smoothness is defined in termsof the squared norm of the comparator's increments, we show thatthe regret grows as $T/\sqrt{\lambda_T}$, where $T$ is the lengthof the game and $\lambda_T$ is an increasing limit on comparatorsmoothness.