Statistical Learning
Locally Non-linear Embeddings for Extreme Multi-label Learning
Bhatia, Kush, Jain, Himanshu, Kar, Purushottam, Jain, Prateek, Varma, Manik
The objective in extreme multi-label learning is to train a classifier that can automatically tag a novel data point with the most relevant subset of labels from an extremely large label set. Embedding based approaches make training and prediction tractable by assuming that the training label matrix is low-rank and hence the effective number of labels can be reduced by projecting the high dimensional label vectors onto a low dimensional linear subspace. Still, leading embedding approaches have been unable to deliver high prediction accuracies or scale to large problems as the low rank assumption is violated in most real world applications. This paper develops the X-One classifier to address both limitations. The main technical contribution in X-One is a formulation for learning a small ensemble of local distance preserving embeddings which can accurately predict infrequently occurring (tail) labels. This allows X-One to break free of the traditional low-rank assumption and boost classification accuracy by learning embeddings which preserve pairwise distances between only the nearest label vectors. We conducted extensive experiments on several real-world as well as benchmark data sets and compared our method against state-of-the-art methods for extreme multi-label classification. Experiments reveal that X-One can make significantly more accurate predictions then the state-of-the-art methods including both embeddings (by as much as 35%) as well as trees (by as much as 6%). X-One can also scale efficiently to data sets with a million labels which are beyond the pale of leading embedding methods.
An Extragradient-Based Alternating Direction Method for Convex Minimization
Lin, Tianyi, Ma, Shiqian, Zhang, Shuzhong
In this paper, we consider the problem of minimizing the sum of two convex functions subject to linear linking constraints. The classical alternating direction type methods usually assume that the two convex functions have relatively easy proximal mappings. However, many problems arising from statistics, image processing and other fields have the structure that while one of the two functions has easy proximal mapping, the other function is smoothly convex but does not have an easy proximal mapping. Therefore, the classical alternating direction methods cannot be applied. To deal with the difficulty, we propose in this paper an alternating direction method based on extragradients. Under the assumption that the smooth function has a Lipschitz continuous gradient, we prove that the proposed method returns an $\epsilon$-optimal solution within $O(1/\epsilon)$ iterations. We apply the proposed method to solve a new statistical model called fused logistic regression. Our numerical experiments show that the proposed method performs very well when solving the test problems. We also test the performance of the proposed method through solving the lasso problem arising from statistics and compare the result with several existing efficient solvers for this problem; the results are very encouraging indeed.
Intrinsic Non-stationary Covariance Function for Climate Modeling
Dalal, Chintan A., Pavlovic, Vladimir, Kopp, Robert E.
Designing a covariance function that represents the underlying correlation is a crucial step in modeling complex natural systems, such as climate models. Geospatial datasets at a global scale usually suffer from non-stationarity and non-uniformly smooth spatial boundaries. A Gaussian process regression using a non-stationary covariance function has shown promise for this task, as this covariance function adapts to the variable correlation structure of the underlying distribution. In this paper, we generalize the non-stationary covariance function to address the aforementioned global scale geospatial issues. We define this generalized covariance function as an intrinsic non-stationary covariance function, because it uses intrinsic statistics of the symmetric positive definite matrices to represent the characteristic length scale and, thereby, models the local stochastic process. Experiments on a synthetic and real dataset of relative sea level changes across the world demonstrate improvements in the error metrics for the regression estimates using our newly proposed approach.
Multisection in the Stochastic Block Model using Semidefinite Programming
Agarwal, Naman, Bandeira, Afonso S., Koiliaris, Konstantinos, Kolla, Alexandra
Identifying underlying structure in graphs is a primitive question for scientists: can existing communities be located in a large graph? Is it possible to partition the vertices of a graph into strongly connected clusters? Several of these questions have been shown to be hard to answer, even approximately, so instead of looking for worst-case guarantees attention has shifted towards average-case analyses. In order to study such questions, the usual approach is to consider a random [McS01] or a semi-random [FK01, MMV14] generative model of graphs, and use it as a benchmark to test existing algorithms or to develop new ones. With respect to identifying underlying community structure, the Stochastic Block Model (SBM) (or planted partition model) has, in recent times, been one of the most popular choices. Its growing popularity is largely due to the fact that its structure is simple to describe, but at the same time it has interesting and involved phase transition properties which have only recently been discovered ([DKMZ11, MNS12, MNS13, ABH14, CX14, MNS14b, HWX14, HWX15, AS15, Ban15]). In this paper we consider the SBM on k-communities defined as follows.
AutoCompete: A Framework for Machine Learning Competition
Thakur, Abhishek, Krohn-Grimberghe, Artus
In this paper, we propose AutoCompete, a highly automated machine learning framework for tackling machine learning competitions. This framework has been learned by us, validated and improved over a period of more than two years by participating in online machine learning competitions. It aims at minimizing human interference required to build a first useful predictive model and to assess the practical difficulty of a given machine learning challenge. The proposed system helps in identifying data types, choosing a machine learn- ing model, tuning hyper-parameters, avoiding over-fitting and optimization for a provided evaluation metric. We also observe that the proposed system produces better (or comparable) results with less runtime as compared to other approaches.
Ego-Object Discovery
Lifelogging devices are spreading faster everyday. This growth can represent great benefits to develop methods for extraction of meaningful information about the user wearing the device and his/her environment. In this paper, we propose a semi-supervised strategy for easily discovering objects relevant to the person wearing a first-person camera. Given an egocentric video/images sequence acquired by the camera, our algorithm uses both the appearance extracted by means of a convolutional neural network and an object refill methodology that allows to discover objects even in case of small amount of object appearance in the collection of images. An SVM filtering strategy is applied to deal with the great part of the False Positive object candidates found by most of the state of the art object detectors. We validate our method on a new egocentric dataset of 4912 daily images acquired by 4 persons as well as on both PASCAL 2012 and MSRC datasets. We obtain for all of them results that largely outperform the state of the art approach. We make public both the EDUB dataset and the algorithm code.
Feature-based tuning of simulated annealing applied to the curriculum-based course timetabling problem
Bellio, Ruggero, Ceschia, Sara, Di Gaspero, Luca, Schaerf, Andrea, Urli, Tommaso
We consider the university course timetabling problem, which is one of the most studied problems in educational timetabling. In particular, we focus our attention on the formulation known as the curriculum-based course timetabling problem, which has been tackled by many researchers and for which there are many available benchmarks. The contribution of this paper is twofold. First, we propose an effective and robust single-stage simulated annealing method for solving the problem. Secondly, we design and apply an extensive and statistically-principled methodology for the parameter tuning procedure. The outcome of this analysis is a methodology for modeling the relationship between search method parameters and instance features that allows us to set the parameters for unseen instances on the basis of a simple inspection of the instance itself. Using this methodology, our algorithm, despite its apparent simplicity, has been able to achieve high quality results on a set of popular benchmarks. A final contribution of the paper is a novel set of real-world instances, which could be used as a benchmark for future comparison.
Multi-Step Stochastic ADMM in High Dimensions: Applications to Sparse Optimization and Noisy Matrix Decomposition
Sedghi, Hanie, Anandkumar, Anima, Jonckheere, Edmond
We propose an efficient ADMM method with guarantees for high-dimensional problems. We provide explicit bounds for the sparse optimization problem and the noisy matrix decomposition problem. For sparse optimization, we establish that the modified ADMM method has an optimal convergence rate of $\mathcal{O}(s\log d/T)$, where $s$ is the sparsity level, $d$ is the data dimension and $T$ is the number of steps. This matches with the minimax lower bounds for sparse estimation. For matrix decomposition into sparse and low rank components, we provide the first guarantees for any online method, and prove a convergence rate of $\tilde{\mathcal{O}}((s+r)\beta^2(p) /T) + \mathcal{O}(1/p)$ for a $p\times p$ matrix, where $s$ is the sparsity level, $r$ is the rank and $\Theta(\sqrt{p})\leq \beta(p)\leq \Theta(p)$. Our guarantees match the minimax lower bound with respect to $s,r$ and $T$. In addition, we match the minimax lower bound with respect to the matrix dimension $p$, i.e. $\beta(p)=\Theta(\sqrt{p})$, for many important statistical models including the independent noise model, the linear Bayesian network and the latent Gaussian graphical model under some conditions. Our ADMM method is based on epoch-based annealing and consists of inexpensive steps which involve projections on to simple norm balls. Experiments show that for both sparse optimization and matrix decomposition problems, our algorithm outperforms the state-of-the-art methods. In particular, we reach higher accuracy with same time complexity.
Semiblind Hyperspectral Unmixing in the Presence of Spectral Library Mismatches
Fu, Xiao, Ma, Wing-Kin, Bioucas-Dias, José, Chan, Tsung-Han
The dictionary-aided sparse regression (SR) approach has recently emerged as a promising alternative to hyperspectral unmixing (HU) in remote sensing. By using an available spectral library as a dictionary, the SR approach identifies the underlying materials in a given hyperspectral image by selecting a small subset of spectral samples in the dictionary to represent the whole image. A drawback with the current SR developments is that an actual spectral signature in the scene is often assumed to have zero mismatch with its corresponding dictionary sample, and such an assumption is considered too ideal in practice. In this paper, we tackle the spectral signature mismatch problem by proposing a dictionary-adjusted nonconvex sparsity-encouraging regression (DANSER) framework. The main idea is to incorporate dictionary correcting variables in an SR formulation. A simple and low per-iteration complexity algorithm is tailor-designed for practical realization of DANSER. Using the same dictionary correcting idea, we also propose a robust subspace solution for dictionary pruning. Extensive simulations and real-data experiments show that the proposed method is effective in mitigating the undesirable spectral signature mismatch effects.
Inference for determinantal point processes without spectral knowledge
Bardenet, Rémi, Titsias, Michalis K.
Determinantal point processes (DPPs) are point process models that naturally encode diversity between the points of a given realization, through a positive definite kernel $K$. DPPs possess desirable properties, such as exact sampling or analyticity of the moments, but learning the parameters of kernel $K$ through likelihood-based inference is not straightforward. First, the kernel that appears in the likelihood is not $K$, but another kernel $L$ related to $K$ through an often intractable spectral decomposition. This issue is typically bypassed in machine learning by directly parametrizing the kernel $L$, at the price of some interpretability of the model parameters. We follow this approach here. Second, the likelihood has an intractable normalizing constant, which takes the form of a large determinant in the case of a DPP over a finite set of objects, and the form of a Fredholm determinant in the case of a DPP over a continuous domain. Our main contribution is to derive bounds on the likelihood of a DPP, both for finite and continuous domains. Unlike previous work, our bounds are cheap to evaluate since they do not rely on approximating the spectrum of a large matrix or an operator. Through usual arguments, these bounds thus yield cheap variational inference and moderately expensive exact Markov chain Monte Carlo inference methods for DPPs.