Statistical Learning
Dropout Training as Adaptive Regularization
Wager, Stefan, Wang, Sida, Liang, Percy
Dropout and other feature noising schemes control overfitting by artificially corrupting the training data. For generalized linear models, dropout performs a form of adaptive regularization. Using this viewpoint, we show that the dropout regularizer is first-order equivalent to an L2 regularizer applied after scaling the features by an estimate of the inverse diagonal Fisher information matrix. We also establish a connection to AdaGrad, an online learning algorithm, and find that a close relative of AdaGrad operates by repeatedly solving linear dropout-regularized problems. By casting dropout as regularization, we develop a natural semi-supervised algorithm that uses unlabeled data to create a better adaptive regularizer. We apply this idea to document classification tasks, and show that it consistently boosts the performance of dropout training, improving on state-of-the-art results on the IMDB reviews dataset.
Dynamic Clustering via Asymptotics of the Dependent Dirichlet Process Mixture
Campbell, Trevor, Liu, Miao, Kulis, Brian, How, Jonathan P., Carin, Lawrence
This paper presents a novel algorithm, based upon the dependent Dirichlet process mixture model (DDPMM), for clustering batch-sequential data containing an unknown number of evolving clusters. The algorithm is derived via a low-variance asymptotic analysis of the Gibbs sampling algorithm for the DDPMM, and provides a hard clustering with convergence guarantees similar to those of the k-means algorithm. Empirical results from a synthetic test with moving Gaussian clusters and a test with real ADS-B aircraft trajectory data demonstrate that the algorithm requires orders of magnitude less computational time than contemporary probabilistic and hard clustering algorithms, while providing higher accuracy on the examined datasets.
Parameterless Optimal Approximate Message Passing
Mousavi, Ali, Maleki, Arian, Baraniuk, Richard G.
Iterative thresholding algorithms are well-suited for high-dimensional problems in sparse recovery and compressive sensing. The performance of this class of algorithms depends heavily on the tuning of certain threshold parameters. In particular, both the final reconstruction error and the convergence rate of the algorithm crucially rely on how the threshold parameter is set at each step of the algorithm. In this paper, we propose a parameter-free approximate message passing (AMP) algorithm that sets the threshold parameter at each iteration in a fully automatic way without either having an information about the signal to be reconstructed or needing any tuning from the user. We show that the proposed method attains both the minimum reconstruction error and the highest convergence rate. Our method is based on applying the Stein unbiased risk estimate (SURE) along with a modified gradient descent to find the optimal threshold in each iteration. Motivated by the connections between AMP and LASSO, it could be employed to find the solution of the LASSO for the optimal regularization parameter. To the best of our knowledge, this is the first work concerning parameter tuning that obtains the fastest convergence rate with theoretical guarantees.
Nonlinear unmixing of hyperspectral images using a semiparametric model and spatial regularization
Chen, Jie, Richard, Cรฉdric, Hero, Alfred O. III
Incorporating spatial information into hyperspectral unmixing procedures has been shown to have positive effects, due to the inherent spatial-spectral duality in hyperspectral scenes. Current research works that consider spatial information are mainly focused on the linear mixing model. In this paper, we investigate a variational approach to incorporating spatial correlation into a nonlinear unmixing procedure. A nonlinear algorithm operating in reproducing kernel Hilbert spaces, associated with an $\ell_1$ local variation norm as the spatial regularizer, is derived. Experimental results, with both synthetic and real data, illustrate the effectiveness of the proposed scheme.
A dependent partition-valued process for multitask clustering and time evolving network modelling
Palla, Konstantina, Knowles, David A., Ghahramani, Zoubin
The fundamental aim of clustering algorithms is to partition data points. We consider tasks where the discovered partition is allowed to vary with some covariate such as space or time. One approach would be to use fragmentation-coagulation processes, but these, being Markov processes, are restricted to linear or tree structured covariate spaces. We define a partition-valued process on an arbitrary covariate space using Gaussian processes. We use the process to construct a multitask clustering model which partitions datapoints in a similar way across multiple data sources, and a time series model of network data which allows cluster assignments to vary over time. We describe sampling algorithms for inference and apply our method to defining cancer subtypes based on different types of cellular characteristics, finding regulatory modules from gene expression data from multiple human populations, and discovering time varying community structure in a social network.
Spatial statistics, image analysis and percolation theory
Langovoy, Mikhail, Habeck, Michael, Schรถlkopf, Bernhard
We develop a novel method for detection of signals and reconstruction of images in the presence of random noise. The method uses results from percolation theory. We specifically address the problem of detection of multiple objects of unknown shapes in the case of nonparametric noise. The noise density is unknown and can be heavy-tailed. The objects of interest have unknown varying intensities. No boundary shape constraints are imposed on the objects, only a set of weak bulk conditions is required. We view the object detection problem as a multiple hypothesis testing for discrete statistical inverse problems. We present an algorithm that allows to detect greyscale objects of various shapes in noisy images. We prove results on consistency and algorithmic complexity of our procedures. Applications to cryo-electron microscopy are presented.
Distributed k-Means and k-Median Clustering on General Topologies
Balcan, Maria Florina, Ehrlich, Steven, Liang, Yingyu
This paper provides new algorithms for distributed clustering for two popular center-based objectives, k-median and k-means. These algorithms have provable guarantees and improve communication complexity over existing approaches. Following a classic approach in clustering by \cite{har2004coresets}, we reduce the problem of finding a clustering with low cost to the problem of finding a coreset of small size. We provide a distributed method for constructing a global coreset which improves over the previous methods by reducing the communication complexity, and which works over general communication topologies. Experimental results on large scale data sets show that this approach outperforms other coreset-based distributed clustering algorithms.
Safe and Efficient Screening For Sparse Support Vector Machine
Assume that X E Him" is a data set containing 71 samples, X: (x1, . . . Let w*()\) be the optimal solution of Eq. (1) All the features With nonzero values in "w" (A) are called active The Lagrangian multiplier [1] of the problem defined in Eq. (1) is: The Eq. (2) can be reformulated as: Since the problem defined in Eq. (1) is convex and the optimal value of the In the preceding equation i'j: ij, and Y is a diagonal matrix and YM: When the input is given, it can be obtained in a closed form. The Ll--regularized L2--Loss SVM in Eq. (1) can be rewritten in an uncon-- Eq. (22) shows that the necessary condition for a feature f to be active in the To bound value of 0Tf' 7 we need to first construct a closed convex set K that We first study how to construct the convex set K. In the following, we construct a closed convex set K based on Eq. (19) and The proof of this proposition can be found in [2]. Let 01 and 02 be the optimal solutions of the problem defined in Eq. (19) for Assume that /\1 A2, and 01 is known. In the preceding equations, 01, A1, and /\2 are known. Figure 1 shows an example of the K in a two dimensional space. And K is indicated by the shaded area. It is indicated by the shaded area. Besides the n dimensional hyperball defined in Eq. (32), it is possible to By applying Proposition 6.1 to the objective function defined in Eq. (33) for 01, Let t::--: Z 0. By substituting 0: 02 and 0: 01 into Eq. Eq. (35)7 respectively, and then combining the two obtained equations7 the As the value of t change from 0 to 007 Eq. (36) generates a series of hyperball. Eq. (36) reaches it minimum when, The theorem can be proved by minimizing the 7" defined in Eq. (36).
Para-active learning
Agarwal, Alekh, Bottou, Leon, Dudik, Miroslav, Langford, John
Training examples are not all equally informative. Active learning strategies leverage this observation in order to massively reduce the number of examples that need to be labeled. We leverage the same observation to build a generic strategy for parallelizing learning algorithms. This strategy is effective because the search for informative examples is highly parallelizable and because we show that its performance does not deteriorate when the sifting process relies on a slightly outdated model. Parallel active learning is particularly attractive to train nonlinear models with non-linear representations because there are few practical parallel learning algorithms for such models. We report preliminary experiments using both kernel SVMs and SGD-trained neural networks.
Online Ensemble Learning for Imbalanced Data Streams
While both cost-sensitive learning and online learning have been studied extensively, the effort in simultaneously dealing with these two issues is limited. Aiming at this challenge task, a novel learning framework is proposed in this paper. The key idea is based on the fusion of online ensemble algorithms and the state of the art batch mode cost-sensitive bagging/boosting algorithms. Within this framework, two separately developed research areas are bridged together, and a batch of theoretically sound online cost-sensitive bagging and online cost-sensitive boosting algorithms are first proposed. Unlike other online cost-sensitive learning algorithms lacking theoretical analysis of asymptotic properties, the convergence of the proposed algorithms is guaranteed under certain conditions, and the experimental evidence with benchmark data sets also validates the effectiveness and efficiency of the proposed methods.