Statistical Learning
Adaptive Multinomial Matrix Completion
Klopp, Olga, Lafond, Jean, Moulines, Eric, Salmon, Joseph
The task of estimating a matrix given a sample of observed entries is known as the \emph{matrix completion problem}. Most works on matrix completion have focused on recovering an unknown real-valued low-rank matrix from a random sample of its entries. Here, we investigate the case of highly quantized observations when the measurements can take only a small number of values. These quantized outputs are generated according to a probability distribution parametrized by the unknown matrix of interest. This model corresponds, for example, to ratings in recommender systems or labels in multi-class classification. We consider a general, non-uniform, sampling scheme and give theoretical guarantees on the performance of a constrained, nuclear norm penalized maximum likelihood estimator. One important advantage of this estimator is that it does not require knowledge of the rank or an upper bound on the nuclear norm of the unknown matrix and, thus, it is adaptive. We provide lower bounds showing that our estimator is minimax optimal. An efficient algorithm based on lifted coordinate gradient descent is proposed to compute the estimator. A limited Monte-Carlo experiment, using both simulated and real data is provided to support our claims.
LARSEN-ELM: Selective Ensemble of Extreme Learning Machines using LARS for Blended Data
Han, Bo, He, Bo, Nian, Rui, Ma, Mengmeng, Zhang, Shujing, Li, Minghui, Lendasse, Amaury
Extreme learning machine (ELM) as a neural network algorithm has shown its good performance, such as fast speed, simple structure etc, but also, weak robustness is an unavoidable defect in original ELM for blended data. We present a new machine learning framework called LARSEN-ELM for overcoming this problem. In our paper, we would like to show two key steps in LARSEN-ELM. In the first step, preprocessing, we select the input variables highly related to the output using least angle regression (LARS). In the second step, training, we employ Genetic Algorithm (GA) based selective ensemble and original ELM. In the experiments, we apply a sum of two sines and four datasets from UCI repository to verify the robustness of our approach. The experimental results show that compared with original ELM and other methods such as OP-ELM, GASEN-ELM and LSBoost, LARSEN-ELM significantly improve robustness performance while keeping a relatively high speed.
An application of topological graph clustering to protein function prediction
Bowman, R. Sean, Heisterkamp, Douglas, Johnson, Jesse, O'Donnol, Danielle
We use a semisupervised learning algorithm based on a topological data analysis approach to assign functional categories to yeast proteins using similarity graphs. This new approach to analyzing biological networks yields results that are as good as or better than state of the art existing approaches.
Nonconvex Statistical Optimization: Minimax-Optimal Sparse PCA in Polynomial Time
Wang, Zhaoran, Lu, Huanran, Liu, Han
Sparse principal component analysis (PCA) involves nonconvex optimization for which the global solution is hard to obtain. To address this issue, one popular approach is convex relaxation. However, such an approach may produce suboptimal estimators due to the relaxation effect. To optimally estimate sparse principal subspaces, we propose a two-stage computational framework named "tighten after relax": Within the 'relax' stage, we approximately solve a convex relaxation of sparse PCA with early stopping to obtain a desired initial estimator; For the 'tighten' stage, we propose a novel algorithm called sparse orthogonal iteration pursuit (SOAP), which iteratively refines the initial estimator by directly solving the underlying nonconvex problem. A key concept of this two-stage framework is the basin of attraction. It represents a local region within which the `tighten' stage has desired computational and statistical guarantees. We prove that, the initial estimator obtained from the 'relax' stage falls into such a region, and hence SOAP geometrically converges to a principal subspace estimator which is minimax-optimal within a certain model class. Unlike most existing sparse PCA estimators, our approach applies to the non-spiked covariance models, and adapts to non-Gaussianity as well as dependent data settings. Moreover, through analyzing the computational complexity of the two stages, we illustrate an interesting phenomenon that larger sample size can reduce the total iteration complexity. Our framework motivates a general paradigm for solving many complex statistical problems which involve nonconvex optimization with provable guarantees.
Joint Hierarchical Gaussian Process Model with Application to Forecast in Medical Monitoring
Duan, Leo L., Clancy, John P., Szczesniak, Rhonda D.
A novel extrapolation method is proposed for longitudinal forecasting. A hierarchical Gaussian process model is used to combine nonlinear population change and individual memory of the past to make prediction. The prediction error is minimized through the hierarchical design. The method is further extended to joint modeling of continuous measurements and survival events. The baseline hazard, covariate and joint effects are conveniently modeled in this hierarchical structure. The estimation and inference are implemented in fully Bayesian framework using the objective and shrinkage priors. In simulation studies, this model shows robustness in latent estimation, correlation detection and high accuracy in forecasting. The model is illustrated with medical monitoring data from cystic fibrosis (CF) patients. Estimation and forecasts are obtained in the measurement of lung function and records of acute respiratory events. Keyword: Extrapolation, Joint Model, Longitudinal Model, Hierarchical Gaussian Process, Cystic Fibrosis, Medical Monitoring
A Case Study in Text Mining: Interpreting Twitter Data From World Cup Tweets
Godfrey, Daniel, Johns, Caley, Meyer, Carl, Race, Shaina, Sadek, Carol
Cluster analysis is a field of data analysis that extracts underlying patterns in data. One application of cluster analysis is in text-mining, the analysis of large collections of text to find similarities between documents. We used a collection of about 30,000 tweets extracted from Twitter just before the World Cup started. A common problem with real world text data is the presence of linguistic noise. In our case it would be extraneous tweets that are unrelated to dominant themes. To combat this problem, we created an algorithm that combined the DBSCAN algorithm and a consensus matrix. This way we are left with the tweets that are related to those dominant themes. We then used cluster analysis to find those topics that the tweets describe. We clustered the tweets using k-means, a commonly used clustering algorithm, and Non-Negative Matrix Factorization (NMF) and compared the results. The two algorithms gave similar results, but NMF proved to be faster and provided more easily interpreted results. We explored our results using two visualization tools, Gephi and Wordle.
Uniform Sampling for Matrix Approximation
Cohen, Michael B., Lee, Yin Tat, Musco, Cameron, Musco, Christopher, Peng, Richard, Sidford, Aaron
Random sampling has become a critical tool in solving massive matrix problems. For linear regression, a small, manageable set of data rows can be randomly selected to approximate a tall, skinny data matrix, improving processing time significantly. For theoretical performance guarantees, each row must be sampled with probability proportional to its statistical leverage score. Unfortunately, leverage scores are difficult to compute. A simple alternative is to sample rows uniformly at random. While this often works, uniform sampling will eliminate critical row information for many natural instances. We take a fresh look at uniform sampling by examining what information it does preserve. Specifically, we show that uniform sampling yields a matrix that, in some sense, well approximates a large fraction of the original. While this weak form of approximation is not enough for solving linear regression directly, it is enough to compute a better approximation. This observation leads to simple iterative row sampling algorithms for matrix approximation that run in input-sparsity time and preserve row structure and sparsity at all intermediate steps. In addition to an improved understanding of uniform sampling, our main proof introduces a structural result of independent interest: we show that every matrix can be made to have low coherence by reweighting a small subset of its rows.
On the Sample Complexity of Subspace Learning
Rudi, Alessandro, Canas, Guille D., Rosasco, Lorenzo
A large number of algorithms in machine learning, from principal component analysis (PCA), and its non-linear (kernel) extensions, to more recent spectral embedding and support estimation methods, rely on estimating a linear subspace from samples. In this paper we introduce a general formulation of this problem and derive novel learning error estimates. Our results rely on natural assumptions on the spectral properties of the covariance operator associated to the data distribu- tion, and hold for a wide class of metrics between subspaces. As special cases, we discuss sharp error estimates for the reconstruction properties of PCA and spectral support estimation. Key to our analysis is an operator theoretic approach that has broad applicability to spectral learning methods.
Ranking via Robust Binary Classification and Parallel Parameter Estimation in Large-Scale Data
Yun, Hyokun, Raman, Parameswaran, Vishwanathan, S. V. N.
We propose RoBiRank, a ranking algorithm that is motivated by observing a close connection between evaluation metrics for learning to rank and loss functions for robust classification. It shows competitive performance on standard benchmark datasets against a number of other representative algorithms in the literature. We also discuss extensions of RoBiRank to large scale problems where explicit feature vectors and scores are not given. We show that RoBiRank can be efficiently parallelized across a large number of machines; for a task that requires 386, 133 49, 824, 519 pairwise interactions between items to be ranked, RoBi-Rank finds solutions that are of dramatically higher quality than that can be found by a state-of-the-art competitor algorithm, given the same amount of wall-clock time for computation.
Policy Iteration Based on Stochastic Factorization
Barreto, A. M. S., Pineau, J., Precup, D.
When a transition probability matrix is represented as the product of two stochastic matrices, one can swap the factors of the multiplication to obtain another transition matrix that retains some fundamental characteristics of the original. Since the derived matrix can be much smaller than its precursor, this property can be exploited to create a compact version of a Markov decision process (MDP), and hence to reduce the computational cost of dynamic programming. Building on this idea, this paper presents an approximate policy iteration algorithm called policy iteration based on stochastic factorization, or PISF for short. In terms of computational complexity, PISF replaces standard policy iteration's cubic dependence on the size of the MDP with a function that grows only linearly with the number of states in the model. The proposed algorithm also enjoys nice theoretical properties: it always terminates after a finite number of iterations and returns a decision policy whose performance only depends on the quality of the stochastic factorization. In particular, if the approximation error in the factorization is sufficiently small, PISF computes the optimal value function of the MDP. The paper also discusses practical ways of factoring an MDP and illustrates the usefulness of the proposed algorithm with an application involving a large-scale decision problem of real economical interest.