Statistical Learning
Oracle inequalities for ranking and U-processes with Lasso penalty
Model selection is an important challenge, if one works with data sets containing many predictors. Finding relevant variables helps to understand better the problem and improves statistical inference. In t he literature there are many methods solving such problems. One of them is empiri cal risk minimization with the penalty, for instance Lasso (Tibshir ani, 1996). The main characteristic of this procedure is an ability to selec t relevant predictors and estimate unknown parameters simultaneously. In th e paper we apply these ideas to the pairwise ranking problem (ordinal r egression) that 1 relates to predicting or guessing the ordering between obje cts on the basis of their observed predictors. The problem of ranking has num erous applications in practice, for instance in information retrieval, banking or quality control.
Macau: Scalable Bayesian Multi-relational Factorization with Side Information using MCMC
Simm, Jaak, Arany, Adam, Zakeri, Pooya, Haber, Tom, Wegner, Jörg K., Chupakhin, Vladimir, Ceulemans, Hugo, Moreau, Yves
We propose Macau, a powerful and flexible Bayesian factorization method for heterogeneous data. Our model can factorize any set of entities and relations that can be represented by a relational model, including tensors and also multiple relations for each entity. Macau can also incorporate side information, specifically entity and relation features, which are crucial for predicting sparsely observed relations. Macau scales to millions of entity instances, hundred millions of observations, and sparse entity features with millions of dimensions. To achieve the scale up, we specially designed sampling procedure for entity and relation features that relies primarily on noise injection in linear regressions. We show performance and advanced features of Macau in a set of experiments, including challenging drug-protein activity prediction task.
Clustering and Inference From Pairwise Comparisons
Wu, Rui, Xu, Jiaming, Srikant, R., Massoulié, Laurent, Lelarge, Marc, Hajek, Bruce
Given a set of pairwise comparisons, the classical ranking problem computes a single ranking that best represents the preferences of all users. In this paper, we study the problem of inferring individual preferences, arising in the context of making personalized recommendations. In particular, we assume that there are $n$ users of $r$ types; users of the same type provide similar pairwise comparisons for $m$ items according to the Bradley-Terry model. We propose an efficient algorithm that accurately estimates the individual preferences for almost all users, if there are $r \max \{m, n\}\log m \log^2 n$ pairwise comparisons per type, which is near optimal in sample complexity when $r$ only grows logarithmically with $m$ or $n$. Our algorithm has three steps: first, for each user, compute the \emph{net-win} vector which is a projection of its $\binom{m}{2}$-dimensional vector of pairwise comparisons onto an $m$-dimensional linear subspace; second, cluster the users based on the net-win vectors; third, estimate a single preference for each cluster separately. The net-win vectors are much less noisy than the high dimensional vectors of pairwise comparisons and clustering is more accurate after the projection as confirmed by numerical experiments. Moreover, we show that, when a cluster is only approximately correct, the maximum likelihood estimation for the Bradley-Terry model is still close to the true preference.
A Novel Minimum Divergence Approach to Robust Speaker Identification
Basu, Ayanendranath, Bose, Smarajit, Pal, Amita, Mukherjee, Anish, Das, Debasmita
In this work, a novel solution to the speaker identification problem is proposed through minimization of statistical divergences between the probability distribution (g). of feature vectors from the test utterance and the probability distributions of the feature vector corresponding to the speaker classes. This approach is made more robust to the presence of outliers, through the use of suitably modified versions of the standard divergence measures. The relevant solutions to the minimum distance methods are referred to as the minimum rescaled modified distance estimators (MRMDEs). Three measures were considered - the likelihood disparity, the Hellinger distance and Pearson's chi-square distance. The proposed approach is motivated by the observation that, in the case of the likelihood disparity, when the empirical distribution function is used to estimate g, it becomes equivalent to maximum likelihood classification with Gaussian Mixture Models (GMMs) for speaker classes, a highly effective approach used, for example, by Reynolds [22] based on Mel Frequency Cepstral Coefficients (MFCCs) as features. Significant improvement in classification accuracy is observed under this approach on the benchmark speech corpus NTIMIT and a new bilingual speech corpus NISIS, with MFCC features, both in isolation and in combination with delta MFCC features. Moreover, the ubiquitous principal component transformation, by itself and in conjunction with the principle of classifier combination, is found to further enhance the performance.
Streaming Kernel Principal Component Analysis
Ghashami, Mina, Perry, Daniel, Phillips, Jeff M.
Kernel principal component analysis (KPCA) provides a concise set of basis vectors which capture non-linear structures within large data sets, and is a central tool in data analysis and learning. To allow for non-linear relations, typically a full $n \times n$ kernel matrix is constructed over $n$ data points, but this requires too much space and time for large values of $n$. Techniques such as the Nystr\"om method and random feature maps can help towards this goal, but they do not explicitly maintain the basis vectors in a stream and take more space than desired. We propose a new approach for streaming KPCA which maintains a small set of basis elements in a stream, requiring space only logarithmic in $n$, and also improves the dependence on the error parameter. Our technique combines together random feature maps with recent advances in matrix sketching, it has guaranteed spectral norm error bounds with respect to the original kernel matrix, and it compares favorably in practice to state-of-the-art approaches.
An Average Classification Algorithm
van Rooyen, Brendan, Menon, Aditya Krishna, Williamson, Robert C.
Many classification algorithms produce a classifier that is a weighted average of kernel evaluations. When working with a high or infinite dimensional kernel, it is imperative for speed of evaluation and storage issues that as few training samples as possible are used in the kernel expansion. Popular existing approaches focus on altering standard learning algorithms, such as the Support Vector Machine, to induce sparsity, as well as post-hoc procedures for sparse approximations. Here we adopt the latter approach. We begin with a very simple classifier, given by the kernel mean $$ f(x) = \frac{1}{n} \sum\limits_{i=i}^{n} y_i K(x_i,x) $$ We then find a sparse approximation to this kernel mean via herding. The result is an accurate, easily parallelized algorithm for learning classifiers.
Dimensionality-reduced subspace clustering
Heckel, Reinhard, Tschannen, Michael, Bölcskei, Helmut
Subspace clustering refers to the problem of clustering unlabeled high-dimensional data points into a union of low-dimensional linear subspaces, whose number, orientations, and dimensions are all unknown. In practice one may have access to dimensionality-reduced observations of the data only, resulting, e.g., from undersampling due to complexity and speed constraints on the acquisition device or mechanism. More pertinently, even if the high-dimensional data set is available it is often desirable to first project the data points into a lower-dimensional space and to perform clustering there; this reduces storage requirements and computational cost. The purpose of this paper is to quantify the impact of dimensionality reduction through random projection on the performance of three subspace clustering algorithms, all of which are based on principles from sparse signal recovery. Specifically, we analyze the thresholding based subspace clustering (TSC) algorithm, the sparse subspace clustering (SSC) algorithm, and an orthogonal matching pursuit variant thereof (SSC-OMP). We find, for all three algorithms, that dimensionality reduction down to the order of the subspace dimensions is possible without incurring significant performance degradation. Moreover, these results are order-wise optimal in the sense that reducing the dimensionality further leads to a fundamentally ill-posed clustering problem. Our findings carry over to the noisy case as illustrated through analytical results for TSC and simulations for SSC and SSC-OMP. Extensive experiments on synthetic and real data complement our theoretical findings.
Cross-validation of matching correlation analysis by resampling matching weights
The strength of association between a pair of data vectors is represented by a nonnegative real number, called matching weight. For dimensionality reduction, we consider a linear transformation of data vectors, and define a matching error as the weighted sum of squared distances between transformed vectors with respect to the matching weights. Given data vectors and matching weights, the optimal linear transformation minimizing the matching error is solved by the spectral graph embedding of Yan et al. (2007). This method is a generalization of the canonical correlation analysis, and will be called as matching correlation analysis (MCA). In this paper, we consider a novel sampling scheme where the observed matching weights are randomly sampled from underlying true matching weights with small probability, whereas the data vectors are treated as constants. We then investigate a cross-validation by resampling the matching weights. Our asymptotic theory shows that the cross-validation, if rescaled properly, computes an unbiased estimate of the matching error with respect to the true matching weights. Existing ideas of cross-validation for resampling data vectors, instead of resampling matching weights, are not applicable here. MCA can be used for data vectors from multiple domains with different dimensions via an embarrassingly simple idea of coding the data vectors. This method will be called as cross-domain matching correlation analysis (CDMCA), and an interesting connection to the classical associative memory model of neural networks is also discussed.
Cloud-based Electronic Health Records for Real-time, Region-specific Influenza Surveillance
Santillana, Mauricio, Nguyen, Andre, Louie, Tamara, Zink, Anna, Gray, Josh, Sung, Iyue, Brownstein, John S.
Introduction Influenza is a leading cause of death in the United States (US), where up to 50,000 are killed each year by influenza- ‐like illnesses (ILI) [1]. Therefore, monitoring, early detection, and prediction of influenza outbreaks are crucial to public health. Disease detection and surveillance systems provide epidemiologic intelligence that allows health officials to deploy preventive measures and help clinic and hospital administrators make optimal staffing and stocking decisions [2]. The US Centers for Disease Control and Prevention (CDC) monitors ILI in the US by gathering information from physicians' reports about patients with ILI seeking medical attention [3]. CDC's ILI data provides useful estimates of influenza activity; however, its availability has a known time lag of one to two weeks. This time lag is far from optimal since public health decisions need to be made based on information that is two weeks old. Systems capable of providing real- ‐time estimates of influenza activity are, thus, critical. Many attempts have been made to design methods capable of providing real- ‐time estimates of ILI activity in the US by leveraging Internet- ‐based data sources that could potentially measure ILI in an indirect manner [4, 5, 6, 7, 8, 9, 10, 11].
Active Sampler: Light-weight Accelerator for Complex Data Analytics at Scale
Gao, Jinyang, Jagadish, H. V., Ooi, Beng Chin
Recent years have witnessed amazing outcomes from "Big Models" trained by "Big Data". Most popular algorithms for model training are iterative. Due to the surging volumes of data, we can usually afford to process only a fraction of the training data in each iteration. Typically, the data are either uniformly sampled or sequentially accessed. In this paper, we study how the data access pattern can affect model training. We propose an Active Sampler algorithm, where training data with more "learning value" to the model are sampled more frequently. The goal is to focus training effort on valuable instances near the classification boundaries, rather than evident cases, noisy data or outliers. We show the correctness and optimality of Active Sampler in theory, and then develop a light-weight vectorized implementation. Active Sampler is orthogonal to most approaches optimizing the efficiency of large-scale data analytics, and can be applied to most analytics models trained by stochastic gradient descent (SGD) algorithm. Extensive experimental evaluations demonstrate that Active Sampler can speed up the training procedure of SVM, feature selection and deep learning, for comparable training quality by 1.6-2.2x.