Statistical Learning
Streamed Learning: One-Pass SVMs
Rai, Piyush, Daumé, Hal III, Venkatasubramanian, Suresh
We present a streaming model for large-scale classification (in the context of $\ell_2$-SVM) by leveraging connections between learning and computational geometry. The streaming model imposes the constraint that only a single pass over the data is allowed. The $\ell_2$-SVM is known to have an equivalent formulation in terms of the minimum enclosing ball (MEB) problem, and an efficient algorithm based on the idea of \emph{core sets} exists (Core Vector Machine, CVM). CVM learns a $(1+\varepsilon)$-approximate MEB for a set of points and yields an approximate solution to corresponding SVM instance. However CVM works in batch mode requiring multiple passes over the data. This paper presents a single-pass SVM which is based on the minimum enclosing ball of streaming data. We show that the MEB updates for the streaming case can be easily adapted to learn the SVM weight vector in a way similar to using online stochastic gradient updates. Our algorithm performs polylogarithmic computation at each example, and requires very small and constant storage. Experimental results show that, even in such restrictive settings, we can learn efficiently in just one pass and get accuracies comparable to other state-of-the-art SVM solvers (batch and online). We also give an analysis of the algorithm, and discuss some open issues and possible extensions.
The Infinite Hierarchical Factor Regression Model
We propose a nonparametric Bayesian factor regression model that accounts for uncertainty in the number of factors, and the relationship between factors. To accomplish this, we propose a sparse variant of the Indian Buffet Process and couple this with a hierarchical model over factors, based on Kingman's coalescent. We apply this model to two problems (factor analysis and factor regression) in gene-expression data analysis.
How the initialization affects the stability of the k-means algorithm
Bubeck, Sebastien, Meila, Marina, von Luxburg, Ulrike
We investigate the role of the initialization for the stability of the k-means clustering algorithm. As opposed to other papers, we consider the actual k-means algorithm and do not ignore its property of getting stuck in local optima. We are interested in the actual clustering, not only in the costs of the solution. We analyze when different initializations lead to the same local optimum, and when they lead to different local optima. This enables us to prove that it is reasonable to select the number of clusters based on stability scores.
A Unified Semi-Supervised Dimensionality Reduction Framework for Manifold Learning
Chatpatanasiri, Ratthachat, Kijsirikul, Boonserm
We present a general framework of semi-supervised dimensionality reduction for manifold learning which naturally generalizes existing supervised and unsupervised learning frameworks which apply the spectral decomposition. Algorithms derived under our framework are able to employ both labeled and unlabeled examples and are able to handle complex problems where data form separate clusters of manifolds. Our framework offers simple views, explains relationships among existing frameworks and provides further extensions which can improve existing algorithms. Furthermore, a new semi-supervised kernelization framework called ``KPCA trick'' is proposed to handle non-linear problems.
Restart Strategy Selection using Machine Learning Techniques
Restart strategies are an important factor in the performance of conflict-driven Davis Putnam style SAT solvers. Selecting a good restart strategy for a problem instance can enhance the performance of a solver. Inspired by recent success applying machine learning techniques to predict the runtime of SAT solvers, we present a method which uses machine learning to boost solver performance through a smart selection of the restart strategy. Based on easy to compute features, we train both a satisfiability classifier and runtime models. We use these models to choose between restart strategies. We present experimental results comparing this technique with the most commonly used restart strategies. Our results demonstrate that machine learning is effective in improving solver performance.
Empirical Bernstein Bounds and Sample Variance Penalization
Maurer, Andreas, Pontil, Massimiliano
W e give improved constants for data dependent and variance sensitive confidence bounds, called empirical Bernstein bounds, and extend these inequalities to hold uniformly over classes of functions whose growth function is polynomial in the sample size n . The bounds lead us to consider sample variance penalization, a novel learning method which takes into account the empirical variance of the loss function. W e give conditions under which sample variance penalization is effective. In particular, we present a bound on the excess risk incurred by the method. Using this, we argue that there are situations in which the excess risk of our method is of order 1 /n, while the excess risk of empirical risk minimization is of order 1 / n . W e show some experimental results, which confirm the theory. Finally, we discuss the potential application of our results to sample compression schemes.
Inter Genre Similarity Modelling For Automatic Music Genre Classification
Music genre classification is an essential tool for music information retrieval systems and it has been finding critical applications in various media platforms. Two important problems of the automatic music genre classification are feature extraction and classifier design. This paper investigates inter-genre similarity modelling (IGS) to improve the performance of automatic music genre classification. Inter-genre similarity information is extracted over the mis-classified feature population. Once the inter-genre similarity is modelled, elimination of the inter-genre similarity reduces the inter-genre confusion and improves the identification rates. Inter-genre similarity modelling is further improved with iterative IGS modelling(IIGS) and score modelling for IGS elimination(SMIGS). Experimental results with promising classification improvements are provided.
Online Learning of Spacecraft Simulation Models
Thomas, Justin R. (United Space Alliance) | Eick, Christoph F. (University of Houston)
Spacecraft simulation is an integral part of NASA mission planning, real-time mission support, training, and systems engineering. Existing approaches that power these simulations cannot quickly react to the dynamic and complex behavior of the International Space Station (ISS). To address this problem, this paper introduces a unique and efficient method for continuously learning highly accurate models from real-time streaming sensor data, relying on an online learning approach. This approach revolutionizes NASA simulation techniques for space missions by providing models that quickly adapt to real-world feedback without human intervention. A novel regional sliding-window technique for online learning of simulation models is proposed that regionally maintains the most recent data. We also explore a knowledge fusion approach to reduce predictive error spikes when confronted with making predictions in situations that are quite different from training scenarios. We demonstrate substantial error reductions up to 74% in our experimental evaluation on the ISS Electrical Power System and discuss the early deployment of our software in the ISS Mission Control Center (MCC) for ground-based simulations.
Not So Naive Online Bayesian Spam Filter
Su, Baojun (Zhejiang University) | Xu, Congfu (Zhejiang University)
Spam filtering, as a key problem in electronic communication, has drawn significant attention due to increasingly huge amounts of junk email on the Internet. Content-based filtering is one reliable method in combating with spammers' changing tactics. Naive Bayes (NB) is one of the earliest content-based machine learning methods both in theory and practice in combating with spammers, which is easy to implement while can achieve considerable accuracy. In this paper, the traditional online Bayesian classifier are enhanced by two ways. First, from theory's point of view, we devise a self-adaptive mechanism to gradually weaken the assumption of independence required by original NB in the online training process, and as a result of that our NSNB is no longer ``naive''. Second, we propose other engineering ways to make the classifier more robust and accuracy. The experiment results show that our NSNB does give state-of-the-art classification performance on online spam filtering on large benchmark data sets while it is extremely fast and takes up little memory in comparison with other statistical methods.
Real-time Automatic Price Prediction for eBay Online Trading
Raykhel, Ilya (Brigham Young University) | Ventura, Dan (Brigham Young University)
We develop a system for attribute-based prediction of final (online) auction pricing, focusing on the eBay laptop category. The system implements a feature-weighted k -NN algorithm, using evolutionary computation to determine feature weights, with prior trades used as training data. The resulting average prediction error is 16%. Mostly automatic trading using the system greatly reduces the time a reseller needs to spend on trading activities, since the bulk of market research is now done automatically with the help of the learned model. The result is a 562% increase in trading efficiency (measured as profit/hour).