Support Vector Machines
Feature Selection for Linear SVM with Provable Guarantees
Paul, Saurabh, Magdon-Ismail, Malik, Drineas, Petros
We give two provably accurate feature-selection techniques for the linear SVM. The algorithms run in deterministic and randomized time respectively. Our algorithms can be used in an unsupervised or supervised setting. The supervised approach is based on sampling features from support vectors. We prove that the margin in the feature space is preserved to within $\epsilon$-relative error of the margin in the full feature space in the worst-case. In the unsupervised setting, we also provide worst-case guarantees of the radius of the minimum enclosing ball, thereby ensuring comparable generalization as in the full feature space and resolving an open problem posed in Dasgupta et al. We present extensive experiments on real-world datasets to support our theory and to demonstrate that our method is competitive and often better than prior state-of-the-art, for which there are no known provable guarantees.
Mesh Learning for Classifying Cognitive Processes
Ozay, Mete, รztekin, Ilke, รztekin, Uygar, Vural, Fatos T. Yarman
A relatively recent advance in cognitive neuroscience has been multi-voxel pattern analysis (MVPA), which enables researchers to decode brain states and/or the type of information represented in the brain during a cognitive operation. MVPA methods utilize machine learning algorithms to distinguish among types of information or cognitive states represented in the brain, based on distributed patterns of neural activity. In the current investigation, we propose a new approach for representation of neural data for pattern analysis, namely a Mesh Learning Model. In this approach, at each time instant, a star mesh is formed around each voxel, such that the voxel corresponding to the center node is surrounded by its p-nearest neighbors. The arc weights of each mesh are estimated from the voxel intensity values by least squares method. The estimated arc weights of all the meshes, called Mesh Arc Descriptors (MADs), are then used to train a classifier, such as Neural Networks, k-Nearest Neighbor, Na\"ive Bayes and Support Vector Machines. The proposed Mesh Model was tested on neuroimaging data acquired via functional magnetic resonance imaging (fMRI) during a recognition memory experiment using categorized word lists, employing a previously established experimental paradigm (\"Oztekin & Badre, 2011). Results suggest that the proposed Mesh Learning approach can provide an effective algorithm for pattern analysis of brain activity during cognitive processing.
Deep learning of fMRI big data: a novel approach to subject-transfer decoding
Koyamada, Sotetsu, Shikauchi, Yumi, Nakae, Ken, Koyama, Masanori, Ishii, Shin
As a technology to read brain states from measurable brain activities, brain decoding are widely applied in industries and medical sciences. In spite of high demands in these applications for a universal decoder that can be applied to all individuals simultaneously, large variation in brain activities across individuals has limited the scope of many studies to the development of individual-specific decoders. In this study, we used deep neural network (DNN), a nonlinear hierarchical model, to construct a subject-transfer decoder. Our decoder is the first successful DNN-based subject-transfer decoder. When applied to a large-scale functional magnetic resonance imaging (fMRI) database, our DNN-based decoder achieved higher decoding accuracy than other baseline methods, including support vector machine (SVM). In order to analyze the knowledge acquired by this decoder, we applied principal sensitivity analysis (PSA) to the decoder and visualized the discriminative features that are common to all subjects in the dataset. Our PSA successfully visualized the subject-independent features contributing to the subject-transferability of the trained decoder.
Sparse Distance Weighted Discrimination
Distance weighted discrimination (DWD) was originally proposed to handle the data piling issue in the support vector machine. In this paper, we consider the sparse penalized DWD for high-dimensional classification. The state-of-the-art algorithm for solving the standard DWD is based on second-order cone programming, however such an algorithm does not work well for the sparse penalized DWD with high-dimensional data. In order to overcome the challenging computation difficulty, we develop a very efficient algorithm to compute the solution path of the sparse DWD at a given fine grid of regularization parameters. We implement the algorithm in a publicly available R package sdwd. We conduct extensive numerical experiments to demonstrate the computational efficiency and classification performance of our method.
Machine Learning Etudes in Astrophysics: Selection Functions for Mock Cluster Catalogs
Hajian, Amir, Alvarez, Marcelo, Bond, J. Richard
Making mock simulated catalogs is an important component of astrophysical data analysis. Selection criteria for observed astronomical objects are often too complicated to be derived from first principles. However the existence of an observed group of objects is a well-suited problem for machine learning classification. In this paper we use one-class classifiers to learn the properties of an observed catalog of clusters of galaxies from ROSAT and to pick clusters from mock simulations that resemble the observed ROSAT catalog. We show how this method can be used to study the cross-correlations of thermal Sunya'ev-Zeldovich signals with number density maps of X-ray selected cluster catalogs. The method reduces the bias due to hand-tuning the selection function and is readily scalable to large catalogs with a high-dimensional space of astrophysical features.
Random Bits Regression: a Strong General Predictor for Big Data
Wang, Yi, Li, Yi, Xiong, Momiao, Jin, Li
We are interested in a general data - based prediction task: g iven a train ing data matrix ( TrX), a training outcome vector ( TrY) and a test data matrix ( TeX), predict test outcome vector (). In the era of big data, two practically conflicting challenges are eminent: (1) the prior knowledge on the subject (a lso known as domain specific knowledge) is largely insufficient; (2) computation and storage cost of big data is unaffordable. To meet these aforementioned challenge s, this paper is devoted to modeling large number of observations without domain specific k nowledge, using regression and classification. The methods widely used for regression and classification can be classified as: linear regression, k nearest neighbor (KNN) [1], support vector machine (SVM) [2], neural network (NN) [3, 4], extreme learning machine (ELM) [5], deep learning (DL) [6], random forest (RF) [7] and boosting (GBM) [8] among others . Each method performs well on some types of datasets but has its own limitations on others [9 - 12] . A method with reasonable performance on boarder, if not universe, datasets is highly desired .
Learning From Weakly Supervised Data by The Expectation Loss SVM (e-SVM) algorithm
Zhu, Jun, Mao, Junhua, Yuille, Alan L.
In many situations we have some measurement of confidence on ``positiveness for a binary label. The ``positiveness" is a continuous value whose range is a bounded interval. It quantifies the affiliation of each training data to the positive class. We propose a novel learning algorithm called \emph{expectation loss SVM} (e-SVM) that is devoted to the problems where only the ``positiveness" instead of a binary label of each training sample is available. Our e-SVM algorithm can also be readily extended to learn segment classifiers under weak supervision where the exact positiveness value of each training example is unobserved. In experiments, we show that the e-SVM algorithm can effectively address the segment proposal classification task under both strong supervision (e.g. the pixel-level annotations are available) and the weak supervision (e.g. only bounding-box annotations are available), and outperforms the alternative approaches. Besides, we further validate this method on two major tasks of computer vision: semantic segmentation and object detection. Our method achieves the state-of-the-art object detection performance on PASCAL VOC 2007 dataset."
Fast Prediction for Large-Scale Kernel Machines
Hsieh, Cho-Jui, Si, Si, Dhillon, Inderjit S.
Kernel machines such as kernel SVM and kernel ridge regression usually construct high quality models; however, their use in real-world applications remains limited due to the high prediction cost. In this paper, we present two novel insights for improving the prediction efficiency of kernel machines. First, we show that by adding โpseudo landmark pointsโ to the classical Nystrยจom kernel approximation in an elegant way, we can significantly reduce the prediction error without much additional prediction cost. Second, we provide a new theoretical analysis on bounding the error of the solution computed by using Nystrยจom kernel approximation method, and show that the error is related to the weighted kmeans objective function where the weights are given by the model computed from the original kernel. This theoretical insight suggests a new landmark point selection technique for the situation where we have knowledge of the original model. Based on these two insights, we provide a divide-and-conquer framework for improving the prediction speed. First, we divide the whole problem into smaller local subproblems to reduce the problem size. In the second phase, we develop a kernel approximation based fast prediction approach within each subproblem. We apply our algorithm to real world large-scale classification and regression datasets, and show that the proposed algorithm is consistently and significantly better than other competitors. For example, on the Covertype classification problem, in terms of prediction time, our algorithm achieves more than 10000 times speedup over the full kernel SVM, and a two-fold speedup over the state-of-the-art LDKL approach, while obtaining much higher prediction accuracy than LDKL (95.2% vs. 89.53%).
Latent Support Measure Machines for Bag-of-Words Data Classification
Yoshikawa, Yuya, Iwata, Tomoharu, Sawada, Hiroshi
In many classification problems, the input is represented as a set of features, e.g., the bag-of-words (BoW) representation of documents. Support vector machines (SVMs) are widely used tools for such classification problems. The performance of the SVMs is generally determined by whether kernel values between data points can be defined properly. However, SVMs for BoW representations have a major weakness in that the co-occurrence of different but semantically similar words cannot be reflected in the kernel calculation. To overcome the weakness, we propose a kernel-based discriminative classifier for BoW data, which we call the latent support measure machine (latent SMM). With the latent SMM, a latent vector is associated with each vocabulary term, and each document is represented as a distribution of the latent vectors for words appearing in the document. To represent the distributions efficiently, we use the kernel embeddings of distributions that hold high order moment information about distributions. Then the latent SMM finds a separating hyperplane that maximizes the margins between distributions of different classes while estimating latent vectors for words to improve the classification performance. In the experiments, we show that the latent SMM achieves state-of-the-art accuracy for BoW text classification, is robust with respect to its own hyper-parameters, and is useful to visualize words.
Predicting Useful Neighborhoods for Lazy Local Learning
Lazy local learning methods train a classifier on the fly" at test time, using only a subset of the training instances that are most relevant to the novel test example. The goal is to tailor the classifier to the properties of the data surrounding the test example. Existing methods assume that the instances most useful for building the local model are strictly those closest to the test example. However, this fails to account for the fact that the success of the resulting classifier depends on the full distribution of selected training instances. Rather than simply gather the test example's nearest neighbors, we propose to predict the subset of training data that is jointly relevant to training its local model. We develop an approach to discover patterns between queries and their "good" neighborhoods using large-scale multi-label classification with compressed sensing. Given a novel test point, we estimate both the composition and size of the training subset likely to yield an accurate local model. We demonstrate the approach on image classification tasks on SUN and aPascal and show it outperforms traditional global and local approaches."