Support Vector Machines
Random Fragments Classification of Microbial Marker Clades with Multi-class SVM and N-Best Algorithm
Microbial clades modeling is a challenging problem in biology based on microarray genome sequences, especially in new species gene isolates discovery and category. Marker family genome sequences play important roles in describing specific microbial clades within species, a framework of support vector machine (SVM) based microbial species classification with N-best algorithm is constructed to classify the centroid marker genome fragments randomly generated from marker genome sequences on MetaRef. A time series feature extraction method is proposed by segmenting the centroid gene sequences and mapping into different dimensional spaces. Two ways of data splitting are investigated according to random splitting fragments along genome sequence (DI) , or separating genome sequences into two parts (DII).Two strategies of fragments recognition tasks, dimension-by-dimension and sequence--by--sequence, are investigated. The k-mer size selection, overlap of segmentation and effects of random split percents are also discussed. Experiments on 12390 maker genome sequences belonging to marker families of 17 species from MetaRef show that, both for DI and DII in dimension-by-dimension and sequence-by-sequence recognition, the recognition accuracy rates can achieve above 28\% in top-1 candidate, and above 91\% in top-10 candidate both on training and testing sets overall.
Multimodal Subspace Support Vector Data Description
Sohrab, Fahad, Raitoharju, Jenni, Iosifidis, Alexandros, Gabbouj, Moncef
In this paper, we propose a novel method for projecting data from multiple modalities to a new subspace optimized for one-class classification. The proposed method iteratively transforms the data from the original feature space of each modality to a new common feature space along with finding a joint compact description of data coming from all the modalities. For data in each modality, we define a separate transformation to map the data from the corresponding feature space to the new optimized subspace by exploiting the available information from the class of interest only. The data description in the new subspace is obtained by Support Vector Data Description. We also propose different regularization strategies for the proposed method and provide both linear and non-linear formulation. We conduct experiments on two multimodal datasets and compare the proposed approach with baseline and recently proposed one-class classification methods combined with early fusion and also considering each modality separately. We show that the proposed Multimodal Subspace Support Vector Data Description outperforms all the methods using data from a single modality and performs better or equally well than the methods fusing data from all modalities.
Probabilistic Kernel Support Vector Machines
Chen, Yongxin, Georgiou, Tryphon T., Tannenbaum, Allen R.
We propose a probabilistic enhancement of standard {\em kernel Support Vector Machines} for binary classification, in order to address the case when, along with given data sets, a description of uncertainty (e.g., error bounds) may be available on each datum. In the present paper, we specifically consider Gaussian distributions to model uncertainty. Thereby, our data consist of pairs $(x_i,\Sigma_i)$, $i\in\{1,\ldots,N\}$, along with an indicator $y_i\in\{-1,1\}$ to declare membership in one of two categories for each pair. These pairs may be viewed to represent the mean and covariance, respectively, of random vectors $\xi_i$ taking values in a suitable linear space (typically ${\mathbb R}^n$). Thus, our setting may also be viewed as a modification of Support Vector Machines to classify distributions, albeit, at present, only Gaussian ones. We outline the formalism that allows computing suitable classifiers via a natural modification of the standard ``kernel trick.'' The main contribution of this work is to point out a suitable kernel function for applying Support Vector techniques to the setting of uncertain data for which a detailed uncertainty description is also available (herein, ``Gaussian points'').
OCKELM+: Kernel Extreme Learning Machine based One-class Classification using Privileged Information (or KOC+: Kernel Ridge Regression or Least Square SVM with zero bias based One-class Classification using Privileged Information)
Gautam, Chandan, Tiwari, Aruna, Tanveer, M.
Kernel method-based one-class classifier is mainly used for outlier or novelty detection. In this letter, kernel ridge regression (KRR) based one-class classifier (KOC) has been extended for learning using privileged information (LUPI). LUPI-based KOC method is referred to as KOC+. This privileged information is available as a feature with the dataset but only for training (not for testing). KOC+ utilizes the privileged information differently compared to normal feature information by using a so-called correction function. Privileged information helps KOC+ in achieving better generalization performance which is exhibited in this letter by testing the classifiers with and without privileged information. Existing and proposed classifiers are evaluated on the datasets from UCI machine learning repository and also on MNIST dataset. Moreover, experimental results evince the advantage of KOC+ over KOC and support vector machine (SVM) based one-class classifiers.
Kickstart your experiments from examples - Azure Machine Learning Studio
For example, to browse experiments that use a PCA-based anomaly detection algorithm: Under Categories click Experiment. Then, under Algorithms Used, click Show all and in the dialog box choose PCA-Based Anomaly Detection. You may have to scroll to see it. For example, to find experiments contributed by Microsoft related to digit recognition that use a two-class support vector machine algorithm, enter "digit recognition" in the search box. For example, to browse experiments that use a PCA-based anomaly detection algorithm: Under Categories click Experiment.
Exploring, Visualizing, and Modeling Big Data with R
Working with BIG DATA requires a particular suite of data analytics tools and advanced techniques, such as machine learning (ML). Many of these tools are readily and freely available in R. This full-day session will provide participants with a hands-on training on how to use data analytics tools and machine learning methods available in R to explore, visualize, and model big data. The first half of our training session will focus on organizing (manipulating and summarizing) and visualizing (both statically and dynamically) big data in R. The second half will involve a series of short lectures on ML techniques (decision trees, random forests, and support vector machines), as well as hands-on demonstrations applying these methods in R. Examples will be drawn from the OECD's Programme for International Student Assessment (PISA). Participants will get opportunities to work through several hands-on lab sessions throughout the day.
Heterogeneous Multi-task Metric Learning across Multiple Domains
Luo, Yong, Wen, Yonggang, Tao, Dacheng
Distance metric learning (DML) plays a crucial role in diverse machine learning algorithms and applications. When the labeled information in target domain is limited, transfer metric learning (TML) helps to learn the metric by leveraging the sufficient information from other related domains. Multi-task metric learning (MTML), which can be regarded as a special case of TML, performs transfer across all related domains. Current TML tools usually assume that the same feature representation is exploited for different domains. However, in real-world applications, data may be drawn from heterogeneous domains. Heterogeneous transfer learning approaches can be adopted to remedy this drawback by deriving a metric from the learned transformation across different domains. But they are often limited in that only two domains can be handled. To appropriately handle multiple domains, we develop a novel heterogeneous multi-task metric learning (HMTML) framework. In HMTML, the metrics of all different domains are learned together. The transformations derived from the metrics are utilized to induce a common subspace, and the high-order covariance among the predictive structures of these domains is maximized in this subspace. There do exist a few heterogeneous transfer learning approaches that deal with multiple domains, but the high-order statistics (correlation information), which can only be exploited by simultaneously examining all domains, is ignored in these approaches. Compared with them, the proposed HMTML can effectively explore such high-order information, thus obtaining more reliable feature transformations and metrics. Effectiveness of our method is validated by the extensive and intensive experiments on text categorization, scene classification, and social image annotation.
Feature Learning Viewpoint of AdaBoost and a New Algorithm
Wang, Fei, Li, Zhongheng, He, Fang, Wang, Rong, Yu, Weizhong, Nie, Feiping
The AdaBoost algorithm has the superiority of resisting overfitting. Understanding the mysteries of this phenomena is a very fascinating fundamental theoretical problem. Many studies are devoted to explaining it from statistical view and margin theory. In this paper, we illustrate it from feature learning viewpoint, and propose the AdaBoost+SVM algorithm, which can explain the resistant to overfitting of AdaBoost directly and easily to understand. Firstly, we adopt the AdaBoost algorithm to learn the base classifiers. Then, instead of directly weighted combination the base classifiers, we regard them as features and input them to SVM classifier. With this, the new coefficient and bias can be obtained, which can be used to construct the final classifier. We explain the rationality of this and illustrate the theorem that when the dimension of these features increases, the performance of SVM would not be worse, which can explain the resistant to overfitting of AdaBoost.
Proposing a Localized Relevance Vector Machine for Pattern Classification
Rismanchian, Farhood, Rahimian, Karim
Relevance vector machine (RVM) can be seen as a probabilistic version of support vector machines which is able to produce sparse solutions by linearly weighting a small number of basis functions instead using all of them. Regardless of a few merits of RVM such as giving probabilistic predictions and relax of parameter tuning, it has poor prediction for test instances that are far away from the relevance vectors. As a solution, we propose a new combination of RVM and k-nearest neighbor (k-NN) rule which resolves this issue with regionally dealing with every test instance. In our settings, we obtain the relevance vectors for each test instance in the local area given by k-NN rule. In this way, relevance vectors are closer and more relevant to the test instance which results in a more accurate model. This can be seen as a piece-wise learner which locally classifies test instances. The model is hence called localized relevance vector machine (LRVM). The LRVM is examined on several datasets of the University of California, Irvine (UCI) repository. Results supported by statistical tests indicate that the performance of LRVM is competitive as compared with a few state-of-the-art classifiers.
A topological data analysis based classification method for multiple measurements
Riihimäki, Henri, Chachólski, Wojciech, Theorell, Jakob, Hillert, Jan, Ramanujam, Ryan
Background, Machine learning models for repeated measurements are limited. Using topological data analysis (TDA), we present a classifier for repeated measurements which samples from the data space and builds a network graph based on the data topology. When applying this to two case studies, accuracy exceeds alternative models with additional benefits such as reporting data subsets with high purity along with feature values. Results, For 300 examples of 3 tree species, the accuracy reached 80% after 30 datapoints, which was improved to 90% after increased sampling to 400 datapoints. Using data from 100 examples of each of 6 point processes, the classifier achieved 96.8% accuracy. In both datasets, the TDA classifier outperformed an alternative model. Conclusions, This algorithm and software can be beneficial for repeated measurement data common in biological sciences, as both an accurate classifier and a feature selection tool.