Support Vector Machines
Content-based image retrieval tutorial
This paper functions as a tutorial for individuals interested to enter the field of information retrieval but wouldn't know where to begin from. It describes two fundamental yet efficient image retrieval techniques, the first being k - nearest neighbors (knn) and the second support vector machines(svm). The goal is to provide the reader with both the theoretical and practical aspects in order to acquire a better understanding. Along with this tutorial we have also developed the equivalent software1 using the MATLAB environment in order to illustrate the techniques, so that the reader can have a hands-on experience.
Semi-Supervised Prediction of Gene Regulatory Networks Using Machine Learning Algorithms
Patel, Nihir, Wang, Jason T. L.
Use of computational methods to predict gene regulatory networks (GRNs) from gene expression data is a challenging task. Many studies have been conducted using unsupervised methods to fulfill the task; however, such methods usually yield low prediction accuracies due to the lack of training data. In this article, we propose semi-supervised methods for GRN prediction by utilizing two machine learning algorithms, namely support vector machines (SVM) and random forests (RF). The semi-supervised methods make use of unlabeled data for training. We investigate inductive and transductive learning approaches, both of which adopt an iterative procedure to obtain reliable negative training data from the unlabeled data. We then apply our semi-supervised methods to gene expression data of Escherichia coli and Saccharomyces cerevisiae, and evaluate the performance of our methods using the expression data. Our analysis indicated that the transductive learning approach outperformed the inductive learning approach for both organisms. However, there was no conclusive difference identified in the performance of SVM and RF. Experimental results also showed that the proposed semi-supervised methods performed better than existing supervised methods for both organisms.
PAC-Bayesian Theorems for Domain Adaptation with Specialization to Linear Classifiers
Germain, Pascal, Habrard, Amaury, Laviolette, François, Morvant, Emilie
In this paper, we provide two main contributions in PAC-Bayesian theory for domain adaptation where the objective is to learn, from a source distribution, a well-performing majority vote on a different target distribution. On the one hand, we propose an improvement of the previous approach proposed by Germain et al. (2013), that relies on a novel distribution pseudodistance based on a disagreement averaging, allowing us to derive a new tighter PAC-Bayesian domain adaptation bound for the stochastic Gibbs classifier. We specialize it to linear classifiers, and design a learning algorithm which shows interesting results on a synthetic problem and on a popular sentiment annotation task. On the other hand, we generalize these results to multisource domain adaptation allowing us to take into account different source domains. This study opens the door to tackle domain adaptation tasks by making use of all the PAC-Bayesian tools.
A Non-Parametric Control Chart For High Frequency Multivariate Data
Kakde, Deovrat, Peredriy, Sergriy, Chaudhuri, Arin, Mcguirk, Anya
Support Vector Data Description (SVDD) is a machine learning technique used for single class classification and outlier detection. SVDD based K-chart was first introduced by Sun and Tsung for monitoring multivariate processes when underlying distribution of process parameters or quality characteristics depart from Normality. The method first trains a SVDD model on data obtained from stable or in-control operations of the process to obtain a threshold $R^2$ and kernel center a. For each new observation, its Kernel distance from the Kernel center a is calculated. The kernel distance is compared against the threshold $R^2$ to determine if the observation is within the control limits. The non-parametric K-chart provides an attractive alternative to the traditional control charts such as the Hotelling's $T^2$ charts when distribution of the underlying multivariate data is either non-normal or is unknown. But there are challenges when K-chart is deployed in practice. The K-chart requires calculating kernel distance of each new observation but there are no guidelines on how to interpret the kernel distance plot and infer about shifts in process mean or changes in process variation. This limits the application of K-charts in big-data applications such as equipment health monitoring, where observations are generated at a very high frequency. In this scenario, the analyst using the K-chart is inundated with kernel distance results at a very high frequency, generally without any recourse for detecting presence of any assignable causes of variation. We propose a new SVDD based control chart, called as $K_T$ chart, which addresses challenges encountered when using K-chart for big-data applications. The $K_T$ charts can be used to simultaneously track process variation and central tendency. We illustrate the successful use of $K_T$ chart using the Tennessee Eastman process data.
Scatter Component Analysis: A Unified Framework for Domain Adaptation and Domain Generalization
Ghifary, Muhammad, Balduzzi, David, Kleijn, W. Bastiaan, Zhang, Mengjie
This paper addresses classification tasks on a particular target domain in which labeled training data are only available from source domains different from (but related to) the target. Two closely related frameworks, domain adaptation and domain generalization, are concerned with such tasks, where the only difference between those frameworks is the availability of the unlabeled target data: domain adaptation can leverage unlabeled target information, while domain generalization cannot. We propose Scatter Component Analyis (SCA), a fast representation learning algorithm that can be applied to both domain adaptation and domain generalization. SCA is based on a simple geometrical measure, i.e., scatter, which operates on reproducing kernel Hilbert space. SCA finds a representation that trades between maximizing the separability of classes, minimizing the mismatch between domains, and maximizing the separability of data; each of which is quantified through scatter. The optimization problem of SCA can be reduced to a generalized eigenvalue problem, which results in a fast and exact solution. Comprehensive experiments on benchmark cross-domain object recognition datasets verify that SCA performs much faster than several state-of-the-art algorithms and also provides state-of-the-art classification accuracy in both domain adaptation and domain generalization. We also show that scatter can be used to establish a theoretical generalization bound in the case of domain adaptation.
Fuzzy Least Squares Twin Support Vector Machines
Sartakhti, Javad Salimi, Ghadiri, Nasser, Afrabandpey, Homayun, Yousefnezhad, Narges
Least Squares Twin Support Vector Machine (LSTSVM) is an extremely efficient and fast version of SVM algorithm for binary classification. LSTSVM combines the idea of Least Squares SVM and Twin SVM in which two non-parallel hyperplanes are found by solving two systems of linear equations. Although the algorithm is very fast and efficient in many classification tasks, it is unable to cope with two features of real-world problems. First, in many real-world classification problems, it is almost impossible to assign data points to a single class. Second, data points in real-world problems may have different importance. In this study, we propose a novel version of LSTSVM based on fuzzy concepts to deal with these two characteristics of real-world data. The algorithm is called Fuzzy LSTSVM (FLSTSVM) which provides more flexibility than the binary classification of LSTSVM. Two models are proposed for the algorithm. In the first model, a fuzzy membership value is assigned to each data point and the hyperplanes are optimized based on these fuzzy samples. In the second model we construct fuzzy hyperplanes to classify data. Finally, we apply our proposed FLSTSVM to an artificial as well as three real-world datasets. Results demonstrate that FLSTSVM obtains better performance than SVM and LSTSVM.
Clinical Utility of Machine-Learning Approaches in Schizophrenia: Improving Diagnostic Confidence for Translational Neuroimaging
Machine-learning approaches are becoming commonplace in the neuroimaging literature as potential diagnostic and prognostic tools for the study of clinical populations. However, very few studies provide clinically informative measures to aid in decision-making and resource allocation. Head-to-head comparison of neuroimaging-based multivariate classifiers is an essential first step to promote translation of these tools to clinical practice. We systematically evaluated the classifier performance using back-to-back structural MRI in two field strengths (3- and 7-T) to discriminate patients with schizophrenia (n 19) from healthy controls (n 20). Gray matter (GM) and white matter images were used as inputs into a support vector machine to classify patients and control subjects.
On the Application of Support Vector Machines to the Prediction of Propagation Losses at 169 MHz for Smart Metering Applications
Uccellari, Martino, Facchini, Francesca, Sola, Matteo, Sirignano, Emilio, Vitetta, Giorgio M., Barbieri, Andrea, Tondelli, Stefano
Recently, the need of deploying new wireless networks for smart gas metering has raised the problem of radio planning in the169 MHz band. Unluckily, software tools commonly adopted for radio planning in cellular communication systems cannot be employed to solve this problem because of the substantially lower transmission frequencies characterizing this application. In this manuscript a novel data-centric solution, based on the use of support vector machine techniques for classification and regression, is proposed. Our method requires the availability of a limited set of received signal strength measurements and the knowledge of a three-dimensional map of the propagation environment of interest, and generates both an estimate of the coverage area and a prediction of the field strength within it. Numerical results referring to different Italian villages and cities evidence that our method is able to achieve good accuracy at the price of an acceptable computational cost and of a limited effort for the acquisition of measurements in the considered environments.
Ranking a set of classifiers based on metrics with differing units • /r/MachineLearning
Note: I posted this question to stackoverflow as well. Support Vector Machines, k-Neighbors Classifiers, Neural Networks, Decision Trees, ...) on the same training set and collects a bunch of performance metrics for each model. Now, most of these are your standard run-of-the-mill metrics like precision, recall, overall accuracy and all that, but some are more complex (or should I say "different"?), for example: I want to find a good way of ranking these models based on user-specified weights for a subset of the aforementioned performance metrics. If a user's goal was to find the model that was least "complex" while still achieving reasonable precision, they would likely assign a higher weight to the "no. of preprocessing steps" attribute and see which model gets ranked highest (probably model 2, but it really depends on the concrete values of the weights of course). So, in short, I am faced with a so-called Multiple-criteria decision-making (MCDM) problem, and I need to solve it.
Support Vector Machines: A Simple Explanation
In this post, we are going to introduce you to the Support Vector Machine (SVM) machine learning algorithm. We will follow a similar process to our recent post Naive Bayes for Dummies; A Simple Explanation by keeping it short and not overly-technical. The aim is to give those of you who are new to machine learning a basic understanding of the key concepts of this algorithm. A Support Vector Machine (SVM) is a supervised machine learning algorithm that can be employed for both classification and regression purposes. SVMs are more commonly used in classification problems and as such, this is what we will focus on in this post.