Support Vector Machines
Influence of the Event Rate on Discrimination Abilities of Bankruptcy Prediction Models
Zhang, Lili, Priestley, Jennifer, Ni, Xuelei
In bankruptcy prediction, the proportion of events is very low, which is often oversampled to eliminate this bias. In this paper, we study the influence of the event rate on discrimination abilities of bankruptcy prediction models. First the statistical association and significance of public records and firmographics indicators with the bankruptcy were explored. Then the event rate was oversampled from 0.12% to 10%, 20%, 30%, 40%, and 50%, respectively. Seven models were developed, including Logistic Regression, Decision Tree, Random Forest, Gradient Boosting, Support Vector Machine, Bayesian Network, and Neural Network. Under different event rates, models were comprehensively evaluated and compared based on Kolmogorov-Smirnov Statistic, accuracy, F1 score, Type I error, Type II error, and ROC curve on the hold-out dataset with their best probability cut-offs. Results show that Bayesian Network is the most insensitive to the event rate, while Support Vector Machine is the most sensitive.
On Breast Cancer Detection: An Application of Machine Learning Algorithms on the Wisconsin Diagnostic Dataset
This paper presents a comparison of six machine learning (ML) algorithms: GRU-SVM (Agarap, 2017), Linear Regression, Multilayer Perceptron (MLP), Nearest Neighbor (NN) search, Softmax Regression, and Support Vector Machine (SVM) on the Wisconsin Diagnostic Breast Cancer (WDBC) dataset (Wolberg, Street, & Mangasarian, 1992) by measuring their classification test accuracy and their sensitivity and specificity values. The said dataset consists of features which were computed from digitized images of FNA tests on a breast mass (Wolberg, Street, & Mangasarian, 1992). For the implementation of the ML algorithms, the dataset was partitioned in the following fashion: 70% for training phase, and 30% for the testing phase. The hyper-parameters used for all the classifiers were manually assigned. Results show that all the presented ML algorithms performed well (all exceeded 90% test accuracy) on the classification task. The MLP algorithm stands out among the implemented algorithms with a test accuracy of ~99.04%.
Machine Learning Approach for Improved Downlink Coordinated Multipoint in Heterogeneous Networks
Mismar, Faris B., Evans, Brian L.
We propose a method for practical downlink coordinated multipoint (DL CoMP) implementation in 4G LTE/LTE-A systems using supervised machine learning. Contributions of this paper are: 1) demonstrating that a support vector machine classifier can learn the optimal conditions at which DL CoMP can be dynamically triggered, 2) improving user throughput in DL CoMP as a result of learning the optimal triggering conditions of DL CoMP, and 3) showing that the machine learning approach is scalable to more than a single macro. The simulation results show an improvement in the pico cell average and edge throughputs and a reduction in the downlink block error rate due to the informed triggering of the multiple radio streams as part of DL CoMP as learned from the support vector machine.
Support Vector Machine Simplified using R
There is no thumb rule of choosing the best kernel. The only solution is Cross-validation. Try several different Kernels, and evaluate their performance metrics such as AUC and select the one with highest AUC. If you want to compare in terms of speed, linear kernels usually compute much faster than radial or polynomial kernels.
Specialized Support Vector Machines for open-set recognition
Júnior, Pedro Ribeiro Mendes, Boult, Terrance E., Wainer, Jacques, Rocha, Anderson
Often, when dealing with real-world recognition problems, we do not need, and often cannot have, knowledge of the entire set of possible classes that might appear during operational testing. Moreover, sometimes some of these classes may be ill-sampled, not sampled at all or undefined. In such cases, we need to think of robust classification methods able to deal with the "unknown" and properly reject samples belonging to classes never seen during training. Notwithstanding, almost all existing classifiers to date were mostly developed for the closed-set scenario, i.e., the classification setup in which it is assumed that all test samples belong to one of the classes with which the classifier was trained. In the open-set scenario, however, a test sample can belong to none of the known classes and the classifier must properly reject it by classifying it as unknown. In this work, we extend upon the well-known Support Vector Machines (SVM) classifier and introduce the Specialized Support Vector Machines (SSVM), which is suitable for recognition in open-set setups. SSVM balances the empirical risk and the risk of the unknown and ensures that the region of the feature space in which a test sample would be classified as known (one of the known classes) is always bounded, ensuring a finite risk of the unknown. The same cannot be guaranteed by the traditional SVM formulation, even when using the Radial Basis Function (RBF) kernel. In this work, we also highlight the properties of the SVM classifier related to the open-set scenario, and provide necessary and sufficient conditions for an RBF SVM to have bounded open-space risk. An extensive set of experiments compares the proposed method with existing solutions in the literature for open-set recognition and the reported results show its effectiveness.
Online Feature Ranking for Intrusion Detection Systems
Atli, Buse Gul, Jung, Alexander
ABSTRACT Many current approaches to the design of intrusion detection systems apply feature selection in a static, nonadaptive fashion. These methods often neglect the dynamic nature of network data which requires to use adaptive feature selection techniques. In this paper, we present a simple technique based on incremental learning of support vector machines in order to rank the features in real time within a streaming model for network data. Some illustrative numerical experiments with two popular benchmark datasets show that our approach allows to adapt to the changes in normal network behaviour and novel attack patterns which have not been experienced before. Index Terms-- Feature selection, streaming data, SVM, SGD, intrusion detection 1. INTRODUCTION The design of efficient intrusion detection systems (IDS) has received considerable attention recently.
Interval-based Prediction Uncertainty Bound Computation in Learning with Missing Values
Hanada, Hiroyuki, Takada, Toshiyuki, Sakuma, Jun, Takeuchi, Ichiro
The problem of machine learning with missing values is common in many areas. A simple approach is to first construct a dataset without missing values simply by discarding instances with missing entries or by imputing a fixed value for each missing entry, and then train a prediction model with the new dataset. A drawback of this naive approach is that the uncertainty in the missing entries is not properly incorporated in the prediction. In order to evaluate prediction uncertainty, the multiple imputation (MI) approach has been studied, but the performance of MI is sensitive to the choice of the probabilistic model of the true values in the missing entries, and the computational cost of MI is high because multiple models must be trained. In this paper, we propose an alternative approach called the Interval-based Prediction Uncertainty Bounding (IPUB) method. The IPUB method represents the uncertainties due to missing entries as intervals, and efficiently computes the lower and upper bounds of the prediction results when all possible training sets constructed by imputing arbitrary values in the intervals are considered. The IPUB method can be applied to a wide class of convex learning algorithms including penalized least-squares regression, support vector machine (SVM), and logistic regression. We demonstrate the advantages of the IPUB method by comparing it with an existing method in numerical experiment with benchmark datasets.
Decision functions from supervised machine learning algorithms as collective variables for accelerating molecular simulations
Sultan, Mohammad M., Pande, Vijay S.
Selection of appropriate collective variables for enhancing molecular simulations remains an unsolved problem in computational biophysics. In particular, picking initial collective variables (CVs) is particularly challenging in higher dimensions. Which atomic coordinates or transforms there of from a list of thousands should one pick for enhanced sampling runs? How does a modeler even begin to pick starting coordinates for investigation? This remains true even in the case of simple two state systems and only increases in difficulty for multi-state systems. In this work, we attempt to solve the initial CV problem using a data-driven approach inspired by supervised machine learning literature. In particular, we show how the decision functions in supervised machine learning (SML) algorithms can be used as initial CVs for accelerated sampling. Using solvated alanine dipeptide and Chignolin mini-protein as our test cases, we illustrate how the distance to the Support Vector Machines decision hyperplane, the output probability estimates from Logistic Regression, and other classifiers may be used to reversibly sample slow structural transitions. We discuss the utility of other SML algorithms that might be useful for identifying CVs for accelerating molecular simulations.
Machine Learning for Beginners, Part 8 – Support Vector Machine
In a February 6 blog, I discussed the unsupervised machine learning Naive Bayes algorithm with an example that was hopefully easy to understand for beginners. During the summer of 2017, I began a five-part series on types of machine learning. That series included more details about k-Nearest neighbor, K-means clustering, Singular Value Decomposition, Principal Component Analysis, Apriori, Frequent Pattern-Growth and more. Today I want to expand on the ideas presented in my Support Vector "Data Science in 90 Seconds" YouTube video and continue the discussion in plain language. If you recall from earlier discussions, supervised machine learning is the'task of inferring a function to describe hidden structure from labeled data'.
Not-So-Random Features
Bullins, Brian, Zhang, Cyril, Zhang, Yi
Choosing the right kernel is a classic question that has riddled machine learning practitioners and theorists alike. Conventional wisdom instructs the user to select a kernel which captures the structure and geometric invariances in the data. Efforts to formulate this principle have inspired vibrant areas of study, going by names from feature selection to multiple kernel learning (MKL). We present a new, principled approach for selecting a translation-invariant or rotation-invariant kernel to maximize the SVM classification margin. We first describe a kernel-alignment subroutine, which finds a peak in the Fourier transform of an adversarially chosen data-dependent measure.