Support Vector Machines
Feature Selection on Lyme Disease Patient Survey Data
Vendrow, Joshua, Haddock, Jamie, Needell, Deanna, Johnson, Lorraine
Lyme disease is a rapidly growing illness that remains poorly understood within the medical community. Critical questions about when and why patients respond to treatment or stay ill, what kinds of treatments are effective, and even how to properly diagnose the disease remain largely unanswered. We investigate these questions by applying machine learning techniques to a large scale Lyme disease patient registry, MyLymeData, developed by the nonprofit LymeDisease.org. We apply various machine learning methods in order to measure the effect of individual features in predicting participants' answers to the Global Rating of Change (GROC) survey questions that assess the self-reported degree to which their condition improved, worsened, or remained unchanged following antibiotic treatment. We use basic linear regression, support vector machines, neural networks, entropy-based decision tree models, and $k$-nearest neighbors approaches. We first analyze the general performance of the model and then identify the most important features for predicting participant answers to GROC. After we identify the "key" features, we separate them from the dataset and demonstrate the effectiveness of these features at identifying GROC. In doing so, we highlight possible directions for future study both mathematically and clinically.
Automated Machine Learning -- a brief review at the end of the early years
Automated machine learning (AutoML) is the sub-field of machine learning that aims at automating, to some extend, all stages of the design of a machine learning system. In the context of supervised learning, AutoML is concerned with feature extraction, pre processing, model design and post processing. Major contributions and achievements in AutoML have been taking place during the recent decade. We are therefore in perfect timing to look back and realize what we have learned. This chapter aims to summarize the main findings in the early years of AutoML. More specifically, in this chapter an introduction to AutoML for supervised learning is provided and an historical review of progress in this field is presented. Likewise, the main paradigms of AutoML are described and research opportunities are outlined.
Let me join you! Real-time F-formation recognition by a socially aware robot
Barua, Hrishav Bakul, Pramanick, Pradip, Sarkar, Chayan, Mg, Theint Haythi
This paper presents a novel architecture to detect social groups in real-time from a continuous image stream of an ego-vision camera. F-formation defines social orientations in space where two or more person tends to communicate in a social place. Thus, essentially, we detect F-formations in social gatherings such as meetings, discussions, etc. and predict the robot's approach angle if it wants to join the social group. Additionally, we also detect outliers, i.e., the persons who are not part of the group under consideration. Our proposed pipeline consists of -- a) a skeletal key points estimator (a total of 17) for the detected human in the scene, b) a learning model (using a feature vector based on the skeletal points) using CRF to detect groups of people and outlier person in a scene, and c) a separate learning model using a multi-class Support Vector Machine (SVM) to predict the exact F-formation of the group of people in the current scene and the angle of approach for the viewing robot. The system is evaluated using two data-sets. The results show that the group and outlier detection in a scene using our method establishes an accuracy of 91%. We have made rigorous comparisons of our systems with a state-of-the-art F-formation detection system and found that it outperforms the state-of-the-art by 29% for formation detection and 55% for combined detection of the formation and approach angle.
Defending Distributed Classifiers Against Data Poisoning Attacks
Weerasinghe, Sandamal, Alpcan, Tansu, Erfani, Sarah M., Leckie, Christopher
Support Vector Machines (SVMs) are vulnerable to targeted training data manipulations such as poisoning attacks and label flips. By carefully manipulating a subset of training samples, the attacker forces the learner to compute an incorrect decision boundary, thereby cause misclassifications. Considering the increased importance of SVMs in engineering and life-critical applications, we develop a novel defense algorithm that improves resistance against such attacks. Local Intrinsic Dimensionality (LID) is a promising metric that characterizes the outlierness of data samples. In this work, we introduce a new approximation of LID called K-LID that uses kernel distance in the LID calculation, which allows LID to be calculated in high dimensional transformed spaces. We introduce a weighted SVM against such attacks using K-LID as a distinguishing characteristic that de-emphasizes the effect of suspicious data samples on the SVM decision boundary. Each sample is weighted on how likely its K-LID value is from the benign K-LID distribution rather than the attacked K-LID distribution. We then demonstrate how the proposed defense can be applied to a distributed SVM framework through a case study on an SDR-based surveillance system. Experiments with benchmark data sets show that the proposed defense reduces classification error rates substantially (10% on average).
Frank-Wolfe algorithm for learning SVM-type multi-category classifiers
Tajima, Kenya, Hirohashi, Yoshihiro, Zara, Esmeraldo Ronnie Rey, Kato, Tsuyoshi
Multi-category support vector machine (MC-SVM) is one of the most popular machine learning algorithms. There are lots of variants of MC-SVM, although different optimization algorithms were developed for different learning machines. In this study, we developed a new optimization algorithm that can be applied to many of MC-SVM variants. The algorithm is based on the Frank-Wolfe framework that requires two subproblems, direction finding and line search, in each iteration. The contribution of this study is the discovery that both subproblems have a closed form solution if the Frank-Wolfe framework is applied to the dual problem. Additionally, the closed form solutions on both for the direction finding and for the line search exist even for the Moreau envelopes of the loss functions. We use several large datasets to demonstrate that the proposed optimization algorithm converges rapidly and thereby improves the pattern recognition performance.
Estimating the time-lapse between medical insurance reimbursement with non-parametric regression models
Akinyemi, Mary, Yinka-Banjo, Chika, Ugot, Ogban-Asuquo, Nwachuku, Akwarandu Ugo
Nonparametric supervised learning algorithms represent a succinct class of supervised learning algorithms where the learning parameters are highly flexible and whose values are directly dependent on the size of the training data. In this paper, we comparatively study the properties of four nonparametric algorithms, K-Nearest Neighbours (KNNs), Support Vector Machines (SVMs), Decision trees and Random forests. The supervised learning task is a regression estimate of the time lapse in medical insurance reimbursement. Our study is concerned precisely with how well each of the nonparametric regression models fits the training data. We quantify the goodness of fit using the R-squared metric. The results are presented with a focus on the effect of the size of the training data, the feature space dimension and hyperparameter optimization. The findings suggest k-NN's and SVM's algorithms as better models in predicting welldefined output labels (i.e,
Rethinking Default Values: a Low Cost and Efficient Strategy to Define Hyperparameters
Mantovani, Rafael Gomes, Rossi, André Luis Debiaso, Alcobaça, Edesio, Gertrudes, Jadson Castro, Junior, Sylvio Barbon, de Carvalho, André Carlos Ponce de Leon Ferreira
Machine Learning (ML) algorithms have been successfully employed by a vast range of practitioners with different backgrounds. One of the reasons for ML popularity is the capability to consistently delivers accurate results, which can be further boosted by adjusting hyperparameters (HP). However, part of practitioners has limited knowledge about the algorithms and does not take advantage of suitable HP settings. In general, HP values are defined by trial and error, tuning, or by using default values. Trial and error is very subjective, time costly and dependent on the user experience. Tuning techniques search for HP values able to maximize the predictive performance of induced models for a given dataset, but with the drawback of a high computational cost and target specificity. To avoid tuning costs, practitioners use default values suggested by the algorithm developer or by tools implementing the algorithm. Although default values usually result in models with acceptable predictive performance, different implementations of the same algorithm can suggest distinct default values. To maintain a balance between tuning and using default values, we propose a strategy to generate new optimized default values. Our approach is grounded on a small set of optimized values able to obtain predictive performance values better than default settings provided by popular tools. The HP candidates are estimated through a pool of promising values tuned from a small and informative set of datasets. After performing a large experiment and a careful analysis of the results, we concluded that our approach delivers better default values. Besides, it leads to competitive solutions when compared with the use of tuned values, being easier to use and having a lower cost.Based on our results, we also extracted simple rules to guide practitioners in deciding whether using our new methodology or a tuning approach.
Positive semidefinite support vector regression metric learning
Most existing metric learning methods focus on learning a similarity or distance measure relying on similar and dissimilar relations between sample pairs. However, pairs of samples cannot be simply identified as similar or dissimilar in many real-world applications, e.g., multi-label learning, label distribution learning. To this end, relation alignment metric learning (RAML) framework is proposed to handle the metric learning problem in those scenarios. But RAML framework uses SVR solvers for optimization. It can't learn positive semidefinite distance metric which is necessary in metric learning. In this paper, we propose two methds to overcame the weakness. Further, We carry out several experiments on the single-label classification, multi-label classification, label distribution learning to demonstrate the new methods achieves favorable performance against RAML framework.
Machine-learning classification using neuroimaging data in schizophrenia, autism, ultra-high risk and first-episode psychosis
Neuropsychiatric disorders are diagnosed based on behavioral criteria, which makes the diagnosis challenging. Objective biomarkers such as neuroimaging are needed, and when coupled with machine learning, can assist the diagnostic decision and increase its reliability. Sixty-four schizophrenia, 36 autism spectrum disorder (ASD), and 106 typically developing individuals were analyzed. FreeSurfer was used to obtain the data from the participant’s brain scans. Six classifiers were utilized to classify the subjects. Subsequently, 26 ultra-high risk for psychosis (UHR) and 17 first-episode psychosis (FEP) subjects were run through the trained classifiers. Lastly, the classifiers’ output of the patient groups was correlated with their clinical severity. All six classifiers performed relatively well to distinguish the subject groups, especially support vector machine (SVM) and Logistic regression (LR). Cortical thickness and subcortical volume feature groups were most useful for the classification. LR and SVM were highly consistent with clinical indices of ASD. When UHR and FEP groups were run with the trained classifiers, majority of the cases were classified as schizophrenia, none as ASD. Overall, SVM and LR were the best performing classifiers. Cortical thickness and subcortical volume were most useful for the classification, compared to surface area. LR, SVM, and DT’s output were clinically informative. The trained classifiers were able to help predict the diagnostic category of both UHR and FEP Individuals.
Machine Learning, Data Science and Deep Learning with Python
Build artificial neural networks with Tensorflow and Keras Classify images, data, and sentiments using deep learning Make predictions using linear regression, polynomial regression, and multivariate regression Data Visualization with MatPlotLib and Seaborn Implement machine learning at massive scale with Apache Spark's MLLib Understand reinforcement learning - and how to build a Pac-Man bot Classify data using K-Means clustering, Support Vector Machines (SVM), KNN, Decision Trees, Naive Bayes, and PCA Use train/test and K-Fold cross validation to choose and tune your models Build a movie recommender system using item-based and user-based collaborative filtering Clean your input data to remove outliers Design and evaluate A/B tests using T-Tests and P-Values You'll need a desktop computer (Windows, Mac, or Linux) capable of running Anaconda 3 or newer. The course will walk you through installing the necessary free software. Some prior coding or scripting experience is required. At least high school level math skills will be required. You'll need a desktop computer (Windows, Mac, or Linux) capable of running Anaconda 3 or newer.