Support Vector Machines
Determining the best classifier for predicting the value of a boolean field on a blood donor database
Motivation: Thanks to digitization, we often have access to large databases, consisting of various fields of information, ranging from numbers to texts and even boolean values. Such databases lend themselves especially well to machine learning, classification and big data analysis tasks. We are able to train classifiers, using already existing data and use them for predicting the values of a certain field, given that we have information regarding the other fields. Most specifically, in this study, we look at the Electronic Health Records (EHRs) that are compiled by hospitals. These EHRs are convenient means of accessing data of individual patients, but there processing as a whole still remains a task. However, EHRs that are composed of coherent, well-tabulated structures lend themselves quite well to the application to machine language, via the usage of classifiers. In this study, we look at a Blood Transfusion Service Center Data Set (Data taken from the Blood Transfusion Service Center in Hsin-Chu City in Taiwan). We used scikit-learn machine learning in python. From Support Vector Machines(SVM), we use Support Vector Classification(SVC), from the linear model we import Perceptron. We also used the K.neighborsclassifier and the decision tree classifiers. We segmented the database into the 2 parts. Using the first, we trained the classifiers and the next part was used to verify if the classifier prediction matched that of the actual values. Contact: ritabratamaiti@hiretrex.com
Simulation assisted machine learning
Deist, Timo, Patti, Andrew, Wang, Zhaoqi, Krane, David, Sorenson, Taylor, Craft, David
Predicting how a proposed cancer treatment will affect a given tumor can be cast as a machine learning problem, but the complexity of biological systems, the number of potentially relevant genomic and clinical features, and the lack of very large scale patient data repositories make this a unique challenge. "Pure data" approaches to this problem are underpowered to detect combinatorially complex interactions and are bound to uncover false correlations despite statistical precautions taken (1). To investigate this setting, we propose a method to integrate simulations, a strong form of prior knowledge, into machine learning, a combination which to date has been largely unexplored. The results of multiple simulations (under various uncertainty scenarios) are used to compute similarity measures between every pair of samples: sample pairs are given a high similarity score if they behave similarly under a wide range of simulation parameters. These similarity values, rather than the original high dimensional feature data, are used to train kernelized machine learning algorithms such as support vector machines, thus handling the curse-of-dimensionality that typically affects genomic machine learning. Using four synthetic datasets of complex systems--three biological models and one network flow optimization model--we demonstrate that when the number of training samples is small compared to the number of features, the simulation kernel approach dominates over no-prior-knowledge methods. In addition to biology and medicine, this approach should be applicable to other disciplines, such as weather forecasting, financial markets, and agricultural management, where predictive models are sought and informative yet approximate simulations are available. The Python SimKern software, the models (in MATLAB, Octave, and R), and the datasets are made freely available at https://github.com/davidcraft/SimKern .
Stealing Hyperparameters in Machine Learning
Wang, Binghui, Gong, Neil Zhenqiang
Hyperparameters are critical in machine learning, as different hyperparameters often result in models with significantly different performance. Hyperparameters may be deemed confidential because of their commercial value and the confidentiality of the proprietary algorithms that the learner uses to learn them. In this work, we propose attacks on stealing the hyperparameters that are learned by a learner. We call our attacks hyperparameter stealing attacks. Our attacks are applicable to a variety of popular machine learning algorithms such as ridge regression, logistic regression, support vector machine, and neural network. We evaluate the effectiveness of our attacks both theoretically and empirically. For instance, we evaluate our attacks on Amazon Machine Learning. Our results demonstrate that our attacks can accurately steal hyperparameters. We also study countermeasures. Our results highlight the need for new defenses against our hyperparameter stealing attacks for certain machine learning algorithms.
The Error Probability of Random Fourier Features is Dimensionality Independent
We show that the error probability of reconstructing kernel matrices from Random Fourier Features for the Gaussian kernel function is at most $\mathcal{O}(R^{2/3} \exp(-D))$, where $D$ is the number of random features and $R$ is the diameter of the data domain. We also provide an information-theoretic method-independent lower bound of $\Omega(\exp(-D))$ for $R>2.1$. Compared to prior work, we are the first to show that the error probability for random Fourier features is independent of the dimensionality of data points. As applications of our theory, we obtain dimension-independent bounds for kernel ridge regression and support vector machines.
5 Career Tips & Outlooks for Analytics Professionals
Most people in the field of analytics can remember writing their own analytical code. Today, our Data Scientists in the MSiA program at Northwestern, can produce analytical models from regression, decision trees, support vector machines (and more) โ all with more or less one simple execution. The manual step is minor. In fact, the manual step is being removed as analytics moves into automation and artificial intelligence. Career Take Away: Develop skills in many model types.
Improving Mild Cognitive Impairment Prediction via Reinforcement Learning and Dialogue Simulation
Tang, Fengyi, Lin, Kaixiang, Uchendu, Ikechukwu, Dodge, Hiroko H., Zhou, Jiayu
Mild cognitive impairment (MCI) is a prodromal phase in the progression from normal aging to dementia, especially Alzheimers disease. Even though there is mild cognitive decline in MCI patients, they have normal overall cognition and thus is challenging to distinguish from normal aging. Using transcribed data obtained from recorded conversational interactions between participants and trained interviewers, and applying supervised learning models to these data, a recent clinical trial has shown a promising result in differentiating MCI from normal aging. However, the substantial amount of interactions with medical staff can still incur significant medical care expenses in practice. In this paper, we propose a novel reinforcement learning (RL) framework to train an efficient dialogue agent on existing transcripts from clinical trials. Specifically, the agent is trained to sketch disease-specific lexical probability distribution, and thus to converse in a way that maximizes the diagnosis accuracy and minimizes the number of conversation turns. We evaluate the performance of the proposed reinforcement learning framework on the MCI diagnosis from a real clinical trial. The results show that while using only a few turns of conversation, our framework can significantly outperform state-of-the-art supervised learning approaches.
D2KE: From Distance to Kernel and Embedding
Wu, Lingfei, Yen, Ian En-Hsu, Xu, Fangli, Ravikumar, Pradeep, Witbrock, Michael
For many machine learning problem settings, particularly with structured inputs such as sequences or sets of objects, a distance measure between inputs can be specified more naturally than a feature representation. However, most standard machine models are designed for inputs with a vector feature representation. In this work, we consider the estimation of a function $f:\mathcal{X} \rightarrow \R$ based solely on a dissimilarity measure $d:\mathcal{X}\times\mathcal{X} \rightarrow \R$ between inputs. In particular, we propose a general framework to derive a family of \emph{positive definite kernels} from a given dissimilarity measure, which subsumes the widely-used \emph{representative-set method} as a special case, and relates to the well-known \emph{distance substitution kernel} in a limiting case. We show that functions in the corresponding Reproducing Kernel Hilbert Space (RKHS) are Lipschitz-continuous w.r.t. the given distance metric. We provide a tractable algorithm to estimate a function from this RKHS, and show that it enjoys better generalizability than Nearest-Neighbor estimates. Our approach draws from the literature of Random Features, but instead of deriving feature maps from an existing kernel, we construct novel kernels from a random feature map, that we specify given the distance measure. We conduct classification experiments with such disparate domains as strings, time series, and sets of vectors, where our proposed framework compares favorably to existing distance-based learning methods such as $k$-nearest-neighbors, distance-substitution kernels, pseudo-Euclidean embedding, and the representative-set method.
Automatic Conflict Detection in Police Body-Worn Audio
Letcher, Alistair, Triลกoviฤ, Jelena, Cademartori, Collin, Chen, Xi, Xu, Jason
In this paper we propose a novel method for automatic conflict detection in police body-worn audio (BWA). Methodologies from statistics, signal processing and machine learning play a burgeoning role in criminology and predictive policing [2], but such tools have not yet been explored for conflict detection in body-worn recordings. Moreover, we find that existing approaches are ineffective when applied to these data off-the-shelf. Notable papers on conflict escalation investigate speech overlap (interruption) and conversational turn-taking as indicators of conflict in political debates. In [3], overlap statistics directly present in a hand-labelled dataset are used to predict conflict, while [4] detect overlap through a Support Vector Machine (SVM) with acoustic and prosodic features. The work in [5] compares variations on both methods. Using automatic overlap detection, their method achieves 62.3% unweighted conflict accuracy at best in political debate audio. This approach is all the less effective on BWA data, which is far noisier and more diverse.
Support Vector Machines
Support vector machines (SVM) and kernel methods are important machine learning techniques. In this short course, we will introduce their basic concepts. We then focus on the training and optimization procedures of SVM. Examples demonstrating the practical use of SVM will also be discussed. Basically we focus on classification.