Goto

Collaborating Authors

 Accuracy


Machine Learning for Threat Analytics: A Boost or a Bust?

#artificialintelligence

Trying to discern drug smugglers passing through customs presents exactly the same problem as trying to discern security threats passing through our networks. Machine learning has been applied to both with varying degrees of success, but ultimately the technology reaches the same limitations. Machine learning has two basic elements: feature vectors and classification exemplars -- the data that is gathered and the corresponding classification examples. In the case of drug smugglers, we might observe number of travelers, point of origin, point of destination, number of bags, length of stay and weight of the bags. We might also flag any traveler or pair of travelers with two or more bags whose combined weight is greater than 150 pounds, whose stay is less than a week and who originated from a climate conducive to poppies.


Implementing your own k-nearest neighbour algorithm using Python

#artificialintelligence

In machine learning, you may often wish to build predictors that allows to classify things into categories based on some set of associated values. For example, it is possible to provide a diagnosis to a patient based on data from previous patients. Many algorithms have been developed for automated classification, and common ones include random forests, support vector machines, Naรฏve Bayes classifiers, and many types of neural networks. To get a feel for how classification works, we take a simple example of a classification algorithm โ€“ k-Nearest Neighbours (kNN) โ€“ and build it from scratch in Python 2. You can use a mostly imperative style of coding, rather than a declarative/functional one with lambda functions and list comprehensions to keep things simple if you are starting with Python. Here, we will provide an introduction to the latter approach.


WWE TLC 2016: Match Card, Predictions For 'SmackDown' PPV

International Business Times

The final "SmackDown" pay-per-view before the Royal Rumble is set for Sunday night in Dallas. WWE TLC: Tables, Ladders & Chairs 2016 will feature six matches, and all but one includes a stipulation. TLC is highlighted by multiple championship rematches. AJ Styles and Dean Ambrose will meet in a WWE World Championship match for a third straight "SmackDown" PPV. The same goes for The Miz and Dolph Ziggler, who have been fighting over The Intercontinental Title.


Reliably Learning the ReLU in Polynomial Time

arXiv.org Machine Learning

We give the first dimension-efficient algorithms for learning Rectified Linear Units (ReLUs), which are functions of the form $\mathbf{x} \mapsto \max(0, \mathbf{w} \cdot \mathbf{x})$ with $\mathbf{w} \in \mathbb{S}^{n-1}$. Our algorithm works in the challenging Reliable Agnostic learning model of Kalai, Kanade, and Mansour (2009) where the learner is given access to a distribution $\cal{D}$ on labeled examples but the labeling may be arbitrary. We construct a hypothesis that simultaneously minimizes the false-positive rate and the loss on inputs given positive labels by $\cal{D}$, for any convex, bounded, and Lipschitz loss function. The algorithm runs in polynomial-time (in $n$) with respect to any distribution on $\mathbb{S}^{n-1}$ (the unit sphere in $n$ dimensions) and for any error parameter $\epsilon = \Omega(1/\log n)$ (this yields a PTAS for a question raised by F. Bach on the complexity of maximizing ReLUs). These results are in contrast to known efficient algorithms for reliably learning linear threshold functions, where $\epsilon$ must be $\Omega(1)$ and strong assumptions are required on the marginal distribution. We can compose our results to obtain the first set of efficient algorithms for learning constant-depth networks of ReLUs. Our techniques combine kernel methods and polynomial approximations with a "dual-loss" approach to convex programming. As a byproduct we obtain a number of applications including the first set of efficient algorithms for "convex piecewise-linear fitting" and the first efficient algorithms for noisy polynomial reconstruction of low-weight polynomials on the unit sphere.


Stability selection for component-wise gradient boosting in multiple dimensions

arXiv.org Machine Learning

Noname manuscript No. (will be inserted by the editor) Abstract We present a new algorithm for boosting generalized additive models for location, scale and shape (GAMLSS) that allows to incorporate stability selection, an increasingly popular way to obtain stable sets of covariates while controlling the per-family error rate (PFER). The model is fitted repeatedly to subsampled data and variables with high selection frequencies are extracted. To apply stability selection to boosted GAMLSS, we develop a new "noncyclical" fitting algorithm that incorporates an additional selection step of the best-fitting distribution parameter in each iteration. This new algorithms has the additional advantage that optimizing the tuning parameters of boosting is reduced from a multidimensional to a one-dimensional problem with vastly decreased complexity. The performance of the novel algorithm is evaluated in an extensive simulation study. We apply this new algorithm to a study to estimate abundance of common eider in Massachusetts, USA, featuring excess zeros, overdispersion, non-linearity and spatiotemporal structures. Stability selection is used to obtain a sparse set of stable predictors. Keywords boosting ยท additive models ยท GAMLSS ยท gamboostLSS ยท Stability selection 1 Introduction In view of the growing size and complexity of modern databases, statistical modeling is increasingly faced with heteroscedasticity issues and a large number of available modeling options. In ecology, for example, it is often observed that outcome variables do not only show differences in mean conditions but also tend to be highly variable across different geographical features or states of a combination of covariates (e.g., [33]). In addition, ecological databases typically contain large numbers of correlated predictor variables that need to be carefully chosen for possible incorporation in a statistical regression model [1,8,31]. A convenient approach to address both heteroscedasticity and variable selection in statistical regression models is the combination of GAMLSS modeling with gradient boosting algorithms. GAMLSS, which refer to "generalized additive models for location, scale and shape" [34], are a modeling technique that relates not only the mean but all parameters of the outcome distribution to the available covariates.


Exploring Strategies for Classification of External Stimuli Using Statistical Features of the Plant Electrical Response

arXiv.org Machine Learning

Plants sense their environment by producing electrical signals which in essence represent changes in underlying physiological processes. These electrical signals, when monitored, show both stochastic and deterministic dynamics. In this paper, we compute 11 statistical features from the raw non-stationary plant electrical signal time series to classify the stimulus applied (causing the electrical signal). By using different discriminant analysis based classification techniques, we successfully establish that there is enough information in the raw electrical signal to classify the stimuli. In the process, we also propose two standard features which consistently give good classification results for three types of stimuli - Sodium Chloride (NaCl), Sulphuric Acid (H2SO4) and Ozone (O3). This may facilitate reduction in the complexity involved in computing all the features for online classification of similar external stimuli in future.


Machine Learning on Human Connectome Data from MRI

arXiv.org Machine Learning

Functional MRI (fMRI) and diffusion MRI (dMRI) are non-invasive imaging modalities that allow in-vivo analysis of a patient's brain network (known as a connectome). Use of these technologies has enabled faster and better diagnoses and treatments of neurological disorders and a deeper understanding of the human brain. Recently, researchers have been exploring the application of machine learning models to connectome data in order to predict clinical outcomes and analyze the importance of subnetworks in the brain. Connectome data has unique properties, which present both special challenges and opportunities when used for machine learning. The purpose of this work is to review the literature on the topic of applying machine learning models to MRI-based connectome data. This field is growing rapidly and now encompasses a large body of research. To summarize the research done to date, we provide a comparative, structured summary of 77 relevant works, tabulated according to different criteria, that represent the majority of the literature on this topic. (We also published a living version of this table online at http://connectomelearning.cs.sfu.ca that the community can continue to contribute to.) After giving an overview of how connectomes are constructed from dMRI and fMRI data, we discuss the variety of machine learning tasks that have been explored with connectome data. We then compare the advantages and drawbacks of different machine learning approaches that have been employed, discussing different feature selection and feature extraction schemes, as well as the learning models and regularization penalties themselves. Throughout this discussion, we focus particularly on how the methods are adapted to the unique nature of graphical connectome data. Finally, we conclude by summarizing the current state of the art and by outlining what we believe are strategic directions for future research.


Sparse Proteomics Analysis - A compressed sensing-based approach for feature selection and classification of high-dimensional proteomics mass spectrometry data

arXiv.org Machine Learning

Background: High-throughput proteomics techniques, such as mass spectrometry (MS)-based approaches, produce very high-dimensional data-sets. In a clinical setting one is often interested in how mass spectra differ between patients of different classes, for example spectra from healthy patients vs. spectra from patients having a particular disease. Machine learning algorithms are needed to (a) identify these discriminating features and (b) classify unknown spectra based on this feature set. Since the acquired data is usually noisy, the algorithms should be robust against noise and outliers, while the identified feature set should be as small as possible. Results: We present a new algorithm, Sparse Proteomics Analysis (SPA), based on the theory of compressed sensing that allows us to identify a minimal discriminating set of features from mass spectrometry data-sets. We show (1) how our method performs on artificial and real-world data-sets, (2) that its performance is competitive with standard (and widely used) algorithms for analyzing proteomics data, and (3) that it is robust against random and systematic noise. We further demonstrate the applicability of our algorithm to two previously published clinical data-sets.


Introduction to Machine Learning for Developers

#artificialintelligence

Today's developers often hear about leveraging machine learning algorithms in order to build more intelligent applications, but many don't know where to start. One of the most important aspects of developing smart applications is to understand the underlying machine learning models, even if you aren't the person building them. Whether you are integrating a recommendation system into your app or building a chat bot, this guide will help you get started in understanding the basics of machine learning. This introduction to machine learning and list of resources is adapted from my October 2016 talk at ACT-W, a women's tech conference. Machine learning studies computer algorithms for learning to do stuff.


A Benchmark and Comparison of Active Learning for Logistic Regression

arXiv.org Machine Learning

Various active learning methods based on logistic regression have been proposed. In this paper, we investigate seven state-of-the-art strategies, present an extensive benchmark, and provide a better understanding of their underlying characteristics. Experiments are carried out both on 3 synthetic datasets and 43 real-world datasets, providing insights into the behaviour of these active learning methods with respect to classification accuracy and their computational cost.