Goto

Collaborating Authors

 Accuracy


Practical Machine Learning Coursera

@machinelearnbot

About this course: One of the most common tasks performed by data scientists and data analysts are prediction and machine learning. This course will cover the basic components of building and applying prediction functions with an emphasis on practical applications. The course will provide basic grounding in concepts such as training and tests sets, overfitting, and error rates. The course will also introduce a range of model based and algorithmic machine learning methods including regression, classification trees, Naive Bayes, and random forests. The course will cover the complete process of building prediction functions including data collection, feature creation, algorithms, and evaluation.


Model-Powered Conditional Independence Test

arXiv.org Machine Learning

We consider the problem of non-parametric Conditional Independence testing (CI testing) for continuous random variables. Given i.i.d samples from the joint distribution $f(x,y,z)$ of continuous random vectors $X,Y$ and $Z,$ we determine whether $X \perp Y | Z$. We approach this by converting the conditional independence test into a classification problem. This allows us to harness very powerful classifiers like gradient-boosted trees and deep neural networks. These models can handle complex probability distributions and allow us to perform significantly better compared to the prior state of the art, for high-dimensional CI testing. The main technical challenge in the classification problem is the need for samples from the conditional product distribution $f^{CI}(x,y,z) = f(x|z)f(y|z)f(z)$ -- the joint distribution if and only if $X \perp Y | Z.$ -- when given access only to i.i.d. samples from the true joint distribution $f(x,y,z)$. To tackle this problem we propose a novel nearest neighbor bootstrap procedure and theoretically show that our generated samples are indeed close to $f^{CI}$ in terms of total variational distance. We then develop theoretical results regarding the generalization bounds for classification for our problem, which translate into error bounds for CI testing. We provide a novel analysis of Rademacher type classification bounds in the presence of non-i.i.d near-independent samples. We empirically validate the performance of our algorithm on simulated and real datasets and show performance gains over previous methods.


A constrained L1 minimization approach for estimating multiple Sparse Gaussian or Nonparanormal Graphical Models

arXiv.org Artificial Intelligence

Identifying context-specific entity networks from aggregated data is an important task, arising often in bioinformatics and neuroimaging. Computationally, this task can be formulated as jointly estimating multiple different, but related, sparse Undirected Graphical Models (UGM) from aggregated samples across several contexts. Previous joint-UGM studies have mostly focused on sparse Gaussian Graphical Models (sGGMs) and can't identify context-specific edge patterns directly. We, therefore, propose a novel approach, SIMULE (detecting Shared and Individual parts of MULtiple graphs Explicitly) to learn multi-UGM via a constrained L1 minimization. SIMULE automatically infers both specific edge patterns that are unique to each context and shared interactions preserved among all the contexts. Through the L1 constrained formulation, this problem is cast as multiple independent subtasks of linear programming that can be solved efficiently in parallel. In addition to Gaussian data, SIMULE can also handle multivariate Nonparanormal data that greatly relaxes the normality assumption that many real-world applications do not follow. We provide a novel theoretical proof showing that SIMULE achieves a consistent result at the rate O(log(Kp)/n_{tot}). On multiple synthetic datasets and two biomedical datasets, SIMULE shows significant improvement over state-of-the-art multi-sGGM and single-UGM baselines.


Dealing with Unbalanced Classes in Machine Learning - deep ideas

#artificialintelligence

In many real-world classification problems, we stumble upon training data with unbalanced classes. This means that the individual classes do not contain the same number of elements. For example, if we want to build an image-based skin cancer detection system using convolutional neural networks, we might encounter a dataset with about 95% negatives and 5% positives. This is for good reasons: Images associated with a negative diagnosis are way more common than images with a positive diagnosis. Rather than regarding this as a flaw in the dataset, we should leverage the additional information that we get.


Per-instance Differential Privacy and the Adaptivity of Posterior Sampling in Linear and Ridge regression

arXiv.org Machine Learning

Differential privacy (DP), ever since its advent, has been a controversial object. On the one hand, it provides strong provable protection of individuals in a data set, on the other hand, it has been heavily criticized for being not practical, partially due to its complete independence to the actual data set it tries to protect. In this paper, we address this issue by a new and more fine-grained notion of differential privacy --- per instance differential privacy (pDP), which captures the privacy of a specific individual with respect to a fixed data set. We show that this is a strict generalization of the standard DP and inherits all its desirable properties, e.g., composition, invariance to side information and closedness to postprocessing, except that they all hold for every instance separately. When the data is drawn from a distribution, we show that per-instance DP implies generalization. Moreover, we provide explicit calculations of the per-instance DP for the output perturbation on a class of smooth learning problems. The result reveals an interesting and intuitive fact that an individual has stronger privacy if he/she has small "leverage score" with respect to the data set and if he/she can be predicted more accurately using the leave-one-out data set. Using the developed techniques, we provide a novel analysis of the One-Posterior-Sample (OPS) estimator and show that when the data set is well-conditioned it provides $(\epsilon,\delta)$-pDP for any target individuals and matches the exact lower bound up to a $1+\tilde{O}(n^{-1}\epsilon^{-2})$ multiplicative factor. We also propose AdaOPS which uses adaptive regularization to achieve the same results with $(\epsilon,\delta)$-DP. Simulation shows several orders-of-magnitude more favorable privacy and utility trade-off when we consider the privacy of only the users in the data set.


Deep Fruit Detection in Orchards

arXiv.org Artificial Intelligence

Abstract-- An accurate and reliable image based fruit detection system is critical for supporting higher level agriculture tasks such as yield mapping and robotic harvesting. This paper presents the use of a state-of-the-art object detection framework, Faster R-CNN, in the context of fruit detection in orchards, including mangoes, almonds and apples. Ablation studies are presented to better understand the practical deployment of the detection network, including how much training data is required to capture variability in the dataset. Data augmentation techniques are shown to yield significant performance gains, resulting in a greater than twofold reduction in the number of training images required. In contrast, transferring knowledge between orchards contributed to negligible performance gain over initialising the Deep Convolutional Neural Network directly from ImageNet features. Finally, to operate over orchard data containing between 100-1000 fruit per image, a tiling approach is introduced for the Faster R-CNN framework. The study has resulted in the best yet detection performance for these orchards relative to previous works, with an F1-score of 0.9 achieved for apples and mangoes. I. INTRODUCTION Vision based fruit detection is a critical component for infield automation in agriculture. With accurate knowledge of individual fruit locations in the field, it is possible to perform yield estimation and mapping, which is important for growers as it facilitates efficient utilisation of resources and improves returns per unit area and time. Precise localisation of the fruit is also a necessary component of an automated robotic harvesting system, which can help mitigate one of the most labour intensive tasks in an orchard [1].


Canelo Alvarez vs. Gennady Golovkin: Start Time, PPV Cost, TV Info

International Business Times

The fight that's been two years in the making and promises to be the best boxing match of 2017 is almost here. Canelo Alvarez and Gennady Golovkin will go head-to-head Saturday night at T-Mobile Arena in Las Vegas with multiple middleweight belts on the line. It won't sell as many pay-per-views as the Aug. 26 bout between Floyd Mayweather and Conor McGregor, though it'd be shocking if it didn't rank second on the year in terms of buys. The PPV starts at 8 p.m. EDT, and watching the fight on TV will cost fans $79.99. Three undercard fights will precede the main event between Alvarez and Golovkin.


Road Friction Estimation for Connected Vehicles using Supervised Machine Learning

arXiv.org Machine Learning

Connected vehicle technology is foreseen to play an important role in reducing the number of traffic accidents while being one of the main enabling components for autonomous driving. One of the application of such connection is to provide accurate information about the road condition such as friction level to drivers or the intelligent systems controlling the car. Road surface friction can be defined as the grip between car tyre and underlying surface. During winter times when the temperature decreases dramatically, friction level reduces substantially, which can increase the risk of car accidents. Studies indicate that road conditions such as surface temperature, type of road, and structure of the road sides play an important role in the measured friction level, and some of these conditions can vary significantly within short distances under specific weather situations. Road friction prediction based on the past sensor measurements available in the cars, e.g., temperature and sun light, has advantages of being independent of the road structure and surrounding infrastructure. Intelligent forecast systems rely on the availability of high quality data in order to allow their multiple actors to make correct decisions in diverse traffic situations. These systems have the potential to increase the safety of roads users by means of the timely sharing of road-related information. With the advances in car-to-car communication technology, today, Volvo cars are equipped with slippery road condition warning system to improve road safety and traffic flow.



Understanding Boosted Trees Models

#artificialintelligence

In the previous post, we learned about tree based learning methods - basics of tree based models and the use of bagging to reduce variance. We also looked at one of the most famous learning algorithms based on the idea of bagging- random forests. In this post, we will look into the details of yet another type of tree-based learning algorithms: boosted trees. Boosting, similar to Bagging, is a general class of learning algorithm where a set of weak learners are combined to get strong learners. For classification problems, a weak learner is defined to be a classifier which is only slightly correlated with the true classification (it can label examples better than random guessing). In contrast, a strong learner is a classifier that is arbitrarily well-correlated with the true classification. Recall that bagging involves creating multiple copies of the original training data set via bootstrapping, fitting a separate decision tree to each copy, and then combining all of the trees in order to create a single predictive model.