Goto

Collaborating Authors

 Performance Analysis


How To Stop Online Harassment: Google Uses Machine Learning Tools To More Accurately Spot Abusive Content

International Business Times

A subsidiary of Google's parent company Alphabet, Jigsaw, is using machine learning to fend off online trolling, reports Wired. The New Yorkโ€“based think tank is building open-source AI tools, collectively called Conversation AI, to filter out harassment and abusive language. "Few things poison conversations online more than abusive language, threats, and harassment," reads the Conversation AI website. "We're studying how computers can learn to understand the nuances and context of abusive language at scale. If successful, machine learning could help publishers and moderators improve comments on their platforms and enhance the exchange of ideas on the internet."


MLDB Blog

#artificialintelligence

The business world is full of streams of items that need to be filtered or evaluated: parts on an assembly line, resumรฉs in an application pile, emails in a delivery queue, transactions awaiting processing. Machine learning techniques are increasingly being used to make such processes more efficient: image processing to flag bad parts, text analysis to surface good candidates, spam filtering to sort email, fraud detection to lower transaction costs etc. In this article, I show how you can take business factors into account when using machine learning to solve these kinds of problems with binary classifiers. Specifically, I show how the concept of expected utility from the field of economics maps onto the Receiver Operating Characteristic (ROC) space often used by machine learning practitioners to compare and evaluate models for binary classification. I begin with a parable illustrating the dangers of not taking such factors into account. This concrete story is followed by a more formal mathematical look at the use of indifference curves in ROC space to avoid this kind of problem and guide model development. I wrap up with some recommendations for successfully using binary classifiers to solve business problems.


How to increase Naive Bayes accuracy? โ€ข /r/MachineLearning

@machinelearnbot

How to increase Naive Bayes accuracy? Total size of dataset was 81. Ok so I ran the program against test data and it gave me accuracy of 21% only. Can anyone tell me why is like that? Where am I going wrong?


How to make Training Data for Naive Bayes? โ€ข /r/MachineLearning

@machinelearnbot

I am learning NB algorithm and implementing on a real dataset that contains only 80 records. Now I want to prepare training data. I want to know whether training data is made from the actual data or the actual pattern given in real data? Also, does training data means covering all cases given in real data or what?


Predictive modelling of football injuries

arXiv.org Machine Learning

The goal of this thesis is to investigate the potential of predictive modelling for football injuries. This work was conducted in close collaboration with Tottenham Hotspurs FC (THFC), the PGA European tour and the participation of Wolverhampton Wanderers (WW). Three investigations were conducted: 1. Predicting the recovery time of football injuries using the UEFA injury recordings: The UEFA recordings is a common standard for recording injuries in professional football. For this investigation, three datasets of UEFA injury recordings were available. Different machine learning algorithms were used in order to build a predictive model. The performance of the machine learning models is then improved by using feature selection conducted through correlation-based subset feature selection and random forests. 2. Predicting injuries in professional football using exposure records: The relationship between exposure (in training hours and match hours) in professional football athletes and injury incidence was studied. A common problem in football is understanding how the training schedule of an athlete can affect the chance of him getting injured. The task was to predict the number of days a player can train before he gets injured. 3. Predicting intrinsic injury incidence using in-training GPS measurements: A significant percentage of football injuries can be attributed to overtraining and fatigue. GPS data collected during training sessions might provide indicators of fatigue, or might be used to detect very intense training sessions which can lead to overtraining. This research used GPS data gathered during training sessions of the first team of THFC, in order to predict whether an injury would take place during a week.


Conformalized Kernel Ridge Regression

arXiv.org Machine Learning

General predictive models do not provide a measure of confidence in predictions without Bayesian assumptions. A way to circumvent potential restrictions is to use conformal methods for constructing non-parametric confidence regions, that offer guarantees regarding validity. In this paper we provide a detailed description of a computationally efficient conformal procedure for Kernel Ridge Regression (KRR), and conduct a comparative numerical study to see how well conformal regions perform against the Bayesian confidence sets. The results suggest that conformalized KRR can yield predictive confidence regions with specified coverage rate, which is essential in constructing anomaly detection systems based on predictive models.


Practical advice for applying machine learning

#artificialintelligence

Sprinkled throughout Andrew Ng's machine learning class is a lot of practical advice for applying machine learning. That's what I'm trying to compile and summarize here. The key is dividing data into training, cross-validation and test sets. The test set is used only to evaluate performance, not to train parameters or select a model representation. The rationale for this is that training set error is not a good predictor of how well your hypothesis will generalize to new examples.


impact to AUC if swap positive and negative during model training

#artificialintelligence

If I swap positive class and negative class, then train a model again (I tried decision tree, adaboost, svm from scikit-learn built-in package) for a two class classification problem. Sometimes, I can see AUC slightly change (around 1-2%). Anyone have any ideas why there are such changes? For ROC curve, x-axis is false positive rate, and y-axis and true positive rate. When prediction model gives prediction scores, we will order the scores from higher value to lower value, and then choose threshold according to the sorted values and calculate at the specific threshold point, what is the fpr and tpr.


Pinnability: Machine learning in the home feed

#artificialintelligence

The home feed, a collection of Pins from the people, boards and interests followed, as well as recommendations including Picked for You, is the most heavily user-engaged part of the service, and contributes a large fraction of total repins. The more people Pin, the better Pinterest can get for each person, which puts us in a unique position to serve up inspiration as a discovery engine on an ongoing basis. The home feed is a key way to discover new content, which is valuable to the Pinner, but poses a challenging question. Given the ever increasing number of Pins from various sources, how can we surface the most personalized and relevant Pins? Pinnability is the collective name of the machine learning models we developed to help Pinners find the best content in their home feed.


Detecting weak changes in dynamic events over networks

arXiv.org Machine Learning

Large volume of networked streaming event data are becoming increasingly available in a wide variety of applications, such as social network analysis, Internet traffic monitoring and healthcare analytics. Streaming event data are discrete observation occurred in continuous time, and the precise time interval between two events carries a great deal of information about the dynamics of the underlying systems. How to promptly detect changes in these dynamic systems using these streaming event data? In this paper, we propose a novel change-point detection framework for multi-dimensional event data over networks. We cast the problem into sequential hypothesis test, and derive the likelihood ratios for point processes, which are computed efficiently via an EM-like algorithm that is parameter-free and can be computed in a distributed fashion. We derive a highly accurate theoretical characterization of the false-alarm-rate, and show that it can achieve weak signal detection by aggregating local statistics over time and networks. Finally, we demonstrate the good performance of our algorithm on numerical examples and real-world datasets from twitter and Memetracker.