Accuracy
Sampling Method for Fast Training of Support Vector Data Description
Chaudhuri, Arin, Kakde, Deovrat, Jahja, Maria, Xiao, Wei, Jiang, Hansi, Kong, Seunghyun, Peredriy, Sergiy
Support Vector Data Description (SVDD) is a popular outlier detection technique which constructs a flexible description of the input data. SVDD computation time is high for large training datasets which limits its use in big-data process-monitoring applications. We propose a new iterative sampling-based method for SVDD training. The method incrementally learns the training data description at each iteration by computing SVDD on an independent random sample selected with replacement from the training data set. The experimental results indicate that the proposed method is extremely fast and provides a good data description .
Google's Jigsaw subsidiary is building open-source AI tools to spot trolls
Can Google bring peace to the web with machine learning? Jigsaw, a subsidiary of parent company Alphabet is certainly trying, building open-source AI tools designed to filter out abusive language. A new feature from Wired describes how the software has been trained on some 17 million comments left underneath New York Times stories, along with 13,000 discussions on Wikipedia pages. This data is labeled and then fed into the software -- called Conversation AI -- which begins to learn what bad comments look like. According to the report, Google says Conversation AI can identify abuse with "more than 92 percent certainty and a 10 percent false-positive rate" when compared to the judgements of a human panel.
How To Stop Online Harassment: Google Uses Machine Learning Tools To More Accurately Spot Abusive Content
A subsidiary of Google's parent company Alphabet, Jigsaw, is using machine learning to fend off online trolling, reports Wired. The New Yorkโbased think tank is building open-source AI tools, collectively called Conversation AI, to filter out harassment and abusive language. "Few things poison conversations online more than abusive language, threats, and harassment," reads the Conversation AI website. "We're studying how computers can learn to understand the nuances and context of abusive language at scale. If successful, machine learning could help publishers and moderators improve comments on their platforms and enhance the exchange of ideas on the internet."
MLDB Blog
The business world is full of streams of items that need to be filtered or evaluated: parts on an assembly line, resumรฉs in an application pile, emails in a delivery queue, transactions awaiting processing. Machine learning techniques are increasingly being used to make such processes more efficient: image processing to flag bad parts, text analysis to surface good candidates, spam filtering to sort email, fraud detection to lower transaction costs etc. In this article, I show how you can take business factors into account when using machine learning to solve these kinds of problems with binary classifiers. Specifically, I show how the concept of expected utility from the field of economics maps onto the Receiver Operating Characteristic (ROC) space often used by machine learning practitioners to compare and evaluate models for binary classification. I begin with a parable illustrating the dangers of not taking such factors into account. This concrete story is followed by a more formal mathematical look at the use of indifference curves in ROC space to avoid this kind of problem and guide model development. I wrap up with some recommendations for successfully using binary classifiers to solve business problems.
How to make Training Data for Naive Bayes? โข /r/MachineLearning
I am learning NB algorithm and implementing on a real dataset that contains only 80 records. Now I want to prepare training data. I want to know whether training data is made from the actual data or the actual pattern given in real data? Also, does training data means covering all cases given in real data or what?
Predictive modelling of football injuries
The goal of this thesis is to investigate the potential of predictive modelling for football injuries. This work was conducted in close collaboration with Tottenham Hotspurs FC (THFC), the PGA European tour and the participation of Wolverhampton Wanderers (WW). Three investigations were conducted: 1. Predicting the recovery time of football injuries using the UEFA injury recordings: The UEFA recordings is a common standard for recording injuries in professional football. For this investigation, three datasets of UEFA injury recordings were available. Different machine learning algorithms were used in order to build a predictive model. The performance of the machine learning models is then improved by using feature selection conducted through correlation-based subset feature selection and random forests. 2. Predicting injuries in professional football using exposure records: The relationship between exposure (in training hours and match hours) in professional football athletes and injury incidence was studied. A common problem in football is understanding how the training schedule of an athlete can affect the chance of him getting injured. The task was to predict the number of days a player can train before he gets injured. 3. Predicting intrinsic injury incidence using in-training GPS measurements: A significant percentage of football injuries can be attributed to overtraining and fatigue. GPS data collected during training sessions might provide indicators of fatigue, or might be used to detect very intense training sessions which can lead to overtraining. This research used GPS data gathered during training sessions of the first team of THFC, in order to predict whether an injury would take place during a week.
Conformalized Kernel Ridge Regression
Burnaev, Evgeny, Nazarov, Ivan
General predictive models do not provide a measure of confidence in predictions without Bayesian assumptions. A way to circumvent potential restrictions is to use conformal methods for constructing non-parametric confidence regions, that offer guarantees regarding validity. In this paper we provide a detailed description of a computationally efficient conformal procedure for Kernel Ridge Regression (KRR), and conduct a comparative numerical study to see how well conformal regions perform against the Bayesian confidence sets. The results suggest that conformalized KRR can yield predictive confidence regions with specified coverage rate, which is essential in constructing anomaly detection systems based on predictive models.
impact to AUC if swap positive and negative during model training
If I swap positive class and negative class, then train a model again (I tried decision tree, adaboost, svm from scikit-learn built-in package) for a two class classification problem. Sometimes, I can see AUC slightly change (around 1-2%). Anyone have any ideas why there are such changes? For ROC curve, x-axis is false positive rate, and y-axis and true positive rate. When prediction model gives prediction scores, we will order the scores from higher value to lower value, and then choose threshold according to the sorted values and calculate at the specific threshold point, what is the fpr and tpr.
Detecting weak changes in dynamic events over networks
Li, Shuang, Xie, Yao, Farajtabar, Mehrdad, Verma, Apurv, Song, Le
Large volume of networked streaming event data are becoming increasingly available in a wide variety of applications, such as social network analysis, Internet traffic monitoring and healthcare analytics. Streaming event data are discrete observation occurred in continuous time, and the precise time interval between two events carries a great deal of information about the dynamics of the underlying systems. How to promptly detect changes in these dynamic systems using these streaming event data? In this paper, we propose a novel change-point detection framework for multi-dimensional event data over networks. We cast the problem into sequential hypothesis test, and derive the likelihood ratios for point processes, which are computed efficiently via an EM-like algorithm that is parameter-free and can be computed in a distributed fashion. We derive a highly accurate theoretical characterization of the false-alarm-rate, and show that it can achieve weak signal detection by aggregating local statistics over time and networks. Finally, we demonstrate the good performance of our algorithm on numerical examples and real-world datasets from twitter and Memetracker.