Goto

Collaborating Authors

 Performance Analysis


Provable Sparse Tensor Decomposition

arXiv.org Machine Learning

We propose a novel sparse tensor decomposition method, namely Tensor Truncated Power (TTP) method, that incorporates variable selection into the estimation of decomposition components. The sparsity is achieved via an efficient truncation step embedded in the tensor power iteration. Our method applies to a broad family of high dimensional latent variable models, including high dimensional Gaussian mixture and mixtures of sparse regressions. A thorough theoretical investigation is further conducted. In particular, we show that the final decomposition estimator is guaranteed to achieve a local statistical rate, and further strengthen it to the global statistical rate by introducing a proper initialization procedure. In high dimensional regimes, the obtained statistical rate significantly improves those shown in the existing non-sparse decomposition methods. The empirical advantages of TTP are confirmed in extensive simulated results and two real applications of click-through rate prediction and high-dimensional gene clustering.


How do you know if your model is going to work? Part 4: Cross-validation techniques

#artificialintelligence

In this article we conclude our four part series on basic model testing. When fitting and selecting models in a data science project, how do you know that your final model is good? And how sure are you that it's better than the models that you rejected? In this concluding Part 4 of our four part mini-series "How do you know if your model is going to work?" we demonstrate cross-validation techniques. Cross validation techniques attempt to improve statistical efficiency by repeatedly splitting data into train and test and re-performing model fit and model evaluation.


Generate Simple And Easy ROC Curve With AUC

#artificialintelligence

The ROC curve is a great way to determine how well your classifier is doing but it can sometimes be tricky to actually generate the curve because of cryptic software instructions in languages such as R. In this tutorial we'll be using a few lines of python code to generate an interactive graph like the one shown above. To make this super simple we'll be using a python install that has all of the statistical, mathematical and graphical packages already included. This will go much more smoothly if you start off by uninstalling python from your system if you already have it installed. Once you've done that head on over to https://www.continuum.io/downloads and download the Anaconda python package which contains absolutely everything we need.


Classification of Phishing Email Using Random Forest Machine Learning Technique

#artificialintelligence

Phishing is one of the major challenges faced by the world of e-commerce today. Thanks to phishing attacks, billions of dollars have been lost by many companies and individuals. In 2012, an online report put the loss due to phishing attack at about 1.5 billion. This global impact of phishing attacks will continue to be on the increase and thus requires more efficient phishing detection techniques to curb the menace. This paper investigates and reports the use of random forest machine learning algorithm in classification of phishing attacks, with the major objective of developing an improved phishing email classifier with better prediction accuracy and fewer numbers of features. From a dataset consisting of 2000 phishing and ham emails, a set of prominent phishing email features (identified from the literature) were extracted and used by the machine learning algorithm with a resulting classification accuracy of 99.7% and low false negative (FN) and false positive (FP) rates.


Using Word2Vec document vectors as features in Naive Bayes โ€ข /r/MachineLearning

@machinelearnbot

You could learn a discretization, or codebook, of your word2vec features. For example, you could run k-means on all of them (well, all your training word2vec features), then treat each one as a single instance of one of k words. Naive bayes proceeds naturally from documents as histograms of these words, and you don't even have to normalize the word counts. But yeah, it's adding another step, and another parameter (k), and discretization can throw away specificity.


Practical Data Science in Python: Guidebook

@machinelearnbot

Caveat: It uses the worst possible technique for spam filtering: Naive Bayes, responsible for extremely poor spam filtering systems with tons of false positives and false negatives, still alive today. So this is definitely not a good resource to learn data science, but a great tutorial to learn Python, especially since naive Bayes is extremely easy to implement, though alternate but far better techniques such as hidden decision trees, are almost just as easy to code.


An Empirical Study into Annotator Agreement, Ground Truth Estimation, and Algorithm Evaluation

arXiv.org Artificial Intelligence

Although agreement between annotators has been studied in the past from a statistical viewpoint, little work has attempted to quantify the extent to which this phenomenon affects the evaluation of computer vision (CV) object detection algorithms. Many researchers utilise ground truth (GT) in experiments and more often than not this GT is derived from one annotator's opinion. How does the difference in opinion affect an algorithm's evaluation? Four examples of typical CV problems are chosen, and a methodology is applied to each to quantify the inter-annotator variance and to offer insight into the mechanisms behind agreement and the use of GT. It is found that when detecting linear objects annotator agreement is very low. The agreement in object position, linear or otherwise, can be partially explained through basic image properties. Automatic object detectors are compared to annotator agreement and it is found that a clear relationship exists. Several methods for calculating GTs from a number of annotations are applied and the resulting differences in the performance of the object detectors are quantified. It is found that the rank of a detector is highly dependent upon the method used to form the GT. It is also found that although the STAPLE and LSML GT estimation methods appear to represent the mean of the performance measured using the individual annotations, when there are few annotations, or there is a large variance in them, these estimates tend to degrade. Furthermore, one of the most commonly adopted annotation combination methods--consensus voting--accentuates more obvious features, which results in an overestimation of the algorithm's performance. Finally, it is concluded that in some datasets it may not be possible to state with any confidence that one algorithm outperforms another when evaluating upon one GT and a method for calculating confidence bounds is discussed.


Algorithm learns to identify anomalous activity online with high degree of accuracy - The Tartan

#artificialintelligence

At the IEEE International Conference on Big Data Security in New York City this month, researchers from MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL) and the machine learning start-up PatternEx, presented a paper about their new security system that combines machine learning approaches and input from human security experts. This system, called AI2 (named by merging "artificial intelligence" and "analyst intuition"), has an 85 percent success rate in identifying threats and a false positive rate of 4.4 percent over a raw data set of 3.6 billion log lines. According to the paper, the three major challenges faced by the security industry are a lack of labelled examples to model learning models on, constant evolution of attacker's methods, and limited reliance on security analysts to determine each threat's risk factor. In fact, stand-alone analyst-driven approaches are limited in their effectiveness because of the fact that attackers learn the behavior used by such systems to predict possible threats, and then work their way around that learned behavior in order to bypass security systems. Furthermore, only machine learning-based approaches can be inefficient based on the fact that they raise a need for human investigation every time they come across an anomaly.


Conversational Markers of Constructive Discussions

arXiv.org Machine Learning

Group discussions are essential for organizing every aspect of modern life, from faculty meetings to senate debates, from grant review panels to papal conclaves. While costly in terms of time and organization effort, group discussions are commonly seen as a way of reaching better decisions compared to solutions that do not require coordination between the individuals (e.g. voting)---through discussion, the sum becomes greater than the parts. However, this assumption is not irrefutable: anecdotal evidence of wasteful discussions abounds, and in our own experiments we find that over 30% of discussions are unproductive. We propose a framework for analyzing conversational dynamics in order to determine whether a given task-oriented discussion is worth having or not. We exploit conversational patterns reflecting the flow of ideas and the balance between the participants, as well as their linguistic choices. We apply this framework to conversations naturally occurring in an online collaborative world exploration game developed and deployed to support this research. Using this setting, we show that linguistic cues and conversational patterns extracted from the first 20 seconds of a team discussion are predictive of whether it will be a wasteful or a productive one.


Arlot , Celisse : A survey of cross-validation procedures for model selection

@machinelearnbot

Used to estimate the risk of an estimator or to perform model selection, cross-validation is a widespread strategy because of its simplicity and its (apparent) universality. Many results exist on model selection performances of cross-validation procedures. This survey intends to relate these results to the most recent advances of model selection theory, with a particular emphasis on distinguishing empirical statements from rigorous theoretical results. As a conclusion, guidelines are provided for choosing the best cross-validation procedure according to the particular features of the problem in hand.