Goto

Collaborating Authors

 Performance Analysis


Discovering Human and Machine Readable Descriptions of Malware Families

AAAI Conferences

While an immense amount of work has gone into novel clustering algorithms, little work has focused on developing compact, domain-specific explanations for the results of the clustering algorithms. Attaching semantic meaning to a cluster has numerous benefits, including the ability for such a description to be both human and machine readable. In this paper, we assume that the clusters are given to us, and find the minimal set of features that can differentiate one cluster from the remaining set of samples. We formulate this problem as an integer linear program. By using samples not belonging to the cluster in the optimization formulation, the resulting description will be minimal and contain no false positives. The efficacy of this method is demonstrated on simulation data and real-world malware data run in a sandbox that collects behavioral characteristics. In the case of malware, once it has been clustered, it would have been sent to a reverse engineer who would have been tasked with creating the actual meaning of the clustering results and disseminating this information through signatures or indicators of compromise. This is a time-consuming process that can take hours to weeks depending on the complexity of the malware family. The methods presented in this paper automatically generate optimal signatures, which can then be quickly propagated to help contain the spread of a malware family.


Validation of Matching

arXiv.org Machine Learning

Our matching problem setting is similar to the transductive setting for classification, from Vapnik [9], where there is a set of training examples with known inputs and class labels and a set of working examples with known inputs and unknown class labels, and the goal is to use the available training and working data to develop a classifier that classifies the working examples with a low error rate. For results on validation of network classifiers (rather than reconciliation algorithms) in transductive settings, refer to [10] and [11]. For theory and insight on why collective classification succeeds in general settings and validation methods for it, refer to [12]. For network reconciliation, we assume that we know some network data, consisting of some node data and the links, for both networks involved in the matching, and our goal is to use that network data to match nodes as accurately as possible between the networks. This paper presents a technique to compute probably approximately correct (PAC) bounds on the precision and recall of matching algorithms.


7 Important Model Evaluation Error Metrics Everyone should know

#artificialintelligence

Predictive Modeling works on constructive feedback principle. Get feedback from metrics, make improvements and continue until you achieve a desirable accuracy. Evaluation metrics explain the performance of a model. An important aspects of evaluation metrics is their capability to discriminate among model results. Once they are finished building a model, they hurriedly map predicted values on unseen data. This is an incorrect approach. Simply, building a predictive model is not your motive. But, creating and selecting a model which gives high accuracy on out of sample data.


Dissimilarity-based Sparse Subset Selection

arXiv.org Machine Learning

Finding an informative subset of a large collection of data points or models is at the center of many problems in computer vision, recommender systems, bio/health informatics as well as image and natural language processing. Given pairwise dissimilarities between the elements of a `source set' and a `target set,' we consider the problem of finding a subset of the source set, called representatives or exemplars, that can efficiently describe the target set. We formulate the problem as a row-sparsity regularized trace minimization problem. Since the proposed formulation is, in general, NP-hard, we consider a convex relaxation. The solution of our optimization finds representatives and the assignment of each element of the target set to each representative, hence, obtaining a clustering. We analyze the solution of our proposed optimization as a function of the regularization parameter. We show that when the two sets jointly partition into multiple groups, our algorithm finds representatives from all groups and reveals clustering of the sets. In addition, we show that the proposed framework can effectively deal with outliers. Our algorithm works with arbitrary dissimilarities, which can be asymmetric or violate the triangle inequality. To efficiently implement our algorithm, we consider an Alternating Direction Method of Multipliers (ADMM) framework, which results in quadratic complexity in the problem size. We show that the ADMM implementation allows to parallelize the algorithm, hence further reducing the computational time. Finally, by experiments on real-world datasets, we show that our proposed algorithm improves the state of the art on the two problems of scene categorization using representative images and time-series modeling and segmentation using representative~models.


Predicting Flights Delay Using Supervised Learning, Logistic Regression

@machinelearnbot

In this post, we'll use a supervised machine learning technique called logistic regression to predict delayed flights. But before we proceed, I like to give condolences to the family of the the victims of the Germanwings tragedy. Note: This is a common data set in the machine learning community to test out algorithms and models given it's publicly available and have sizable data. In this blog, we will look at small sample snapsot(2201 flights in January 2004). In another post, we can explore using Big Data technologies such as Hadoop MapReduce or Spark machine learning libraries to do large scale predictive analytics and data mining.


Multilevel Weighted Support Vector Machine for Classification on Healthcare Data with Missing Values

arXiv.org Machine Learning

This work is motivated by the needs of predictive analytics on healthcare data as represented by Electronic Medical Records. Such data is invariably problematic: noisy, with missing entries, with imbalance in classes of interests, leading to serious bias in predictive modeling. Since standard data mining methods often produce poor performance measures, we argue for development of specialized techniques of data-preprocessing and classification. In this paper, we propose a new method to simultaneously classify large datasets and reduce the effects of missing values. It is based on a multilevel framework of the cost-sensitive SVM and the expected maximization imputation method for missing values, which relies on iterated regression analyses. We compare classification results of multilevel SVM-based algorithms on public benchmark datasets with imbalanced classes and missing values as well as real data in health applications, and show that our multilevel SVM-based method produces fast, and more accurate and robust classification results.


Fanguard: Catching Star Wars surprises and other spoilers with Machine Learning

#artificialintelligence

Ruth Toner Data Scientist Twitch Insight Fellow 2016 Physics Postdoc Harvard University Ruth Toner was a Fellow in our most recent Data Science session in Silicon Valley. She's since joined the Community team at Twitch as a Data Scientist. In this post she describes Fanguard, the tool she built at Insight to protect Tumblr readers from spoilers for blockbuster movies and popular TV shows. Before attending Insight Data Science, I spent eight years of my life in the field of particle physics. Like many postdocs and grad students, when I wasn't trying to discover the basic laws of matter (i.e., debugging my code), I spent a lot of time surfing the Internet.


A U-statistic Approach to Hypothesis Testing for Structure Discovery in Undirected Graphical Models

arXiv.org Machine Learning

Structure discovery in graphical models is the determination of the topology of a graph that encodes conditional independence properties of the joint distribution of all variables in the model. For some class of probability distributions, an edge between two variables is present if and only if the corresponding entry in the precision matrix is non-zero. For a finite sample estimate of the precision matrix, entries close to zero may be due to low sample effects, or due to an actual association between variables; these two cases are not readily distinguishable. %Fisher provided a hypothesis test based on a parametric approximation to the distribution of an entry in the precision matrix of a Gaussian distribution, but this may not provide valid upper bounds on $p$-values for non-Gaussian distributions. Many related works on this topic consider potentially restrictive distributional or sparsity assumptions that may not apply to a data sample of interest, and direct estimation of the uncertainty of an estimate of the precision matrix for general distributions remains challenging. Consequently, we make use of results for $U$-statistics and apply them to the covariance matrix. By probabilistically bounding the distortion of the covariance matrix, we can apply Weyl's theorem to bound the distortion of the precision matrix, yielding a conservative, but sound test threshold for a much wider class of distributions than considered in previous works. The resulting test enables one to answer with statistical significance whether an edge is present in the graph, and convergence results are known for a wide range of distributions. The computational complexities is linear in the sample size enabling the application of the test to large data samples for which computation time becomes a limiting factor. We experimentally validate the correctness and scalability of the test on multivariate distributions for which the distributional assumptions of competing tests result in underestimates of the false positive ratio. By contrast, the proposed test remains sound, promising to be a useful tool for hypothesis testing for diverse real-world problems.


Kaggle Ensembling Guide

#artificialintelligence

Model ensembling is a very powerful technique to increase accuracy on a variety of ML tasks. In this article I will share my ensembling approaches for Kaggle Competitions. For the first part we look at creating ensembles from submission files. The second part will look at creating ensembles through stacked generalization/blending. I answer why ensembling reduces the generalization error. Finally I show different methods of ensembling, together with their results and code to try it out for yourself. This is how you win ML competitions: you take other peoples' work and ensemble them together." The most basic and convenient way to ensemble is to ensemble Kaggle submission CSV files. You only need the predictions on the test set for these methods -- no need to retrain a model. This makes it a quick way to ensemble already existing model predictions, ideal when teaming up. Let's see why model ensembling reduces error rate and why it works better to ensemble low-correlated model predictions. During space missions it is very important that all signals are correctly relayed. A coding solution was found in error correcting codes. The simplest error correcting code is a repetition-code: Relay the signal multiple times in equally sized chunks and have a majority vote. Signal corruption is a very rare occurrence and often occur in small bursts. So then it figures that it is even rarer to have a corrupted majority vote. As long as the corruption is not completely unpredictable (has a 50% chance of occurring) then signals can be repaired. Suppose we have a test set of 10 samples. The ground truth is all positive ("1?):


A bit on the F1 score floor

#artificialintelligence

At Strata Hadoop World "R Day" Tutorial, Tuesday, March 29 2016, San Jose, California we spent some time on classifier measures derived from the so-called "confusion matrix." We repeated our usual admonition to not use "accuracy" as a project goal (business people tend to ask for it as it is the word they are most familiar with, but it usually isn't what they really want). And we worked through the usual bestiary of other metrics (precision, recall, sensitivity, specificity, AUC, balanced accuracy, and many more). We surveyed over a dozen common measures the data scientist is expected to know. While this may seem complicated, this is much better than the traditions used when trying to estimate inter-observer or tagger agreement (where there are around 100 measures, many of which combine effect size and significance, and requires significant research to understand which measures are monotone related to each other; see: Warrens, M. (2008).