Performance Analysis
Multivariate data visualization
As a fraud practitioner using data mining techniques to detect fraud, anomalies, outliers or other indicators of potential problems I use a combination of data mining and data matching techniques. The volumes of data in a client assignment can vary from 15 million records of company directors, 60,000 employees, accounts payables data of suppliers 900,000 and invoice transaction 11,million. I'm not a great fan of predictive technologies as the disparate data sets don't seem to fit with the techniques, but I'm open to alternative methodologies. I've recently tested a single fraud profile using "Receiver Operating Characteristic" to evaluate the sensitivity and specificity of the profile. The results fell within the ROC space.
Predicting winners of the Rugby World Cup
For the sake of brevity, not all the relevant data and code are displayed in this post but can rather be found here. And you can visit the final working web application here. The Rugby World Cup (RWC) is here! With many fans around the world excited to see the action unfold over the next month and a half. If you've never heard of the sport, whatisrugby.com
ATD: Anomalous Topic Discovery in High Dimensional Discrete Data
Soleimani, Hossein, Miller, David J.
We propose an algorithm for detecting patterns exhibited by anomalous clusters in high dimensional discrete data. Unlike most anomaly detection (AD) methods, which detect individual anomalies, our proposed method detects groups (clusters) of anomalies; i.e. sets of points which collectively exhibit abnormal patterns. In many applications this can lead to better understanding of the nature of the atypical behavior and to identifying the sources of the anomalies. Moreover, we consider the case where the atypical patterns exhibit on only a small (salient) subset of the very high dimensional feature space. Individual AD techniques and techniques that detect anomalies using all the features typically fail to detect such anomalies, but our method can detect such instances collectively, discover the shared anomalous patterns exhibited by them, and identify the subsets of salient features. In this paper, we focus on detecting anomalous topics in a batch of text documents, developing our algorithm based on topic models. Results of our experiments show that our method can accurately detect anomalous topics and salient features (words) under each such topic in a synthetic data set and two real-world text corpora and achieves better performance compared to both standard group AD and individual AD techniques. All required code to reproduce our experiments is available from https://github.com/hsoleimani/ATD
Machine Learning Advances Fight Against Cancer
Developing effective tools against cancer has been a long, complicated endeavor with successes and disappointments. Despite all, cancer remains the leading cause of death worldwide. Now, machine learning and data analytics are being recruited as tools in the effort fight the disease and show significant promise according to two recent papers. In one paper – An Analytics Approach to Designing Combination Chemotherapy Regimens for Cancer – researchers from MIT and Stanford "propose models that use machine learning and optimization to suggest regimens to be tested in phase II and phase III trials." Their work, published in March in Management Science, could help cut costs and speed clinical trials.
Machine Unlearning: The Value of Imperfect Models
A project manager once told me that "any job worth doing is worth doing poorly." I understood exactly what she meant, and she knew that I would understand, especially when she preceded our conversation with these words: "I wouldn't say this to everyone, but I know you will understand what I mean." The message was clear to me because I was a perfectionist (and hopefully I have learned over the years to be less of a perfectionist thanks to my project manager's wise counsel). As a perfectionist, I would strive for 100% completion and perfection on every project, every analysis, and every report. It would take me longer than most people to finish the analysis and report, and my manager understood why.
Biologically Inspired Radio Signal Feature Extraction with Sparse Denoising Autoencoders
Migliori, Benjamin, Zeller-Townson, Riley, Grady, Daniel, Gebhardt, Daniel
Automatic modulation classification (AMC) is an important task for modern communication systems; however, it is a challenging problem when signal features and precise models for generating each modulation may be unknown. We present a new biologically-inspired AMC method without the need for models or manually specified features --- thus removing the requirement for expert prior knowledge. We accomplish this task using regularized stacked sparse denoising autoencoders (SSDAs). Our method selects efficient classification features directly from raw in-phase/quadrature (I/Q) radio signals in an unsupervised manner. These features are then used to construct higher-complexity abstract features which can be used for automatic modulation classification. We demonstrate this process using a dataset generated with a software defined radio, consisting of random input bits encoded in 100-sample segments of various common digital radio modulations. Our results show correct classification rates of > 99% at 7.5 dB signal-to-noise ratio (SNR) and > 92% at 0 dB SNR in a 6-way classification test. Our experiments demonstrate a dramatically new and broadly applicable mechanism for performing AMC and related tasks without the need for expert-defined or modulation-specific signal information.
Advanced analytics, big data, predictive modelling, deep learning
The purpose of testing analysis in predictive analytics is to compare the response of a predictive model against the actual target values in an independent testing set. There are several techniques that can be used for testing the performance of a predictive model. Receiver operating characteristic (ROC) curve is one of the most useful testing methods for binary classification problems, since it provides a comprehensive and visually attractive way to summarize the accuracy of predictions. By moving the decision threshold, we change the number of instances classified as positives and negatives. If the score of an instance is greater or equal to the threshold, then it will be classified as positive.
Classification of Big Data with Application to Imaging Genetics
Ulfarsson, Magnus O., Palsson, Frosti, Sigurdsson, Jakob, Sveinsson, Johannes R.
ECENT technological achievements and globalization have increased data acquisition capability in almost all corners of human activities, ranging from scientific and engineering endeavors such as genomics, medical imaging, remote sensing, economics and finance, and all the way to people's personal lives with the emergence of social media through the world wide web and mobile networks. The enormous growth of data creates daunting challenges, not only in finding out how to store and access the data, but more importantly, how to process and make sense of it. Also, since data collection is expensive, we are somehow obliged to make good use of the data at hand, so it is obvious that for further progress, the development of efficient algorithms for processing big data is very important. Big data is usually considered in terms of the number of observations n and the number of variables p measured on each observation. In many branches of science such as genetics and medical imaging, the number of variables is very large and is often much larger than the number of observations. This scenario is often denoted as p n.
How to test classifier better than chance using k-fold cross-validation? • /r/MachineLearning
I have 400 units and 10 groups, and I'm classifying the units' group membership using a discriminant function analysis or linear discriminant analysis. During cross-validation, I want to test that my solution is doing a better job at classifying them than chance (10%). I can get an error rate, but don't know how to statistically compare. With the hold-out approach, I can test it using Press' Q statistic or Maximum Chance Criterion. But with k-fold I don't think I can use this approach.
Arimo Predictive Engine (tm) Shows Opportunity to Improve Investor Returns in Peer-to-Peer Lending - Arimo
Random forest model using Lending Club public dataset shows opportunity to improve adjusted return by 2.75% Arimo recently performed a study using a public dataset provided by Lending Club with the goal of showing how machine learning could improve investor returns. To do this we used the PredictiveEngine component of our Data Intelligence Platform, which provides the ability to easily build a variety of predictive machine learning models which scale transparently when deployed on distributed parallel computing platforms. Lending Club is an online peer-to-peer lending company that connects borrowers with investors who have capital to lend. When a loan application is submitted by a borrower, Lending Club reviews and decides whether to offer a loan at a risk-adjusted rate or to reject the application. As of the 3rd quarter of 2015, more than 12 billion in loans have been issued through Lending Club.