Accuracy
4 Reasons Your Machine Learning Model is Wrong (and How to Fix It)
There are a number of machine learning models to choose from. We can use Linear Regression to predict a value, Logistic Regression to classify distinct outcomes, and Neural Networks to model non-linear behaviors. When we build these models, we always use a set of historical data to help our machine learning algorithms learn what is the relationship between a set of input features to a predicted output. But even if this model can accurately predict a value from historical data, how do we know it will work as well on new data? Or more plainly, how do we evaluate whether a machine learning model is actually "good"?
Robust Contextual Outlier Detection: Where Context Meets Sparsity
Liang, Jiongqian, Parthasarathy, Srinivasan
Outlier detection is a fundamental data science task with applications ranging from data cleaning to network security. Given the fundamental nature of the task, this has been the subject of much research. Recently, a new class of outlier detection algorithms has emerged, called {\it contextual outlier detection}, and has shown improved performance when studying anomalous behavior in a specific context. However, as we point out in this article, such approaches have limited applicability in situations where the context is sparse (i.e. lacking a suitable frame of reference). Moreover, approaches developed to date do not scale to large datasets. To address these problems, here we propose a novel and robust approach alternative to the state-of-the-art called RObust Contextual Outlier Detection (ROCOD). We utilize a local and global behavioral model based on the relevant contexts, which is then integrated in a natural and robust fashion. We also present several optimizations to improve the scalability of the approach. We run ROCOD on both synthetic and real-world datasets and demonstrate that it outperforms other competitive baselines on the axes of efficacy and efficiency (40X speedup compared to modern contextual outlier detection methods). We also drill down and perform a fine-grained analysis to shed light on the rationale for the performance gains of ROCOD and reveal its effectiveness when handling objects with sparse contexts.
50 Questions to Test True Data Science Knowledge
Explain what regularization is and why it is useful. What are the benefits and drawbacks of specific methods, such as ridge regression and LASSO? Explain what a local optimum is and why it is important in a specific context, such as k-means clustering. What are specific ways for determining if you have a local optimum problem? What can be done to avoid local optima?
4 trends in security data science
In 2015, we saw graphs dominate security data science. Graphs permeated all areas--everything from visualizations to graphical inference. It's quite easy to write about security trends for 2016--the hard part is trying to interpret what the trends could potentially mean to organizations on a day-to-day basis. This article is not the wishlist of a deluded security data scientist. Rather, these are strategic trends that you can expect to see in the field, mixed with tactical steps to capitalize on them.
Student and Faculty Guide – 10 easy steps to get up and running with Azure Machine Learning
My colleague Amy Nicholson is the UK expert on Azure Machine Learning, the following blog post is after a quizzing session to get understand how to get started with Azure Machine Learning" Each student receives $100 of Azure credit per month, for 6 months. The Faculty member receives $250 per month, for 12 months. The Azure machine learning team provided a very nice walkthrough tutorial which covers a lot of the basics. This tutorial is really useful as it takes you through the entire process of creating an AzureML workspace, uploading data, creating an experiment to predict someone's credit risk, building, training, and evaluating the models, publishing your best model as a web service, and calling that web service. Now you need to learn how to import a data set into Azure Machine Learning, and where to find interesting data to build something amazing.
109 Commonly Asked Data Science Interview Questions
What is the Central Limit Theorem and why is it important? How many sampling methods do you know? What is the difference between Type I vs Type II error? What do the terms P-value, coefficient, R-Squared value mean? What is the significance of each of these components? What are the assumptions required for linear regression? There are four major assumptions: 1. There is a linear relationship between the variables, meaning the model you are creating actually fits the data, 2. The errors or residuals of the data are normally distributed and independent from each other, 3. There is minimal multicollinearity between explanatory variables, and 4. Homoscedasticity. This means the variance around the regression line is the same for all values of the predictor variable. What is an example of a dataset with a non-Gaussian distribution?
Optimal tuning for divide-and-conquer kernel ridge regression with massive data
Xu, Ganggang, Shang, Zuofeng, Cheng, Guang
We propose a first data-driven tuning procedure for divide-and-conquer kernel ridge regression (Zhang et al., 2015). While the proposed criterion is computationally scalable for massive data sets, it is also shown to be asymptotically optimal under mild conditions. The effectiveness of our method is illustrated by extensive simulations and an application to Million Song Dataset. Some key words:Distributed GCV, divide-and-conquer, kernel ridge regression, optimal tuning.