Statistical Learning


Logistic Regression using Python (Sklearn, NumPy, MNIST, Handwriting Recognition, Matplotlib)

@machinelearnbot

One of the first models I learned when I started my data science journey was Logistic Regression. The name Logistic Regression is highly misleading. The image above shows a bunch of training digits (observations) from the MNIST dataset whose category membership is known (labels 0–9). After training a model with logistic regression, it can be used to predict an image label (labels 0–9) given an image.


A Large-Scale Study of Programming Languages and Code Quality in GitHub

Communications of the ACM

In this study, we gather a very large data set from GitHub (728 projects, 63 million SLOC, 29,000 authors, 1.5 million commits, in 17 languages) in an attempt to shed some empirical light on this question. This reasonably large sample size allows us to use a mixed-methods approach, combining multiple regression modeling with visualization and text analytics, to study the effect of language features such as static versus dynamic typing and allowing versus disallowing type confusion on software quality. By triangulating findings from different methods, and controlling for confounding effects such as team size, project size, and project history, we report that language design does have a significant, but modest effect on software quality. We also calculate other project-related statistics, including maximum commit age of a project and the total number of developers, used as control variables in our regression model, and discussed in Section 3.


Classical Statistics and Statistical Learning in Imaging Neuroscience (PDF Download Available)

#artificialintelligence

All dimensions in the brain data (i.e., voxel variables) are This is where random field theory comes to the rescue. For instance, signals from "brain regions" are


Classification with Scikit-Learn

#artificialintelligence

With the dataset splitted into training and test sets, we can start building a classification model. Actually, classifiers like Random Forest and Gradient Boosting classification performs best for most datasets and challenges on Kaggle (That does not mean you should rule out all other classifiers). Again, we will split the dataset into a 70% training set and a 30% test set and start training and validating a batch of the eight most used classifiers. For datasets, where this is not the case we can play around with the features in the dataset, add extra features from additional datasets or change the parameters of the classifiers in order to improve the accuracy.


Predicting Portland Home Prices

#artificialintelligence

Predicting Portland home prices allowed me to do this because I was able to incorporate various web scraping techniques, natural language processing on text, deep learning models on images, and gradient boosting into tackling the problem. The Zillow metadata contained the descriptors you would expect - square footage, neighborhood, year built, etc. Okay, now that I was confident that my image model was doing a good job, I was ready to combine the Zillow metadata, realtor description word matrix, and the image feature matrix into one matrix and then implement gradient boosting in order to predict home prices. Incorporating the images into my model immediately dropped that error by $20 K. Adding in the realtor description to that dropped it by another $10 K. Finally, adding in the Zillow metadata lowered the mean absolute error to approximately $71 K. Perhaps you are wondering how well the Zillow metadata alone would do in predicting home prices?


Building a Logistic Regression model from scratch

#artificialintelligence

Here is an extremely simple logistic problem. Logistic regression is an estimation of Logit function. Following are the first and second derivative of log likelihood function. Here is a recap of Newton Raphson method.


Building a Logistic Regression model from scratch

#artificialintelligence

Here is an extremely simple logistic problem. Logistic regression is an estimation of Logit function. Following are the first and second derivative of log likelihood function. Here is a recap of Newton Raphson method.


A Solution to Missing Data: Imputation Using R

@machinelearnbot

If the missing values are not MAR or MCAR then they fall into the third category of missing values known as Not Missing At Random, otherwise abbreviated as NMAR. The package provides four different methods to impute values with the default model being linear regression for continuous variables and logistic regression for categorical variables. In R, I will use the NHANES dataset (National Health and Nutrition Examination Survey data by the US National Center for Health Statistics). The NHANES data is a small dataset of 25 observations, each having 4 features - age, bmi, hypertension status and cholesterol level.


Deep Learning Prerequisites: Linear Regression in Python

@machinelearnbot

This course teaches you about one popular technique used in machine learning, data science and statistics: linear regression. Linear regression is the simplest machine learning model you can learn, yet there is so much depth that you'll be returning to it for years to come. We will apply multi-dimensional linear regression to predicting a patient's systolic blood pressure given their age and weight. If you want more than just a superficial look at machine learning models, this course is for you.


Top 3 free online courses for Artificial Intelligence and Machine Learning

#artificialintelligence

Artificial Intelligence deals with the understanding of machines and programming them to do tasks autonomously as well as helping them get smarter. The course will teach you the fundamentals of Artificial Intelligence with insights on search, simulated annealing, logical planning and more. The course is for those who have tried machine learning and data science but are having trouble putting the ideas down in code. You will learn about Numpy (which is fundamental computing package for Python) where you will explore complex mathematical functions that can be performed in Python.