Basic Data Cleaning for Machine Learning (That You Must Perform)


Data cleaning is a critically important step in any machine learning project. In tabular data, there are many different statistical analysis and data visualization techniques you can use to explore your data in order to identify data cleaning operations you may want to perform. Before jumping to the sophisticated methods, there are some very basic data cleaning operations that you probably should perform on every single machine learning project. These are so basic that they are often overlooked by seasoned machine learning practitioners, yet are so critical that if skipped, models may break or report overly optimistic performance results. In this tutorial, you will discover basic data cleaning you should always perform on your dataset.

What is Apache Spark? The big data platform that crushed Hadoop


Apache Spark is a data processing framework that can quickly perform processing tasks on very large data sets, and can also distribute data processing tasks across multiple computers, either on its own or in tandem with other distributed computing tools. These two qualities are key to the worlds of big data and machine learning, which require the marshalling of massive computing power to crunch through large data stores. Spark also takes some of the programming burdens of these tasks off the shoulders of developers with an easy-to-use API that abstracts away much of the grunt work of distributed computing and big data processing. From its humble beginnings in the AMPLab at U.C. Berkeley in 2009, Apache Spark has become one of the key big data distributed processing frameworks in the world. Spark can be deployed in a variety of ways, provides native bindings for the Java, Scala, Python, and R programming languages, and supports SQL, streaming data, machine learning, and graph processing.

A Feature Selection Tool for Machine Learning in Python


Feature selection, the process of finding and selecting the most useful features in a dataset, is a crucial step of the machine learning pipeline. Unnecessary features decrease training speed, decrease model interpretability, and, most importantly, decrease generalization performance on the test set. Frustrated by the ad-hoc feature selection methods I found myself applying over and over again for machine learning problems, I built a class for feature selection in Python available on GitHub. In this article we will walk through using the FeatureSelector on an example machine learning dataset. We'll see how it allows us to rapidly implement these methods, allowing for a more efficient workflow.

The 4 steps necessary before fitting a machine learning model


There are many steps in a common machine learning pipeline and much thought that goes into architecting it. There is the problem definition, data acquisition, error detection and data cleaning, etc. In this story, we begin with the assumption that we have a clean and ready to go dataset. With that in mind, we outline the four steps necessary before fitting any machine learning model. We then implement those steps in Pytorch, using a common syntax for invoking multiple method calls; method chaining.

Regression with Deep Learning for Sensor Performance Optimization Machine Learning

Neural networks with at least two hidden layers are called deep networks. Recent developments in AI and computer programming in general has led to development of tools such as Tensorflow, Keras, NumPy etc. making it easier to model and draw conclusions from data. In this work we re-approach non-linear regression with deep learning enabled by Keras and Tensorflow. In particular, we use deep learning to parametrize a non-linear multivariate relationship between inputs and outputs of an industrial sensor with an intent to optimize the sensor performance based on selected key metrics.

Predicting UFC Fights With Machine Learning


As a fan of MMA I often find myself trying to predict the outcome of fights on an upcoming fight card. The problem is that fighting by its nature can be very unpredictable. More so than even boxing, the outcome of an MMA fight can change in a split second, but of course that's what makes it so interesting. All the same I wondered if there was a way to apply modern machine learning techniques to historical fight data and see how a model would perform on new fights. Of course like any ML project I needed data to work with.

Machine Learning: Step-By-Step


Now, how could we improve our baseline model? Using dimension reduction, we can approximate the original dataset with fewer variables, while reducing computational power to run our model. Using PCA, we can study the cumulative explained variance ratio of these features to understand which features explain the most variance in the data. We instantiate the PCA function and set the number of components (features) that we want to consider. We'll set it to "30" to see the explained variance of all the generated components, before deciding where to make the cut.

Gain State-Of-The-Art Results on Tabular Data with Deep Learning & Embedding Layers [A How To Guide]


Tree-based models like Random Forest and XGBoost have become very popular in solving tabular(structured) data problems and gained a lot of tractions in Kaggle competitions lately. It has its very deserving reasons. However, in this article, I want to introduce a different approach from's Tree-based models like Random Forest and XGBoost have become very popular in solving tabular(structured) data problems and gained a lot of tractions in Kaggle competitions lately. It has its very deserving reasons.

Time series modeling with Facebook Prophet


When trying to understand time series, there's so much to think about. Is it affected by seasonality? What kind of model should I use, and how well will it perform? All these questions can make time series modeling kind of intimidating, but it doesn't have to be that bad. While working on a project for my data science bootcamp recently, I tried Facebook Prophet, an open-source package for time series modeling developed by … y'know, Facebook.