What dataset is the equivalent of MNIST for regression? What dataset if any is widely used for regression type problems (continuous value for the label, as opposed to discrete)? Is there such a dataset that serves as a standard?

Regression is a modeling task that involves predicting a numerical value given an input. Algorithms used for regression tasks are also referred to as "regression" algorithms, with the most widely known and perhaps most successful being linear regression. Linear regression fits a line or hyperplane that best describes the linear relationship between inputs and the target numeric value. If the data contains outlier values, the line can become biased, resulting in worse predictive performance. Robust regression refers to a suite of algorithms that are robust in the presence of outliers in training data.

To illustrate I will use data from gapminder conveniently provided in an R package by Jenny Bryan. Now we're going to bring physical intuition into this by imagining these points as physical objects. For example, we can interpret the mean \((\bar x, \bar y)\) 6252.7, This is the larger, blue point in the plot above. It's not really important to think of mass specifically, just that this point is the center of the physical ensemble.

Here, we load the chocolate data into our program using pandas; we also drop two of the columns we won't be using in our calculation: competitorname and winpercent. Our y becomes the first column in the dataset which indicates if our specific sweet is chocolate (1) or not (0). The remaining columns are used as variables/features to predict our y and, thus, become our X. If you're confused about why we're doing with …[:, 0][:,np.newaxis] on line 5, this is to turn y into a column. We simply add a new dimension to convert the horizontal vector into a vertical column!

It is a good idea to have small well understood datasets when getting started in machine learning and learning a new tool. The Weka machine learning workbench provides a directory of small well understood datasets in the installed directory. In this post you will discover some of these small well understood datasets distributed with Weka, their details and where to learn more about them. We will focus on a handful of datasets of differing types. Standard Machine Learning Datasets Used For Practice in Weka Photo by Marvin Foushee, some rights reserved.