A data pipeline is a set of data processing applied in series to one or more data sources. Data pipelines are very important in the machine learning field. Traditionally data scientists tend to work in a more "manual" way, where data are manipulated through files and notebooks. However, when the model needs to be migrated to a production environment, it can be a difficult exercise. Luigi is python package that allows to create data pipelines.

This package implements the experiments described in the paper Countering Adversarial Images Using Input Transformations. It contains implementations for adversarial attacks, defenses based image transformations, training, and testing convolutional networks under adversarial attacks using our defenses. We also provide pre-trained models.

Skewed data is common in data science; skew is the degree of distortion from a normal distribution. For example, below is a plot of the house prices from Kaggle's House Price Competition that is right skewed, meaning there are a minority of very large values. Why do we care if the data is skewed? If the response variable is skewed like in Kaggle's House Prices Competition, the model will be trained on a much larger number of moderately priced homes, and will be less likely to successfully predict the price for the most expensive houses. The concept is the same as training a model on imbalanced categorical classes.

Regression attempts to predict one dependent variable (usually denoted by Y) and a series of other changing variables (known as independent variables, usually denoted by X). Linear Regression is a way of predicting a response Y on the basis of a single predictor variable X. It is assumed that there is approximately a linear relationship between X and Y. Mathematically, we can represent this relationship as: Let's take the simplest possible example. Here we have 2 data points represented by two black points. All we are trying to do when we calculate our regression line is draw a line that is as close to every point as possible.