Regression
Modelling tourism demand to Spain with machine learning techniques. The impact of forecast horizon on model selection
Claveria, Oscar, Monte, Enric, Torra, Salvador
This study assesses the influence of the forecast horizon on the forecasting performance of several machine learning techniques. We compare the fo recast accuracy of Support Vector Regression (SVR) to Neural Network (NN) models, using a linear model as a benchmark. We focus on international tourism demand to all seventeen regions of Spain. The SVR with a Gaussian radial basis function kernel outperforms the rest of the models for the longest forecast horizons. We also find that machine learning methods improve their forecasting accuracy with respect to linear models as forecast horizons increase. This result shows the suitability of SVR for medium and long term forecasting.
Modelling cross-dependencies between Spain's regional tourism markets with an extension of the Gaussian process regression model
Claveria, Oscar, Monte, Enric, Torra, Salvador
This study presents an extension of the Gaussian process regression model for multiple-input multiple-output forecasting. This approach allows modelling the cross-dependencies between a given set of input variables and generating a vectorial prediction. Making use of the existing correlations in international tourism demand to all seventeen regions of Spain, the performance of the proposed model is assessed in a multiple-step-ahead forecasting comparison. The results of the experiment in a multivariate setting show that the Gaussian process regression model significantly improves the forecasting accuracy of a multi-layer perceptron neural network used as a benchmark. The results reveal that incorporating the connections between different markets in the modelling process may prove very useful to refine predictions at a regional level.
Building A Linear Regression with PySpark and MLlib
Apache Spark has become one of the most commonly used and supported open-source tools for machine learning and data science. In this post, I'll help you get started using Apache Spark's spark.ml Our data is from the Kaggle competition: Housing Values in Suburbs of Boston. AGE -- proportion of owner-occupied units built prior to 1940. BLACK -- 1000(Bk -- 0.63)ยฒ where Bk is the proportion of blacks by town. This is the target variable.
Solid Harmonic Wavelet Scattering for Predictions of Molecule Properties
Eickenberg, Michael, Exarchakis, Georgios, Hirn, Matthew, Mallat, Stรฉphane, Thiry, Louis
We present a machine learning algorithm for the prediction of molecule properties inspired by ideas from density functional theory. Using Gaussian-type orbital functions, we create surrogate electronic densities of the molecule from which we compute invariant "solid harmonic scattering coefficients" that account for different types of interactions at different scales. Multi-linear regressions of various physical properties of molecules are computed from these invariant coefficients. Numerical experiments show that these regressions have near state of the art performance, even with relatively few training examples. Predictions over small sets of scattering coefficients can reach a DFT precision while being interpretable.
scikit-learn โTest Predictions Using Various Models
Scikit-learn has evolved as a robust library for machine learning applications in Python with support for a wide range of supervised and unsupervised learning algorithms. This course begins by taking you through videos on linear models; with scikit-learn, you will take a machine learning approach to linear regression. As you progress, you will explore logistic regression. Then you will build models with distance metrics, including clustering. You will also look at cross-validation and post-model workflows, where you will see how to select a model that predicts well.
OMG - Emotion Challenge Solution
Cui, Yuqi, Zhang, Xiao, Wang, Yang, Guo, Chenfeng, Wu, Dongrui
Abstract--This short paper describes our solution to the 2018 IEEE World Congress on Computational Intelligence One-Minute Gradual-Emotional Behavior Challenge, whose goal was to estimate continuous arousal and valence values from short videos. We designed four base regression models using visual and audio features, and then used a spectral approach to fuse them to obtain improved performance. (IEEE WCCI 2018). The dataset was composed of 420 relatively long emotion videos with an average length of 1 minute, collected from a variety of Youtube channels. Videos were separated into clips based on utterances, and each utterance's valence and arousal levels were annotated by at least five independent subjects using the Amazon Mechanical Turk tool.
Steps of Modelling
The data are usually recorded in rows and columns. A column represents a variable,whereas a row represents an observation, which is a set of p 1 values for a single subject i.e. one value for the response variable and one value for each of the p predictors. Each of the variables can be classified as either quantitative or qualitative. A technique used in cases where the response variable is binary is called logistic regression. In regression analysis, the predictor variables can be either quantitative and or qualitative. For the purpose of computations, however, the qualitative variables, if any, have to be coded into a set of indicator or dummy variables.
Simultaneous Parameter Learning and Bi-Clustering for Multi-Response Models
Yu, Ming, Ramamurthy, Karthikeyan Natesan, Thompson, Addie, Lozano, Aurรฉlie
We consider multi-response and multitask regression models, where the parameter matrix to be estimated is expected to have an unknown grouping structure. The groupings can be along tasks, or features, or both, the last one indicating a bi-cluster or "checkerboard" structure. Discovering this grouping structure along with parameter inference makes sense in several applications, such as multi-response Genome-Wide Association Studies. This additional structure can not only can be leveraged for more accurate parameter estimation, but it also provides valuable information on the underlying data mechanisms (e.g. relationships among genotypes and phenotypes in GWAS). In this paper, we propose two formulations to simultaneously learn the parameter matrix and its group structures, based on convex regularization penalties. We present optimization approaches to solve the resulting problems and provide numerical convergence guarantees. Our approaches are validated on extensive simulations and real datasets concerning phenotypes and genotypes of plant varieties.
Top 6 errors novice machine learning engineers make
In machine learning, there are many ways to build a product or solution and each way assumes something different. Many times, it's not obvious how to navigate and identify which assumptions are reasonable. People new to machine learning make mistakes, which in hindsight will often feel silly. I've created a list of the top mistakes that novice machine learning engineers make. Hopefully, you can learn from these common errors and create more robust solutions that bring real value.
Novel Prediction Techniques Based on Clusterwise Linear Regression
Gitman, Igor, Chen, Jieshi, Lei, Eric, Dubrawski, Artur
In this paper we explore different regression models based on Clusterwise Linear Regression (CLR). CLR aims to find the partition of the data into $k$ clusters, such that linear regressions fitted to each of the clusters minimize overall mean squared error on the whole data. The main obstacle preventing to use found regression models for prediction on the unseen test points is the absence of a reasonable way to obtain CLR cluster labels when the values of target variable are unknown. In this paper we propose two novel approaches on how to solve this problem. The first approach, predictive CLR builds a separate classification model to predict test CLR labels. The second approach, constrained CLR utilizes a set of user-specified constraints that enforce certain points to go to the same clusters. Assuming the constraint values are known for the test points, they can be directly used to assign CLR labels. We evaluate these two approaches on three UCI ML datasets as well as on a large corpus of health insurance claims. We show that both of the proposed algorithms significantly improve over the known CLR-based regression methods. Moreover, predictive CLR consistently outperforms linear regression and random forest, and shows comparable performance to support vector regression on UCI ML datasets. The constrained CLR approach achieves the best performance on the health insurance dataset, while enjoying only $\approx 20$ times increased computational time over linear regression.