Goto

Collaborating Authors

 Regression


Regression analysis using Python

#artificialintelligence

This article was written by Stuart Reid. This tutorial covers regression analysis using the Python StatsModels package with Quandl integration. For motivational purposes, here is what we are working towards: a regression analysis program which receives multiple data-set names from Quandl.com, automatically downloads the data, analyses it, and plots the results in a new window. Linear regression analysis fits a straight line to some data in order to capture the linear relationship between that data. The regression line is constructed by optimizing the parameters of the straight line function such that the line best fits a sample of (x, y) observations where y is a variable dependent on the value of x.


How to Become a Machine Learning Specialist in Under 20 Hours from This FREE LinkedIn Course

#artificialintelligence

If you are interested to become a Machine Learning Specialist, you are in the right place, because here we have the best LinkedIn course that you will love it. Machine Learning proves to be the future of our civilization, something that will help us to elevate our achievements to the next level, and explore new things, and all in all increase the quality of our life. The job positions in Machine Learning areas are one of the highest paying in the whole IT industry due to the fact that it requires knowledge in Mathematics, Statistics, Computer Science, and Software Engineering all combined. Now, to gain knowledge in all of these fields can be time-consuming due to all of those are sciences in themselves. However, there are huge corporations that have a huge need for experts in these areas and do not have the time that it takes to create these experts as we've already mentioned.


Logistic Regression Through the Veil of Imprecise Data

arXiv.org Machine Learning

Logistic regression is an important statistical tool for assessing the probability of an outcome based upon some predictive variables. Standard methods can only deal with precisely known data, however many datasets have uncertainties which traditional methods either reduce to a single point or completely disregarded. In this paper we show that it is possible to include these uncertainties by considering an imprecise logistic regression model using the set of possible models that can be obtained from values from within the intervals. This has the advantage of clearly expressing the epistemic uncertainty removed by traditional methods.


What's a good imputation to predict with missing values?

arXiv.org Machine Learning

How to learn a good predictor on data with missing values? Most efforts focus on first imputing as well as possible and second learning on the completed data to predict the outcome. Yet, this widespread practice has no theoretical grounding. Here we show that for almost all imputation functions, an impute-then-regress procedure with a powerful learner is Bayes optimal. This result holds for all missing-values mechanisms, in contrast with the classic statistical results that require missing-at-random settings to use imputation in probabilistic modeling. Moreover, it implies that perfect conditional imputation may not be needed for good prediction asymptotically. In fact, we show that on perfectly imputed data the best regression function will generally be discontinuous, which makes it hard to learn. Crafting instead the imputation so as to leave the regression function unchanged simply shifts the problem to learning discontinuous imputations. Rather, we suggest that it is easier to learn imputation and regression jointly. We propose such a procedure, adapting NeuMiss, a neural network capturing the conditional links across observed and unobserved variables whatever the missing-value pattern. Experiments confirm that joint imputation and regression through NeuMiss is better than various two step procedures in our experiments with finite number of samples.


Collaborative Nonstationary Multivariate Gaussian Process Model

arXiv.org Machine Learning

Currently, multi-output Gaussian process regression models either do not model nonstationarity or are associated with severe computational burdens and storage demands. Nonstationary multi-variate Gaussian process models (NMGP) use a nonstationary covariance function with an input-dependent linear model of coregionalisation to jointly model input-dependent correlation, scale, and smoothness of outputs. Variational sparse approximation relies on inducing points to enable scalable computations. Here, we take the best of both worlds: considering an inducing variable framework on the underlying latent functions in NMGP, we propose a novel model called the collaborative nonstationary Gaussian process model(CNMGP). For CNMGP, we derive computationally tractable variational bounds amenable to doubly stochastic variational inference. Together, this allows us to model data in which outputs do not share a common input set, with a computational complexity that is independent of the size of the inputs and outputs. We illustrate the performance of our method on synthetic data and three real datasets and show that our model generally pro-vides better predictive performance than the state-of-the-art, and also provides estimates of time-varying correlations that differ across outputs.


LASSO regression using tidymodels and #TidyTuesday data for The Office

#artificialintelligence

Our modeling goal here is to predict the IMDB ratings for episodes of The Office based on the other characteristics of the episodes in the #TidyTuesday dataset. There are two datasets, one with the ratings and one with information like director, writer, and which character spoke which line. The episode numbers and titles are not consistent between them, so we can use regular expressions to do a better job of matching the datasets up for joining. We are going to use several different kinds of features for modeling. First, let's find out how many times characters speak per episode.


Top 10 Important Data Science Algorithms to Know About

#artificialintelligence

Being a data scientist, one should properly understand the importance of AI and machine learning (ML) algorithms. Knowing data science algorithms through and through is deemed to be one of the most important skills in data science. These algorithms are the important slices in tasks like prediction, classification, and clustering from the data set in concern. The linear regression method is used for the value prediction of the dependent variable by using the values of the independent variable. It is suitable for predicting the value of a continuous quantity.


Using machine learning for quantum annealing accuracy prediction

arXiv.org Artificial Intelligence

Quantum annealers, such as the device built by D-Wave Systems, Inc., offer a way to compute solutions of NP-hard problems that can be expressed in Ising or QUBO (quadratic unconstrained binary optimization) form. Although such solutions are typically of very high quality, problem instances are usually not solved to optimality due to imperfections of the current generations quantum annealers. In this contribution, we aim to understand some of the factors contributing to the hardness of a problem instance, and to use machine learning models to predict the accuracy of the D-Wave 2000Q annealer for solving specific problems. We focus on the Maximum Clique problem, a classic NP-hard problem with important applications in network analysis, bioinformatics, and computational chemistry. By training a machine learning classification model on basic problem characteristics such as the number of edges in the graph, or annealing parameters such as D-Wave's chain strength, we are able to rank certain features in the order of their contribution to the solution hardness, and present a simple decision tree which allows to predict whether a problem will be solvable to optimality with the D-Wave 2000Q. We extend these results by training a machine learning regression model that predicts the clique size found by D-Wave.


Review of Low-Voltage Load Forecasting: Methods, Applications, and Recommendations

arXiv.org Machine Learning

The increased digitalisation and monitoring of the energy system opens up numerous opportunities % and solutions which can help to decarbonise the energy system. Applications on low voltage (LV), localised networks, such as community energy markets and smart storage will facilitate decarbonisation, but they will require advanced control and management. Reliable forecasting will be a necessary component of many of these systems to anticipate key features and uncertainties. Despite this urgent need, there has not yet been an extensive investigation into the current state-of-the-art of low voltage level forecasts, other than at the smart meter level. This paper aims to provide a comprehensive overview of the landscape, current approaches, core applications, challenges and recommendations. Another aim of this paper is to facilitate the continued improvement and advancement in this area. To this end, the paper also surveys some of the most relevant and promising trends. It establishes an open, community-driven list of the known LV level open datasets to encourage further research and development.


A Minimax Lower Bound for Low-Rank Matrix-Variate Logistic Regression

arXiv.org Machine Learning

This paper considers the problem of matrix-variate logistic regression. The fundamental error threshold on estimating coefficient matrices in the logistic regression problem is found by deriving a lower bound on the minimax risk. The focus of this paper is on derivation of a minimax risk lower bound for low-rank coefficient matrices. The bound depends explicitly on the dimensions and distribution of the covariates, the rank and energy of the coefficient matrix, and the number of samples. The resulting bound is proportional to the intrinsic degrees of freedom in the problem, which suggests the sample complexity of the low-rank matrix logistic regression problem can be lower than that for vectorized logistic regression. \color{red}\color{black} The proof techniques utilized in this work also set the stage for development of minimax lower bounds for tensor-variate logistic regression problems.