Goto

Collaborating Authors

 Regression


Scratching Linear Regression using PyTorch - Part 1

#artificialintelligence

The weights w11, w12,... w23 and biases b1 & b2 can also be represented as matrices, initialized as random values. The first row of w and the first element of b are used to predict the first target variable, i.e., the yield of apples, and similarly, the second for the yield of oranges. Our model is simply a function that performs a matrix multiplication of the inputs and the weights w (transposed) and adds the bias b for each observation. Let's create a function called model that will predict output by calculating the above operation when the inputs is passed as a parameter: Now, let's compare the predictions with the actual targets: As you can see, the difference between our model's predictions and the actual target values is vast. So, we need to improve our model to reduce the difference.


Efficient Methods for Online Multiclass Logistic Regression

arXiv.org Machine Learning

Multiclass logistic regression is a fundamental task in machine learning with applications in classification and boosting. Previous work (Foster et al., 2018) has highlighted the importance of improper predictors for achieving "fast rates" in the online multiclass logistic regression problem without suffering exponentially from secondary problem parameters, such as the norm of the predictors in the comparison class. While Foster et al. (2018) introduced a statistically optimal algorithm, it is in practice computationally intractable due to its run-time complexity being a large polynomial in the time horizon and dimension of input feature vectors. In this paper, we develop a new algorithm, FOLKLORE, for the problem which runs significantly faster than the algorithm of Foster et al. (2018) - the running time per iteration scales quadratically in the dimension - at the cost of a linear dependence on the norm of the predictors in the regret bound. This yields the first practical algorithm for online multiclass logistic regression, resolving an open problem of Foster et al. (2018). Furthermore, we show that our algorithm can be applied to online bandit multiclass prediction and online multiclass boosting, yielding more practical algorithms for both problems compared to the ones in (Foster et al., 2018) with similar performance guarantees. Finally, we also provide an online-to-batch conversion result for our algorithm.


Fair Regression under Sample Selection Bias

arXiv.org Artificial Intelligence

Recent research on fair regression focused on developing new fairness notions and approximation methods as target variables and even the sensitive attribute are continuous in the regression setting. However, all previous fair regression research assumed the training data and testing data are drawn from the same distributions. This assumption is often violated in real world due to the sample selection bias between the training and testing data. In this paper, we develop a framework for fair regression under sample selection bias when dependent variable values of a set of samples from the training data are missing as a result of another hidden process. Our framework adopts the classic Heckman model for bias correction and the Lagrange duality to achieve fairness in regression based on a variety of fairness notions. Heckman model describes the sample selection process and uses a derived variable called the Inverse Mills Ratio (IMR) to correct sample selection bias. We use fairness inequality and equality constraints to describe a variety of fairness notions and apply the Lagrange duality theory to transform the primal problem into the dual convex optimization. For the two popular fairness notions, mean difference and mean squared error difference, we derive explicit formulas without iterative optimization, and for Pearson correlation, we derive its conditions of achieving strong duality. We conduct experiments on three real-world datasets and the experimental results demonstrate the approach's effectiveness in terms of both utility and fairness metrics.


Mixability made efficient: Fast online multiclass logistic regression

arXiv.org Machine Learning

Mixability has been shown to be a powerful tool to obtain algorithms with optimal regret. However, the resulting methods often suffer from high computational complexity which has reduced their practical applicability. For example, in the case of multiclass logistic regression, the aggregating forecaster (Foster et al. (2018)) achieves a regret of $O(\log(Bn))$ whereas Online Newton Step achieves $O(e^B\log(n))$ obtaining a double exponential gain in $B$ (a bound on the norm of comparative functions). However, this high statistical performance is at the price of a prohibitive computational complexity $O(n^{37})$.


Foolish Crowds Support Benign Overfitting

arXiv.org Machine Learning

We prove a lower bound on the excess risk of sparse interpolating procedures for linear regression with Gaussian data in the overparameterized regime. We apply this result to obtain a lower bound for basis pursuit (the minimum $\ell_1$-norm interpolant) that implies that its excess risk can converge at an exponentially slower rate than OLS (the minimum $\ell_2$-norm interpolant), even when the ground truth is sparse. Our analysis exposes the benefit of an effect analogous to the "wisdom of the crowd", except here the harm arising from fitting the noise is ameliorated by spreading it among many directions - the variance reduction arises from a foolish crowd.


Machine Learning Project Predict Will it Rain Tomorrow in Australia - Projects Based Learning

#artificialintelligence

In this project we will be working with a data set, indicating whether it rain the next day in Australia, Yes or No? This column is Yes if the rain for that day was 1mm or more. We will try to create a model that will predict using the available data. Welcome to this project on predict whether it will rain tomorrow in Australia in Apache Spark Machine Learning using Databricks platform community edition server which allows you to execute your spark code, free of cost on their server just by registering through email id. In this project, we explore Apache Spark and Machine Learning on the Databricks platform.


Ridge and Lasso Regression

#artificialintelligence

You might have worked on some simple linear regression using ordinary least squares, and its more general regression of polynomial functions. You've also seen how we can overfit models to data using polynomials and interactions. In this blog post, I want to take a look at another way to tune our linear regression models. These methods all modify the mean squared error function that you are optimizing against. The modifications will add a penalty for large coefficient weights in the resulting model. Ridge and Lasso regression are some of the simple techniques to reduce model complexity and prevent over-fitting.


Rerunning OCR: A Machine Learning Approach to Quality Assessment and Enhancement Prediction

arXiv.org Artificial Intelligence

Iterating with new and improved OCR solutions enforces decisions to be taken when it comes to targeting the right reprocessing candidates. This especially applies when the underlying data collection is of considerable size and rather diverse in terms of fonts, languages, periods of publication and consequently OCR quality. This article captures the efforts of the National Library of Luxembourg to support those exact decisions. They are crucial in order to guarantee low computational overhead and reduced quality degradation risks, combined with a more quantifiable OCR improvement. In particular, this work explains the methodology of the library with respect to text block level quality assessment. As an extension of this technique, another contribution comes in the form of a regression model that takes the enhancement potential of a new OCR engine into account. They both mark promising approaches, especially for cultural institutions dealing with historic data of lower quality.


Ship Performance Monitoring using Machine-learning

arXiv.org Machine Learning

The hydrodynamic performance of a sea-going ship varies over its lifespan due to factors like marine fouling and the condition of the anti-fouling paint system. In order to accurately estimate the power demand and fuel consumption for a planned voyage, it is important to assess the hydrodynamic performance of the ship. The current work uses machine-learning (ML) methods to estimate the hydrodynamic performance of a ship using the onboard recorded in-service data. Three ML methods, NL-PCR, NL-PLSR and probabilistic ANN, are calibrated using the data from two sister ships. The calibrated models are used to extract the varying trend in ship's hydrodynamic performance over time and predict the change in performance through several propeller and hull cleaning events. The predicted change in performance is compared with the corresponding values estimated using the fouling friction coefficient ($\Delta C_F$). The ML methods are found to be performing well while modelling the hydrodynamic state variables of the ships with probabilistic ANN model performing the best, but the results from NL-PCR and NL-PLSR are not far behind, indicating that it may be possible to use simple methods to solve such problems with the help of domain knowledge.


Heterogeneous Overdispersed Count Data Regressions via Double Penalized Estimations

arXiv.org Machine Learning

This paper studies the non-asymptotic merits of the double $\ell_1$-regularized for heterogeneous overdispersed count data via negative binomial regressions. Under the restricted eigenvalue conditions, we prove the oracle inequalities for Lasso estimators of two partial regression coefficients for the first time, using concentration inequalities of empirical processes. Furthermore, derived from the oracle inequalities, the consistency and convergence rate for the estimators are the theoretical guarantees for further statistical inference. Finally, both simulations and a real data analysis demonstrate that the new methods are effective.