Goto

Collaborating Authors

 Regression


Model Selection & Validation - ROC Curve - An Example Part-7

#artificialintelligence

A lab excercise is show cased to calculate ROC and AUC for a sample data set of logistic regression model. Learn and apply the practical code to test the data. Data Scientists take an enormous mass of messy data points (unstructured and structured) and use their formidable skills in math, statistics, and programming to clean, massage and organize. But worry not we are here to the rescue and teach you how to be a data scientist, more importantly, upgrade your analytic skills to tackle any problem in the field of data science. Join us on "statinfer.com" for becoming a "scientist in data science" Our "Machine Learning" course is now available on Udemy https://www.udemy.com/machine-learnin... Part 1 โ€“ Introduction to R Programming.


End-to-End Example: Using Logistic Regression for predicting Diabetes Commonlounge

@machinelearnbot

In this tutorial, we will see how to predict whether a person has diabetes or not, based on information like blood pressure, body mass index (BMI), age, etc. The data was collected and made available by "National Institute of Diabetes and Digestive and Kidney Diseases" as part of the Pima Indians Diabetes Database. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here belong to the Pima Indian heritage (subgroup of Native Americans), and are females of ages 21 and above. We will be using Python as our programming language, and making use of some popular python machine learning and data science related packages.


Efficient Algorithms and Lower Bounds for Robust Linear Regression

arXiv.org Machine Learning

We study the problem of high-dimensional linear regression in a robust model where an $\epsilon$-fraction of the samples can be adversarially corrupted. We focus on the fundamental setting where the covariates of the uncorrupted samples are drawn from a Gaussian distribution $\mathcal{N}(0, \Sigma)$ on $\mathbb{R}^d$. We give nearly tight upper bounds and computational lower bounds for this problem. Specifically, our main contributions are as follows: For the case that the covariance matrix is known to be the identity, we give a sample near-optimal and computationally efficient algorithm that outputs a candidate hypothesis vector $\widehat{\beta}$ which approximates the unknown regression vector $\beta$ within $\ell_2$-norm $O(\epsilon \log(1/\epsilon) \sigma)$, where $\sigma$ is the standard deviation of the random observation noise. An error of $\Omega (\epsilon \sigma)$ is information-theoretically necessary, even with infinite sample size. Prior work gave an algorithm for this problem with sample complexity $\tilde{\Omega}(d^2/\epsilon^2)$ whose error guarantee scales with the $\ell_2$-norm of $\beta$. For the case of unknown covariance, we show that we can efficiently achieve the same error guarantee as in the known covariance case using an additional $\tilde{O}(d^2/\epsilon^2)$ unlabeled examples. On the other hand, an error of $O(\epsilon \sigma)$ can be information-theoretically attained with $O(d/\epsilon^2)$ samples. We prove a Statistical Query (SQ) lower bound providing evidence that this quadratic tradeoff in the sample size is inherent. More specifically, we show that any polynomial time SQ learning algorithm for robust linear regression (in Huber's contamination model) with estimation complexity $O(d^{2-c})$, where $c>0$ is an arbitrarily small constant, must incur an error of $\Omega(\sqrt{\epsilon} \sigma)$.


Dynamic Advisor-Based Ensemble (dynABE): Case Study in Stock Trend Prediction of a Major Critical Metal Producer

arXiv.org Machine Learning

The demand of metals by modern technology has been shifting from common base metals to a variety of minor metals, such as cobalt or indium. The industrial importance and limited geological availability of some minor metals have led to them being considered more "critical," and there is a growing interest in such critical metals and their producing companies. In this research, we create a novel framework, Dynamic Advisor-Based Ensemble (dynABE), to predict the stock trend of major critical metal producers. Specifically, dynABE first utilizes domain knowledge to group the features into different "advisors," each advisor dealing with a particular economic sector. Then through ensembles of weak classifiers, each advisor produces a prediction result, and all the advisors are combined again in a biased online update fashion to dynamically make the final prediction. Based on a misclassification error of 32% for Jinchuan Group's stock (HKG: 2362), we further test a simple stock trading strategy, which leads to a back-tested return of 296%, or an excess return of 130% within one year. In addition, the feature set selected by dynABE also suggests potentially influential factors to metal criticality, because stock prices of major producers influence metal production. Therefore, not only does this research propose a novel framework for specialized stock trend prediction, it also provides domain insights into dynamic features that potentially influence metal criticality.


Logistic Regression Regularized with Optimization

#artificialintelligence

Logistic regression predicts the probability of the outcome being true. In this exercise, we will implement a logistic regression and apply it to two different data sets. To learn the basics of Logistic Regression in R read this post. In the first part of this exercise, we will build a logistic regression model to predict whether a student gets admitted into a university. Suppose that you are the administrator of a university department and you want to determine each applicant's chance of admission based on their results on two exams.


Introduction to Machine Learning Algorithms: Logistic Regression

#artificialintelligence

Logistic regression is the most famous machine learning algorithm after linear regression. In a lot of ways, linear regression and logistic regression are similar. But, the biggest difference lies in what they are used for. Linear regression algorithms are used to predict/forecast values but logistic regression is used for classification tasks. If you are shaky on the concepts of linear regression, check this out.


Uniform regret bounds over $R^d$ for the sequential linear regression problem with the square loss

arXiv.org Machine Learning

We consider the setting of online linear regression for arbitrary deterministic sequences, with the square loss. We are interested in regret bounds that hold uniformly over all vectors in $u $\in$ R^d$. Vovk (2001) showed a d ln T lower bound on this uniform regret. We exhibit forecasters with closed-form regret bounds that match this d ln T quantity. To the best of our knowledge, earlier works only provided closed-form regret bounds of 2d ln T + O(1).


Statistical mechanical analysis of sparse linear regression as a variable selection problem

arXiv.org Machine Learning

An algorithmic limit of compressed sensing or related variable-selection problems is analytically evaluated when a design matrix is given by an overcomplete random matrix. The replica method from statistical mechanics is employed to derive the result. The analysis is conducted through evaluation of the entropy, an exponential rate of the number of combinations of variables giving a specific value of fit error to given data which is assumed to be generated from a linear process using the design matrix. This yields the typical achievable limit of the fit error when solving a representative $\ell_0$ problem and includes the presence of unfavourable phase transitions preventing local search algorithms from reaching the minimum-error configuration. The associated phase diagrams are presented. A noteworthy outcome of the phase diagrams is, however, that there exists a wide parameter region where any phase transition is absent from the high temperature to the lowest temperature at which the minimum-error configuration or the ground state is reached. This implies that certain local search algorithms can find the ground state with moderate computational costs in that region. The theoretical evaluation of the entropy is confirmed by extensive numerical methods using the exchange Monte Carlo and the multi-histogram methods. Another numerical test based on a metaheuristic optimisation algorithm called simulated annealing is conducted, which well supports the theoretical predictions on the local search algorithms and we can find the ground state with high probability in polynomial time with respect to system size.


Implicit ridge regularization provided by the minimum-norm least squares estimator when $n\ll p$

arXiv.org Machine Learning

A conventional wisdom in statistical learning is that large models require strong regularization to prevent overfitting. This rule has been recently challenged by deep neural networks: despite being expressive enough to fit any training set perfectly, they still generalize well. Here we show that the same is true for linear regression in the under-determined $n\ll p$ situation, provided that one uses the minimum-norm estimator. The case of linear model with least squares loss allows full and exact mathematical analysis. We prove that augmenting a model with many random covariates with small constant variance and using minimum-norm estimator is asymptotically equivalent to adding the ridge penalty. Using toy example simulations as well as real-life high-dimensional data sets, we demonstrate that explicit ridge penalty often fails to provide any improvement over this implicit ridge regularization. In this regime, minimum-norm estimator achieves zero training error but nevertheless has low expected error.


Statistical Reasoning for Public Health 2: Regression Methods Coursera

@machinelearnbot

Structure: Good structure and went through all the basic principles of statistics in detail. Appreciated how it did not have to go through the methodology of each method, but taught us how to appreciate it and understand the data as it was presented in the literature. I liked how John went through the examples in the literature so it was good to see how it was utilised in practice. I wish there was a separate course to teach us how to use these methods with sample data, perhaps a taster of this would have been good to include? but I do understand that would be challenging for some. I think some in-video questions would have been good to check-up on the progress of learning.