Goto

Collaborating Authors

 Regression


What is Gradient Descent in Machine Learning?

#artificialintelligence

In every Machine Learning problem where there is an association of regression, there is one more term associated and that is called Gradient Descent. As we all know that Linear regression, Logistic regression, SVM, etc. is associated with finding the best fit line to fit in all the points where the slope of the line and bias tend to cover all the points in the dataset. This never happens as a perfect fit line leads to the condition of overfitting. So, the difference that is present between the target output and predicted output is termed as the loss function or the cost function and is given by the difference of predicted value by actual value to the power of 2. When this cost function is minimum we say that we have attained the point of least error and our model can be used as a benchmark model. In the field of statistics, there is a lot of tuning and tweaking that is done to attain the point of least error.


Evaluating Progress on Machine Learning for Longitudinal Electronic Healthcare Data

arXiv.org Machine Learning

The Large Scale Visual Recognition Challenge based on the well-known Imagenet dataset catalyzed an intense flurry of progress in computer vision. Benchmark tasks have propelled other sub-fields of machine learning forward at an equally impressive pace, but in healthcare it has primarily been image processing tasks, such as in dermatology and radiology, that have experienced similar benchmark-driven progress. In the present study, we performed a comprehensive review of benchmarks in medical machine learning for structured data, identifying one based on the Medical Information Mart for Intensive Care (MIMIC-III) that allows the first direct comparison of predictive performance and thus the evaluation of progress on four clinical prediction tasks: mortality, length of stay, phenotyping, and patient decompensation. We find that little meaningful progress has been made over a 3 year period on these tasks, despite significant community engagement. Through our meta-analysis, we find that the performance of deep recurrent models is only superior to logistic regression on certain tasks. We conclude with a synthesis of these results, possible explanations, and a list of desirable qualities for future benchmarks in medical machine learning.


The Efficacy of $L_1$ Regularization in Two-Layer Neural Networks

arXiv.org Machine Learning

A crucial problem in neural networks is to select the most appropriate number of hidden neurons and obtain tight statistical risk bounds. In this work, we present a new perspective towards the bias-variance tradeoff in neural networks. As an alternative to selecting the number of neurons, we theoretically show that $L_1$ regularization can control the generalization error and sparsify the input dimension. In particular, with an appropriate $L_1$ regularization on the output layer, the network can produce a statistical risk that is near minimax optimal. Moreover, an appropriate $L_1$ regularization on the input layer leads to a risk bound that does not involve the input data dimension. Our analysis is based on a new amalgamation of dimension-based and norm-based complexity analysis to bound the generalization error. A consequent observation from our results is that an excessively large number of neurons do not necessarily inflate generalization errors under a suitable regularization.


Linear Classifier Combination via Multiple Potential Functions

arXiv.org Machine Learning

A vital aspect of the classification based model construction process is the calibration of the scoring function. One of the weaknesses of the calibration process is that it does not take into account the information about the relative positions of the recognized objects in the feature space. To alleviate this limitation, in this paper, we propose a novel concept of calculating a scoring function based on the distance of the object from the decision boundary and its distance to the class centroid. An important property is that the proposed score function has the same nature for all linear base classifiers, which means that outputs of these classifiers are equally represented and have the same meaning. The proposed approach is compared with other ensemble algorithms and experiments on multiple Keel datasets demonstrate the effectiveness of our method. To discuss the results of our experiments, we use multiple classification performance measures and statistical analysis.


Estimation of causal effects of multiple treatments in healthcare database studies with rare outcomes

arXiv.org Machine Learning

The preponderance of large-scale healthcare databases provide abundant opportunities for comparative effectiveness research. Evidence necessary to making informed treatment decisions often relies on comparing effectiveness of multiple treatment options on outcomes of interest observed in a small number of individuals. Causal inference with multiple treatments and rare outcomes is a subject that has been treated sparingly in the literature. This paper designs three sets of simulations, representative of the structure of our healthcare database study, and propose causal analysis strategies for such settings. We investigate and compare the operating characteristics of three types of methods and their variants: Bayesian Additive Regression Trees (BART), regression adjustment on multivariate spline of generalized propensity scores (RAMS) and inverse probability of treatment weighting (IPTW) with multinomial logistic regression or generalized boosted models. Our results suggest that BART and RAMS provide lower bias and mean squared error, and the widely used IPTW methods deliver unfavorable operating characteristics. We illustrate the methods using a case study evaluating the comparative effectiveness of robotic-assisted surgery, video-assisted thoracoscopic surgery and open thoracotomy for treating non-small cell lung cancer.


It Is Likely That Your Loss Should be a Likelihood

arXiv.org Machine Learning

Many common loss functions such as mean-squared-error, cross-entropy, and reconstruction loss are unnecessarily rigid. Under a probabilistic interpretation, these common losses correspond to distributions with fixed shapes and scales. We instead argue for optimizing full likelihoods that include parameters like the normal variance and softmax temperature. Joint optimization of these "likelihood parameters" with model parameters can adaptively tune the scales and shapes of losses in addition to the strength of regularization. We explore and systematically evaluate how to parameterize and apply likelihood parameters for robust modeling, outlier-detection, and re-calibration. Additionally, we propose adaptively tuning $L_2$ and $L_1$ weights by fitting the scale parameters of normal and Laplace priors and introduce more flexible element-wise regularizers.


Logistic Regression Clearly Explained

#artificialintelligence

Logistic Regression is the most widely used classification algorithm in machine learning. It is used in many real-world scenarios like spam detected, cancer detection, IRIS dataset, etc. Mostly it is used in binary classification problems. But it can also be used in multiclass classification. Logistic Regression predicts the probability that the given data point belongs to a certain class or not. In this article, I will be using the famous heart disease dataset from Kaggle. In this dataset, the main goal is to predict whether the given person has heart disease or not.


First-order Optimization for Superquantile-based Supervised Learning

arXiv.org Machine Learning

Classical supervised learning via empirical risk (or negative log-likelihood) minimization hinges upon the assumption that the testing distribution coincides with the training distribution. This assumption can be challenged in modern applications of machine learning in which learning machines may operate at prediction time with testing data whose distribution departs from the one of the training data. We revisit the superquantile regression method by proposing a first-order optimization algorithm to minimize a superquantile-based learning objective. The proposed algorithm is based on smoothing the superquantile function by infimal convolution. Promising numerical results illustrate the interest of the approach towards safer supervised learning.


Applied Machine Learning Models For Improved Startup Valuation.

#artificialintelligence

Determining the valuation of an early-stage Startup is in most cases very challenging due limited historical data, little to no existing revenues, market uncertainty and many more. Traditional valuation techniques, such as Discounted Cash Flow (DCF) or Multiples (CCA), therefore often lead to inappropriate results. On the other hand, alternative valuation methods remain subject to an individual's subjective assessment and a black box for others. Therefore, the underlying study leverages machine learning algorithms to predict a fair, data-driven and comprehensible startup valuations. Three different data sources are merged and applied to three regression models.


Regress Consistently when Oblivious Outliers Overwhelm

arXiv.org Machine Learning

We give a novel analysis of the Huber loss estimator for consistent robust linear regression proving that it simultaneously achieves an optimal dependency on the fraction of outliers and on the dimension. We consider a linear regression model with an oblivious adversary, who may corrupt the observations in an arbitrary way but without knowing the data. (This adversary model also captures heavy-tailed noise distributions). Given observations $y_1,\ldots,y_n$ with an $\alpha$ uncorrupted fraction, we obtain error guarantees $\tilde{O}(\sqrt{d/\alpha^2\cdot n})$, optimal up to logarithmic terms. Our algorithm works with a nearly optimal fraction of inliers $\alpha\geq \tilde{O}(\sqrt{d/n})$ and under mild restricted isometry assumptions (RIP) on the (transposed) design matrix. Prior to this work, even in the simple case of spherical Gaussian design, no estimator was known to achieve vanishing error guarantees in the high dimensional settings $d\gtrsim \sqrt{n}$, whenever the fraction of uncorrupted observations is smaller than $1/\log n$. Our analysis of the Huber loss estimator only exploits the first order optimality conditions. Furthermore, in the special case of Gaussian design $X\sim N(0,1)^{n \times d}$, we show that a strikingly simple algorithm based on computing coordinate-wise medians achieves similar guarantees in linear time. The algorithm also extends to the settings where the parameter vector $\beta^*$ is sparse.