Goto

Collaborating Authors

 Regression


Statistical Reasoning for Public Health 2: Regression Methods Coursera

#artificialintelligence

This module, along with module 2B introduces two key concepts in statistics/epidemiology, confounding and effect modification. A relation between an outcome and exposure of interested can be confounded if a another variable (or variables) is associated with both the outcome and the exposure. In such cases the crude outcome/exposure associate may over or under-estimate the association of interest. Confounding is an ever-present threat in non-randomized studies, but results of interest can be adjusted for potential confounders.


Regression Analysis for Statistics & Machine Learning in R

#artificialintelligence

It is a practical, hands-on course, i.e. we will spend some time dealing with some of the theoretical concepts related to both statistical and machine learning regression analysis. However, majority of the course will focus on implementing different techniques on real data and interpret the results. After each video you will learn a new concept or technique which you may apply to your own projects.


Mathematics for Machine Learning : Linear Regression & Least Square Regression

#artificialintelligence

As we know from the basic maths that if we plot an'X','Y' graph, a linear relationship will always come up with a straight line. The equation of a straight line is written using the y mx b, where m is the slope (Gradient) and b is y-intercept (where the line crosses the Y axis). Once we get the equation of a straight line from 2 points in space in y mx b format, we can use the same equation to predict the points at different values of x which result in a straight line. In this formula, m is the slope and b is y-intercept. Let's take a real world example to demonstrate the usage of linear regression and usage of Least Square Method to reduce the errors Let's take a real world example of the price of agricultural products and how it varies based on the location its sold.


Automatic Classification of Object Code Using Machine Learning

arXiv.org Machine Learning

Recent research has repeatedly shown that machine learning techniques can be applied to either whole files or file fragments to classify them for analysis. We build upon these techniques to show that for samples of un-labeled compiled computer object code, one can apply the same type of analysis to classify important aspects of the code, such as its target architecture and endianess. We show that using simple byte-value histograms we retain enough information about the opcodes within a sample to classify the target architecture with high accuracy, and then discuss heuristic-based features that exploit information within the operands to determine endianess. We introduce a dataset with over 16000 code samples from 20 architectures and experimentally show that by using our features, classifiers can achieve very high accuracy with relatively small sample sizes.


Interpretable Machine Learning with iml and mlr

#artificialintelligence

Machine learning models repeatedly outperform interpretable, parametric models like the linear regression model. The gains in performance have a price: The models operate as black boxes which are not interpretable. Fortunately, there are many methods that can make machine learning models interpretable. Feature importance: Which were the most important features? Feature effects: How does a feature influence the prediction?


Modeling Dengue Vector Population Using Remotely Sensed Data and Machine Learning

arXiv.org Machine Learning

Mosquitoes are vectors of many human diseases. In particular, Aedes \ae gypti (Linnaeus) is the main vector for Chikungunya, Dengue, and Zika viruses in Latin America and it represents a global threat. Public health policies that aim at combating this vector require dependable and timely information, which is usually expensive to obtain with field campaigns. For this reason, several efforts have been done to use remote sensing due to its reduced cost. The present work includes the temporal modeling of the oviposition activity (measured weekly on 50 ovitraps in a north Argentinean city) of Aedes \ae gypti (Linnaeus), based on time series of data extracted from operational earth observation satellite images. We use are NDVI, NDWI, LST night, LST day and TRMM-GPM rain from 2012 to 2016 as predictive variables. In contrast to previous works which use linear models, we employ Machine Learning techniques using completely accessible open source toolkits. These models have the advantages of being non-parametric and capable of describing nonlinear relationships between variables. Specifically, in addition to two linear approaches, we assess a Support Vector Machine, an Artificial Neural Networks, a K-nearest neighbors and a Decision Tree Regressor. Considerations are made on parameter tuning and the validation and training approach. The results are compared to linear models used in previous works with similar data sets for generating temporal predictive models. These new tools perform better than linear approaches, in particular Nearest Neighbor Regression (KNNR) performs the best. These results provide better alternatives to be implemented operatively on the Argentine geospatial Risk system that is running since 2012.


Estimating Learnability in the Sublinear Data Regime

arXiv.org Machine Learning

We consider the problem of estimating how well a model class is capable of fitting a distribution of labeled data. We show that it is often possible to accurately estimate this "learnability" even when given an amount of data that is too small to reliably learn any accurate model. Our first result applies to the setting where the data is drawn from a $d$-dimensional distribution with isotropic covariance, and the label of each datapoint is an arbitrary noisy function of the datapoint. In this setting, we show that with $O(\sqrt{d})$ samples, one can accurately estimate the fraction of the variance of the label that can be explained via the best linear function of the data. We extend these techniques to the setting of binary classification, where we show that in an analogous setting, the prediction error of the best linear classifier can be accurately estimated given $O(\sqrt{d})$ labeled samples. Note that in both the linear regression and binary classification settings, even if there is no noise in the labels, a sample size linear in the dimension, $d$, is required to \emph{learn} any function correlated with the underlying model. We further extend our estimation approach to the setting where the data distribution has an (unknown) arbitrary covariance matrix, allowing these techniques to be applied to settings where the model class consists of a linear function applied to a nonlinear embedding of the data. Finally, we demonstrate the practical viability of these approaches on synthetic and real data. This ability to estimate the explanatory value of a set of features (or dataset), even in the regime in which there is too little data to realize that explanatory value, may be relevant to the scientific and industrial settings for which data collection is expensive and there are many potentially relevant feature sets that could be collected.


Distribution Assertive Regression

arXiv.org Machine Learning

In regression modelling approach, the main step is to fit the regression line as close as possible to the target variable. In this process most algorithms try to fit all of the data in a single line and hence fitting all parts of target variable in one go. It was observed that the error between predicted and target variable usually have a varying behavior across the various quantiles of the dependent variable and hence single point diagnostic like MAPE has its limitation to signify the level of fitness across the distribution of Y(dependent variable). To address this problem, a novel approach is proposed in the paper to deal with regression fitting over various quantiles of target variable. Using this approach we have significantly improved the eccentric behavior of the distance (error) between predicted and actual value of regression. Our proposed solution is based on understanding the segmented behavior of the data with respect to the internal segments within the data and approach for retrospectively fitting the data based on each quantile behavior. We believe exploring and using this approach would help in achieving better and more explainable results in most settings of real world data modelling problems.


Valid Inference for $L_2$-Boosting

arXiv.org Machine Learning

We review several recently proposed post-selection inference frameworks and assess their transferability to the component-wise functional gradient descent algorithm (CFGD) under normality assumption for model errors, also known as $L_2$-Boosting. The CFGD is one of the most versatile toolboxes to analyze data, as it scales well to high-dimensional data sets, allows for a very flexible definition of additive regression models and incorporates inbuilt variable selection. %After addressing several issues associated with Due to the iterative nature, which can repeatedly select the same component to update, an inference framework for component-wise boosting algorithms requires adaptations of existing approaches; we propose tests and confidence intervals for linear, grouped and penalized additive model components estimated using the $L_2$-boosting selection process. We apply our framework to the prostate cancer data set and investigate the properties of our concepts in simulation studies. %The most general and promising selective inference framework for $L_2$-Boosting as well as for more general gradient-descent boosting algorithms is an sampling approach which constitutes an adoption of the recently proposed method by Yang et al. (2016).


Machine learning regression on hyperspectral data to estimate multiple water parameters

arXiv.org Machine Learning

In this paper, we present a regression framework involving several machine learning models to estimate water parameters based on hyperspectral data. Measurements from a multi-sensor field campaign, conducted on the River Elbe, Germany, represent the benchmark dataset. It contains hyperspectral data and the five water parameters chlorophyll a, green algae, diatoms, CDOM and turbidity. We apply a PCA for the high-dimensional data as a possible preprocessing step. Then, we evaluate the performance of the regression framework with and without this preprocessing step. The regression results of the framework clearly reveal the potential of estimating water parameters based on hyperspectral data with machine learning. The proposed framework provides the basis for further investigations, such as adapting the framework to estimate water parameters of different inland waters.