Goto

Collaborating Authors

 Regression


A Binary Regression Adaptive Goodness-of-fit Test (BAGofT)

arXiv.org Machine Learning

The Pearson's $\chi^2$ test and residual deviance test are two classical goodness-of-fit tests for binary regression models such as logistic regression. These two tests cannot be applied when we have one or more continuous covariates in the data, a quite common situation in practice. In that case, the most widely used approach is the Hosmer-Lemeshow test, which partitions the covariate space into groups according to quantiles of the fitted probabilities from all the observations. However, its grouping scheme is not flexible enough to explore how to adversarially partition the data space in order to enhance the power. In this work, we propose a new methodology, named binary regression adaptive grouping goodness-of-fit test (BAGofT), to address the above concern. It is a two-stage solution where the first stage adaptively selects candidate partitions using "training" data, and the second stage performs $\chi^2$ tests with necessary corrections based on "test" data. A proper data splitting ensures that the test has desirable size and power properties. From our experimental results, BAGofT performs much better than Hosmer-Lemeshow test in many situations.


Impact of Narrow Lanes on Arterial Road Vehicle Crashes: A Machine Learning Approach

arXiv.org Machine Learning

In this paper we adopted state-of-the-art machine learning algorithms, namely: random forest (RF) and least squares boosting, to model crash data and identify the optimum model to study the impact of narrow lanes on the safety of arterial roads. Using a ten-year crash dataset in four cities in Nebraska, two machine learning models were assessed based on the prediction error. The RF model was identified as the best model. The RF was used to compute the importance of the lane width predictors in our regression model based on two different measures. Subsequently, the RF model was used to simulate the crash rate for different lane widths. The Kruskal-Wallis test, was then conducted to determine if simulated values from the four lane width groups have equal means. The test null hypothesis of equal means for simulated values from the four lane width groups was rejected. Consequently, it was concluded that the crash rates from at least one lane width group was statistically different from the others. Finally, the results from the pairwise comparisons using the Tukey and Kramer test showed that the changes in crash rates between any two lane width conditions were statistically significant.


White-Box Target Attack for EEG-Based BCI Regression Problems

arXiv.org Artificial Intelligence

Machine learning has achieved great success in many applications, including electroencephalogram (EEG) based brain-computer interfaces (BCIs). Unfortunately, many machine learning models are vulnerable to adversarial examples, which are crafted by adding deliberately designed perturbations to the original inputs. Many adversarial attack approaches for classification problems have been proposed, but few have considered target adversarial attacks for regression problems. This paper proposes two such approaches. More specifically, we consider white-box target attacks for regression problems, where we know all information about the regression model to be attacked, and want to design small perturbations to change the regression output by a pre-determined amount. Experiments on two BCI regression problems verified that both approaches are effective. Moreover, adversarial examples generated from both approaches are also transferable, which means that we can use adversarial examples generated from one known regression model to attack an unknown regression model, i.e., to perform black-box attacks. To our knowledge, this is the first study on adversarial attacks for EEG-based BCI regression problems, which calls for more attention on the security of BCI systems.


Privacy Preserving Gaze Estimation using Synthetic Images via a Randomized Encoding Based Framework

arXiv.org Machine Learning

Eye tracking is handled as one of the key technologies for applications which assess and evaluate human attention, behavior and biometrics, especially using gaze, pupillary and blink behaviors. One of the main challenges with regard to the social acceptance of eye-tracking technology is however the preserving of sensitive and personal information. To tackle this challenge, we employed a privacy-preserving framework based on randomized encoding to train a Support Vector Regression model on synthetic eye images privately to estimate human gaze. During the computation, none of the parties learns about the data or the result that any other party has. Furthermore, the party that trains the model cannot reconstruct pupil, blink or visual scanpath. The experimental results showed that our privacy preserving framework is also capable of working in real-time, as accurate as a non-private version of it and could be extended to other eye-tracking related problems.


The gradient complexity of linear regression

arXiv.org Machine Learning

We investigate the computational complexity of several basic linear algebra primitives, including largest eigenvector computation and linear regression, in the computational model that allows access to the data via a matrix-vector product oracle. We show that for polynomial accuracy, $\Theta(d)$ calls to the oracle are necessary and sufficient even for a randomized algorithm. Our lower bound is based on a reduction to estimating the least eigenvalue of a random Wishart matrix. This simple distribution enables a concise proof, leveraging a few key properties of the random Wishart ensemble.


Bias-aware model selection for machine learning of doubly robust functionals

arXiv.org Machine Learning

While model selection is a well-studied topic in parametric and nonparametric regression or density estimation, model selection of possibly high dimensional nuisance parameters in semiparametric problems is far less developed. In this paper, we propose a new model selection framework for making inferences about a finite dimensional functional defined on a semiparametric model, when the latter admits a doubly robust estimating function. The class of such doubly robust functionals is quite large, including many missing data and causal inference problems. Under double robustness, the estimated functional should incur no bias if either of two nuisance parameters is evaluated at the truth while the other spans a large collection of candidate models. We introduce two model selection criteria for bias reduction of functional of interest, each based on a novel definition of pseudo-risk for the functional that embodies this double robustness property and thus may be used to select the candidate model that is nearest to fulfilling this property even when all models are wrong. Both selection criteria have a bias awareness property that selection of one nuisance parameter can be made to compensate for excessive bias due to poor learning of the other nuisance parameter. We establish an oracle property for a multi-fold cross-validation version of the new model selection criteria which states that our empirical criteria perform nearly as well as an oracle with a priori knowledge of the pseudo-risk for each candidate model. We also describe a smooth approximation to the selection criteria which allows for valid post-selection inference. Finally, we perform model selection of a semiparametric estimator of average treatment effect given an ensemble of candidate machine learning methods to account for confounding in a study of right heart catheterization in the ICU of critically ill patients.


Logistic Regression

#artificialintelligence

In the previous article, we studied Linear Regression. One thing that I believe is that if we can correlate anything with us or our life, there are greater chances of understanding the concept. So I will try to explain everything by relating it to humans.


Linear regression in Python: Using numpy, scipy, and statsmodels

#artificialintelligence

The original article is no longer available. Similar (and more comprehensive) material is available below. You can access this material here.


Linear regression in Python: Using numpy, scipy, and statsmodels

#artificialintelligence

The original article is no longer available. Similar (and more comprehensive) material is available below. You can access this material here.


Variable Grouping Based Bayesian Additive Regression Tree

arXiv.org Machine Learning

Using ensemble methods for regression has been a large success in obtaining high-accuracy prediction. Examples are Bagging, Random forest, Boosting, BART (Bayesian additive regression tree), and their variants. In this paper, we propose a new perspective named variable grouping to enhance the predictive performance. The main idea is to seek for potential grouping of variables in such way that there is no nonlinear interaction term between variables of different groups. Given a sum-of-learner model, each learner will only be responsible for one group of variables, which would be more efficient in modeling nonlinear interactions. We propose a two-stage method named variable grouping based Bayesian additive regression tree (GBART) with a well-developed python package gbart available. The first stage is to search for potential interactions and an appropriate grouping of variables. The second stage is to build a final model based on the discovered groups. Experiments on synthetic and real data show that the proposed method can perform significantly better than classical approaches.