Goto

Collaborating Authors

 Regression


Conditional Selective Inference for Robust Regression and Outlier Detection using Piecewise-Linear Homotopy Continuation

arXiv.org Machine Learning

In practical data analysis under noisy environment, it is common to first use robust methods to identify outliers, and then to conduct further analysis after removing the outliers. In this paper, we consider statistical inference of the model estimated after outliers are removed, which can be interpreted as a selective inference (SI) problem. To use conditional SI framework, it is necessary to characterize the events of how the robust method identifies outliers. Unfortunately, the existing methods cannot be directly used here because they are applicable to the case where the selection events can be represented by linear/quadratic constraints. In this paper, we propose a conditional SI method for popular robust regressions by using homotopy method. We show that the proposed conditional SI method is applicable to a wide class of robust regression and outlier detection methods and has good empirical performance on both synthetic data and real data experiments.


Machine Learning Regression Masterclass in Python

#artificialintelligence

Udemy Coupon - Machine Learning Regression Masterclass in Python, Build 8 Practical Projects and Master Machine Learning Regression Techniques Using Python, Scikit Learn and Keras Created by Dr. Ryan Ahmed, Ph.D., MBA, Kirill Eremenko, Hadelin de Ponteves, SuperDataScience Team, Mitchell Bouchard English [Auto-generated] Students also bought Deep Learning Prerequisites: Linear Regression in Python Learn Regression Analysis for Business Regression Analysis / Data Analytics in Regression Regression Analysis for Statistics & Machine Learning in R Machine Learning for Beginners: Linear Regression model in R Preview this Course GET COUPON CODE Description Artificial Intelligence (AI) revolution is here! The technology is progressing at a massive scale and is being widely adopted in the Healthcare, defense, banking, gaming, transportation and robotics industries. Machine Learning is a subfield of Artificial Intelligence that enables machines to improve at a given task with experience. Machine Learning is an extremely hot topic; the demand for experienced machine learning engineers and data scientists has been steadily growing in the past 5 years. According to a report released by Research and Markets, the global AI and machine learning technology sectors are expected to grow from $1.4B to $8.8B by 2022 and it is predicted that AI tech sector will create around 2.3 million jobs by 2020.


Bayesian subset selection and variable importance for interpretable prediction and classification

arXiv.org Machine Learning

Subset selection is a valuable tool for interpretable learning, scientific discovery, and data compression. However, classical subset selection is often eschewed due to selection instability, computational bottlenecks, and lack of post-selection inference. We address these challenges from a Bayesian perspective. Given any Bayesian predictive model $\mathcal{M}$, we elicit predictively-competitive subsets using linear decision analysis. The approach is customizable for (local) prediction or classification and provides interpretable summaries of $\mathcal{M}$. A key quantity is the acceptable family of subsets, which leverages the predictive distribution from $\mathcal{M}$ to identify subsets that offer nearly-optimal prediction. The acceptable family spawns new (co-) variable importance metrics based on whether variables (co-) appear in all, some, or no acceptable subsets. Crucially, the linear coefficients for any subset inherit regularization and predictive uncertainty quantification via $\mathcal{M}$. The proposed approach exhibits excellent prediction, interval estimation, and variable selection for simulated data, including $p=400 > n$. These tools are applied to a large education dataset with highly correlated covariates, where the acceptable family is especially useful. Our analysis provides unique insights into the combination of environmental, socioeconomic, and demographic factors that predict educational outcomes, and features highly competitive prediction with remarkable stability.


Modeling Classroom Occupancy using Data of WiFi Infrastructure in a University Campus

arXiv.org Artificial Intelligence

Universities worldwide are experiencing a surge in enrollments, therefore campus estate managers are seeking continuous data on attendance patterns to optimize the usage of classroom space. As a result, there is an increasing trend to measure classrooms attendance by employing various sensing technologies, among which pervasive WiFi infrastructure is seen as a low cost method. In a dense campus environment, the number of connected WiFi users does not well estimate room occupancy since connection counts are polluted by adjoining rooms, outdoor walkways, and network load balancing. In this paper, we develop machine learning based models to infer classroom occupancy from WiFi sensing infrastructure. Our contributions are three-fold: (1) We analyze metadata from a dense and dynamic wireless network comprising of thousands of access points (APs) to draw insights into coverage of APs, behavior of WiFi connected users, and challenges of estimating room occupancy; (2) We propose a method to automatically map APs to classrooms using unsupervised clustering algorithms; and (3) We model classroom occupancy using a combination of classification and regression methods of varying algorithms. We achieve 84.6% accuracy in mapping APs to classrooms while the accuracy of our estimation for room occupancy is comparable to beam counter sensors with a symmetric Mean Absolute Percentage Error (sMAPE) of 13.10%.


Relationship between brain injury criteria and brain strain across different types of head impacts can be different

arXiv.org Artificial Intelligence

Multiple brain injury criteria (BIC) are developed to quickly quantify brain injury risks after head impacts. These BIC originated from different types of head impacts (e.g., sports and car crashes) are widely used in risk evaluation. However, the accuracy of using the BIC on brain injury risk estimation across different types of head impacts has not been evaluated. Physiologically, brain strain is often considered the key parameter of brain injury. To evaluate the BIC's risk estimation accuracy across five datasets comprising different head impact types, linear regression was used to model 95% maximum principal strain, 95% maximum principal strain at the corpus callosum, and cumulative strain damage (15%) on each of 18 BIC respectively. The results show a significant difference in the relationship between BIC and brain strain across datasets, indicating the same BIC value may suggest different brain strain in different head impact types. The accuracy of brain strain regression is generally decreasing if the BIC regression models are fit on a dataset with a different type of head impact rather than on the dataset with the same type. Given this finding, this study raises concerns for applying BIC to estimate the brain injury risks for head impacts different from the head impacts on which the BIC was developed.


Off-Policy Risk Assessment in Contextual Bandits

arXiv.org Machine Learning

To evaluate prospective contextual bandit policies when experimentation is not possible, practitioners often rely on off-policy evaluation, using data collected under a behavioral policy. While off-policy evaluation studies typically focus on the expected return, practitioners often care about other functionals of the reward distribution (e.g., to express aversion to risk). In this paper, we first introduce the class of Lipschitz risk functionals, which subsumes many common functionals, including variance, mean-variance, and conditional value-at-risk (CVaR). For Lipschitz risk functionals, the error in off-policy risk estimation is bounded by the error in off-policy estimation of the cumulative distribution function (CDF) of rewards. Second, we propose Off-Policy Risk Assessment (OPRA), an algorithm that (i) estimates the target policy's CDF of rewards; and (ii) generates a plug-in estimate of the risk. Given a collection of Lipschitz risk functionals, OPRA provides estimates for each with corresponding error bounds that hold simultaneously. We analyze both importance sampling and variance-reduced doubly robust estimators of the CDF. Our primary theoretical contributions are (i) the first concentration inequalities for both types of CDF estimators and (ii) guarantees on our Lipschitz risk functional estimates, which converge at a rate of O(1/\sqrt{n}). For practitioners, OPRA offers a practical solution for providing high-confidence assessments of policies using a collection of relevant metrics.


SurvNAM: The machine learning survival model explanation

arXiv.org Machine Learning

A new modification of the Neural Additive Model (NAM) called SurvNAM and its modifications are proposed to explain predictions of the black-box machine learning survival model. The method is based on applying the original NAM to solving the explanation problem in the framework of survival analysis. The basic idea behind SurvNAM is to train the network by means of a specific expected loss function which takes into account peculiarities of the survival model predictions and is based on approximating the black-box model by the extension of the Cox proportional hazards model which uses the well-known Generalized Additive Model (GAM) in place of the simple linear relationship of covariates. The proposed method SurvNAM allows performing the local and global explanation. A set of examples around the explained example is randomly generated for the local explanation. The global explanation uses the whole training dataset. The proposed modifications of SurvNAM are based on using the Lasso-based regularization for functions from GAM and for a special representation of the GAM functions using their weighted linear and non-linear parts, which is implemented as a shortcut connection. A lot of numerical experiments illustrate the SurvNAM efficiency.


Why Machine Learning Integrated Patient Flow Simulation?

arXiv.org Artificial Intelligence

Patient flow analysis can be studied from a clinical and or operational perspective using simulation. Traditional statistical methods such as stochastic distribution methods have been used to construct patient flow simulation submodels such as patient inflow, Length of Stay (LoS), Cost of Treatment (CoT) and Clinical Pathway (CP) models. However, patient inflow demonstrates seasonality, trend and variation over time. LoS, CoT and CP are significantly determined by attributes of patients and clinical and laboratory test results. For this reason, patient flow simulation models constructed using traditional statistical methods are criticized for ignoring heterogeneity and their contribution to personalized and value based healthcare. On the other hand, machine learning methods have proven to be efficient to study and predict admission rate, LoS, CoT, and CP. This paper, hence, describes why coupling machine learning with patient flow simulation is important and proposes a conceptual architecture that shows how to integrate machine learning with patient flow simulation.


Probabilistic water demand forecasting using quantile regression algorithms

arXiv.org Machine Learning

Machine and statistical learning algorithms can be reliably automated and applied at scale. Therefore, they can constitute a considerable asset for designing practical forecasting systems, such as those related to urban water demand. Quantile regression algorithms are statistical and machine learning algorithms that can provide probabilistic forecasts in a straightforward way, and have not been applied so far for urban water demand forecasting. In this work, we aim to fill this gap by automating and extensively comparing several quantile-regression-based practical systems for probabilistic one-day ahead urban water demand forecasting. For designing the practical systems, we use five individual algorithms (i.e., the quantile regression, linear boosting, generalized random forest, gradient boosting machine and quantile regression neural network algorithms), their mean combiner and their median combiner. The comparison is conducted by exploiting a large urban water flow dataset, as well as several types of hydrometeorological time series (which are considered as exogenous predictor variables in the forecasting setting). The results mostly favour the practical systems designed using the linear boosting algorithm, probably due to the presence of trends in the urban water flow time series. The forecasts of the mean and median combiners are also found to be skilful in general terms.


Towards Handling Uncertainty-at-Source in AI -- A Review and Next Steps for Interval Regression

arXiv.org Artificial Intelligence

Most of statistics and AI draw insights through modelling discord or variance between sources of information (i.e., inter-source uncertainty). Increasingly, however, research is focusing upon uncertainty arising at the level of individual measurements (i.e., within- or intra-source), such as for a given sensor output or human response. Here, adopting intervals rather than numbers as the fundamental data-type provides an efficient, powerful, yet challenging way forward -- offering systematic capture of uncertainty-at-source, increasing informational capacity, and ultimately potential for insight. Following recent progress in the capture of interval-valued data, including from human participants, conducting machine learning directly upon intervals is a crucial next step. This paper focuses on linear regression for interval-valued data as a recent growth area, providing an essential foundation for broader use of intervals in AI. We conduct an in-depth analysis of state-of-the-art methods, elucidating their behaviour, advantages, and pitfalls when applied to datasets with different properties. Specific emphasis is given to the challenge of preserving mathematical coherence -- i.e., ensuring that models maintain fundamental mathematical properties of intervals throughout -- and the paper puts forward extensions to an existing approach to guarantee this. Carefully designed experiments, using both synthetic and real-world data, are conducted -- with findings presented alongside novel visualizations for interval-valued regression outputs, designed to maximise model interpretability. Finally, the paper makes recommendations concerning method suitability for data sets with specific properties and highlights remaining challenges and important next steps for developing AI with the capacity to handle uncertainty-at-source.