Regression
Causality-based Explanation of Classification Outcomes
Bertossi, Leopoldo, Li, Jordan, Schleich, Maximilian, Suciu, Dan, Vagena, Zografoula
Machine-learning (ML) models are increasingly used today in making decisions that affect real people's lives, and, because of that, there is a huge need to ensure that the models and their decisions are interpretable by their human users. Motivated by this need, there has bee a lot of interest recently in the ML community in studying Interpretable models [18]. There is currently no consensus on what interpretability means, and no benchmarks for evaluating interpretability [5, 10]. The only consensus is that simpler models such as linear regression or decision trees are considered more interpretable than complex models like, say, deep neural nets. However, two general principles for approaching interpretability have emerged in the literature that are relevant to our paper.
Data-driven surrogate modelling and benchmarking for process equipment
Gonรงalves, Gabriel F. N., Batchvarov, Assen, Liu, Yuyi, Liu, Yuxin, Mason, Lachlan, Pan, Indranil, Matar, Omar K.
A suite of computational fluid dynamics (CFD) simulations geared towards chemical process equipment modelling has been developed and validated with experimental results from the literature. Various regression based active learning strategies are explored with these CFD simulators in-the-loop under the constraints of a limited function evaluation budget. Specifically, five different sampling strategies and five regression techniques are compared, considering a set of three test cases of industrial significance and varying complexity. Gaussian process regression was observed to have a consistently good performance for these applications. The present quantitative study outlines the pros and cons of the different available techniques and highlights the best practices for their adoption. The test cases and tools are available with an open-source license, to ensure reproducibility and engage the wider research community in contributing to both the CFD models and developing and benchmarking new improved algorithms tailored to this field.
A Time Series Approach To Player Churn and Conversion in Videogames
del Rรญo, Ana Fernรกndez, Guitart, Anna, Periรกรฑez, รfrica
Players of a free-to-play game are divided into three main groups: non-paying active users, paying active users and inactive users. A State Space time series approach is then used to model the daily conversion rates between the different groups, i.e., the probability of transitioning from one group to another. This allows, not only for predictions on how these rates are to evolve, but also for a deeper understanding of the impact that in-game planning and calendar effects have. It is also used in this work for the detection of marketing and promotion campaigns about which no information is available. In particular, two different State Space formulations are considered and compared: an Autoregressive Integrated Moving Average process and an Unobserved Components approach, in both cases with a linear regression to explanatory variables. Both yield very close estimations for covariate parameters, producing forecasts with similar performances for most transition rates. While the Unobserved Components approach is more robust and needs less human intervention in regards to model definition, it produces significantly worse forecasts for non-paying user abandonment probability. More critically, it also fails to detect a plausible marketing and promotion campaign scenario.
Experimental Comparison of Semi-parametric, Parametric, and Machine Learning Models for Time-to-Event Analysis Through the Concordance Index
Fernandez, Camila, Chen, Chung Shue, Gaillard, Pierre, Silva, Alonso
In this paper, we make an experimental comparison of semi-parametric (Cox proportional hazards model, Aalen's additive regression model), parametric (Weibull AFT model), and machine learning models (Random Survival Forest, Gradient Boosting with Cox Proportional Hazards Loss, DeepSurv) through the concordance index on two different datasets (PBC and GBCSG2). We present two comparisons: one with the default hyper-parameters of these models and one with the best hyper-parameters found by randomized search.
A brief introduction to Logistic Regression techsocialnetwork
In our previous chapters, we mainly discussed about the Linear Regression model where the target variable to be predicted is continous in nature and there is a linear relationship between the independent and target varables. But how to predict a discrete varaible based uopn the predictors which are linearly related with the target. In this case Logistic Regression comes to rescue. In this article, we will mainly focus on this predictive model and know the inner engineering of this model. So, What is Logistic Regression?
Top 5 Data Science Algorithms that you must know!
Right now, we utilize different data science algorithms to solve the task needing to be done. There are many algorithms out there, so it tends to be quite overpowering for beginners. Today, we will quickly present the top 5 mainstream Machine Learning algorithms so you can get settled with the energizing universe of Data Science! Linear Regression is likely the most famous ML algorithm. It finds a line that best fits a dissipated data points on a graph.
Multivariate Functional Regression via Nested Reduced-Rank Regularization
Liu, Xiaokang, Ma, Shujie, Chen, Kun
We propose a nested reduced-rank regression (NRRR) approach in fitting regression model with multivariate functional responses and predictors, to achieve tailored dimension reduction and facilitate interpretation/visualization of the resulting functional model. Our approach is based on a two-level low-rank structure imposed on the functional regression surfaces. A global low-rank structure identifies a small set of latent principal functional responses and predictors that drives the underlying regression association. A local low-rank structure then controls the complexity and smoothness of the association between the principal functional responses and predictors. Through a basis expansion approach, the functional problem boils down to an interesting integrated matrix approximation task, where the blocks or submatrices of an integrated low-rank matrix share some common row space and/or column space. An iterative algorithm with convergence guarantee is developed. We establish the consistency of NRRR and also show through non-asymptotic analysis that it can achieve at least a comparable error rate to that of the reduced-rank regression. Simulation studies demonstrate the effectiveness of NRRR. We apply NRRR in an electricity demand problem, to relate the trajectories of the daily electricity consumption with those of the daily temperatures.
Short-Term Forecasting of CO2 Emission Intensity in Power Grids by Machine Learning
Leerbeck, Kenneth, Bacher, Peder, Junker, Rune, Goranoviฤ, Goran, Corradi, Olivier, Ebrahimy, Razgar, Tveit, Anna, Madsen, Henrik
A machine learning algorithm is developed to forecast the CO2 emission intensities in electrical power grids in the Danish bidding zone DK2, distinguishing between average and marginal emissions. The analysis was done on data set comprised of a large number (473) of explanatory variables such as power production, demand, import, weather conditions etc. collected from selected neighboring zones. The number was reduced to less than 50 using both LASSO (a penalized linear regression analysis) and a forward feature selection algorithm. Three linear regression models that capture different aspects of the data (non-linearities and coupling of variables etc.) were created and combined into a final model using Softmax weighted average. Cross-validation is performed for debiasing and autoregressive moving average model (ARIMA) implemented to correct the residuals, making the final model the variant with exogenous inputs (ARIMAX). The forecasts with the corresponding uncertainties are given for two time horizons, below and above six hours. Marginal emissions came up independent of any conditions in the DK2 zone, suggesting that the marginal generators are located in the neighbouring zones. The developed methodology can be applied to any bidding zone in the European electricity network without requiring detailed knowledge about the zone.
Auditing ML Models for Individual Bias and Unfairness
Xue, Songkai, Yurochkin, Mikhail, Sun, Yuekai
We consider the task of auditing ML models for individual bias/unfairness. We formalize the task in an optimization problem and develop a suite of inferential tools for the optimal value. Our tools permit us to obtain asymptotic confidence intervals and hypothesis tests that cover the target/control the Type I error rate exactly. To demonstrate the utility of our tools, we use them to reveal the gender and racial biases in Northpointe's COMPAS recidivism prediction instrument.