Regression
Ex-Ante Assessment of Discrimination in Dataset
Vasquez, Jonathan, Gitiaux, Xavier, Rangwala, Huzefa
Data owners face increasing liability for how the use of their data could harm under-priviliged communities. Stakeholders would like to identify the characteristics of data that lead to algorithms being biased against any particular demographic groups, for example, defined by their race, gender, age, and/or religion. Specifically, we are interested in identifying subsets of the feature space where the ground truth response function from features to observed outcomes differs across demographic groups. To this end, we propose FORESEE, a FORESt of decision trEEs algorithm, which generates a score that captures how likely an individual's response varies with sensitive attributes. Empirically, we find that our approach allows us to identify the individuals who are most likely to be misclassified by several classifiers, including Random Forest, Logistic Regression, Support Vector Machine, and k-Nearest Neighbors. The advantage of our approach is that it allows stakeholders to characterize risky samples that may contribute to discrimination, as well as, use the FORESEE to estimate the risk of upcoming samples.
Intuition of Multivariate Adaptive Regression Splines (MARS)
Multivariate Adaptive Regression Splines or commonly known as MARS is an algorithm best suited for high dimensional and complex non-linear relationship dataset. It can be seen as a generalised form of Stepwise Regression (Stepwise regression does a forward selection first where it starts loading the model and then pruning or backward selection to remove the variables that don't help reduce the error rate significantly). This function is similar to rectified linear function of neural network where the starting point is the input itself or 0. MARS uses this function to create knots and this is known as Spline. These functions are always generated in pair of Left function and Right function. MARS then generates multiple such functions that we call basis function for all the input variables and then runs linear regression on each basis function's output.
Semi-automatic tuning of coupled climate models with multiple intrinsic timescales: lessons learned from the Lorenz96 model
Lguensat, Redouane, Deshayes, Julie, Durand, Homer, Balaji, V.
The objective of this study is to evaluate the potential for History Matching (HM) to tune a climate system with multi-scale dynamics. By considering a toy climate model, namely, the two-scale Lorenz96 model and producing experiments in perfect-model setting, we explore in detail how several built-in choices need to be carefully tested. We also demonstrate the importance of introducing physical expertise in the range of parameters, a priori to running HM. Finally we revisit a classical procedure in climate model tuning, that consists of tuning the slow and fast components separately. By doing so in the Lorenz96 model, we illustrate the non-uniqueness of plausible parameters and highlight the specificity of metrics emerging from the coupling. This paper contributes also to bridging the communities of uncertainty quantification, machine learning and climate modeling, by making connections between the terms used by each community for the same concept and presenting promising collaboration avenues that would benefit climate modeling research.
Transformer Networks for Predictive Group Elevator Control
Zhang, Jing, Tsiligkaridis, Athanasios, Taguchi, Hiroshi, Raghunathan, Arvind, Nikovski, Daniel
We propose a Predictive Group Elevator Scheduler by using predictive information of passengers arrivals from a Transformer based destination predictor and a linear regression model that predicts remaining time to destinations. Through extensive empirical evaluation, we find that the savings of Average Waiting Time (AWT) could be as high as above 50% for light arrival streams and around 15% for medium arrival streams in afternoon down-peak traffic regimes. Such results can be obtained after carefully setting the Predicted Probability of Going to Elevator (PPGE) threshold, thus avoiding a majority of false predictions for people heading to the elevator, while achieving as high as 80% of true predictive elevator landings as early as after having seen only 60% of the whole trajectory of a passenger.
Exporting NIR regression models built in Python
Hi everyone, and thanks for tuning in to our new post on exporting NIR regression models built in Python. One of the reader of this blog asked me this question: "How can we export a model that we just build, so that we can use it over and over again without having to fit the training data every time?" I must admit, I didn't have the answer straight away, it's a very good question. Once the training part is completed, it would be good to export the model to file, store it and retrieve it at a later time. If you'd like to get started with building your calibration models in Python, take a look at some of our previous posts.
Feasibility Layer Aided Machine Learning Approach for Day-Ahead Operations
Ramesh, Arun Venkatesh, Li, Xingpeng
Day-ahead operations involves a complex and computationally intensive optimization process to determine the generator commitment schedule and dispatch. The optimization process is a mixed-integer linear program (MILP) also known as security-constrained unit commitment (SCUC). Independent system operators (ISOs) run SCUC daily and require state-of-the-art algorithms to speed up the process. Existing patterns in historical information can be leveraged for model reduction of SCUC, which can provide significant time savings. In this paper, machine learning (ML) based classification approaches, namely logistic regression, neural networks, random forest and K-nearest neighbor, were studied for model reduction of SCUC. The ML was then aided with a feasibility layer (FL) and post-process technique to ensure high-quality solutions. The proposed approach is validated on several test systems namely, IEEE 24-Bus system, IEEE-73 Bus system, IEEE 118-Bus system, 500-Bus system, and Polish 2383-Bus system. Moreover, model reduction of a stochastic SCUC (SSCUC) was demonstrated utilizing a modified IEEE 24-Bus system with renewable generation. Simulation results demonstrate a high training accuracy to identify commitment schedule while FL and post-process ensure ML predictions do not lead to infeasible solutions with minimal loss in solution quality.
Towards Coupling Full-disk and Active Region-based Flare Prediction for Operational Space Weather Forecasting
Pandey, Chetraj, Ji, Anli, Angryk, Rafal A., Georgoulis, Manolis K., Aydin, Berkay
Solar flare prediction is a central problem in space weather forecasting and has captivated the attention of a wide spectrum of researchers due to recent advances in both remote sensing as well as machine learning and deep learning approaches. The experimental findings based on both machine and deep learning models reveal significant performance improvements for task specific datasets. Along with building models, the practice of deploying such models to production environments under operational settings is a more complex and often time-consuming process which is often not addressed directly in research settings. We present a set of new heuristic approaches to train and deploy an operational solar flare prediction system for $\geq$M1.0-class flares with two prediction modes: full-disk and active region-based. In full-disk mode, predictions are performed on full-disk line-of-sight magnetograms using deep learning models whereas in active region-based models, predictions are issued for each active region individually using multivariate time series data instances. The outputs from individual active region forecasts and full-disk predictors are combined to a final full-disk prediction result with a meta-model. We utilized an equal weighted average ensemble of two base learners' flare probabilities as our baseline meta learner and improved the capabilities of our two base learners by training a logistic regression model. The major findings of this study are: (i) We successfully coupled two heterogeneous flare prediction models trained with different datasets and model architecture to predict a full-disk flare probability for next 24 hours, (ii) Our proposed ensembling model, i.e., logistic regression, improves on the predictive performance of two base learners and the baseline meta learner measured in terms of two widely used metrics True Skill Statistic (TSS) and Heidke Skill core (HSS), and (iii) Our result analysis suggests that the logistic regression-based ensemble (Meta-FP) improves on the full-disk model (base learner) by $\sim9\%$ in terms TSS and $\sim10\%$ in terms of HSS. Similarly, it improves on the AR-based model (base learner) by $\sim17\%$ and $\sim20\%$ in terms of TSS and HSS respectively. Finally, when compared to the baseline meta model, it improves on TSS by $\sim10\%$ and HSS by $\sim15\%$.
Regressing Relative Fine-Grained Change for Sub-Groups in Unreliable Heterogeneous Data Through Deep Multi-Task Metric Learning
Mahony, Niall O', Campbell, Sean, Krpalkova, Lenka, Walsh, Joseph, Riordan, Daniel
Fine-Grained Change Detection and Regression Analysis are essential in many applications of ArtificialIntelligence. In practice, this task is often challenging owing to the lack of reliable ground truth information andcomplexity arising from interactions between the many underlying factors affecting a system. Therefore,developing a framework which can represent the relatedness and reliability of multiple sources of informationbecomes critical. In this paper, we investigate how techniques in multi-task metric learning can be applied for theregression of fine-grained change in real data.The key idea is that if we incorporate the incremental change in a metric of interest between specific instancesof an individual object as one of the tasks in a multi-task metric learning framework, then interpreting thatdimension will allow the user to be alerted to fine-grained change invariant to what the overall metric isgeneralised to be. The techniques investigated are specifically tailored for handling heterogeneous data sources,i.e. the input data for each of the tasks might contain missing values, the scale and resolution of the values is notconsistent across tasks and the data contains non-independent and identically distributed (non-IID) instances. Wepresent the results of our initial experimental implementations of this idea and discuss related research in thisdomain which may offer direction for further research.
Linear Regression in Python
This tutorial will focus on two main broad topics that are Simple Linear Regression and Multiple Linear regression. Throughout the tutorial, key points are illustrated with clear, step-by-step examples for better understanding. By the end of the tutorial, you will be able to compute all of the essential outputs for simple linear regression and multiple regression. Most important, you will be able to correctly interpret the outputs you produce. Linear regression is a common Statistical Data Analysis technique that is widely being used.
#002 Machine Learning - Linear Regression Models - Master Data Science 18.07.2022
Highlights: Welcome back to the all-new series on Machine Learning. In the previous post, we gave you a sneak peak into the basics of Machine Learning, the two types of Machine Learning, viz., Supervised & Unsupervised, and implemented some examples using various algorithms in each of the techniques. In this new tutorial post, we will explore one of the most widely used Supervised Learning algorithms in the world today – Linear Regression. We will start off with some theory and go on to build a simple model in Python, from scratch. In our previous post (also the first post of this Machine Learning tutorial series), we brushed the fundamentals of Linear Regression using the example of housing price prediction, given the size of the house. If you remember, the prediction was based on the linear relationship that existed between the house price and the size of the house. Have a look at the image below. In the graph above, the size of the house is shown along the horizontal axis and the price of a house is shown along the vertical axis. Here, each data point is a house with its respective size and the price that the house was recently sold for.