Regression
A review of machine learning applications in wildfire science and management
Jain, Piyush, Coogan, Sean C P, Subramanian, Sriram Ganapathi, Crowley, Mark, Taylor, Steve, Flannigan, Mike D
Artificial intelligence has been applied in wildfire science and management since the 1990s, with early applications including neural networks and expert systems. Since then the field has rapidly progressed congruently with the wide adoption of machine learning (ML) in the environmental sciences. Here, we present a scoping review of ML in wildfire science and management. Our objective is to improve awareness of ML among wildfire scientists and managers, as well as illustrate the challenging range of problems in wildfire science available to data scientists. We first present an overview of popular ML approaches used in wildfire science to date, and then review their use in wildfire science within six problem domains: 1) fuels characterization, fire detection, and mapping; 2) fire weather and climate change; 3) fire occurrence, susceptibility, and risk; 4) fire behavior prediction; 5) fire effects; and 6) fire management. We also discuss the advantages and limitations of various ML approaches and identify opportunities for future advances in wildfire science and management within a data science context. We identified 298 relevant publications, where the most frequently used ML methods included random forests, MaxEnt, artificial neural networks, decision trees, support vector machines, and genetic algorithms. There exists opportunities to apply more current ML methods (e.g., deep learning and agent based learning) in wildfire science. However, despite the ability of ML models to learn on their own, expertise in wildfire science is necessary to ensure realistic modelling of fire processes across multiple scales, while the complexity of some ML methods requires sophisticated knowledge for their application. Finally, we stress that the wildfire research and management community plays an active role in providing relevant, high quality data for use by practitioners of ML methods.
An Information-Theoretic Approach to Explainable Machine Learning
A key obstacle to the successful deployment of machine learning (ML) methods to important application domains is the (lack of) explainability of predictions. Explainable ML is challenging since explanations must be tailored (personalized) to individual users with varying backgrounds. On one extreme, users can have received graduate level education in machine learning while on the other extreme, users might have no formal education in linear algebra. Linear regression with few features might be perfectly interpretable for the first group but must be considered a black-box for the latter. Using a simple probabilistic model for the predictions and user knowledge, we formalize explainable ML using information theory. Providing an explanation is then considered as the task of reducing the "surprise" incurred by a prediction. Moreover, the effect of an explanation is measured by the conditional mutual information between the explanation and prediction, given the user background.
Data Pre-Processing and Evaluating the Performance of Several Data Mining Methods for Predicting Irrigation Water Requirement
Khan, Mahmood A., Islam, Md Zahidul, Hafeez, Mohsin
Recent drought and population growth are planting unprecedented demand for the use of available limited water resources. Irrigated agriculture is one of the major consumers of freshwater. A large amount of water in irrigated agriculture is wasted due to poor water management practices. To improve water management in irrigated areas, models for estimation of future water requirements are needed. Developing a model for forecasting irrigation water demand can improve water management practices and maximise water productivity. Data mining can be used effectively to build such models. In this study, we prepare a dataset containing information on suitable attributes for forecasting irrigation water demand. The data is obtained from three different sources namely meteorological data, remote sensing images and water delivery statements. In order to make the prepared dataset useful for demand forecasting and pattern extraction, we pre-process the dataset using a novel approach based on a combination of irrigation and data mining knowledge. We then apply and compare the effectiveness of different data mining methods namely decision tree (DT), artificial neural networks (ANNs), systematically developed forest (SysFor) for multiple trees, support vector machine (SVM), logistic regression, and the traditional Evapotranspiration (ETc) methods and evaluate the performance of these models to predict irrigation water demand. Our experimental results indicate the usefulness of data pre-processing and the effectiveness of different classifiers. Among the six methods we used, SysFor produces the best prediction with 97.5% accuracy followed by a decision tree with 96% and ANN with 95% respectively by closely matching the predictions with actual water usage. Therefore, we recommend using SysFor and DT models for irrigation water demand forecasting.
Quantile Regularization: Towards Implicit Calibration of Regression Models
Recent works have shown that most deep learning models are often poorly calibrated, i.e., they may produce overconfident predictions that are wrong. It is therefore desirable to have models that produce predictive uncertainty estimates that are reliable. Several approaches have been proposed recently to calibrate classification models. However, there is relatively little work on calibrating regression models. We present a method for calibrating regression models based on a novel quantile regularizer defined as the cumulative KL divergence between two CDFs. Unlike most of the existing approaches for calibrating regression models, which are based on post-hoc processing of the model's output and require an additional dataset, our method is trainable in an end-to-end fashion without requiring an additional dataset. The proposed regularizer can be used with any training objective for regression. We also show that post-hoc calibration methods like Isotonic Calibration sometimes compound miscalibration whereas our method provides consistently better calibrations. We provide empirical results demonstrating that the proposed quantile regularizer significantly improves calibration for regression models trained using approaches, such as Dropout VI and Deep Ensembles.
Modelling High-Dimensional Categorical Data Using Nonconvex Fusion Penalties
Stokell, Benjamin G., Shah, Rajen D., Tibshirani, Ryan J.
We propose a method for estimation in high-dimensional linear models with nominal categorical data. Our estimator, called SCOPE, fuses levels together by making their corresponding coefficients exactly equal. This is achieved using the minimax concave penalty on differences between the order statistics of the coefficients for a categorical variable, thereby clustering the coefficients. We provide an algorithm for exact and efficient computation of the global minimum of the resulting nonconvex objective in the case with a single variable with potentially many levels, and use this within a block coordinate descent procedure in the multivariate case. We show that an oracle least squares solution that exploits the unknown level fusions is a limit point of the coordinate descent with high probability, provided the true levels have a certain minimum separation; these conditions are known to be minimal in the univariate case. We demonstrate the favourable performance of SCOPE across a range of real and simulated datasets. An R package CatReg implementing SCOPE for linear models and also a version for logistic regression is available on CRAN.
Prediction of adverse events in Afghanistan: regression analysis of time series data grouped not by geographic dependencies
Fiok, Krzysztof, Karwowski, Waldemar, Wilamowski, Maciej
The aim of this study was to approach a difficult regression task on highly unbalanced data regarding active theater of war in Afghanistan. Our focus was set on predicting the negative events number without distinguishing precise nature of the events given historical data on investment and negative events per each of predefined 400 Afghanistan districts. In contrast with previous research on the matter, we propose an approach to analysis of time series data that benefits from non-conventional aggregation of these territorial entities. By carrying out initial exploratory data analysis we demonstrate that dividing data according to our proposal allows to identify strong trend and seasonal components in the selected target variable. Utilizing this approach we also tried to estimate which data regarding investments is most important for prediction performance. Based on our exploratory analysis and previous research we prepared 5 sets of independent variables that were fed to 3 machine learning regression models. The results expressed by mean absolute and mean square errors indicate that leveraging historical data regarding target variable allows for reasonable performance, however unfortunately other proposed independent variables does not seem to improve prediction quality.
How Much Can A Retailer Sell? Sales Forecasting on Tmall
Chen, Chaochao, Liu, Ziqi, Zhou, Jun, Li, Xiaolong, Qi, Yuan, Jiao, Yujing, Zhong, Xingyu
Time-series forecasting is an important task in both academic and industry, which can be applied to solve many real forecasting problems like stock, water-supply, and sales predictions. In this paper, we study the case of retailers' sales forecasting on Tmall--the world's leading online B2C platform. By analyzing the data, we have two main observations, i.e., sales seasonality after we group different groups of retails and a Tweedie distribution after we transform the sales (target to forecast). Based on our observations, we design two mechanisms for sales forecasting, i.e., seasonality extraction and distribution transformation. First, we adopt Fourier decomposition to automatically extract the seasonalities for different categories of retailers, which can further be used as additional features for any established regression algorithms. Second, we propose to optimize the Tweedie loss of sales after logarithmic transformations. We apply these two mechanisms to classic regression models, i.e., neural network and Gradient Boosting Decision Tree, and the experimental results on Tmall dataset show that both mechanisms can significantly improve the forecasting results.
Piecewise linear regressions for approximating distance metrics
Putman, Josiah, Oh, Lisa, Zhao, Luyang, Honnold, Evan, Brown, Galen, Wang, Weifu, Balkcom, Devin
This paper presents a data structure that summarizes distances between configurations across a robot configuration space, using a binary space partition whose cells contain parameters used for a locally linear approximation of the distance function. Querying the data structure is extremely fast, particularly when compared to the graph search required for querying Probabilistic Roadmaps, and memory requirements are promising. The paper explores the use of the data structure constructed for a single robot to provide a heuristic for challenging multi-robot motion planning problems. Potential applications also include the use of remote computation to analyze the space of robot motions, which then might be transmitted on-demand to robots with fewer computational resources.
PrIU: A Provenance-Based Approach for Incrementally Updating Regression Models
Wu, Yinjun, Tannen, Val, Davidson, Susan B.
The ubiquitous use of machine learning algorithms brings new challenges to traditional database problems such as incremental view update. Much effort is being put in better understanding and debugging machine learning models, as well as in identifying and repairing errors in training datasets. Our focus is on how to assist these activities when they have to retrain the machine learning model after removing problematic training samples in cleaning or selecting different subsets of training data for interpretability. This paper presents an efficient provenance-based approach, PrIU, and its optimized version, PrIU-opt, for incrementally updating model parameters without sacrificing prediction accuracy. We prove the correctness and convergence of the incrementally updated model parameters, and validate it experimentally. Experimental results show that up to two orders of magnitude speed-ups can be achieved by PrIU-opt compared to simply retraining the model from scratch, yet obtaining highly similar models.
Off-Policy Evaluation and Learning for External Validity under a Covariate Shift
Kato, Masahiro, Uehara, Masatoshi, Yasui, Shota
We consider the evaluation and training of a new policy for the evaluation data by using the historical data obtained from a different policy. The goal of off-policy evaluation (OPE) is to estimate the expected reward of a new policy over the evaluation data, and that of off-policy learning (OPL) is to find a new policy that maximizes the expected reward over the evaluation data. Although the standard OPE and OPL assume the same distribution of covariate between the historical and evaluation data, there often exists a problem of a covariate shift, i.e., the distribution of the covariate of the historical data is different from that of the evaluation data. In this paper, we derive the efficiency bound of OPE under a covariate shift. Then, we propose doubly robust and efficient estimators for OPE and OPL under a covariate shift by using an estimator of the density ratio between the distributions of the historical and evaluation data. We also discuss other possible estimators and compare their theoretical properties. Finally, we confirm the effectiveness of the proposed estimators through experiments.