Goto

Collaborating Authors

 Regression


Predicting Flights Delay Using Supervised Learning, Logistic Regression

@machinelearnbot

In this post, we'll use a supervised machine learning technique called logistic regression to predict delayed flights. But before we proceed, I like to give condolences to the family of the the victims of the Germanwings tragedy. Note: This is a common data set in the machine learning community to test out algorithms and models given it's publicly available and have sizable data. In this blog, we will look at small sample snapsot(2201 flights in January 2004). In another post, we can explore using Big Data technologies such as Hadoop MapReduce or Spark machine learning libraries to do large scale predictive analytics and data mining.


On the Performance of Forecasting Models in the Presence of Input Uncertainty

arXiv.org Machine Learning

Nowadays, with the unprecedented penetration of renewable distributed energy resources (DERs), the necessity of an efficient energy forecasting model is more demanding than before. Generally, forecasting models are trained using observed weather data while the trained models are applied for energy forecasting using forecasted weather data. In this study, the performance of several commonly used forecasting methods in the presence of weather predictors with uncertainty is assessed and compared. Accordingly, both observed and forecasted weather data are collected, then the influential predictors for solar PV generation forecasting model are selected using several measures. Using observed and forecasted weather data, an analysis on the uncertainty of weather variables is represented by MAE and bootstrapping. The energy forecasting model is trained using observed weather data, and finally, the performance of several commonly used forecasting methods in solar energy forecasting is simulated and compared for a real case study.


10 Algorithms Machine Learning Engineers Need To Know About

#artificialintelligence

With the fast mechanization brought about by the technological revolution, the word manual is slowly getting lost amidst the crowd and will very soon completely vanish. As Big Data has whisked the tech industry, Machine Learning is gaining importance and has robustly handled huge amount of data making accurate predictions. In an era of constant progress, we can only guess what astounding invention and discovery is to come next. The data-crunching machines that have been seamlessly executing the advanced techniques. Machine learning is a subset of the Artificial Intelligence, which is a broader term and concept.


Kernel Method for Detecting Higher Order Interactions in multi-view Data: An Application to Imaging, Genetics, and Epigenetics

arXiv.org Machine Learning

In this study, we tested the interaction effect of multimodal datasets using a novel method called the kernel method for detecting higher order interactions among biologically relevant mulit-view data. Using a semiparametric method on a reproducing kernel Hilbert space (RKHS), we used a standard mixed-effects linear model and derived a score-based variance component statistic that tests for higher order interactions between multi-view data. The proposed method offers an intangible framework for the identification of higher order interaction effects (e.g., three way interaction) between genetics, brain imaging, and epigenetic data. Extensive numerical simulation studies were first conducted to evaluate the performance of this method. Finally, this method was evaluated using data from the Mind Clinical Imaging Consortium (MCIC) including single nucleotide polymorphism (SNP) data, functional magnetic resonance imaging (fMRI) scans, and deoxyribonucleic acid (DNA) methylation data, respectfully, in schizophrenia patients and healthy controls. We treated each gene-derived SNPs, region of interest (ROI) and gene-derived DNA methylation as a single testing unit, which are combined into triplets for evaluation. In addition, cardiovascular disease risk factors such as age, gender, and body mass index were assessed as covariates on hippocampal volume and compared between triplets. Our method identified $13$-triplets ($p$-values $\leq 0.001$) that included $6$ gene-derived SNPs, $10$ ROIs, and $6$ gene-derived DNA methylations that correlated with changes in hippocampal volume, suggesting that these triplets may be important in explaining schizophrenia-related neurodegeneration. With strong evidence ($p$-values $\leq 0.000001$), the triplet ({\bf MAGI2, CRBLCrus1.L, FBXO28}) has the potential to distinguish schizophrenia patients from the healthy control variations.


Knowledge Elicitation via Sequential Probabilistic Inference for High-Dimensional Prediction

arXiv.org Artificial Intelligence

Prediction in a small-sized sample with a large number of covariates, the "small n, large p" problem, is challenging. This setting is encountered in multiple applications, such as precision medicine, where obtaining additional samples can be extremely costly or even impossible, and extensive research effort has recently been dedicated to finding principled solutions for accurate prediction. However, a valuable source of additional information, domain experts, has not yet been efficiently exploited. We formulate knowledge elicitation generally as a probabilistic inference process, where expert knowledge is sequentially queried to improve predictions. In the specific case of sparse linear regression, where we assume the expert has knowledge about the values of the regression coefficients or about the relevance of the features, we propose an algorithm and computational approximation for fast and efficient interaction, which sequentially identifies the most informative features on which to query expert knowledge. Evaluations of our method in experiments with simulated and real users show improved prediction accuracy already with a small effort from the expert.


Generalising Random Forest Parameter Optimisation to Include Stability and Cost

arXiv.org Machine Learning

Random forests are among the most popular classification and regression methods used in industrial applications. To be effective, the parameters of random forests must be carefully tuned. This is usually done by choosing values that minimize the prediction error on a held out dataset. We argue that error reduction is only one of several metrics that must be considered when optimizing random forest parameters for commercial applications. We propose a novel metric that captures the stability of random forests predictions, which we argue is key for scenarios that require successive predictions. We motivate the need for multi-criteria optimization by showing that in practical applications, simply choosing the parameters that lead to the lowest error can introduce unnecessary costs and produce predictions that are not stable across independent runs. To optimize this multi-criteria trade-off, we present a new framework that efficiently finds a principled balance between these three considerations using Bayesian optimisation. The pitfalls of optimising forest parameters purely for error reduction are demonstrated using two publicly available real world datasets. We show that our framework leads to parameter settings that are markedly different from the values discovered by error reduction metrics.


An Introduction to the Practical and Theoretical Aspects of Mixture-of-Experts Modeling

arXiv.org Machine Learning

Mixture-of-experts (MoE) models are a powerful paradigm for modeling of data arising from complex data generating processes (DGPs). In this article, we demonstrate how different MoE models can be constructed to approximate the underlying DGPs of arbitrary types of data. Due to the probabilistic nature of MoE models, we propose the maximum quasi-likelihood (MQL) estimator as a method for estimating MoE model parameters from data, and we provide conditions under which MQL estimators are consistent and asymptotically normal. The blockwise minorization-maximizatoin (blockwise-MM) algorithm framework is proposed as an all-purpose method for constructing algorithms for obtaining MQL estimators. An example derivation of a blockwise-MM algorithm is provided. We then present a method for constructing information criteria for estimating the number of components in MoE models and provide justification for the classic Bayesian information criterion (BIC). We explain how MoE models can be used to conduct classification, clustering, and regression and we illustrate these applications via a pair of worked examples.


Predicting Surgery Duration with Neural Heteroscedastic Regression

arXiv.org Machine Learning

Scheduling surgeries is a challenging task due to the fundamental uncertainty of the clinical environment, as well as the risks and costs associated with under- and over-booking. We investigate neural regression algorithms to estimate the parameters of surgery case durations, focusing on the issue of heteroscedasticity. We seek to simultaneously estimate the duration of each surgery, as well as a surgery-specific notion of our uncertainty about its duration. Estimating this uncertainty can lead to more nuanced and effective scheduling strategies, as we are able to schedule surgeries more efficiently while allowing an informed and case-specific margin of error. Using surgery records %from the UC San Diego Health System, from a large United States health system we demonstrate potential improvements on the order of 20% (in terms of minutes overbooked) compared to current scheduling techniques. Moreover, we demonstrate that surgery durations are indeed heteroscedastic. We show that models that estimate case-specific uncertainty better fit the data (log likelihood). Additionally, we show that the heteroscedastic predictions can more optimally trade off between over and under-booking minutes, especially when idle minutes and scheduling collisions confer disparate costs.


Residual Value Forecasting Using Asymmetric Cost Functions

arXiv.org Machine Learning

Leasing is a popular channel to market new cars. Pricing a leasing contract is complicated because the leasing rate embodies an expectation of the residual value of the car after contract expiration. To aid lessors in their pricing decisions, the paper develops resale price forecasting models. A peculiarity of the leasing business is that forecast errors entail different costs. Identifying effective ways to address this characteristic is the main objective of the paper. More specifically, the paper contributes to the literature through i) consolidating and integrating previous work in forecasting with asymmetric cost of error functions, ii) systematically evaluating previous approaches and comparing them to a new approach, and iii) demonstrating that forecasting with asymmetric cost of error functions enhances the quality of decision support in car leasing. For example, under the assumption that the costs of overestimating resale prices is twice that of the opposite error, incorporating corresponding cost asymmetry into forecast model development reduces decision costs by about eight percent, compared to a standard forecasting model. Higher asymmetry produces even larger improvements.


Mean Absolute Percentage Error for regression models

arXiv.org Machine Learning

We study in this paper the consequences of using the Mean Absolute Percentage Error (MAPE) as a measure of quality for regression models. We prove the existence of an optimal MAPE model and we show the universal consistency of Empirical Risk Minimization based on the MAPE. We also show that finding the best model under the MAPE is equivalent to doing weighted Mean Absolute Error (MAE) regression, and we apply this weighting strategy to kernel regression. The behavior of the MAPE kernel regression is illustrated on simulated data.