Regression
Feature Selection for Data Integration with Mixed Multi-view Data
Baker, Yulia, Tang, Tiffany M., Allen, Genevera I.
Data integration methods that analyze multiple sources of data simultaneously can often provide more holistic insights than can separate inquiries of each data source. Motivated by the advantages of data integration in the era of "big data", we investigate feature selection for high-dimensional multi-view data with mixed data types (e.g. continuous, binary, count-valued). This heterogeneity of multi-view data poses numerous challenges for existing feature selection methods. However, after critically examining these issues through empirical and theoretically-guided lenses, we develop a practical solution, the Block Randomized Adaptive Iterative Lasso (B-RAIL), which combines the strengths of the randomized Lasso, adaptive weighting schemes, and stability selection. B-RAIL serves as a versatile data integration method for sparse regression and graph selection, and we demonstrate the effectiveness of B-RAIL through extensive simulations and a case study to infer the ovarian cancer gene regulatory network. In this case study, B-RAIL successfully identifies well-known biomarkers associated with ovarian cancer and hints at novel candidates for future ovarian cancer research.
Localized Linear Regression in Networked Data
The network Lasso (nLasso) has been proposed recently as an efficient learning algorithm for massive networked data sets (big data over networks). It extends the well-known least least absolute shrinkage and selection operator (Lasso) from learning sparse (generalized) linear models to network models. Efficient implementations of the nLasso have been obtained using convex optimization methods. These implementations naturally lend to highly scalable message passing methods. In this paper, we analyze the performance of nLasso when applied to localized linear regression problems involving networked data. Our main result is a sufficient conditions on the network structure and available label information such that nLasso accurately learns a localized linear regression model from few labeled data points.
Sparse Learning for Variable Selection with Structures and Nonlinearities
In this thesis we discuss machine learning methods performing automated variable selection for learning sparse predictive models. There are multiple reasons for promoting sparsity in the predictive models. By relying on a limited set of input variables the models naturally counteract the overfitting problem ubiquitous in learning from finite sets of training points. Sparse models are cheaper to use for predictions, they usually require lower computational resources and by relying on smaller sets of inputs can possibly reduce costs for data collection and storage. Sparse models can also contribute to better understanding of the investigated phenomenons as they are easier to interpret than full models.
Short-term Load Forecasting at Different Aggregation Levels with Predictability Analysis
Peng, Yayu, Wang, Yishen, Lu, Xiao, Li, Haifeng, Shi, Di, Wang, Zhiwei, Li, Jie
Short-term load forecasting (STLF) is essential for the reliable and economic operation of power systems. Though many STLF methods were proposed over the past decades, most of them focused on loads at high aggregation levels only. Thus, low-aggregation load forecast still requires further research and development. Compared with the substation or city level loads, individual loads are typically more volatile and much more challenging to forecast. To further address this issue, this paper first discusses the characteristics of small-and-medium enterprise (SME) and residential loads at different aggregation levels and quantifies their predictability with approximate entropy. Various STLF techniques, from the conventional linear regression to state-of-the-art deep learning, are implemented for a detailed comparative analysis to verify the forecasting performances as well as the predictability using an Irish smart meter dataset. In addition, the paper also investigates how using data processing improves individual-level residential load forecasting with low predictability. Effectiveness of the discussed method is validated with numerical results.
A unified spectra analysis workflow for the assessment of microbial contamination of ready to eat green salads: Comparative study and application of non-invasive sensors
Tsakanikas, Panagiotis, Fengou, Lemonia Christina, Manthou, Evanthia, Lianou, Alexandra, Panagou, Efstathios Z., Nychas, George John E.
The present study provides a comparative assessment of non-invasive sensors as means of estimating the microbial contamination and time-on-shelf (i.e. storage time) of leafy green vegetables, using a novel unified spectra analysis workflow. Two fresh ready-to-eat green salads were used in the context of this study for the purpose of evaluating the efficiency and practical application of the presented workflow: rocket and baby spinach salads. The employed analysis workflow consisted of robust data normalization, powerful feature selection based on random forests regression, and selection of the number of partial least squares regression coefficients in the training process by estimating the knee-point on the explained variance plot. Training processes were based on microbiological and spectral data derived during storage of green salad samples at isothermal conditions (4, 8 and 12C), whereas testing was performed on data during storage under dynamic temperature conditions (simulating real-life temperature fluctuations in the food supply chain). Since an increasing interest in the use of non-invasive sensors in food quality assessment has been made evident in recent years, the unified spectra analysis workflow described herein, by being based on the creation/usage of limited sized featured sets, could be very useful in food-specific low-cost sensor development.
Active Stacking for Heart Rate Estimation
Wu, Dongrui, Liu, Feifei, Liu, Chengyu
Heart rate estimation from electrocardiogram signals is very important for the early detection of cardiovascular diseases. However, due to large individual differences and varying electrocardiogram signal quality, there does not exist a single reliable estimation algorithm that works well on all subjects. Every algorithm may break down on certain subjects, resulting in a significant estimation error. Ensemble regression, which aggregates the outputs of multiple base estimators for more reliable and stable estimates, can be used to remedy this problem. Moreover, active learning can be used to optimally select a few trials from a new subject to label, based on which a stacking ensemble regression model can be trained to aggregate the base estimators. This paper proposes four active stacking approaches, and demonstrates that they all significantly outperform three common unsupervised ensemble regression approaches, and a supervised stacking approach which randomly selects some trials to label. Remarkably, our active stacking approaches only need three or four labeled trials from each subject to achieve an average root mean squared estimation error below three beats per minute, making them very convenient for real-world applications. To our knowledge, this is the first research on active stacking, and its application to heart rate estimation.
Gene Expression based Survival Prediction for Cancer Patients: A Topic Modeling Approach
Cancer is one of the leading cause of death, worldwide. Many believe that genomic data will enable us to better predict the survival time of these patients, which will lead to better, more personalized treatment options and patient care. As standard survival prediction models have a hard time coping with the high-dimensionality of such gene expression (GE) data, many projects use some dimensionality reduction techniques to overcome this hurdle. We introduce a novel methodology, inspired by topic modeling from the natural language domain, to derive expressive features from the high-dimensional GE data. There, a document is represented as a mixture over a relatively small number of topics, where each topic corresponds to a distribution over the words; here, to accommodate the heterogeneity of a patient's cancer, we represent each patient (~document) as a mixture over cancer-topics, where each cancer-topic is a mixture over GE values (~words). This required some extensions to the standard LDA model eg: to accommodate the "real-valued" expression values - leading to our novel "discretized" Latent Dirichlet Allocation (dLDA) procedure. We initially focus on the METABRIC dataset, which describes breast cancer patients using the r=49,576 GE values, from microarrays. Our results show that our approach provides survival estimates that are more accurate than standard models, in terms of the standard Concordance measure. We then validate this approach by running it on the Pan-kidney (KIPAN) dataset, over r=15,529 GE values - here using the mRNAseq modality - and find that it again achieves excellent results. In both cases, we also show that the resulting model is calibrated, using the recent "D-calibrated" measure. These successes, in two different cancer types and expression modalities, demonstrates the generality, and the effectiveness, of this approach.
Machine Learning Methods Economists Should Know About
We discuss the relevance of the recent Machine Learning (ML) literature for economics and econometrics. First we discuss the differences in goals, methods and settings between the ML literature and the traditional econometrics and statistics literatures. Then we discuss some specific methods from the machine learning literature that we view as important for empirical researchers in economics. These include supervised learning methods for regression and classification, unsupervised learning methods, as well as matrix completion methods. Finally, we highlight newly developed methods at the intersection of ML and econometrics, methods that typically perform better than either off-the-shelf ML or more traditional econometric methods when applied to particular classes of problems, problems that include causal inference for average treatment effects, optimal policy estimation, and estimation of the counterfactual effect of price changes in consumer choice models.
From both sides now: the math of linear regression ·
Linear regression is the most basic and the most widely used technique in machine learning; yet for all its simplicity, studying it can unlock some of the most important concepts in statistics. If you have a basic undestanding of linear regression expressed as $ \hat{Y} \theta_0 \theta_1X$, but don't have a background in statistics and find statements like "ridge regression is equivalent to the maximum a posteriori (MAP) estimate with a zero-mean Gaussian prior" bewildering, then this post is for you. With a superficial goal of understanding that somewhat obtuse statement, its main objective is to explore the topic, starting from the standard formulation of linear regression, moving on to the probabilistic approach (maximum likelihood formulation) and from there to Bayesian linear regression. I'll use the $\theta$ character throughout to refer to the coefficients (weights) of a regression model, either explicitly broken out as $\theta_0$ and $\theta_1$ for intercept and slope respectively, or just $\theta$ referring to the vector of coefficients. I'll usually use the expression $\theta Tx_i$ for the prediction a model gives at $x_i$, the assumption being that a 1 has been added to the vector of values at $x_i$. 1 In the single predictor case, we know that the least squares fit is the line that minimizes the sum of the squared distances between observed data and predicted values, i.e. it minimizes the Residual Sum of Squares (RSS): These residuals are pretty important in how we reason about our model.
Semi-Parametric Uncertainty Bounds for Binary Classification
Csáji, Balázs Csanád, Tamás, Ambrus
The paper studies binary classification and aims at estimating the underlying regression function which is the conditional expectation of the class labels given the inputs. The regression function is the key component of the Bayes optimal classifier, moreover, besides providing optimal predictions, it can also assess the risk of misclassification. We aim at building non-asymptotic confidence regions for the regression function and suggest three kernel-based semi-parametric resampling methods. We prove that all of them guarantee regions with exact coverage probabilities and they are strongly consistent.