Goto

Collaborating Authors

 Regression


Machine Learning with R – Barbara Fusinska

#artificialintelligence

Barbara started by introducing machine learning (ML), gave a brief overview of R and then discussed three examples; classifying hand written digits, estimating values in a socio-economic dataset and clustering crimes in Chicago. ML is statistics in steroids. ML uses data to find that pattern then uses that pattern (model) to predict results from similar data. Barbra uses the example of classifying film genres into either action or romance based on the number of kicks and kisses. Barbara described supervised and unsupervised. Unsupervised is the "wild, wild west" we can't train the model and it is much more difficult to understand how effective these are. Back to supervised learning, it's important to choose good predicting factors – in the movie example perhaps the title, actors, script may have been better predictors that the number of kicks and kisses. Then you must choose the algorithm and then tune it and finally make it useful and visible and get it into production - it's a hard job especially when data scientists and software developer seem to be different tribes.


Small Moving Window Calibration Models for Soft Sensing Processes with Limited History

arXiv.org Machine Learning

Five simple soft sensor methodologies with two update conditions were compared on two experimentally-obtained datasets and one simulated dataset. The soft sensors investigated were moving window partial least squares regression (and a recursive variant), moving window random forest regression, the mean moving window of y, and a novel random forest partial least squares regression ensemble (RF-PLS), all of which can be used with small sample sizes so that they can be rapidly placed online. It was found that, on two of the datasets studied, small window sizes led to the lowest prediction errors for all of the moving window methods studied. On the majority of datasets studied, the RF-PLS calibration method offered the lowest onestep-ahead prediction errors compared to those of the other methods, and it demonstrated greater predictive stability at larger time delays than moving window PLS alone. It was found that both the random forest and RF-PLS methods most adequately modeled the datasets that did not feature purely monotonic increases in property values, but that both methods performed more poorly than moving window PLS models on one dataset with purely monotonic property values. Other data dependent findings are presented and discussed. Preprint submitted to Arxiv March 14, 2018 1. Introduction Soft sensors for regression tasks have found wide utility in process engineering and process analytical chemistry [1, 2, 3]. A soft sensor is effectively a calibration used on time-series data. Here, we consider a soft sensor to be any algorithm that can be used to estimate a property value from several readily available but indirect measurements. The goal of implementing a soft sensor is typically to avoid the use of a physical sensor for variables that may require extensive time or work up to measure [3]. In the context of industrial chemical processes, these algorithms should meet several specifications.


A Multi-Modal Approach to Infer Image Affect

arXiv.org Machine Learning

The group affect or emotion in an image of people can be inferred by extracting features about both the people in the picture and the overall makeup of the scene. The state-of-the-art on this problem investigates a combination of facial features, scene extraction and even audio tonality. This paper combines three additional modalities, namely, human pose, text-based tagging and CNN extracted features / predictions. To the best of our knowledge, this is the first time all of the modalities were extracted using deep neural networks. We evaluate the performance of our approach against baselines and identify insights throughout this paper.


Estimating activity cycles with probabilistic methods I. Bayesian Generalised Lomb-Scargle Periodogram with Trend

arXiv.org Machine Learning

Period estimation is one of the central topics in astronomical time series analysis, where data is often unevenly sampled. Especially challenging are studies of stellar magnetic cycles, as there the periods looked for are of the order of the same length than the datasets themselves. The datasets often contain trends, the origin of which is either a real long-term cycle or an instrumental effect, but these effects cannot be reliably separated, while they can lead to erroneous period determinations if not properly handled. In this study we aim at developing a method that can handle the trends properly, and by performing extensive set of testing, we show that this is the optimal procedure when contrasted with methods that do not include the trend directly to the model. The effect of the form of the noise (whether constant or heteroscedastic) on the results is also investigated. We introduce a Bayesian Generalised Lomb-Scargle Periodogram with Trend (BGLST), which is a probabilistic linear regression model using Gaussian priors for the coefficients and uniform prior for the frequency parameter. We show, using synthetic data, that when there is no prior information on whether and to what extent the true model of the data contains a linear trend, the introduced BGLST method is preferable to the methods which either detrend the data or leave the data untrended before fitting the periodic model. Whether to use noise with different than constant variance in the model depends on the density of the data sampling as well as on the true noise type of the process.


Oracle Inequalities for High-dimensional Prediction

arXiv.org Machine Learning

The abundance of high-dimensional data in the modern sciences has generated tremendous interest in penalized estimators such as the lasso, scaled lasso, square-root lasso, elastic net, and many others. In this paper, we establish a general oracle inequality for prediction in high-dimensional linear regression with such methods. Since the proof relies only on convexity and continuity arguments, the result holds irrespective of the design matrix and applies to a wide range of penalized estimators. Overall, the bound demonstrates that generic estimators can provide consistent prediction with any design matrix. From a practical point of view, the bound can help to identify the potential of specific estimators, and they can help to get a sense of the prediction accuracy in a given application.


Multi-kernel Regression For Graph Signal Processing

arXiv.org Machine Learning

We develop a multi-kernel based regression method for graph signal processing where the target signal is assumed to be smooth over a graph. In multi-kernel regression, an effective kernel function is expressed as a linear combination of many basis kernel functions. We estimate the linear weights to learn the effective kernel function by appropriate regularization based on graph smoothness. We show that the resulting optimization problem is shown to be convex and pro- pose an accelerated projected gradient descent based solution. Simulation results using real-world graph signals show efficiency of the multi-kernel based approach over a standard kernel based approach.


The biglasso Package: A Memory- and Computation-Efficient Solver for Lasso Model Fitting with Big Data in R

arXiv.org Machine Learning

Penalized regression models such as the lasso have been extensively applied to analyzing high-dimensional data sets. However, due to memory limitations, existing R packages like glmnet and ncvreg are not capable of fitting lasso-type models for ultrahigh-dimensional, multi-gigabyte data sets that are increasingly seen in many areas such as genetics, genomics, biomedical imaging, and high-frequency finance. In this research, we implement an R package called biglasso that tackles this challenge. biglasso utilizes memory-mapped files to store the massive data on the disk, only reading data into memory when necessary during model fitting, and is thus able to handle out-of-core computation seamlessly. Moreover, it's equipped with newly proposed, more efficient feature screening rules, which substantially accelerate the computation. Benchmarking experiments show that our biglasso package, as compared to existing popular ones like glmnet, is much more memory- and computation-efficient. We further analyze a 31 GB real data set on a laptop with only 16 GB RAM to demonstrate the out-of-core computation capability of biglasso in analyzing massive data sets that cannot be accommodated by existing R packages.


A pathway-based kernel boosting method for sample classification using genomic data

arXiv.org Machine Learning

The analysis of cancer genomic data has long suffered "the curse of dimensionality". Sample sizes for most cancer genomic studies are a few hundreds at most while there are tens of thousands of genomic features studied. Various methods have been proposed to leverage prior biological knowledge, such as pathways, to more effectively analyze cancer genomic data. Most of the methods focus on testing marginal significance of the associations between pathways and clinical phenotypes. They can identify relevant pathways, but do not involve predictive modeling. In this article, we propose a Pathway-based Kernel Boosting (PKB) method for integrating gene pathway information for sample classification, where we use kernel functions calculated from each pathway as base learners and learn the weights through iterative optimization of the classification loss function. We apply PKB and several competing methods to three cancer studies with pathological and clinical information, including tumor grade, stage, tumor sites, and metastasis status. Our results show that PKB outperforms other methods, and identifies pathways relevant to the outcome variables.


Confidence Intervals for Algorithmic Leveraging in Linear Regression

arXiv.org Machine Learning

The age of big data has produced data sets that are computationally expensive to analyze and store. Algorithmic leveraging proposes that we sample observations from the original data set to generate a representative data set and then perform analysis on the representative data set. In this paper, we present efficient algorithms for constructing finite sample confidence intervals for each algorithmic leveraging estimated regression coefficient, with asymptotic coverage guarantees. In simulations, we confirm empirically that the confidence intervals have the desired coverage probabilities, while bootstrap confidence intervals may not.


How to cross-validate PCA, clustering, and matrix decomposition models · Its Neuronal

#artificialintelligence

TL;DR I cover how cross-validation is a somewhat tricky problem for matrix factorization models (including PCA & clustering as special cases) and provide some Python code snippets for fitting these models with held out data. Cross-validation is a fundamental paradigm in modern data analysis. However, it is largely applied to supervised settings, such as regression and classification. Here, the procedure is simple: fit your model on, say, 90% of the data (the training set), and evaluate its performance on the remaining 10% (the test set). However, this idea does not easily extend to other unsupervised methods, such as dimensionality reduction methods or clustering.