Regression
Predicting Bike Usage for New York City’s Bike Sharing System
Singhvi, Divya (Cornell University) | Singhvi, Somya (Cornell University) | Frazier, Peter I. (Cornell University) | Henderson, Shane G. (Cornell University) | Mahony, Eoin O' (Cornell University) | (Cornell University) | Shmoys, David B. (Cornell University) | Woodard, Dawn B.
Bike sharing systems consist of a fleet of bikes placed in a network of docking stations. These bikes can then be rented and returned to any of the docking stations after usage. Predicting unrealized bike demand at locations currently without bike stations is important for effectively designing and expanding bike sharing systems. We predict pairwise bike demand for New York City’s Citi Bike system. Since the system is driven by daily commuters we focus only on the morning rush hours between 7:00 AM to 11:00 AM during weekdays. We use taxi usage, weather and spatial variables as covariates to predict bike demand, and further analyze the influence of precipitation and day of week. We show that aggregating stations in neighborhoods can substantially improve predictions. The presented model can assist planners by predicting bike demand at a macroscopic level, between pairs of neighborhoods.
Lower Dimensional Representations of City Neighbourhoods
Saeidi, Marzieh (University College London) | Riedel, Sebastian (University College London) | Capra, Licia (University College London)
We aim to profile characteristics of areas of variant units across a district, city or a country. Studying attributes of areas can be very useful in several situations. In the past, research has focused mainly on studying specific char- acteristics of areas using a few selected attributes. In this paper we propose an alternative view on neighbourhood profiles. Instead of characterising a neighbourhood through a set of attributes such as those collected by the census, we propose use of a low-dimensional fea- ture representation, or embedding, created from one or more input sources. The purpose of the embeddings is having a generic representation for entities that can do well across several downstream tasks such as regression for attributes prediction.
Competing with the Empirical Risk Minimizer in a Single Pass
Frostig, Roy, Ge, Rong, Kakade, Sham M., Sidford, Aaron
In many estimation problems, e.g. linear and logistic regression, we wish to minimize an unknown objective given only unbiased samples of the objective function. Furthermore, we aim to achieve this using as few samples as possible. In the absence of computational constraints, the minimizer of a sample average of observed data -- commonly referred to as either the empirical risk minimizer (ERM) or the $M$-estimator -- is widely regarded as the estimation strategy of choice due to its desirable statistical convergence properties. Our goal in this work is to perform as well as the ERM, on every problem, while minimizing the use of computational resources such as running time and space usage. We provide a simple streaming algorithm which, under standard regularity assumptions on the underlying problem, enjoys the following properties: * The algorithm can be implemented in linear time with a single pass of the observed data, using space linear in the size of a single sample. * The algorithm achieves the same statistical rate of convergence as the empirical risk minimizer on every problem, even considering constant factors. * The algorithm's performance depends on the initial error at a rate that decreases super-polynomially. * The algorithm is easily parallelizable. Moreover, we quantify the (finite-sample) rate at which the algorithm becomes competitive with the ERM.
Using NLP to measure democracy
This paper uses natural language processing to create the first machine-coded democracy index, which I call Automated Democracy Scores (ADS). The ADS are based on 42 million news articles from 6,043 different sources and cover all independent countries in the 1993-2012 period. Unlike the democracy indices we have today the ADS are replicable and have standard errors small enough to actually distinguish between cases. The ADS are produced with supervised learning. Three approaches are tried: a) a combination of Latent Semantic Analysis and tree-based regression methods; b) a combination of Latent Dirichlet Allocation and tree-based regression methods; and c) the Wordscores algorithm. The Wordscores algorithm outperforms the alternatives, so it is the one on which the ADS are based. There is a web application where anyone can change the training set and see how the results change: democracy-scores.org
Particle Gibbs for Bayesian Additive Regression Trees
Lakshminarayanan, Balaji, Roy, Daniel M., Teh, Yee Whye
Additive regression trees are flexible non-parametric models and popular off-the-shelf tools for real-world non-linear regression. In application domains, such as bioinformatics, where there is also demand for probabilistic predictions with measures of uncertainty, the Bayesian additive regression trees (BART) model, introduced by Chipman et al. (2010), is increasingly popular. As data sets have grown in size, however, the standard Metropolis-Hastings algorithms used to perform inference in BART are proving inadequate. In particular, these Markov chains make local changes to the trees and suffer from slow mixing when the data are high-dimensional or the best fitting trees are more than a few layers deep. We present a novel sampler for BART based on the Particle Gibbs (PG) algorithm (Andrieu et al., 2010) and a top-down particle filtering algorithm for Bayesian decision trees (Lakshminarayanan et al., 2013). Rather than making local changes to individual trees, the PG sampler proposes a complete tree to fit the residual. Experiments show that the PG sampler outperforms existing samplers in many settings.
On the Predictive Properties of Binary Link Functions
This paper provides a theoretical and computational justification of the long held claim that of the similarity of the probit and logit link functions often used in binary classification. Despite this widespread recognition of the strong similarities between these two link functions, very few (if any) researchers have dedicated time to carry out a formal study aimed at establishing and characterizing firmly all the aspects of the similarities and differences. This paper proposes a definition of both structural and predictive equivalence of link functions-based binary regression models, and explores the various ways in which they are either similar or dissimilar. From a predictive analytics perspective, it turns out that not only are probit and logit perfectly predictively concordant, but the other link functions like cauchit and complementary log log enjoy very high percentage of predictive equivalence. Throughout this paper, simulated and real life examples demonstrate all the equivalence results that we prove theoretically.
Polynomial-Chaos-based Kriging
Schoebi, R., Sudret, B., Wiart, J.
Computer simulation has become the standard tool in many engineering fields for designing and optimizing systems, as well as for assessing their reliability. To cope with demanding analysis such as optimization and reliability, surrogate models (a.k.a meta-models) have been increasingly investigated in the last decade. Polynomial Chaos Expansions (PCE) and Kriging are two popular non-intrusive meta-modelling techniques. PCE surrogates the computational model with a series of orthonormal polynomials in the input variables where polynomials are chosen in coherency with the probability distributions of those input variables. On the other hand, Kriging assumes that the computer model behaves as a realization of a Gaussian random process whose parameters are estimated from the available computer runs, i.e. input vectors and response values. These two techniques have been developed more or less in parallel so far with little interaction between the researchers in the two fields. In this paper, PC-Kriging is derived as a new non-intrusive meta-modeling approach combining PCE and Kriging. A sparse set of orthonormal polynomials (PCE) approximates the global behavior of the computational model whereas Kriging manages the local variability of the model output. An adaptive algorithm similar to the least angle regression algorithm determines the optimal sparse set of polynomials. PC-Kriging is validated on various benchmark analytical functions which are easy to sample for reference results. From the numerical investigations it is concluded that PC-Kriging performs better than or at least as good as the two distinct meta-modeling techniques. A larger gain in accuracy is obtained when the experimental design has a limited size, which is an asset when dealing with demanding computational models.
Gaussian Process Models for HRTF based Sound-Source Localization and Active-Learning
Luo, Yuancheng, Zotkin, Dmitry N., Duraiswami, Ramani
From a machine learning perspective, the human ability localize sounds can be modeled as a non-parametric and non-linear regression problem between binaural spectral features of sound received at the ears (input) and their sound-source directions (output). The input features can be summarized in terms of the individual's head-related transfer functions (HRTFs) which measure the spectral response between the listener's eardrum and an external point in $3$D. Based on these viewpoints, two related problems are considered: how can one achieve an optimal sampling of measurements for training sound-source localization (SSL) models, and how can SSL models be used to infer the subject's HRTFs in listening tests. First, we develop a class of binaural SSL models based on Gaussian process regression and solve a \emph{forward selection} problem that finds a subset of input-output samples that best generalize to all SSL directions. Second, we use an \emph{active-learning} approach that updates an online SSL model for inferring the subject's SSL errors via headphones and a graphical user interface. Experiments show that only a small fraction of HRTFs are required for $5^{\circ}$ localization accuracy and that the learned HRTFs are localized closer to their intended directions than non-individualized HRTFs.
An Aggregation Method for Sparse Logistic Regression
$L_1$ regularized logistic regression has now become a workhorse of data mining and bioinformatics: it is widely used for many classification problems, particularly ones with many features. However, $L_1$ regularization typically selects too many features and that so-called false positives are unavoidable. In this paper, we demonstrate and analyze an aggregation method for sparse logistic regression in high dimensions. This approach linearly combines the estimators from a suitable set of logistic models with different underlying sparsity patterns and can balance the predictive ability and model interpretability. Numerical performance of our proposed aggregation method is then investigated using simulation studies. We also analyze a published genome-wide case-control dataset to further evaluate the usefulness of the aggregation method in multilocus association mapping.
Evaluation of modelling approaches for predicting the spatial distribution of soil organic carbon stocks at the national scale
Martin, M. P., Orton, T. G., Lacarce, E., Meersmans, J., Saby, N. P. A., Paroissien, J. B., Jolivet, C., Boulonne, L., Arrouays, D.
Soil organic carbon (SOC) plays a major role in the global carbon budget. It can act as a source or a sink of atmospheric carbon, thereby possibly influencing the course of climate change. Improving the tools that model the spatial distributions of SOC stocks at national scales is a priority, both for monitoring changes in SOC and as an input for global carbon cycles studies. In this paper, we compare and evaluate two recent and promising modelling approaches. First, we considered several increasingly complex boosted regression trees (BRT), a convenient and efficient multiple regression model from the statistical learning field. Further, we considered a robust geostatistical approach coupled to the BRT models. Testing the different approaches was performed on the dataset from the French Soil Monitoring Network, with a consistent cross-validation procedure. We showed that when a limited number of predictors were included in the BRT model, the standalone BRT predictions were significantly improved by robust geostatistical modelling of the residuals. However, when data for several SOC drivers were included, the standalone BRT model predictions were not significantly improved by geostatistical modelling. Therefore, in this latter situation, the BRT predictions might be considered adequate without the need for geostatistical modelling, provided that i) care is exercised in model fitting and validating, and ii) the dataset does not allow for modelling of local spatial autocorrelations, as is the case for many national systematic sampling schemes.