Regression
Retrospective Causal Inference with Machine Learning Ensembles: An Application to Anti-Recidivism Policies in Colombia
Samii, Cyrus, Paler, Laura, Daly, Sarah Zukerman
We present new methods to estimate causal effects retrospectively from micro data with the assistance of a machine learning ensemble. This approach overcomes two important limitations in conventional methods like regression modeling or matching: (i) ambiguity about the pertinent retrospective counterfactuals and (ii) potential misspecification, overfitting, and otherwise bias-prone or inefficient use of a large identifying covariate set in the estimation of causal effects. Our method targets the analysis toward a well defined ``retrospective intervention effect'' (RIE) based on hypothetical population interventions and applies a machine learning ensemble that allows data to guide us, in a controlled fashion, on how to use a large identifying covariate set. We illustrate with an analysis of policy options for reducing ex-combatant recidivism in Colombia.
Applying Machine Learning Techniques to Classify Musical Instrument Loudspeakers
Celestion loudspeakers have powered the performances of many noted guitar and bass players, including legends such as Jimi Hendrix. Deciding whether a loudspeaker is good enough for professional musicians is a lengthy and painstaking process. Each speaker has its own unique sound based on a combination of sonic characteristics, such as midrange character and brightness. Evaluating a musical instrument loudspeaker involves subjective judgement about whether it generates a "good" sound. Only engineers with years of experience can reliably make that decision, and then only after repeated listening to a single loudspeaker and comparing the sounds it produces with those produced by a reference speaker.
Bayesian quantile additive regression trees
Kindo, Bereket P., Wang, Hao, Hanson, Timothy, Peรฑa, Edsel A.
Quantile regression gives a comprehensive picture of the relationship between a response variable and a set of predictors. It is particularly appealing when the inferential interest lies in the probabilistic properties of extreme observations conditional on a set of predictors. Such objectives arise in various disciplines: in environmental sciences, Friederichs and Hense (2007) study the probabilistic properties of extreme precipitation events, while Pedersen (2015) model the tail distribution of stock and bond returns. In an epidemiological study, Burgette et al. (2011) use penalized quantile regression to explore covariates that affect the lower tail of the distribution of birth weight of babies. When the distribution of the dependent variable is skewed, the desire for robustness to extreme observations makes quantile regression a preferred approach. Examples include the study of tourist expense patterns in Marrocu et al. (2015) and wage distribution in Buchinsky (1995).
Sparse additive Gaussian process with soft interactions
A significant portion of existing variable selection methods are only applicable to linear parametric models. Despite the linearity and additivity assumption, variable selection in linear regression models has been popular since 1970; refer to Akaike information criterion [AIC; Akaike (1973)]; Bayesian information criterion [BIC; Schwarz et al (1978)] and Risk inflation criterion [RIC; Foster and George (1994)]. Popular classical sparse-regression methods such as Least absolute shrinkage operator [LASSO; Tibshirani (1996); Efron et al (2004)], and related penalization methods (Fan and Li, 2001; Zou and Hastie, 2005; Zou, 2006; Zhang, 2010) have gained popularity over the last decade due to their simplicity, computational scalability and efficiency in prediction when the underlying relation between the response and the predictors can be adequately described by parametric models. Bayesian methods (Mitchell and Beauchamp, 1988; George and McCulloch, 1993, 1997) with sparsity inducing priors offers greater applicability beyond parametric models and are a convenient alternative when the underlying goal is in inference and uncertainty quantification. However, there is still a limited amount of literature which seriously considers relaxing the linearity assumption, particularly when the dimension of the predictors is high. Moreover, when the focus is on learning the interactions between the variables, parametric models are often restrictive since they require very many parameters to capture the higher-order interaction terms. 2 Smoothing based non-additive nonparametric regression methods (Lafferty and Wasser-man, 2008; Wahba, 1990; Green and Silverman, 1993; Hastie and Tibshirani, 1990) can accommodate a wide range of relationships between predictors and response leading to excellent predictive performance.
Genetic algorithms and symbolic regression
A few months ago, I wrote a post about optimization using gradient descent, which involves searching for a model that best meets certain criteria by repeatedly making adjustments that improve things a little bit at a time. In many situations, this works quite well and will always or almost always finds the best solution. But in other cases, it's possible for this approach to fall into a locally optimal solution that isn't the overall best, but is better than any nearby solution. A common way to deal with this sort of situation is to add some randomness into the algorithm, making it possible to jump out of one of these locally optimal solutions into a slightly worse solution that is adjacent to a much better one. In this post, I want to explore one such approach, called a genetic algorithm (or an evolutionary algorithm), which I'll illustrate with a specific type of genetic algorithm called symbolic regression.
Convergence rates of Kernel Conjugate Gradient for random design regression
Blanchard, Gilles, Krรคmer, Nicole
We prove statistical rates of convergence for kernel-based least squares regression from i.i.d. data using a conjugate gradient algorithm, where regularization against overfitting is obtained by early stopping. This method is related to Kernel Partial Least Squares, a regression method that combines supervised dimensionality reduction with least squares projection. Following the setting introduced in earlier related literature, we study so-called "fast convergence rates" depending on the regularity of the target regression function (measured by a source condition in terms of the kernel integral operator) and on the effective dimensionality of the data mapped into the kernel space. We obtain upper bounds, essentially matching known minimax lower bounds, for the $\mathcal{L}^2$ (prediction) norm as well as for the stronger Hilbert norm, if the true regression function belongs to the reproducing kernel Hilbert space. If the latter assumption is not fulfilled, we obtain similar convergence rates for appropriate norms, provided additional unlabeled data are available.
rasbt/python-machine-learning-book
Softmax Regression (synonyms: Multinomial Logistic, Maximum Entropy Classifier, or just Multi-class Logistic Regression) is a generalization of logistic regression that we can use for multi-class classification (under the assumption that the classes are mutually exclusive). In contrast, we use the (standard) Logistic Regression model in binary classification tasks. Now, let me briefly explain how that works and how softmax regression differs from logistic regression. As the name suggests, in softmax regression (SMR), we replace the sigmoid logistic function by the so-called softmax function?: Now, this softmax function computes the probability that this training sample x(i) belongs to class j given the weight and net input z(i). So, we compute the probability p(y j x(i); wj) for each class label in j 1, ..., k.
LogisticRegression - mlxtend
Related to the Perceptron and'Adaline', a Logistic Regression model is a linear model for binary classification. However, instead of minimizing a linear cost function such as the sum of squared errors (SSE) in Adaline, we minimize a sigmoid function, i.e., the logistic function: Here, p(y 1 \mid \mathbf{x}) is the conditional probability that a particular sample belongs to class 1 given its features \mathbf{x} . The logit function takes inputs in the range [0, 1] and transform them to values over the entire real number range. In contrast, the logistic function takes input values over the entire real number range and transforms them to values in the range [0, 1]. In other words, the logistic function is the inverse of the logit function, and it lets us predict the conditional probability that a certain sample belongs to class 1 (or class 0).
Fundamental Parameters of Main-Sequence Stars in an Instant with Machine Learning
Bellinger, Earl P., Angelou, George C., Hekker, Saskia, Basu, Sarbani, Ball, Warrick, Guggenberger, Elisabeth
Owing to the remarkable photometric precision of space observatories like Kepler, stellar and planetary systems beyond our own are now being characterized en masse for the first time. These characterizations are pivotal for endeavors such as searching for Earth-like planets and solar twins, understanding the mechanisms that govern stellar evolution, and tracing the dynamics of our Galaxy. The volume of data that is becoming available, however, brings with it the need to process this information accurately and rapidly. While existing methods can constrain fundamental stellar parameters such as ages, masses, and radii from these observations, they require substantial computational efforts to do so. We develop a method based on machine learning for rapidly estimating fundamental parameters of main-sequence solar-like stars from classical and asteroseismic observations. We first demonstrate this method on a hare-and-hound exercise and then apply it to the Sun, 16 Cyg A & B, and 34 planet-hosting candidates that have been observed by the Kepler spacecraft. We find that our estimates and their associated uncertainties are comparable to the results of other methods, but with the additional benefit of being able to explore many more stellar parameters while using much less computation time. We furthermore use this method to present evidence for an empirical diffusion-mass relation. Our method is open source and freely available for the community to use. The source code for all analyses and for all figures appearing in this manuscript can be found electronically at https://github.com/earlbellinger/asteroseismology
An Application of Network Lasso Optimization For Ride Sharing Prediction
Ghosh, Shaona, Page, Kevin, De Roure, David
Ride sharing has important implications in terms of environmental, social and individual goals by reducing carbon footprints, fostering social interactions and economizing commuter costs. The ride sharing systems that are commonly available lack adaptive and scalable techniques that can simultaneously learn from the large scale data and predict in real-time dynamic fashion. In this paper, we study such a problem towards a smart city initiative, where a generic ride sharing system is conceived capable of making predictions about ride share opportunities based on the historically recorded data while satisfying real-time ride requests. Underpinning the system is an application of a powerful machine learning convex optimization framework called Network Lasso that uses the Alternate Direction Method of Multipliers (ADMM) optimization for learning and dynamic prediction. We propose an application of a robust and scalable unified optimization framework within the ride sharing case-study. The application of Network Lasso framework is capable of jointly optimizing and clustering different rides based on their spatial and model similarity. The prediction from the framework clusters new ride requests, making accurate price prediction based on the clusters, detecting hidden correlations in the data and allowing fast convergence due to the network topology. We provide an empirical evaluation of the application of ADMM network Lasso on real trip record and simulated data, proving their effectiveness since the mean squared error of the algorithm's prediction is minimized on the test rides.