Regression
Sufficient Representations for Categorical Variables
Johannemann, Jonathan, Hadad, Vitor, Athey, Susan, Wager, Stefan
Many learning algorithms require categorical data to be transformed into real vectors before it can be used as input. Often, categorical variables are encoded as one-hot (or dummy) vectors. However, this mode of representation can be wasteful since it adds many low-signal regressors, especially when the number of unique categories is large. In this paper, we investigate simple alternative solutions for universally consistent estimators that rely on lower-dimensional real-valued representations of categorical variables that are "sufficient" in the sense that no predictive information is lost. We then compare preexisting and proposed methods on simulated and observational datasets.
Machine learning algorithms to infer trait matching and predict species interactions in ecological networks
Pichler, Maximilian, Boreux, Virginie, Klein, Alexandra-Maria, Schleuning, Matthias, Hartig, Florian
Ecologists have long suspected that species are more likely to interact if their traits match in a particular way. For example, a pollination interaction may be particularly likely if the proportions of a bee's tongue match flower shape in a beneficial way. Empirical evidence for trait matching, however, varies significantly in strength among different types of ecological networks. Here, we show that ambiguity among empirical trait matching studies may have arisen at least in parts from using overly simple statistical models. Using simulated and real data, we contrast conventional regression models with Machine Learning (ML) models (Random Forest, Boosted Regression Trees, Deep Neural Networks, Convolutional Neural Networks, Support Vector Machines, naive Bayes, and k-Nearest-Neighbor), testing their ability to predict species interactions based on traits, and infer trait combinations causally responsible for species interactions. We find that the best ML models can successfully predict species interactions in plant-pollinator networks (up to 0.93 AUC) and outperform conventional regression models. Our results also demonstrate that ML models can better identify the causally responsible trait matching combinations than GLMs. In two case studies, the best ML models could successfully predict species interactions in a global plant-pollinator database and infer ecologically plausible trait matching rules for a plant-hummingbird network from Costa Rica, without any prior assumptions about the system. We conclude that flexible ML models offer many advantages over traditional regression models for understanding interaction networks. We anticipate that these results extrapolate to other network types, such as trophic or competitive networks. More generally, our results highlight the potential of ML and artificial intelligence for inference beyond standard tasks such as pattern recognition.
Regression Analysis in One Picture
The basic idea behind regression analysis is to take a set of data and use that data to make predictions. A useful first step is to make a scatter plot to see the rough shape of your data. Then, choose a regression method to find the line of best fit. Which method you choose depends upon the shape the scatter plot reveals (is it a line, a curve, or a parabola?) The following image shows an overview of regression; See below for links to more detail.
Understanding Neural Networks within Data Science
Moving forward, let's start with our basic imports: Let's say you want to make a model that is either a classification or regression based. How would you know which is the best model & which should you apply to your data set. In order to answer this, you need to fully understand what data you're trying to apply data science concepts to. My Cybersecurity data science project was a unbalanced classification problem. So I would decide to use a classification neural network model on the data.
How Concerned Should You be About Predictor Collinearity? It Depends…
This past Northern Hemisphere summer, I gave several talks (some in the Southern Hemisphere) in which one of the Q&A topics was the problem of collinearity between predictor variables (also known as multicollinearity). My stock response to a question on this topic was (and is) to reply with the clarifying question, "How many rows do you have to develop the model?" If the follow-up response was in the tens of thousands, my counter-response was "Don't worry about collinearity." In contrast, if the audience member's response was a few hundred rows or less, my response was "Very!" While these two different responses may seem contradictory, they actually are not.
Generalizing Psychological Similarity Spaces to Unseen Stimuli
Bechberger, Lucas, Kühnberger, Kai-Uwe
Generalizing Psychological Similarity Spaces to Unseen Stimuli Combining Multidimensional Scaling with Artificial Neural Networks Lucas Bechberger and Kai-Uwe Kühnberger Abstract The cognitive framework of conceptual spaces proposes to represent concepts as regions in psychological similarity spaces. These similarity spaces are typically obtained through multidimensional scaling (MDS), which converts human dissimilarity ratings for a fixed set of stimuli into a spatial representation. One can distinguish metric MDS (which assumes that the dissimilarity ratings are interval or ratio scaled) from nonmetric MDS (which only assumes an ordinal scale). In our first study, we show that despite its additional assumptions, metric MDS does not necessarily yield better solutions than nonmetric MDS. In this chapter, we furthermore propose to learn a mapping from raw stimuli into the similarity space using artificial neural networks (ANNs) in order to generalize the similarity space to unseen inputs. In our second study, we show that a linear regression from the activation vectors of a convolutional ANN to similarity spaces obtained by MDS can be successful and that the results are sensitive to the number of dimensions of the similarity space. 1 Introduction The cognitive framework of conceptual spaces [Gärdenfors, 2000] proposes a geometric representation of conceptual structures: Instances are represented as points and concepts are represented as regions in psychological similarity spaces. Based on this representation, one can explain a range of cognitive phenomena from oneshotLucas Bechberger Institute of Cognitive Science, Osnabrück University email: lucas.bechberger@ The research presented in this paper is an updated, corrected, and significantly extended version of research reported in [Bechberger and Kypridemou, 2018]. 1 arXiv:1908.09260v1 In principle, there are three ways of obtaining the dimensions of a conceptual space: If the domain of interest is well understood, one can manually define the dimensions and thus the overall similarity space. A second approach is based on machine learning algorithms for dimensionality reduction. For instance, unsupervised artificial neural networks (ANNs) such as autoencoders or self-organizing maps can be used to find a compressed representation for a given set of input stimuli. This task is typically solved by optimizing a mathematical error function which may be not satisfactory from a psychological point of view. A third way of obtaining the dimensions of a conceptual space is based on dissimilarity ratings obtained from human subjects. The technique of "multidimensional scaling" (MDS) takes as an input these pairwise dissimilarities as well as the desired number t of dimensions. It then represents each stimulus as a point in an t -dimensional space in such a way that the distances between points in this space reflect the dissimilarities of their corresponding stimuli.
Bias, Variance, and Regularization in Linear Regression: Lasso, Ridge, and Elastic Net -- Differences and uses
Regression is an incredibly popular and common machine learning technique. Often the starting point in learning machine learning, linear regression is an intuitive algorithm for easy-to-understand problems. It can generally be used whenever you're trying to predict a continuous variable (a variable that can take any value in some numeric range), linear regressions and its relatives are often strong options, and are almost always the best place to start. This blog assumes a functional knowledge of ordinary least squares (OLS) linear regression. You can read more about OLS linear regression here, here, or here.
Using Contextual Information to Improve Blood Glucose Prediction
Akbari, Mohammad, Chunara, Rumi
Blood glucose value prediction is an important task in diabetes management. While it is reported that glucose concentration is sensitive to social context such as mood, physical activity, stress, diet, alongside the influence of diabetes pathologies, we need more research on data and methodologies to incorporate and evaluate signals about such temporal context into prediction models. Person-generated data sources, such as actively contributed surveys as well as passively mined data from social media offer opportunity to capture such context, however the self-reported nature and sparsity of such data mean that such data are noisier and less specific than physiological measures such as blood glucose values themselves. Therefore, here we propose a Gaussian Process model to both address these data challenges and combine blood glucose and latent feature representations of contextual data for a novel multi-signal blood glucose prediction task. We find this approach outperforms common methods for multi-variate data, as well as using the blood glucose values in isolation. Given a robust evaluation across two blood glucose datasets with different forms of contextual information, we conclude that multi-signal Gaussian Processes can improve blood glucose prediction by using contextual information and may provide a significant shift in blood glucose prediction research and practice.
Consistent Classification with Generalized Metrics
Wang, Xiaoyan, Li, Ran, Yan, Bowei, Koyejo, Oluwasanmi
We propose a framework for constructing and analyzing multiclass and multioutput classification metrics, i.e., involving multiple, possibly correlated multiclass labels. Our analysis reveals novel insights on the geometry of feasible confusion tensors -- including necessary and sufficient conditions for the equivalence between optimizing an arbitrary non-decomposable metric and learning a weighted classifier. Further, we analyze averaging methodologies commonly used to compute multioutput metrics and characterize the corresponding Bayes optimal classifiers. We show that the plug-in estimator based on this characterization is consistent and is easily implemented as a post-processing rule. Empirical results on synthetic and benchmark datasets support the theoretical findings.
Wasserstein Distributionally Robust Optimization: Theory and Applications in Machine Learning
Kuhn, Daniel, Esfahani, Peyman Mohajerin, Nguyen, Viet Anh, Shafieezadeh-Abadeh, Soroosh
Many decision problems in science, engineering and economics are affected by uncertain parameters whose distribution is only indirectly observable through samples. The goal of data-driven decision-making is to learn a decision from finitely many training samples that will perform well on unseen test samples. This learning task is difficult even if all training and test samples are drawn from the same distribution---especially if the dimension of the uncertainty is large relative to the training sample size. Wasserstein distributionally robust optimization seeks data-driven decisions that perform well under the most adverse distribution within a certain Wasserstein distance from a nominal distribution constructed from the training samples. In this tutorial we will argue that this approach has many conceptual and computational benefits. Most prominently, the optimal decisions can often be computed by solving tractable convex optimization problems, and they enjoy rigorous out-of-sample and asymptotic consistency guarantees. We will also show that Wasserstein distributionally robust optimization has interesting ramifications for statistical learning and motivates new approaches for fundamental learning tasks such as classification, regression, maximum likelihood estimation or minimum mean square error estimation, among others.