Regression
What is Logistic Regression?
This tutorial is on the basics of applying logistic regression, using a little bit of Python. It is also a continuation of the post "What is Linear Regression?", which can be found here. It is a little counterintuitive, but Logistic Regression is typically used as a classifier. In fact, Logistic Regression is one of the most used and well-known classification methods Data Scientists use. The idea behind this classification method is that the output will be between 0 and 1. Essentially returning the probability that the data you gave to the model, belongs to a certain group or class.
Cluster Regularization via a Hierarchical Feature Regression
Prediction tasks with high-dimensional nonorthogonal predictor sets pose a challenge for least squares based fitting procedures. A large and productive literature exists, discussing various regularized approaches to improving the out-of-sample robustness of parameter estimates. This paper proposes a novel cluster-based regularization -- the hierarchical feature regression (HFR) --, which mobilizes insights from the domains of machine learning and graph theory to estimate parameters along a supervised hierarchical representation of the predictor set, shrinking parameters towards group targets. The method is innovative in its ability to estimate optimal compositions of predictor groups, as well as the group targets endogenously. The HFR can be viewed as a supervised factor regression, with the strength of shrinkage governed by a penalty on the extent of idiosyncratic variation captured in the fitting process. The method demonstrates good predictive accuracy and versatility, outperforming a panel of benchmark regularized estimators across a diverse set of simulated regression tasks, including dense, sparse and grouped data generating processes. An application to the prediction of economic growth is used to illustrate the HFR's effectiveness in an empirical setting, with favorable comparisons to several frequentist and Bayesian alternatives.
Understanding Logistic Regression in Pythonic Way
Let's take an example suppose you see two people Ashutosh & Abhishek and guessed that Abhishek is obese and Ashutosh is non-obese. But If you want to perform same task using machine how would you perform. Here Classification based models come into picture. Based on the data that has been provided to machine, it will predict whether the person is obese or not. Logistic Regression is the combination of Linear Regression and Sigmoid Function.
Accuracy on the Line: On the Strong Correlation Between Out-of-Distribution and In-Distribution Generalization
Miller, John, Taori, Rohan, Raghunathan, Aditi, Sagawa, Shiori, Koh, Pang Wei, Shankar, Vaishaal, Liang, Percy, Carmon, Yair, Schmidt, Ludwig
For machine learning systems to be reliable, we must understand their performance in unseen, out-of-distribution environments. In this paper, we empirically show that out-of-distribution performance is strongly correlated with in-distribution performance for a wide range of models and distribution shifts. Specifically, we demonstrate strong correlations between in-distribution and out-of-distribution performance on variants of CIFAR-10 & ImageNet, a synthetic pose estimation task derived from YCB objects, satellite imagery classification in FMoW-WILDS, and wildlife classification in iWildCam-WILDS. The strong correlations hold across model architectures, hyperparameters, training set size, and training duration, and are more precise than what is expected from existing domain adaptation theory. To complete the picture, we also investigate cases where the correlation is weaker, for instance some synthetic distribution shifts from CIFAR-10-C and the tissue classification dataset Camelyon17-WILDS. Finally, we provide a candidate theory based on a Gaussian data model that shows how changes in the data covariance arising from distribution shift can affect the observed correlations.
Coastal water quality prediction based on machine learning with feature interpretation and spatio-temporal analysis
Grbčić, Luka, Družeta, Siniša, Mauša, Goran, Lipić, Tomislav, Lušić, Darija Vukić, Alvir, Marta, Lučin, Ivana, Sikirica, Ante, Davidović, Davor, Travaš, Vanja, Kalafatović, Daniela, Pikelj, Kristina, Fajković, Hana, Holjević, Toni, Kranjčević, Lado
Coastal water quality management is a public health concern, as poor coastal water quality can harbor pathogens that are dangerous to human health. Tourism-oriented countries need to actively monitor the condition of coastal water at tourist popular sites during the summer season. In this study, routine monitoring data of $Escherichia\ Coli$ and enterococci across 15 public beaches in the city of Rijeka, Croatia, were used to build machine learning models for predicting their levels based on environmental parameters as well as to investigate their relationships with environmental stressors. Gradient Boosting (Catboost, Xgboost), Random Forests, Support Vector Regression and Artificial Neural Networks were trained with measurements from all sampling sites and used to predict $E.\ Coli$ and enterococci values based on environmental features. The evaluation of stability and generalizability with 10-fold cross validation analysis of the machine learning models, showed that the Catboost algorithm performed best with R$^2$ values of 0.71 and 0.68 for predicting $E.\ Coli$ and enterococci, respectively, compared to other evaluated ML algorithms including Xgboost, Random Forests, Support Vector Regression and Artificial Neural Networks. We also use the SHapley Additive exPlanations technique to identify and interpret which features have the most predictive power. The results show that site salinity measured is the most important feature for forecasting both $E.\ Coli$ and enterococci levels. Finally, the spatial and temporal accuracy of both ML models were examined at sites with the lowest coastal water quality. The spatial $E. Coli$ and enterococci models achieved strong R$^2$ values of 0.85 and 0.83, while the temporal models achieved R$^2$ values of 0.74 and 0.67. The temporal model also achieved moderate R$^2$ values of 0.44 and 0.46 at a site with high coastal water quality.
Use of Variational Inference in Music Emotion Recognition
Deziderio, Nathalie, de Carvalho, Hugo Tremonte
This work was developed aiming to employ Statistical techniques to the field of Music Emotion Recognition, a well-recognized area within the Signal Processing world, but hardly explored from the statistical point of view. Here, we opened several possibilities within the field, applying modern Bayesian Statistics techniques and developing efficient algorithms, focusing on the applicability of the results obtained. Although the motivation for this project was the development of a emotion-based music recommendation system, its main contribution is a highly adaptable multivariate model that can be useful interpreting any database where there is an interest in applying regularization in an efficient manner. Broadly speaking, we will explore what role a sound theoretical statistical analysis can play in the modeling of an algorithm that is able to understand a well-known database and what can be gained with this kind of approach.
A Dataset and Method for Hallux Valgus Angle Estimation Based on Deep Learing
Xu, Ningyuan, Zhuang, Jiayan, Wu, Yaojun, Xiao, Jiangjian
Angular measurements is essential to make a resonable treatment for Hallux valgus (HV), a common forefoot deformity. However, it still depends on manual labeling and measurement, which is time-consuming and sometimes unreliable. Automating this process is a thing of concern. However, it lack of dataset and the keypoints based method which made a great success in pose estimation is not suitable for this field.To solve the problems, we made a dataset and developed an algorithm based on deep learning and linear regression. It shows great fitting ability to the ground truth.
Multivariate functional group sparse regression: functional predictor selection
In this paper, we propose methods for functional predictor selection and the estimation of smooth functional coefficients simultaneously in a scalar-on-function regression problem under high-dimensional multivariate functional data setting. In particular, we develop two methods for functional group-sparse regression under a generic Hilbert space of infinite dimension. We show the convergence of algorithms and the consistency of the estimation and the selection (oracle property) under infinite-dimensional Hilbert spaces. Simulation studies show the effectiveness of the methods in both the selection and the estimation of functional coefficients. The applications to the functional magnetic resonance imaging (fMRI) reveal the regions of the human brain related to ADHD and IQ.
Privacy Concerns in Chatbot Interactions: When to Trust and When to Worry
Saglam, Rahime Belen, Nurse, Jason R. C., Hodges, Duncan
Through advances in their conversational abilities, chatbots have started to request and process an increasing variety of sensitive personal information. The accurate disclosure of sensitive information is essential where it is used to provide advice and support to users in the healthcare and finance sectors. In this study, we explore users' concerns regarding factors associated with the use of sensitive data by chatbot providers. We surveyed a representative sample of 491 British citizens. Our results show that the user concerns focus on deleting personal information and concerns about their data's inappropriate use. We also identified that individuals were concerned about losing control over their data after a conversation with conversational agents. We found no effect from a user's gender or education but did find an effect from the user's age, with those over 45 being more concerned than those under 45. We also considered the factors that engender trust in a chatbot. Our respondents' primary focus was on the chatbot's technical elements, with factors such as the response quality being identified as the most critical factor. We again found no effect from the user's gender or education level; however, when we considered some social factors (e.g. avatars or perceived 'friendliness'), we found those under 45 years old rated these as more important than those over 45. The paper concludes with a discussion of these results within the context of designing inclusive, digital systems that support a wide range of users.
eQTL mapping using allele-specific gene expression
Using information from allele-specific gene expression (ASE) can sub-stantially improve the power to map gene expression quantitative trait loci (eQTLs). However, such practice has been limited, partly due to high computational cost and the requirement to access raw data that can take a large amount of storage space. To address these computational challenges, we have developed a computational framework that uses a statistical method named TReCASE as its computational engine, and it is computationally feasible for large scale analysis. We applied it to map eQTLs in 28 human tissues using the data from the Genotype-Tissue Expression (GTEx) project. Compared with a popular linear regression method that does not use ASE data, TReCASE can double the number of eGenes (i.e., genes with at least one significant eQTL) when sample size is relatively small, e.g., n 200.