Goto

Collaborating Authors

 Regression


Decision Tree Algorithm -A Complete Guide - Analytics Vidhya

#artificialintelligence

Till now we have learned about linear regression, logistic regression, and they were pretty hard to understand. Let's now start with Decision tree's and I assure you this is probably the easiest algorithm in Machine Learning. There's not much mathematics involved here. Since it is very easy to use and interpret it is one of the most widely used and practical methods used in Machine Learning. Root Nodes – It is the node present at the beginning of a decision tree from this node the population starts dividing according to various features.


All the Datasets You Need to Practice Data Science Skills and Make a Great Portfolio

#artificialintelligence

Every time I attempt to do a project for learning a new topic or for a project I spend a significant amount of time finding a suitable dataset for that. That way I have quite a lot of datasets that helped me learn and do some cool projects for my portfolio. I am going to share those datasets in this article so that you have a dataset to practice and make your portfolio. This dataset has information on the Olympic results. Each row contains the data of a country. This dataset will give you a taste of data cleaning to start with.


Math Behind Logistic Regression

#artificialintelligence

Before we understand the bizarre symbols used in Logistic Regression, let's recollect the underlying idea of this technique. Logistic Regression is a machine learning technique that is widely used for classification problems. The definition above indicates that the algorithm is also useful for problems other than classification, regression for example. But this article will be centered around classification only. How does a classification problem look like?


Improve Linear Regression for Time Series Forecasting

#artificialintelligence

Time series forecasting is a very fascinating task. However, build a machine-learning algorithm to predict future data is trickier than expected. The hardest thing to handle is the temporal dependency present in the data. By their nature, time-series data are subject to shifts. This may result in temporal drifts of various kinds which may become our algorithm inaccurate.


Targeting Underrepresented Populations in Precision Medicine: A Federated Transfer Learning Approach

arXiv.org Machine Learning

The limited representation of minorities and disadvantaged populations in large-scale clinical and genomics research has become a barrier to translating precision medicine research into practice. Due to heterogeneity across populations, risk prediction models are often found to be underperformed in these underrepresented populations, and therefore may further exacerbate known health disparities. In this paper, we propose a two-way data integration strategy that integrates heterogeneous data from diverse populations and from multiple healthcare institutions via a federated transfer learning approach. The proposed method can handle the challenging setting where sample sizes from different populations are highly unbalanced. With only a small number of communications across participating sites, the proposed method can achieve performance comparable to the pooled analysis where individual-level data are directly pooled together. We show that the proposed method improves the estimation and prediction accuracy in underrepresented populations, and reduces the gap of model performance across populations. Our theoretical analysis reveals how estimation accuracy is influenced by communication budgets, privacy restrictions, and heterogeneity across populations. We demonstrate the feasibility and validity of our methods through numerical experiments and a real application to a multi-center study, in which we construct polygenic risk prediction models for Type II diabetes in AA population.


Polynomial Regression in Machine Learning

#artificialintelligence

Polynomial Regression is one of the important parts of Machine Learning. Polynomial Regression is a regression algorithm that frames a relationship between the independent variable(x) and dependent variable(y) as nth degree polynomial. Basically, it brings forth the finest estimation for dependent and independent variables. To convert the multiple linear regression into polynomial regression we need to add some polynomial terms. It acts as a saver when we have to deal with a dataset that is not linearly separable.


Causal Inference in the Wild

#artificialintelligence

Causal inference is a hot topic in machine learning, and there are many excellent primers on the theory of causal inference available [1–4]. But much fewer examples of real-world applications of machine-learning-powered causal inference exist. This article introduces one such example from an industry context, using a (public) real-world dataset. It is aimed at a technical audience with an understanding of the basics of causality. Specifically, I will look at the "ideal" scenario of price elasticity estimation [2, 5].


Collinearity in Regression Model

#artificialintelligence

To make it more clear why collinearity is such a problem, let's take a look at a use case. For the use case, I am going to use the car dataset that you can download easily on Kaggle. Let's imagine we want to predict the price of a car, or price variable in the dataset. To predict it, we will use certain independent variables such as the car's city MPG, highway MPG, horsepower, engine size, stroke, width, peak RPM, and compression ratio. Next, we build a regression model based on these independent variables.


Heavy-tailed Streaming Statistical Estimation

arXiv.org Machine Learning

We consider the task of heavy-tailed statistical estimation given streaming $p$-dimensional samples. This could also be viewed as stochastic optimization under heavy-tailed distributions, with an additional $O(p)$ space complexity constraint. We design a clipped stochastic gradient descent algorithm and provide an improved analysis, under a more nuanced condition on the noise of the stochastic gradients, which we show is critical when analyzing stochastic optimization problems arising from general statistical estimation problems. Our results guarantee convergence not just in expectation but with exponential concentration, and moreover does so using $O(1)$ batch size. We provide consequences of our results for mean estimation and linear regression. Finally, we provide empirical corroboration of our results and algorithms via synthetic experiments for mean estimation and linear regression.


Predicting Census Survey Response Rates via Interpretable Nonparametric Additive Models with Structured Interactions

arXiv.org Machine Learning

Accurate and interpretable prediction of survey response rates is important from an operational standpoint. The US Census Bureau's well-known ROAM application uses principled statistical models trained on the US Census Planning Database data to identify hard-to-survey areas. An earlier crowdsourcing competition revealed that an ensemble of regression trees led to the best performance in predicting survey response rates; however, the corresponding models could not be adopted for the intended application due to limited interpretability. In this paper, we present new interpretable statistical methods to predict, with high accuracy, response rates in surveys. We study sparse nonparametric additive models with pairwise interactions via $\ell_0$-regularization, as well as hierarchically structured variants that provide enhanced interpretability. Despite strong methodological underpinnings, such models can be computationally challenging -- we present new scalable algorithms for learning these models. We also establish novel non-asymptotic error bounds for the proposed estimators. Experiments based on the US Census Planning Database demonstrate that our methods lead to high-quality predictive models that permit actionable interpretability for different segments of the population. Interestingly, our methods provide significant gains in interpretability without losing in predictive performance to state-of-the-art black-box machine learning methods based on gradient boosting and feedforward neural networks. Our code implementation in python is available at https://github.com/ShibalIbrahim/Additive-Models-with-Structured-Interactions.