Goto

Collaborating Authors

 Regression


Exact Distribution-Free Hypothesis Tests for the Regression Function of Binary Classification via Conditional Kernel Mean Embeddings

arXiv.org Machine Learning

In this paper we suggest two statistical hypothesis tests for the regression function of binary classification based on conditional kernel mean embeddings. The regression function is a fundamental object in classification as it determines both the Bayes optimal classifier and the misclassification probabilities. A resampling based framework is applied and combined with consistent point estimators for the conditional kernel mean map to construct distribution-free hypothesis tests. These tests are introduced in a flexible manner allowing us to control the exact probability of type I error. We also prove that both proposed techniques are consistent under weak statistical assumptions, i.e., the type II error probabilities pointwise converge to zero.


Logistic Regression for Binary Classification

#artificialintelligence

In previous articles, I talked about deep learning and the functions used to predict results. In this article, we will use logistic regression to perform binary classification. Binary classification is named this way because it classifies the data into two results. Simply put, the result will be "yes" (1) or "no" (0). To determine whether the result is "yes" or "no", we will use a probability function: This probability function will give us a number from 0 to 1 indicating how likely this observation will belong to the classification that we have currently determined to be "yes".


Simple Linear Regression Tutorial for Machine Learning (ML)

#artificialintelligence

Simple linear regression is a statistical approach that allows us to study and summarize the relationship between two continuous quantitative variables. Simple linear regression is used in machine learning models, mathematics, statistical modeling, forecasting epidemics, and other quantitative fields. Out of the two variables, one variable is called the dependent variable, and the other variable is called the independent variable. Our goal is to predict the dependent variable's value based on the value of the independent variable. A simple linear regression aims to find the best relationship between X (independent variable) and Y (dependent variable).


Linear Regression over Networks with Communication Guarantees

arXiv.org Machine Learning

A key functionality of emerging connected autonomous systems such as smart cities, smart transportation systems, and the industrial Internet-of-Things, is the ability to process and learn from data collected at different physical locations. This is increasingly attracting attention under the terms of distributed learning and federated learning. However, in connected autonomous systems, data transfer takes place over communication networks with often limited resources. This paper examines algorithms for communication-efficient learning for linear regression tasks by exploiting the informativeness of the data. The developed algorithms enable a tradeoff between communication and learning with theoretical performance guarantees and efficient practical implementations.


The Top 10 Machine Learning Algorithms for ML Beginners

#artificialintelligence

Interest in learning machine learning has skyrocketed in the years since Harvard Business Review article named'Data Scientist' the'Sexiest job of the 21st century'. But if you're just starting out in machine learning, it can be a bit difficult to break into. It has been reposted with permission, and was last updated in 2019). This post is targeted towards beginners. If you've got some experience in data science and machine learning, you may be more interested in this more in-depth tutorial on doing machine learning in Python with scikit-learn, or in our machine learning courses, which start here. If you're not clear yet on the differences between "data science" and "machine learning," this article offers a good explanation: machine learning and data science -- what makes them different? Machine learning algorithms are programs that can learn from data and improve from experience, without human intervention.


Representation Matters: Assessing the Importance of Subgroup Allocations in Training Data

arXiv.org Machine Learning

Datasets play a critical role in shaping the perception of performance and progress in machine learning (ML)--the way we collect, process, and analyze data affects the way we benchmark success and form new research agendas (Paullada et al., 2020; Dotan & Milli, 2020). A growing appreciation of this determinative role of datasets has sparked a concomitant concern that standard datasets used for training and evaluating ML models lack diversity along significant dimensions, for example, geography, gender, and skin type (Shankar et al., 2017; Buolamwini & Gebru, 2018). Lack of diversity in evaluation data can obfuscate disparate performance when evaluating based on aggregate accuracy (Buolamwini & Gebru, 2018). Lack of diversity in training data can limit the extent to which learned models can adequately apply to all portions of a population, a concern highlighted in recent work in the medical domain (Habib et al., 2019; Hofmanninger et al., 2020). Our work aims to develop a general unifying perspective on the way that dataset composition affects outcomes of machine learning systems.


Ensembles of Random SHAPs

arXiv.org Machine Learning

Ensemble-based modifications of the well-known SHapley Additive exPlanations (SHAP) method for the local explanation of a black-box model are proposed. The modifications aim to simplify SHAP which is computationally expensive when there is a large number of features. The main idea behind the proposed modifications is to approximate SHAP by an ensemble of SHAPs with a smaller number of features. According to the first modification, called ER-SHAP, several features are randomly selected many times from the feature set, and Shapley values for the features are computed by means of "small" SHAPs. The explanation results are averaged to get the final Shapley values. According to the second modification, called ERW-SHAP, several points are generated around the explained instance for diversity purposes, and results of their explanation are combined with weights depending on distances between points and the explained instance. The third modification, called ER-SHAP-RF, uses the random forest for preliminary explanation of instances and determining a feature probability distribution which is applied to selection of features in the ensemble-based procedure of ER-SHAP. Many numerical experiments illustrating the proposed modifications demonstrate their efficiency and properties for local explanation.


Calibrated Simplex Mapping Classification

arXiv.org Machine Learning

In many supervised learning applications, it is not sufficient to know the most probable class y for a certain data point x. Instead, a well-calibrated probabilistic prediction p(y x) is required. For instance, in clinical applications, class probabilities are important for confidence in model predictions (Challis et al., 2015). Some classifiers intrinsically provide such a posterior probability, e. g. logistic regression or Gaussian process classification (GPC) as described in Rasmussen and Williams (2006). There are also various methods to install or improve such a calibration for a given classification approach (Niculescu-Mizil and Caruana, 2005), like Platt scaling (Platt, 2000) or isotonic regression (Zadrozny and Elkan, 2002).


Linear Regression for Dummies

#artificialintelligence

In my previous article, I have highlighted 4 algorithms to start off in Machine Learning: Linear Regression, Logistic Regression, Decision Trees and Random Forest. Now, I am creating a series of the same. The equation which defines the simplest form of the regression equation with one dependent and one independent variable: y mx c. Where y estimated dependent variable, c constant, m regression coefficient and x independent variable. Let's just understand with an example: Say; There is a certain relationship between the marks scored by the students (y- Dependent variable) in an exam and hours they studied for the exam(x- Independent Variable).


Linear Regression and Logistic Regression using R Studio

#artificialintelligence

In this section we will learn - What does Machine Learning mean. What are the meanings or different terms associated with machine learning? You will see some examples so that you understand what machine learning actually is. It also contains steps involved in building a machine learning model, not just linear models, any machine learning model.