Goto

Collaborating Authors

 Regression


Imbalanced Data Set - Data Science with RiSi

#artificialintelligence

While doing ML algorithms like linear Regression, logistic regression, etc. the algorithm uses one or more independent variables for predicting the dependent variable. The data set should be balanced to avoid predicting the results incorrectly. But it is not necessary that the data set is balanced all the time. In some cases like fraudulent data, cancer patient records, etc the data set may be imbalanced. Today we will discuss about how to clean an Imbalanced data set.


Modeling Cell Populations Measured By Flow Cytometry With Covariates Using Sparse Mixture of Regressions

arXiv.org Machine Learning

The ocean is filled with microscopic microalgae called phytoplankton, which together are responsible for as much photosynthesis as all plants on land combined. Our ability to predict their response to the warming ocean relies on understanding how the dynamics of phytoplankton populations is influenced by changes in environmental conditions. One powerful technique to study the dynamics of phytoplankton is flow cytometry, which measures the optical properties of thousands of individual cells per second. Today, oceanographers are able to collect flow cytometry data in real-time onboard a moving ship, providing them with fine-scale resolution of the distribution of phytoplankton across thousands of kilometers. One of the current challenges is to understand how these small and large scale variations relate to environmental conditions, such as nutrient availability, temperature, light and ocean currents. In this paper, we propose a novel sparse mixture of multivariate regressions model to estimate the time-varying phytoplankton subpopulations while simultaneously identifying the specific environmental covariates that are predictive of the observed changes to these subpopulations. We demonstrate the usefulness and interpretability of the approach using both synthetic data and real observations collected on an oceanographic cruise conducted in the north-east Pacific in the spring of 2017.


Partial Trace Regression and Low-Rank Kraus Decomposition

arXiv.org Machine Learning

The trace regression model, a direct extension of the well-studied linear regression model, allows one to map matrices to real-valued outputs. We here introduce an even more general model, namely the partial-trace regression model, a family of linear mappings from matrix-valued inputs to matrix-valued outputs; this model subsumes the trace regression model and thus the linear regression model. Borrowing tools from quantum information theory, where partial trace operators have been extensively studied, we propose a framework for learning partial trace regression models from data by taking advantage of the so-called low-rank Kraus representation of completely positive maps. We show the relevance of our framework with synthetic and real-world experiments conducted for both i) matrix-to-matrix regression and ii) positive semidefinite matrix completion, two tasks which can be formulated as partial trace regression problems.


ML Codes

#artificialintelligence

It is a machine learning algorithm that is used for classification problems. It follows a sigmoid path due to its function which can be written 1/(1 e (-hypothesis)). What we going to do here is, creating a model using a logistic regression algorithm that just classifies the data between the event i.e. it is happening or not. Hence we define a threshold value that predicts the plot under 0 and 1 as 0.5 being threshold value. It is based on the concept of probability and does predictive analysis.


Hi-CI: Deep Causal Inference in High Dimensions

arXiv.org Artificial Intelligence

We address the problem of counterfactual regression using causal inference (CI) in observational studies consisting of high dimensional covariates and high cardinality treatments. Confounding bias, which leads to inaccurate treatment effect estimation, is attributed to covariates that affect both treatments and outcome. The presence of high-dimensional co-variates exacerbates the impact of bias as it is harder to isolate and measure the impact of these confounders. In the presence of high-cardinality treatment variables, CI is rendered ill-posed due to the increase in the number of counterfactual outcomes to be predicted. We propose Hi-CI, a deep neural network (DNN) based framework for estimating causal effects in the presence of large number of covariates, and high-cardinal and continuous treatment variables. The proposed architecture comprises of a decorrelation network and an outcome prediction network. In the decorrelation network, we learn a data representation in lower dimensions as compared to the original covariates and addresses confounding bias alongside. Subsequently, in the outcome prediction network, we learn an embedding of high-cardinality and continuous treatments, jointly with the data representation. We demonstrate the efficacy of causal effect prediction of the proposed Hi-CI network using synthetic and real-world NEWS datasets.


Defending Regression Learners Against Poisoning Attacks

arXiv.org Machine Learning

Regression models, which are widely used from engineering applications to financial forecasting, are vulnerable to targeted malicious attacks such as training data poisoning, through which adversaries can manipulate their predictions. Previous works that attempt to address this problem rely on assumptions about the nature of the attack/attacker or overestimate the knowledge of the learner, making them impractical. We introduce a novel Local Intrinsic Dimensionality (LID) based measure called N-LID that measures the local deviation of a given data point's LID with respect to its neighbors. We then show that N-LID can distinguish poisoned samples from normal samples and propose an N-LID based defense approach that makes no assumptions of the attacker. Through extensive numerical experiments with benchmark datasets, we show that the proposed defense mechanism outperforms the state of the art defenses in terms of prediction accuracy (up to 76% lower MSE compared to an undefended ridge model) and running time.


5 Mathematical topics to be learned for Machine Learning and Artificial Intelligence

#artificialintelligence

Share this post In this post, we are going in deep with list of mathematics to be learned before going to start ahead in AI or Machine Learning Table of Contents What is Impact of Mathematics on Machine Learning? What is the Approximate Distribution ratio of Topics in Mathematics Detailed List of Mathematical topics Good Sources to learn Mathematics for Machine Learning 1. What is the Impact of Mathematics on Machine Learning or Artificial Intelligence(AI) Mathematics has an incredible impact on developing machine learning algorithms for real-time problem-solving. In the Machine Learning algorithm, learning insights from data is completely numerical one. The first algorithm ( i.e., Linear regression) to the last algorithm all are associated with Mathematics and Optimization.


The Math Behind Logistic Regression

#artificialintelligence

Have you ever wondered how logistic regression works and how loss function is minimized by gradient descent? Have you ever wondered how logistic regression works and how loss function is minimized by gradient descent? This article is for you. Before starting with logistic regression, it is important to understand what is Supervised learning. Supervised learning is training the model on a dataset that contains a target(output) column.


HeteGCN: Heterogeneous Graph Convolutional Networks for Text Classification

arXiv.org Machine Learning

We consider the problem of learning efficient and inductive graph convolutional networks for text classification with a large number of examples and features. Existing state-of-the-art graph embedding based methods such as predictive text embedding (PTE) and TextGCN have shortcomings in terms of predictive performance, scalability and inductive capability. To address these limitations, we propose a heterogeneous graph convolutional network (HeteGCN) modeling approach that unites the best aspects of PTE and TextGCN together. The main idea is to learn feature embeddings and derive document embeddings using a HeteGCN architecture with different graphs used across layers. We simplify TextGCN by dissecting into several HeteGCN models which (a) helps to study the usefulness of individual models and (b) offers flexibility in fusing learned embeddings from different models. In effect, the number of model parameters is reduced significantly, enabling faster training and improving performance in small labeled training set scenario. Our detailed experimental studies demonstrate the efficacy of the proposed approach.


Estimating the time-lapse between medical insurance reimbursement with non-parametric regression models

arXiv.org Machine Learning

Nonparametric supervised learning algorithms represent a succinct class of supervised learning algorithms where the learning parameters are highly flexible and whose values are directly dependent on the size of the training data. In this paper, we comparatively study the properties of four nonparametric algorithms, K-Nearest Neighbours (KNNs), Support Vector Machines (SVMs), Decision trees and Random forests. The supervised learning task is a regression estimate of the time lapse in medical insurance reimbursement. Our study is concerned precisely with how well each of the nonparametric regression models fits the training data. We quantify the goodness of fit using the R-squared metric. The results are presented with a focus on the effect of the size of the training data, the feature space dimension and hyperparameter optimization. The findings suggest k-NN's and SVM's algorithms as better models in predicting welldefined output labels (i.e,