Goto

Collaborating Authors

 Regression


Survey of resampling techniques for improving classification performance in unbalanced datasets

arXiv.org Machine Learning

A number of classification problems need to deal with data imbalance between classes. Often it is desired to have a high recall on the minority class while maintaining a high precision on the majority class. In this paper, we review a number of resampling techniques proposed in literature to handle unbalanced datasets and study their effect on classification performance.


The Gentlest Introduction to Tensorflow โ€“ Part 2

#artificialintelligence

Editor's note: You may want to check out part 1 of this tutorial before proceeding. In the previous article, we used Tensorflow (TF) to build and learn a linear regression model with a single feature so that given a feature value (house size/sqm), we can predict the outcome (house price/). In machine learning (ML) literature, we come across the term'training' very often, let us literally look at what that means in TF. The goal in linear regression is to find W, b, such that given any feature value (x), we can find the prediction (y) by substituting W, x, b values into the model. However to find W, b that can give accurate predictions, we need to'train' the model using available data (the multiple pairs of actual feature (x), and actual outcome (y_), note the underscore).


R FUNCTIONS FOR REGRESSION ANALYSIS โ€“ Step Up Analytics

#artificialintelligence

Here are some helpful R functions for regression analysis grouped by their goal. The name of package is in parentheses. Base has a method for objects inheriting from class "lm" (stasts) This is a generic function, but currently only has a methods for objects inheriting from classes "lm" and "glm" (stasts) AIC: Generic function calculating the Akaike information criterion for one or several fitted model objects for which a log-likelihood value can be obtained, according to the formula -2*log-likelihood k*npar, where npar represents the number of parameters in the fitted model, and k 2 for the usual AIC, or k log(n) (n the number of observations) for the so-called BIC or SBC (Schwarz's Bayesian criterion) (stats) Four plots (selectable by which) are currently provided: a plot of residuals against fitted values, a Scale-Location plot of sqrt{ residuals } against fitted values, a Normal Q-Q plot, and a plot of Cook's distances versus row labels (stats) Performs Bartlett's test of the null that the variances in each of the groups (samples) are the same (stats) bgtest: Breusch-Godfrey Test (lmtest) bptest: Breusch-Pagan Test (lmtest)


Linear Regression Analysis using R โ€“ Step Up Analytics

#artificialintelligence

One of the most frequent used techniques in statistics is linear regression where we investigate the potential relationship between a variable of interest (often called the response variable but there are many other names in use) and a set of one of more variables (known as the independent variables or some other term). Unsurprisingly there are flexible facilities inR for fitting a range of linear models from the simple case of a single variable to more complex relationships. In this post we will consider the case of simple linear regression with one response variable and a single independent variable. The purpose of using this data is to determine whether there is a relationship, described by a simple linear regression model, between variables. You seen in the image that first i checked my working directory and then changed it to another directory, this means the working datafiles have another location so i changed it for my help.


Simple Logistic Regression using Keras

#artificialintelligence

This post basically takes the tutorial on Classifying MNIST digits using Logistic Regression which is primarily written for Theano and attempts to port it to Keras. So, what better way to put that claim to the test than to write some code! Keras comes with great documentation. One can really get up and running in a matter of minutes. Everything needed to accomplish the goal can be found on the Guide to Sequential Model page (assuming of course the initial setup and configuration is all taken care of).


High-dimensional Mixed Graphical Models

arXiv.org Machine Learning

High-Dimensional Mixed Graphical Models Jie Cheng โ€ , Tianxi Liโ€ก, Elizaveta Levinaโ€ก, Ji Zhuโ€ก โ€  Google, Inc.,โ€ก Department of Statistics, University of Michigan March 22, 2018 Abstract While graphical models for continuous data (Gaussian graphical models) and discrete data (Ising models) have been extensively studied, there is little work on graphical models for data sets with both continuous and discrete variables (mixed data), which are common in many scientific applications. We propose a novel graphical model for mixed data, which is simple enough to be suitable for high-dimensional data, yet flexible enough to represent all possible graph structures. We develop a computationally efficient regression-based algorithm for fitting the model by focusing on the conditional log-likelihood of each variable given the rest. The parameters have a natural group structure, and sparsity in the fitted graph is attained by incorporating a group lasso penalty, approximated by a weighted lasso penalty for computational efficiency. We demonstrate the effectiveness of our method through an extensive simulation study and apply it to a music annotation data set (CAL500), obtaining a sparse and interpretable graphical model relating the continuous features of the audio signal to binary variables such as genre, emotions, and usage associated with particular songs. 1 arXiv:1304.2810v3 Key Words: Conditional Gaussian density, Graphical model, Group lasso, Mixed variables, Music annotation. 1 Introduction Graphical models have proven to be a useful tool in representing the conditional dependency structure of multivariate distributions. The undirected graphical model in particular, sometimes also referred to as the Markov network, has drawn a notable amount of attention over the past decade. In an undirected graphical model, nodes in the graph represent the variables, while an edge between a pair of variables indicates that they are dependent conditional on all other variables. The properties of these models are by now well understood and studied both in the classical and the high-dimensional settings. Both these models can only deal with variables of one kind - either all continuous variables in Gaussian models or all binary variables in the Ising model (extensions of the Ising model to general discrete data, while possible in principle, are rarely used in 2 practice). In many applications, however, data sources are complex and varied, and frequently result in mixed types of data, with both continuous and discrete variables present in the same dataset. In this paper, we will focus on graphical models for this type of mixed data (mixed graphical models).


Solving a Mixture of Many Random Linear Equations by Tensor Decomposition and Alternating Minimization

arXiv.org Machine Learning

We consider the problem of solving mixed random linear equations with $k$ components. This is the noiseless setting of mixed linear regression. The goal is to estimate multiple linear models from mixed samples in the case where the labels (which sample corresponds to which model) are not observed. We give a tractable algorithm for the mixed linear equation problem, and show that under some technical conditions, our algorithm is guaranteed to solve the problem exactly with sample complexity linear in the dimension, and polynomial in $k$, the number of components. Previous approaches have required either exponential dependence on $k$, or super-linear dependence on the dimension. The proposed algorithm is a combination of tensor decomposition and alternating minimization. Our analysis involves proving that the initialization provided by the tensor method allows alternating minimization, which is equivalent to EM in our setting, to converge to the global optimum at a linear rate.


NCBI-Hackathons/Machine_Learning_Immunogenicity

#artificialintelligence

This project looks into the application of Machine Learning (ML) techniques in the prediction of Immunogenicity (Categorical; Positive or Negative) based on a peptide and its associated amino acid properties. This study uses peptide data from the Immune Epitode Database (IEDB). The R package "Peptides" has been used to compute the amino acid properties and mashup with peptide data to enable the use of ML algorithms for immunogenicity analysis, particularly, the algorithms that are more efficient with numeric and categorical data instead of string sequence. Tensorflow is an open source software library ML that provides linear regression and classification algorithms (open sourced by Google in Nov 2015) for multi-dimensional arrays (aka "Tensors"). K-fold cross-validation as well as hold-out of test data was used to train and test the generated models.


Conditional Sparse Linear Regression

arXiv.org Machine Learning

Linear regression, the fitting of linear relationships among variables in a data set, is a standard tool in data analysis. In particular, for the sake of interpretability and utility in further analysis, we desire to find highly sparse linear relationships, i.e., involving only a few variables. Of course, such simple linear relationships often will not hold across an entire population. But, more frequently there will exist conditions - perhaps a range of parameters or a segment of a larger population - under which such sparse models fit the data quite well. For example, Rosenfeld et al. [16] used data mining heuristics to identify small segments of a population in which a few additional risk factors were highly predictive of certain kinds of cancer, whereas these same risk factors were not significant in the overall population. Simple rules for special cases may also hint at the more complex general rules. More generally, we need to develop new techniques to reason about populations in which most members are atypical in some way, which are colloquially (and somewhat abusively) referred to as long-tailed distributions. We are seeking principled alternatives to ad-hoc approaches such as trying a variety of methods for clustering the data and hoping that the identified clusters can be modeled well.


Scalable Modeling of Multivariate Longitudinal Data for Prediction of Chronic Kidney Disease Progression

arXiv.org Machine Learning

Prediction of the future trajectory of a disease is an important challenge for personalized medicine and population health management. However, many complex chronic diseases exhibit large degrees of heterogeneity, and furthermore there is not always a single readily available biomarker to quantify disease severity. Even when such a clinical variable exists, there are often additional related biomarkers routinely measured for patients that may better inform the predictions of their future disease state. To this end, we propose a novel probabilistic generative model for multivariate longitudinal data that captures dependencies between multivariate trajectories. We use a Gaussian process based regression model for each individual trajectory, and build off ideas from latent class models to induce dependence between their mean functions. We fit our method using a scalable variational inference algorithm to a large dataset of longitudinal electronic patient health records, and find that it improves dynamic predictions compared to a recent state of the art method. Our local accountable care organization then uses the model predictions during chart reviews of high risk patients with chronic kidney disease.