Goto

Collaborating Authors

 Regression


Constant Size Molecular Descriptors For Use With Machine Learning

arXiv.org Machine Learning

A set of molecular descriptors whose length is independent of molecular size is developed for machine learning models that target thermodynamic and electronic properties of molecules. These features are evaluated by monitoring performance of kernel ridge regression models on well-studied data sets of small organic molecules. The features include connectivity counts, which require only the bonding pattern of the molecule, and encoded distances, which summarize distances between both bonded and non-bonded atoms and so require the full molecular geometry. In addition to having constant size, these features summarize information regarding the local environment of atoms and bonds, such that models can take advantage of similarities resulting from the presence of similar chemical fragments across molecules. Combining these two types of features leads to models whose performance is comparable to or better than the current state of the art. The features introduced here have the advantage of leading to models that may be trained on smaller molecules and then used successfully on larger molecules.


3D Morphology Prediction of Progressive Spinal Deformities from Probabilistic Modeling of Discriminant Manifolds

arXiv.org Machine Learning

We introduce a novel approach for predicting the progression of adolescent idiopathic scoliosis from 3D spine models reconstructed from biplanar X-ray images. Recent progress in machine learning have allowed to improve classification and prognosis rates, but lack a probabilistic framework to measure uncertainty in the data. We propose a discriminative probabilistic manifold embedding where locally linear mappings transform data points from high-dimensional space to corresponding low-dimensional coordinates. A discriminant adjacency matrix is constructed to maximize the separation between progressive and non-progressive groups of patients diagnosed with scoliosis, while minimizing the distance in latent variables belonging to the same class. To predict the evolution of deformation, a baseline reconstruction is projected onto the manifold, from which a spatiotemporal regression model is built from parallel transport curves inferred from neighboring exemplars. Rate of progression is modulated from the spine flexibility and curve magnitude of the 3D spine deformation. The method was tested on 745 reconstructions from 133 subjects using longitudinal 3D reconstructions of the spine, with results demonstrating the discriminatory framework can identify between progressive and non-progressive of scoliotic patients with a classification rate of 81% and prediction differences of 2.1$^{o}$ in main curve angulation, outperforming other manifold learning methods. Our method achieved a higher prediction accuracy and improved the modeling of spatiotemporal morphological changes in highly deformed spines compared to other learning methods.


Estimating Individual Treatment Effect in Observational Data Using Random Forest Methods

arXiv.org Machine Learning

Even for a medical discipline steeped in a tradition of randomized trials, the evidence basis for only a few guidelines is based on randomized trials (Tricoci et al., 2009). In part this is due to continued development of treatments, in part to enormous expense of clinical trials, and in large part to the hundreds of treatments and their nuances involved in real-world, heterogeneous clinical practice. Thus, many therapeutic decisions are based on observational studies. However, comparative treatment effectiveness studies of observational data suffer from two major problems: only partial overlap of treatments and selection bias. Each treatment is to a degree bounded within constraints of indication and appropriateness. Thus, transplantation is constrained by variables such as age, a mitral valve procedure is constrained by presence of mitral valve regurgitation. However, these boundaries overlap widely, and the same patient may be treated differently by different physicians or different hospitals, often without explicit or evident reasons. Thus, a fundamental hurdle in observational studies evaluating comparative effectiveness of treatment options is to address the resulting selection bias or confounding. Naively evaluating differences in outcomes without doing so leads to biased results and flawed scientific conclusions.


Examining correlation

@machinelearnbot

Contingency Tables are good visualization method, with counts, percentiles in your case a 5 x 5 mosaic plot and table of counts, etc. Chi Sq tests use likelihood ratio and Pearson tests for example, but there are numerous options in stat software for analysis of those mosaic plots and their contingency table data. And of course the Nominal Logistic Regression Modeling tools have effects tests (Wald, Likelihood Ratio) for the main effects and interactions of your model. JMP.com or most other stat software tools support this type of data. Pasted below are list of OPTIONS for the Mosaic Plot and its Contingency Table from JMP help file (no detail, just names or tests and analysis options for your consideration). This list is property of JMP.com


The best kept secret about linear and logistic regression

@machinelearnbot

All the regression theory developed by statisticians over the last 200 years (related to the general linear model) is useless. Regression can be performed as accurately without statistical models, including the computation of confidence intervals (for estimates, predicted values or regression parameters). The non-statistical approach is also more robust than theory described in all statistics textbooks and taught in all statistical courses. It does not require Map-Reduce when data is really big, nor any matrix inversion, maximum likelihood estimation, or mathematical optimization (Newton algorithm). It is indeed incredibly simple, robust, easy to interpret, and easy to code (no statistical libraries required).


Clustering responses to define dependent variable for logistic regression

@machinelearnbot

Some colleagues of mine are working with survey responses, and are attempting to predict behaviors with demographic data. So, the plan is to define a dependent variable from some combination of responses to the survey questions, and then use a regression technique to model this dependent variable using other characteristics of the respondents. We all agree on the 5 or so questions that will define the dependent variable, but we disagree on how to specify the definition. I want to look at the actual questions being answered, and create a "score" as a weighted count of the'yeses' to the questions (weights based on how "on point" each question is to the behavior we are trying to define). My colleagues thought that this was too imprecise, and particularly criticised the'intuitive' weight assignment.


Random Forest Missing Data Algorithms

arXiv.org Machine Learning

Random forest (RF) missing data algorithms are an attractive approach for dealing with missing data. They have the desirable properties of being able to handle mixed types of missing data, they are adaptive to interactions and nonlinearity, and they have the potential to scale to big data settings. Currently there are many different RF imputation algorithms but relatively little guidance about their efficacy, which motivated us to study their performance. Using a large, diverse collection of data sets, performance of various RF algorithms was assessed under different missing data mechanisms. Algorithms included proximity imputation, on the fly imputation, and imputation utilizing multivariate unsupervised and supervised splitting---the latter class representing a generalization of a new promising imputation algorithm called missForest. Performance of algorithms was assessed by ability to impute data accurately. Our findings reveal RF imputation to be generally robust with performance improving with increasing correlation. Performance was good under moderate to high missingness, and even (in certain cases) when data was missing not at random.


The Discrete Dantzig Selector: Estimating Sparse Linear Models via Mixed Integer Linear Optimization

arXiv.org Machine Learning

We propose a novel high-dimensional linear regression estimator: the Discrete Dantzig Selector, which minimizes the number of nonzero regression coefficients subject to a budget on the maximal absolute correlation between the features and residuals. Motivated by the significant advances in integer optimization over the past 10-15 years, we present a Mixed Integer Linear Optimization (MILO) approach to obtain certifiably optimal global solutions to this nonconvex optimization problem. The current state of algorithmics in integer optimization makes our proposal substantially more computationally attractive than the least squares subset selection framework based on integer quadratic optimization, recently proposed in [8] and the continuous nonconvex quadratic optimization framework of [33]. We propose new discrete first-order methods, which when paired with state-of-the-art MILO solvers, lead to good solutions for the Discrete Dantzig Selector problem for a given computational budget. We illustrate that our integrated approach provides globally optimal solutions in significantly shorter computation times, when compared to off-the-shelf MILO solvers. We demonstrate both theoretically and empirically that in a wide range of regimes the statistical properties of the Discrete Dantzig Selector are superior to those of popular $\ell_{1}$-based approaches. We illustrate that our approach can handle problem instances with p = 10,000 features with certifiable optimality making it a highly scalable combinatorial variable selection approach in sparse linear modeling.


How to Do Linear Regression the Right Way [LIVE]

#artificialintelligence

I'll perform linear regression from scratch in Python using a method called'Gradient Descent' to determine the relationship between student test scores & amount of hours studied. This will be about 50 lines of code and I'll deep dive into the math behind this. That's what keeps me going.