Regression
Constant Size Molecular Descriptors For Use With Machine Learning
Collins, Christopher R., Gordon, Geoffrey J., von Lilienfeld, O. Anatole, Yaron, David J.
A set of molecular descriptors whose length is independent of molecular size is developed for machine learning models that target thermodynamic and electronic properties of molecules. These features are evaluated by monitoring performance of kernel ridge regression models on well-studied data sets of small organic molecules. The features include connectivity counts, which require only the bonding pattern of the molecule, and encoded distances, which summarize distances between both bonded and non-bonded atoms and so require the full molecular geometry. In addition to having constant size, these features summarize information regarding the local environment of atoms and bonds, such that models can take advantage of similarities resulting from the presence of similar chemical fragments across molecules. Combining these two types of features leads to models whose performance is comparable to or better than the current state of the art. The features introduced here have the advantage of leading to models that may be trained on smaller molecules and then used successfully on larger molecules.
3D Morphology Prediction of Progressive Spinal Deformities from Probabilistic Modeling of Discriminant Manifolds
Kadoury, Samuel, Mandel, William, Roy-Beaudry, Marjolaine, Nault, Marie-Lyne, Parent, Stefan
We introduce a novel approach for predicting the progression of adolescent idiopathic scoliosis from 3D spine models reconstructed from biplanar X-ray images. Recent progress in machine learning have allowed to improve classification and prognosis rates, but lack a probabilistic framework to measure uncertainty in the data. We propose a discriminative probabilistic manifold embedding where locally linear mappings transform data points from high-dimensional space to corresponding low-dimensional coordinates. A discriminant adjacency matrix is constructed to maximize the separation between progressive and non-progressive groups of patients diagnosed with scoliosis, while minimizing the distance in latent variables belonging to the same class. To predict the evolution of deformation, a baseline reconstruction is projected onto the manifold, from which a spatiotemporal regression model is built from parallel transport curves inferred from neighboring exemplars. Rate of progression is modulated from the spine flexibility and curve magnitude of the 3D spine deformation. The method was tested on 745 reconstructions from 133 subjects using longitudinal 3D reconstructions of the spine, with results demonstrating the discriminatory framework can identify between progressive and non-progressive of scoliotic patients with a classification rate of 81% and prediction differences of 2.1$^{o}$ in main curve angulation, outperforming other manifold learning methods. Our method achieved a higher prediction accuracy and improved the modeling of spatiotemporal morphological changes in highly deformed spines compared to other learning methods.
Estimating Individual Treatment Effect in Observational Data Using Random Forest Methods
Lu, Min, Sadiq, Saad, Feaster, Daniel J., Ishwaran, Hemant
Even for a medical discipline steeped in a tradition of randomized trials, the evidence basis for only a few guidelines is based on randomized trials (Tricoci et al., 2009). In part this is due to continued development of treatments, in part to enormous expense of clinical trials, and in large part to the hundreds of treatments and their nuances involved in real-world, heterogeneous clinical practice. Thus, many therapeutic decisions are based on observational studies. However, comparative treatment effectiveness studies of observational data suffer from two major problems: only partial overlap of treatments and selection bias. Each treatment is to a degree bounded within constraints of indication and appropriateness. Thus, transplantation is constrained by variables such as age, a mitral valve procedure is constrained by presence of mitral valve regurgitation. However, these boundaries overlap widely, and the same patient may be treated differently by different physicians or different hospitals, often without explicit or evident reasons. Thus, a fundamental hurdle in observational studies evaluating comparative effectiveness of treatment options is to address the resulting selection bias or confounding. Naively evaluating differences in outcomes without doing so leads to biased results and flawed scientific conclusions.
Examining correlation
Contingency Tables are good visualization method, with counts, percentiles in your case a 5 x 5 mosaic plot and table of counts, etc. Chi Sq tests use likelihood ratio and Pearson tests for example, but there are numerous options in stat software for analysis of those mosaic plots and their contingency table data. And of course the Nominal Logistic Regression Modeling tools have effects tests (Wald, Likelihood Ratio) for the main effects and interactions of your model. JMP.com or most other stat software tools support this type of data. Pasted below are list of OPTIONS for the Mosaic Plot and its Contingency Table from JMP help file (no detail, just names or tests and analysis options for your consideration). This list is property of JMP.com
The best kept secret about linear and logistic regression
All the regression theory developed by statisticians over the last 200 years (related to the general linear model) is useless. Regression can be performed as accurately without statistical models, including the computation of confidence intervals (for estimates, predicted values or regression parameters). The non-statistical approach is also more robust than theory described in all statistics textbooks and taught in all statistical courses. It does not require Map-Reduce when data is really big, nor any matrix inversion, maximum likelihood estimation, or mathematical optimization (Newton algorithm). It is indeed incredibly simple, robust, easy to interpret, and easy to code (no statistical libraries required).
Clustering responses to define dependent variable for logistic regression
Some colleagues of mine are working with survey responses, and are attempting to predict behaviors with demographic data. So, the plan is to define a dependent variable from some combination of responses to the survey questions, and then use a regression technique to model this dependent variable using other characteristics of the respondents. We all agree on the 5 or so questions that will define the dependent variable, but we disagree on how to specify the definition. I want to look at the actual questions being answered, and create a "score" as a weighted count of the'yeses' to the questions (weights based on how "on point" each question is to the behavior we are trying to define). My colleagues thought that this was too imprecise, and particularly criticised the'intuitive' weight assignment.
Random Forest Missing Data Algorithms
Random forest (RF) missing data algorithms are an attractive approach for dealing with missing data. They have the desirable properties of being able to handle mixed types of missing data, they are adaptive to interactions and nonlinearity, and they have the potential to scale to big data settings. Currently there are many different RF imputation algorithms but relatively little guidance about their efficacy, which motivated us to study their performance. Using a large, diverse collection of data sets, performance of various RF algorithms was assessed under different missing data mechanisms. Algorithms included proximity imputation, on the fly imputation, and imputation utilizing multivariate unsupervised and supervised splitting---the latter class representing a generalization of a new promising imputation algorithm called missForest. Performance of algorithms was assessed by ability to impute data accurately. Our findings reveal RF imputation to be generally robust with performance improving with increasing correlation. Performance was good under moderate to high missingness, and even (in certain cases) when data was missing not at random.
The Discrete Dantzig Selector: Estimating Sparse Linear Models via Mixed Integer Linear Optimization
Mazumder, Rahul, Radchenko, Peter
We propose a novel high-dimensional linear regression estimator: the Discrete Dantzig Selector, which minimizes the number of nonzero regression coefficients subject to a budget on the maximal absolute correlation between the features and residuals. Motivated by the significant advances in integer optimization over the past 10-15 years, we present a Mixed Integer Linear Optimization (MILO) approach to obtain certifiably optimal global solutions to this nonconvex optimization problem. The current state of algorithmics in integer optimization makes our proposal substantially more computationally attractive than the least squares subset selection framework based on integer quadratic optimization, recently proposed in [8] and the continuous nonconvex quadratic optimization framework of [33]. We propose new discrete first-order methods, which when paired with state-of-the-art MILO solvers, lead to good solutions for the Discrete Dantzig Selector problem for a given computational budget. We illustrate that our integrated approach provides globally optimal solutions in significantly shorter computation times, when compared to off-the-shelf MILO solvers. We demonstrate both theoretically and empirically that in a wide range of regimes the statistical properties of the Discrete Dantzig Selector are superior to those of popular $\ell_{1}$-based approaches. We illustrate that our approach can handle problem instances with p = 10,000 features with certifiable optimality making it a highly scalable combinatorial variable selection approach in sparse linear modeling.