Regression
A machine learning classifier trained on cancer transcriptomes detects NF1 inactivation signal in glioblastoma
Background We have identified molecules that exhibit synthetic lethality in cells with loss of the neurofibromin 1 (NF1) tumor suppressor gene. However, recognizing tumors that have inactivation of the NF1 tumor suppressor function is challenging because the loss may occur via mechanisms that do not involve mutation of the genomic locus. Degradation of the NF1 protein, independent of NF1 mutation status, phenocopies inactivating mutations to drive tumors in human glioma cell lines. NF1 inactivation may alter the transcriptional landscape of a tumor and allow a machine learning classifier to detect which tumors will benefit from synthetic lethal molecules. Results We developed a strategy to predict tumors with low NF1 activity and hence tumors that may respond to treatments that target cells lacking NF1.
Projected Regression Methods for Inverting Fredholm Integrals: Formalism and Application to Analytical Continuation
Arsenault, Louis-Francois, Neuberg, Richard, Hannah, Lauren A., Millis, Andrew J.
We present a machine learning approach to the inversion of Fredholm integrals of the first kind. The approach provides a natural regularization in cases where the inverse of the Fredholm kernel is ill-conditioned. It also provides an efficient and stable treatment of constraints. The key observation is that the stability of the forward problem permits the construction of a large database of outputs for physically meaningful inputs. We apply machine learning to this database to generate a regression function of controlled complexity, which returns approximate solutions for previously unseen inputs; the approximate solutions are then projected onto the subspace of functions satisfying relevant constraints. We also derive and present uncertainty estimates. We illustrate the approach by applying it to the analytical continuation problem of quantum many-body physics, which involves reconstructing the frequency dependence of physical excitation spectra from data obtained at specific points in the complex frequency plane. Under standard error metrics the method performs as well or better than the Maximum Entropy method for low input noise and is substantially more robust to increased input noise. We expect the methodology to be similarly effective for any problem involving a formally ill-conditioned inversion, provided that the forward problem can be efficiently solved.
Book: Mastering Python for Data Science
If you are a Python developer who wants to master the world of data science then this book is for you. Some knowledge of data science is assumed. Derive inferences from the analysis by performing inferential statistics Evaluate and apply the linear regression technique to estimate the relationships among variables. Evaluate and apply the linear regression technique to estimate the relationships among variables. Evaluate and apply the linear regression technique to estimate the relationships among variables.
Large Scale Decision Forests: Lessons Learned - Sift Science Engineering Blog
We at Sift Science provide fraud detection for hundreds of customers spanning many industries and use cases. To do this, we have devised a specialized modeling stack that is able to adapt to individual customers while simultaneously delivering a great out-of-box experience for new customers, achieved by mixing the output from a "global" model โ trained on our entire network of data โ with the output from a customer's individualized model. Prior to decision forests, we used a custom-built logistic regression classifier combined with highly specialized feature engineering for our global model. While logistic regression has many great attributes, it is fundamentally limited by its inability to model non-linear interactions between features. At Sift, we tend to think of our modeling stack primarily as an enabler of our feature engineering; more powerful modeling allows us to extract the most insight from our features and can even lead to new classes of features. So when in early 2015 we stopped seeing benefits from feature engineering work, it was clear to us that we needed a major upgrade to our modeling stack.
10 types of regressions. Which one to use?
Linear regression: Oldest type of regression, designed 250 years ago; computations (on small data) could easily be carried out by a human being, by design. Can be used for interpolation, but not suitable for predictive analytics; has many drawbacks when applied to modern data, e.g. A better solution is piecewise-linear regression, in particular for time series. Logistic regression: Used extensively in clinical trials, scoring and fraud detection, when the response is binary (chance of succeeding or failing, e.g. for a new tested drug or a credit card transaction). Suffers same drawbacks as linear regression (not robust, model-dependent), and computing regression coeffients involves using complex iterative, numerically unstable algorithm.
Improved prediction accuracy for disease risk mapping using Gaussian Process stacked generalisation
Bhatt, Samir, Cameron, Ewan, Flaxman, Seth R, Weiss, Daniel J, Smith, David L, Gething, Peter W
Maps of infectious disease---charting spatial variations in the force of infection, degree of endemicity, and the burden on human health---provide an essential evidence base to support planning towards global health targets. Contemporary disease mapping efforts have embraced statistical modelling approaches to properly acknowledge uncertainties in both the available measurements and their spatial interpolation. The most common such approach is that of Gaussian process regression, a mathematical framework comprised of two components: a mean function harnessing the predictive power of multiple independent variables, and a covariance function yielding spatio-temporal shrinkage against residual variation from the mean. Though many techniques have been developed to improve the flexibility and fitting of the covariance function, models for the mean function have typically been restricted to simple linear terms. For infectious diseases, known to be driven by complex interactions between environmental and socio-economic factors, improved modelling of the mean function can greatly boost predictive power. Here we present an ensemble approach based on stacked generalisation that allows for multiple, non-linear algorithmic mean functions to be jointly embedded within the Gaussian process framework. We apply this method to mapping Plasmodium falciparum prevalence data in Sub-Saharan Africa and show that the generalised ensemble approach markedly out-performs any individual method.
Data Science Dictionary
The idea of cross-validation is to split the data into N subsets, to put one subset aside, to estimate parameters of the model from the remaining N-1 subsets, and to use the retained subset to estimate the error of the model. Such a process is repeated N times - with each of the N subsets being used as the validation set . Then the values of the errors obtained in such N steps are combined to provide the final estimate of the model error. The cross-validation is used in various classification and prediction procedures, such as regression analysis, discriminant analysis, neural networks and classification and regression trees (CART) . The goal is to improve the quality of the decision that is made from the outcome of the study on the basis of statistical methods, and to ensure that maximum information is obtained from scarce experimental data.
How to Treat Missing Values in Your Data
How do you deal with missing values - ignore or treat them? The answer would depend on the percentage of those missing values in the dataset, the variables affected by missing values, whether those missing values are a part of dependent or the independent variables, etc. Missing Value treatment becomes important since the data insights or the performance of your predictive model could be impacted if the missing values are not appropriately handled.The 2 tables above give different insights. The inference from the table on the left with the missing data indicates lower count for Android Mobile users and iOS Tablet users and higher Average Transaction Value compared to the inference from the right table with no missing data. The inference from the data with missing values could adversely impact business decisions. The best scenario is to get the actual value that was missing by going back to the Data Extraction & Collection stage and correcting possible errors during these stages. Generally, that won't be the case and you will still be left with missing values.
High-dimensional regression over disease subgroups
Dondelinger, Frank, Mukherjee, Sach, Initiative, The Alzheimer's Disease Neuroimaging
We consider high-dimensional regression over subgroups of observations. Our work is motivated by biomedical problems, where disease subtypes, for example, may differ with respect to underlying regression models, but sample sizes at the subgroup-level may be limited. We focus on the case in which subgroup-specific models may be expected to be similar but not necessarily identical. Our approach is to treat subgroups as related problem instances and jointly estimate subgroup-specific regression coefficients. This is done in a penalized framework, combining an $\ell_1$ term with an additional term that penalizes differences between subgroup-specific coefficients. This gives solutions that are globally sparse but that allow information-sharing between the subgroups. We present algorithms for estimation and empirical results on simulated data and using Alzheimer's disease, amyotrophic lateral sclerosis and cancer datasets. These examples demonstrate the gains our approach can offer in terms of prediction and the ability to estimate subgroup-specific sparsity patterns.