Goto

Collaborating Authors

 Regression


The GTEx Consortium atlas of genetic regulatory effects across human tissues

Science

The Genotype-Tissue Expression (GTEx) project was established to characterize genetic effects on the transcriptome across human tissues and to link these regulatory mechanisms to trait and disease associations. Here, we present analyses of the version 8 data, examining 15,201 RNA-sequencing samples from 49 tissues of 838 postmortem donors. We comprehensively characterize genetic associations for gene expression and splicing in cis and trans, showing that regulatory associations are found for almost all genes, and describe the underlying molecular mechanisms and their contribution to allelic heterogeneity and pleiotropy of complex traits. Leveraging the large diversity of tissues, we provide insights into the tissue specificity of genetic effects and show that cell type composition is a key factor in understanding gene regulatory mechanisms in human tissues.


A First Step Towards Distribution Invariant Regression Metrics

arXiv.org Machine Learning

Regression evaluation has been performed for decades. Some metrics have been identified to be robust against shifting and scaling of the data but considering the different distributions of data is much more difficult to address (imbalance problem) even though it largely impacts the comparability between evaluations on different datasets. In classification, it has been stated repeatedly that performance metrics like the F-Measure and Accuracy are highly dependent on the class distribution and that comparisons between different datasets with different distributions are impossible. We show that the same problem exists in regression. The distribution of odometry parameters in robotic applications can for example largely vary between different recording sessions. Here, we need regression algorithms that either perform equally well for all function values, or that focus on certain boundary regions like high speed. This has to be reflected in the evaluation metric. We propose the modification of established regression metrics by weighting with the inverse distribution of function values $Y$ or the samples $X$ using an automatically tuned Gaussian kernel density estimator. We show on synthetic and robotic data in reproducible experiments that classical metrics behave wrongly, whereas our new metrics are less sensitive to changing distributions, especially when correcting by the marginal distribution in $X$. Our new evaluation concept enables the comparison of results between different datasets with different distributions. Furthermore, it can reveal overfitting of a regression algorithm to overrepresented target values. As an outcome, non-overfitting regression algorithms will be more likely chosen due to our corrected metrics.


Generalized Multi-Output Gaussian Process Censored Regression

arXiv.org Machine Learning

When modelling censored observations, a typical approach in current regression methods is to use a censored-Gaussian (i.e. Tobit) model to describe the conditional output distribution. In this paper, as in the case of missing data, we argue that exploiting correlations between multiple outputs can enable models to better address the bias introduced by censored data. To do so, we introduce a heteroscedastic multi-output Gaussian process model which combines the non-parametric flexibility of GPs with the ability to leverage information from correlated outputs under input-dependent noise conditions. To address the resulting inference intractability, we further devise a variational bound to the marginal log-likelihood suitable for stochastic optimization. We empirically evaluate our model against other generative models for censored data on both synthetic and real world tasks and further show how it can be generalized to deal with arbitrary likelihood functions. Results show how the added flexibility allows our model to better estimate the underlying non-censored (i.e. true) process under potentially complex censoring dynamics.


Simulating normalising constants with referenced thermodynamic integration: application to COVID-19 model selection

arXiv.org Machine Learning

Model selection is a fundamental part of Bayesian statistical inference; a widely used tool in the field of epidemiology. Simple methods such as Akaike Information Criterion are commonly used but they do not incorporate the uncertainty of the model's parameters, which can give misleading choices when comparing models with similar fit to the data. One approach to model selection in a more rigorous way that uses the full posterior distributions of the models is to compute the ratio of the normalising constants (or model evidence), known as Bayes factors. These normalising constants integrate the posterior distribution over all parameters and balance over and under fitting. However, normalising constants often come in the form of intractable, high-dimensional integrals, therefore special probabilistic techniques need to be applied to correctly estimate the Bayes factors. One such method is thermodynamic integration (TI), which can be used to estimate the ratio of two models' evidence by integrating over a continuous path between the two un-normalised densities. In this paper we introduce a variation of the TI method, here referred to as referenced TI, which computes a single model's evidence in an efficient way by using a reference density such as a multivariate normal - where the normalising constant is known. We show that referenced TI, an asymptotically exact Monte Carlo method of calculating the normalising constant of a single model, in practice converges to the correct result much faster than other competing approaches such as the method of power posteriors. We illustrate the implementation of the algorithm on informative 1- and 2-dimensional examples, and apply it to a popular linear regression problem, and use it to select parameters for a model of the COVID-19 epidemic in South Korea.


A Gentle Introduction to Self-Training and Semi-Supervised Learning

#artificialintelligence

When it comes to machine learning classification tasks, the more data available to train algorithms, the better. In supervised learning, this data must be labeled with respect to the target class -- otherwise, these algorithms wouldn't be able to learn the relationships between the independent and target variables. So, what if we only have enough time and money to label some of a large data set, and choose to leave the rest unlabeled? Can this unlabeled data somehow be used in a classification algorithm? This is where semi-supervised learning comes in.


The First Step in Bayesian Time Series-- Linear Regression

#artificialintelligence

Today time series forecasting is ubiquitous, and decision-making processes in companies depend heavily on their ability to predict the future. Through a short series of articles I will present you with a possible approach to this kind of problems, combining state-space models with Bayesian statistics. In the initial articles, I will take some of the examples from the book An Introduction to State Space Time Series Analysis from Jacques J.F. Commandeur and Siem Jan Koopman [1]. It comprises a well-known introduction to the subject of state-space modeling applied to the time series domain. In classical regression analysis, it is assumed a linear relationship between a dependent variable y and a predictor variable x.


Working out the mystery of ectasia risk with artificial intelligence

#artificialintelligence

This article was reviewed by Renato Ambrósio, Jr, MD, PhD Ectasia is an intriguing and mysterious complication of laser-vision-correction (LVC) procedures. The potentially devastating problem underscores the importance of determining the susceptibility of the cornea for developing progressive ectasia, and of going beyond detecting just mild or subclinical keratoconus. The corneal structure as well as the potential impact of LVC should be considered to predict ectasia risk in every patient. "The LVC procedure and eye rubbing are the primary environmental culprits in the development of ectasia in any cornea," said Renato Ambrósio, Jr, MD, PhD. "So, a basic factor for avoiding ectasia is educating the patient not to rub the eye."


Algorithmic Trading Using Logistic Regression - Hands-Off Investing

#artificialintelligence

With the increasing popularity of machine learning, many traders are looking for ways in which they can "teach" a computer to trade for them. This process is called algorithmic trading (sometimes called algo-trading). Algorithmic trading is a hands off strategy for buying and selling stocks that leverages technical indicators instead of human intuition. In order to implement an algorithmic trading strategy though, you have to first narrow down a list of stocks that you want to analyze. This walk-through provides an automated process (using python and logistic regression) for determining the best stocks to algo-trade.


Build an IoT hub for streaming, storing, and analyzing sensor data in the cloud: Connect an Android device to the IBM Cloud, build a Node-RED dashboard, and build an AI classifier

#artificialintelligence

In this tutorial, we present the high-level steps that are involved in connecting an Android device to the cloud and developing analytics models to analyze sensor data. By the end of this tutorial you should be able to set up your own IoT hub for streaming, storing and processing device data. The following figure shows the architecture of our sample app. This tutorial requires an Android device (smartphone), an internet connection, and an IBM Cloud account. In Step 1 you will create an account on IBM Cloud and install an application on your Android phone.


Empirical Strategy for Stretching Probability Distribution in Neural-network-based Regression

arXiv.org Artificial Intelligence

In regression analysis under artificial neural networks, the prediction performance depends on determining the appropriate weights between layers. As randomly initialized weights are updated during back-propagation using the gradient descent procedure under a given loss function, the loss function structure can affect the performance significantly. In this study, we considered the distribution error, i.e., the inconsistency of two distributions (those of the predicted values and label), as the prediction error, and proposed weighted empirical stretching (WES) as a novel loss function to increase the overlap area of the two distributions. The function depends on the distribution of a given label, thus, it is applicable to any distribution shape. Moreover, it contains a scaling hyperparameter such that the appropriate parameter value maximizes the common section of the two distributions. To test the function capability, we generated ideal distributed curves (unimodal, skewed unimodal, bimodal, and skewed bimodal) as the labels, and used the Fourier-extracted input data from the curves under a feedforward neural network. In general, WES outperformed loss functions in wide use, and the performance was robust to the various noise levels. The improved results in RMSE for the extreme domain (i.e., both tail regions of the distribution) are expected to be utilized for prediction of abnormal events in non-linear complex systems such as natural disaster and financial crisis.