Goto

Collaborating Authors

 chip-seq


Training Flexible Models of Genetic Variant Effects from Functional Annotations using Accelerated Linear Algebra

Amin, Alan N., Potapczynski, Andres, Wilson, Andrew Gordon

arXiv.org Artificial Intelligence

To understand how genetic variants in human genomes manifest in phenotypes -- traits like height or diseases like asthma -- geneticists have sequenced and measured hundreds of thousands of individuals. Geneticists use this data to build models that predict how a genetic variant impacts phenotype given genomic features of the variant, like DNA accessibility or the presence of nearby DNA-bound proteins. As more data and features become available, one might expect predictive models to improve. Unfortunately, training these models is bottlenecked by the need to solve expensive linear algebra problems because variants in the genome are correlated with nearby variants, requiring inversion of large matrices. Previous methods have therefore been restricted to fitting small models, and fitting simplified summary statistics, rather than the full likelihood of the statistical model. In this paper, we leverage modern fast linear algebra techniques to develop DeepWAS (Deep genome Wide Association Studies), a method to train large and flexible neural network predictive models to optimize likelihood. Notably, we find that larger models only improve performance when using our full likelihood approach; when trained by fitting traditional summary statistics, larger models perform no better than small ones. We find larger models trained on more features make better predictions, potentially improving disease predictions and therapeutic target identification.


Nucleosome positioning: resources and tools online

Teif, Vladimir B.

arXiv.org Machine Learning

This is the author's version which is being continuously updated and not synchronised with the journal version. The final printed version will appear in Briefings in Bioinformatics Abstract Nucleosome positioning is an important process required for proper genome packing and its accessibility to execute the genetic program in a cell-specific, timely manner. In the recent years hundreds of papers have been devoted to the bioinformatics, physics and biology of nucleosome positioning. The purpose of this review is to cover a practical aspect of this field, namely to provide a guide to the multitude of nucleosome positioning resources available online. These include almost 300 experimental datasets of genome-wide nucleosome occupancy profiles determined in different cell types and more than 40 computational tools for the analysis of experimental nucleosome positioning data and prediction of intrinsic nucleosome formation probabilities from the DNA sequence. A manually curated, up to date list of these resources will be maintained at http://generegulation.info. 1 Introduction The nucleosome is the basic unit of chromatin compaction, composed of the histone octamer and 146-147 base pairs (bp) of DNA wrapped around it. Nucleosomes can form at any genomic locations, but some DNA sequences have higher affinity to the histone octamer, mostly due to the differential bending properties of the DNA double helix. In addition, nucleosome positioning is cell type-specific, in a sense that the cells of the same organism sharing the same genome can have different nucleosome locations depending on the cell type and state. Interested readers are directed to a number of recent publications reviewing the biological, physical and bioinformatics aspects of these phenomena, which will be outside of the scope of the current work [1-32]. Here we will omit fundamental scientific questions, and will focus on a very practical aspect of the field: which experimental nucleosome positioning datasets already exist, how to generate your own data, and how to compare these with other experimental datasets and bioinformatically predicted nucleosome positions in a given genome? 1. Available experimental datasets Recent high-throughput genome-wide data with respect to nucleosome positioning come from a number of related techniques, which have in common an idea to cut DNA between nucleosomes and map protected DNA regions. The most frequently used method is MNase-seq (chromatin digestion by micrococcal nuclease followed by deep sequencing) [11, 33-35].