Goto

Collaborating Authors

 genetic variant


GV-Rep: A Large-Scale Dataset for Genetic Variant Representation Learning

Neural Information Processing Systems

The development of deep learning approaches for modeling these multifactorial effects of GVs is still in its nascent stages, primarily due to the lack of comprehensive datasets that capture the intricate relationships between GVs and their downstream effects on complex traits.



Bears in Italy inbreed more, but are less aggressive

Popular Science

Apennine brown bears have been isolated from their European counterparts since the Roman Empire. Breakthroughs, discoveries, and DIY tips sent every weekday. While bear attacks seem to have become a significant problem in Japan--with the country going as far as deploying the army --new research reveals that an Italian bear species has evolved to be less aggressive. Apennine brown bears () have been in close contact with humans for generations. Their small, endangered population exists only in central Italy, and previous research suggests that this population split off from other European brown bears 2,000 to 3,000 years ago .


GV-Rep: A Large-Scale Dataset for Genetic Variant Representation Learning

Neural Information Processing Systems

The development of deep learning approaches for modeling these multifactorial effects of GVs is still in its nascent stages, primarily due to the lack of comprehensive datasets that capture the intricate relationships between GVs and their downstream effects on complex traits.


Evolution of intelligence in our ancestors may have come at a cost

New Scientist

A timeline of genetic changes in millions of years of human evolution shows that variants linked to higher intelligence appeared most rapidly around 500,000 years ago, and were closely followed by mutations that made us more prone to mental illness. The findings suggest a "trade-off" in brain evolution between intelligence and psychiatric issues, says Ilan Libedinsky at the Center for Neurogenomics and Cognitive Research in Amsterdam, the Netherlands. Why did humans evolve big brains? "Mutations related to psychiatric disorders apparently involve part of the genome that also involves intelligence. So there's an overlap there," says Libedinsky. "[The advances in cognition] may have come at the price of making our brains more vulnerable to mental disorders."


Identification and Estimation of the Bi-Directional MR with Some Invalid Instruments

Neural Information Processing Systems

We consider the challenging problem of estimating causal effects from purely observational data in the bi-directional Mendelian randomization (MR), where some invalid instruments, as well as unmeasured confounding, usually exist. To address this problem, most existing methods attempt to find proper valid instrumental variables (IVs) for the target causal effect by expert knowledge or by assuming that the causal model is a one-directional MR model. As such, in this paper, we first theoretically investigate the identification of the bi-directional MR from observational data. In particular, we provide necessary and sufficient conditions under which valid IV sets are correctly identified such that the bi-directional MR model is identifiable, including the causal directions of a pair of phenotypes (i.e., the treatment and outcome). Moreover, based on the identification theory, we develop a cluster fusion-like method to discover valid IV sets and estimate the causal effects of interest. We theoretically demonstrate the correctness of the proposed algorithm. Experimental results show the effectiveness of our method for estimating causal effects in both one-directional and bi-directional MR models.


Incorporating LLM Embeddings for Variation Across the Human Genome

Niu, Hongqian, Bryan, Jordan, Li, Xihao, Li, Didong

arXiv.org Artificial Intelligence

In the past few years, foundation models based on large transformer networks such as Google's BERT (Kenton and Toutanova, 2019) and OpenAI's GPT family (Radford, 2018) have been shown to be invaluable aids for scientific discovery in the analysis of genomic data (Cui et al., 2024; Theodoris et al., 2023; Chen and Zou, 2025). More specifically, foundation models targeted for genomic applications typically comprise of those that are trained on enormous databases of experimental data such as scGPT (Cui et al., 2024), which was trained on transcriptomes from 33 million human cells from 441 different studies or the GeneFormer model (Theodoris et al., 2023), which was trained on 29.9 million human single-cell transcriptomes. On the other hand, foundation models based on pre-training on internet-scale databases of natural language texts may offer distinct advantages, such as potentially taking advantage of niche biological relationships which may be widely documented in scientific literature, but not necessarily be represented experimentally in large-scale genomics datasets. For this reason, some recent works have used the embedding outputs of large-language models (LLMs) such as ChatGPT (Radford, 2018) to encode the biological information contained in text-based gene descriptions, such as those in the NCBI database (Schoch et al., 2020). Notably, Chen and Zou (2025) show that these text-based gene descriptors can be input to GPT-3.5 to obtain gene embeddings that act as features/covariates for standard prediction algorithms, denoted GenePT.


Whole-Genome Sequencing Will Change Pregnancy

WIRED

At WIRED Health 2025, Orchid CEO Noor Siddiqui and the genomics pioneer George Church laid out their view of the future of genetic screening. The world of pregnancy is going to radically change, predicts Noor Siddiqui. "I think that the default way people are going to choose to have kids is via IVF and embryo screening," she said at the WIRED Health summit last week. "There's just a massive amount of risk that you can take off of the table." Siddiqui is the founder and CEO of Orchid, a biotech company that offers whole-genome screening of embryos for IVF.


Functional Analysis of Variance for Association Studies

Vsevolozhskaya, Olga A., Zaykin, Dmitri V., Greenwood, Mark C., Wei, Changshuai, Lu, Qing

arXiv.org Artificial Intelligence

While progress has been made in identifying common genetic variants associated with human diseases, for most of common complex diseases, the identified genetic variants only account for a small proportion of heritability. Challenges remain in finding additional unknown genetic variants predisposing to complex diseases. With the advance in next-generation sequencing technologies, sequencing studies have become commonplace in genetic research. The ongoing exome-sequencing and whole-genome-sequencing studies generate a massive amount of sequencing variants and allow researchers to comprehensively investigate their role in human diseases. The discovery of new disease-associated variants can be enhanced by utilizing powerful and computationally efficient statistical methods. In this paper, we propose a functional analysis of variance (FANOVA) method for testing an association of sequence variants in a genomic region with a qualitative trait. The FANOVA has a number of advantages: (1) it tests for a joint effect of gene variants, including both common and rare; (2) it fully utilizes linkage disequilibrium and genetic position information; and (3) allows for either protective or risk-increasing causal variants. Through simulations, we show that FANOVA outperform two popularly used methods - SKAT and a previously proposed method based on functional linear models (FLM), - especially if a sample size of a study is small and/or sequence variants have low to moderate effects. We conduct an empirical study by applying three methods (FANOVA, SKAT and FLM) to sequencing data from Dallas Heart Study. While SKAT and FLM respectively detected ANGPTL 4 and ANGPTL 3 associated with obesity, FANOVA was able to identify both genes associated with obesity.


A Weighted U Statistic for Genetic Association Analyses of Sequencing Data

Wei, Changshuai, Li, Ming, He, Zihuai, Vsevolozhskaya, Olga, Schaid, Daniel J., Lu, Qing

arXiv.org Artificial Intelligence

With advancements in next generation sequencing technology, a massive amount of sequencing data are generated, offering a great opportunity to comprehensively investigate the role of rare variants in the genetic etiology of complex diseases. Nevertheless, this poses a great challenge for the statistical analysis of high-dimensional sequencing data. The association analyses based on traditional statistical methods suffer substantial power loss because of the low frequency of genetic variants and the extremely high dimensionality of the data. We developed a weighted U statistic, referred to as WU-seq, for the high-dimensional association analysis of sequencing data. Based on a non-parametric U statistic, WU-SEQ makes no assumption of the underlying disease model and phenotype distribution, and can be applied to a variety of phenotypes. Through simulation studies and an empirical study, we showed that WU-SEQ outperformed a commonly used SKAT method when the underlying assumptions were violated (e.g., the phenotype followed a heavy-tailed distribution). Even when the assumptions were satisfied, WU-SEQ still attained comparable performance to SKAT. Finally, we applied WU-seq to sequencing data from the Dallas Heart Study (DHS), and detected an association between ANGPTL 4 and very low density lipoprotein cholesterol.