TEDDY: A Family Of Foundation Models For Understanding Single Cell Biology

Chevalier, Alexis, Ghosh, Soumya, Awasthi, Urvi, Watkins, James, Bieniewska, Julia, Mitrea, Nichita, Kotova, Olga, Shkura, Kirill, Noble, Andrew, Steinbaugh, Michael, Delile, Julien, Meier, Christoph, Zhukov, Leonid, Khalil, Iya, Mukherjee, Srayanta, Mueller, Judith

arXiv.org Artificial Intelligence 

The complexity of cell biology and the mechanisms of disease pathogenesis are driven by an intricate regulatory network of genes [Chatterjee and Ahituv, 2017, Theodoris et al., 2015, 2021]. A better resolution of this complex interactome network would enhance our ability to design drugs that target the causal mechanism of the disease rather than interventions that aim to modulate the downstream effects [Ding et al., 2022]. However, accurate inference of gene regulatory networks is challenging. The possible space for genetic interactions is vast [Bunne et al., 2024], the networks to be inferred are highly context-dependent, different cell types and tissue types exhibit different regulatory networks and exhibit significant variations across donors [Chen and Dahl, 2024]. Moreover, the data required to study gene regulatory networks for a specific disease is usually limited and highly specialized, often plagued by experimental artifacts [Hicks et al., 2018]. However, a confluence of recent technological progress promises to make this challenging problem more tractable. The advent of accurate single-cell sequencing technologies that remove the artifacts of bulk cell data, better reflect natural variability, and provide signals at higher resolutions. This, along with the increasing availability of atlas-scale scRNAseq datasets that span an extensive range of diseases, cell types, tissue types, and donors provide an unprecedented opportunity for studying disease mechanisms at scale.