HyperHELM: Hyperbolic Hierarchy Encoding for mRNA Language Modeling

van Spengler, Max, Moskalev, Artem, Mansi, Tommaso, Prakash, Mangal, Liao, Rui

arXiv.org Artificial Intelligence 

Language models are increasingly applied to biological sequences like proteins and mRNA, yet their default Euclidean geometry may mismatch the hierarchical structures inherent to biological data. While hyperbolic geometry provides a better alternative for accommodating hierarchical data, it has yet to find a way into language modeling for mRNA sequences. In this work, we introduce HyperHELM, a framework that implements masked language model pre-training in hyperbolic space for mRNA sequences. Using a hybrid design with hyperbolic layers atop Euclidean backbone, HyperHELM aligns learned representations with the biological hierarchy defined by the relationship between mRNA and amino acids. Across multiple multi-species datasets, it outperforms Euclidean baselines on 9 out of 10 tasks involving property prediction, with 10% improvement on average, and excels in out-of-distribution generalization to long and low-GC content sequences; for antibody region annotation, it surpasses hierarchy-aware Euclidean models by 3% in annotation accuracy. Our results highlight hyperbolic geometry as an effective inductive bias for hierarchical language modeling of mRNA sequences. Language models have been increasingly applied to biological sequence data, fueled by the growth of large-scale omics datasets (Lin et al., 2023; Celaj et al., 2023; Brixi et al., 2025). The biological sequences, however, are structured differently from natural language, particularly in their hierarchical organization, where nucleotides or amino acids form motifs that can be nested within larger functional groups (Buhr et al., 2016). In this work, we take the rapidly expanding therapeutic domain of RNA, where the codon-amino acid hierarchy plays a key role in determining the biophysical properties of mRNA sequences and their expressed proteins (Clancy & Brown, 2008), and we focus on encoding this hierarchy directly into the representation space of a bio-language model by leveraging hyperbolic geometry.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found