codon
Equi-mRNA: Protein Translation Equivariant Encoding for mRNALanguage Models
The growing importance of mRNA therapeutics and synthetic biology highlights the need for models that capture the latent structure of synonymous codon (different triplets encoding the same amino acid) usage, which subtly modulates translation efficiency and gene expression. While recent efforts incorporate codon-level inductive biases through auxiliary objectives, they often fall short of explicitly modeling the structured relationships that arise from the genetic code's inherent symmetries. We introduce Equi-mRNA, the first codon-level equivariant mRNA language model that explicitly encodes synonymous codon symmetries as cyclic subgroups of 2D Special Orthogonal matrix (SO(2)). By combining group-theoretic priors with an auxiliary equivariance loss and symmetry-aware pooling, Equi-mRNA learns biologically grounded representations that outperform vanilla baselines across multiple axes. On downstream property-prediction tasks including expression, stability, and riboswitch switching Equi-mRNA delivers up to 10% improvements in accuracy. In sequence generation, it produces mRNA constructs that are up to 4 more realistic under Fréchet BioDistance metrics and 28% better preserve functional properties compared to vanilla baseline. Interpretability analyses further reveal that learned codon-rotation distributions recapitulate known GC-content biases and tRNA abundance patterns, offering novel insights into codon usage. Equi-mRNA establishes a new biologically principled paradigm for mRNA modeling.
StructuredDNA: A Bio-Physical Framework for Energy-Aware Transformer Routing
The rapid scaling of large computational models has led to a critical increase in energy and compute costs. Inspired by biological systems where structure and function emerge from low-energy configurations, we introduce StructuredDNA, a sparse architecture framework for modular, energy-aware Transformer routing. StructuredDNA replaces dense Mixture-of-Experts routing with a bio-physical, energy-guided routing layer based on semantic energy minimization. Inputs are dynamically grouped into semantic codons, and routing selects a single expert by minimizing a global energy functional that combines cohesion, uncertainty, and computational cost. We validate StructuredDNA on both specialized (BioASQ) and open-domain benchmarks (WikiText-103). On BioASQ (K = 50), we achieve a 97.7% reduction in Energy Utilization Density (EUD) and a Semantic Stability Index (SSI) of 0.998. We further demonstrate a Semantic Scaling Law on WikiText-103, showing that the architecture generalizes to open domains by scaling expert granularity (K = 2048) while maintaining more than 99% energy efficiency. StructuredDNA thus establishes a robust, domain-agnostic paradigm for future sparse computational frameworks. StructuredDNA provides an explicit link between bio-physical principles and sparse expert routing in Transformer architectures, and points toward future energy-aware, modular, and scalable computational systems. We discuss limitations of this proof-of-concept study and outline directions for scaling the approach to larger models, datasets, and hardware platforms. The StructuredDNA implementation is available at https://github.com/InnoDeep-repos/StructuredDNA .
Curriculum-Augmented GFlowNets For mRNA Sequence Generation
Laajil, Aya, Shtanchaev, Abduragim, Muhammad, Sajan, Moulines, Eric, Lahlou, Salem
Designing mRNA sequences is a major challenge in developing next-generation therapeutics, since it involves exploring a vast space of possible nucleotide combinations while optimizing sequence properties like stability, translation efficiency, and protein expression. While Generative Flow Networks are promising for this task, their training is hindered by sparse, long-horizon rewards and multi-objective trade-offs. We propose Curriculum-Augmented GFlowNets (CAGFN), which integrate curriculum learning with multi-objective GFlowNets to generate de novo mRNA sequences. We also provide a new mRNA design environment for GFlowNets which, given a target protein sequence and a combination of biological objectives, allows for the training of models that generate plausible mRNA candidates. This provides a biologically motivated setting for applying and advancing GFlowNets in therapeutic sequence design. On different mRNA design tasks, CAGFN improves Pareto performance and biological plausibility, while maintaining diversity. Moreover, CAGFN reaches higher-quality solutions faster than a GFlowNet trained with random sequence sampling (no curriculum), and enables generalization to out-of-distribution sequences. Imagine a molecule that can be designed to instruct human cells to produce a protein of interest. Such is the promise of messenger RNA (mRNA), which has become a cornerstone of modern biotechnology (Pardi et al., 2018; Sahin et al., 2014). Designing de novo mRNA sequences, that encode a target protein and achieve optimality on particular properties of interest (Gustafsson et al., 2004; Kane, 1995; Mauger et al., 2019), is therefore of growing practical importance. This task can be framed as generating long, structured sequences under multiple, often competing objectives, which makes search and optimization challenging (Keeney & Raiffa, 1993; Zhang et al., 2023; Angermueller et al., 2020). Because biological targets are diverse and downstream outcomes are difficult to predict, diversity is a central design criterion (Mullis et al., 2019). This need is amplified by the limited predictive power of inexpensive screening methods, such as in-silico simulations or in vitro assays.
Evaluating DNA function understanding in genomic language models using evolutionarily implausible sequences
Jiang, Shiyu, Liu, Xuyin, Wang, Zitong Jerry
Genomic language models (gLMs) hold promise for generating novel, functional DNA sequences for synthetic biology. However, realizing this potential requires models to go beyond evolutionary plausibility and understand how DNA sequence encodes gene expression and regulation. We introduce a benchmark called Nullsettes, which assesses how well models can predict in silico loss-of-function (LOF) mutations, in synthetic expression cassettes with little evolutionary precedent. Testing 12 state-of-the-art gLMs, we find that most fail to consistently detect these strong LOF mutations. All models show a sharp drop in predictive accuracy as the likelihood assigned to the original (nonmutant) sequence decreases, suggesting that gLMs rely heavily on pattern-matching to their evolutionary prior rather than on any mechanistic understanding of gene expression. Our findings highlight fundamental limitations in how gLMs generalize to engineered, non-natural sequences, and underscore the need for benchmarks and modeling strategies that prioritize functional understanding.
Equi-mRNA: Protein Translation Equivariant Encoding for mRNA Language Models
Yazdani-Jahromi, Mehdi, Yalabadi, Ali Khodabandeh, Garibay, Ozlem Ozmen
The growing importance of mRNA therapeutics and synthetic biology highlights the need for models that capture the latent structure of synonymous codon (different triplets encoding the same amino acid) usage, which subtly modulates translation efficiency and gene expression. While recent efforts incorporate codon-level inductive biases through auxiliary objectives, they often fall short of explicitly modeling the structured relationships that arise from the genetic code's inherent symmetries. We introduce Equi-mRNA, the first codon-level equivariant mRNA language model that explicitly encodes synonymous codon symmetries as cyclic subgroups of 2D Special Orthogonal matrix (SO(2)). By combining group-theoretic priors with an auxiliary equivariance loss and symmetry-aware pooling, Equi-mRNA learns biologically grounded representations that outperform vanilla baselines across multiple axes. On downstream property-prediction tasks including expression, stability, and riboswitch switching Equi-mRNA delivers up to approximately 10% improvements in accuracy. In sequence generation, it produces mRNA constructs that are up to approximately 4x more realistic under Frechet BioDistance metrics and approximately 28% better preserve functional properties compared to vanilla baseline. Interpretability analyses further reveal that learned codon-rotation distributions recapitulate known GC-content biases and tRNA abundance patterns, offering novel insights into codon usage. Equi-mRNA establishes a new biologically principled paradigm for mRNA modeling, with significant implications for the design of next-generation therapeutics.
A New Deep-learning-Based Approach For mRNA Optimization: High Fidelity, Computation Efficiency, and Multiple Optimization Factors
Gong, Zheng, Jiang, Ziyi, Gao, Weihao, Zhuo, Deng, Ma, Lan
The mRNA optimization is critical for therapeutic and biotechnological applications, since sequence features directly govern protein expression levels and efficacy. However, current methods face significant challenges in simultaneously achieving three key objectives: (1) fidelity (preventing unintended amino acid changes), (2) computational efficiency (speed and scalability), and (3) the scope of optimization variables considered (multi-objective capability). Furthermore, existing methods often fall short of comprehensively incorporating the factors related to the mRNA lifecycle and translation process, including intrinsic mRNA sequence properties, secondary structure, translation elongation kinetics, and tRNA availability. To address these limitations, we introduce \textbf{RNop}, a novel deep learning-based method for mRNA optimization. We collect a large-scale dataset containing over 3 million sequences and design four specialized loss functions, the GPLoss, CAILoss, tAILoss, and MFELoss, which simultaneously enable explicit control over sequence fidelity while optimizing species-specific codon adaptation, tRNA availability, and desirable mRNA secondary structure features. Then, we demonstrate RNop's effectiveness through extensive in silico and in vivo experiments. RNop ensures high sequence fidelity, achieves significant computational throughput up to 47.32 sequences/s, and yields optimized mRNA sequences resulting in a significant increase in protein expression for functional proteins compared to controls. RNop surpasses current methodologies in both quantitative metrics and experimental validation, enlightening a new dawn for efficient and effective mRNA design. Code and models will be available at https://github.com/HudenJear/RPLoss.
The Misinterpretable Evidence Conveyed by Arbitrary Codes
This essay explores the possibility of making use of Evidenc e Theory (ET) [51] in order to represent communication between and within living organisms ranging from humans to bacteria. ET, also known as "Dempster-Shafer Theo ry" or "Belief Functions Theory," is a mathematical theory of uncertain reasoning th at takes as prototypical situation a judge evaluating testimonies, or a detective ex amining cues, rather than a gambler playing dice [48] [52]. This marks a sharp differenc e with Probability Theory (PT) because, albeit fundamental constructs such as Bayes' Theorem can be obtained from the corresponding expressions of ET as special cases, g amblers know the faces of a die or the numbers on a roulette -- they assume to live in a clos ed world -- whereas judges and detectives are aware that unexpected clues and te stimonies may open up novel possibilities -- they are aware of living in an open worl d [23]. I submit that ET is more appropriate than PT to represent info rmation transmission through arbitrary codes that multiply the generation o f novelties. Furthermore, its paradigmatic situation of judges listening to testimonies is structurally similar to information communication, whereas the paradigmatic situation of gamblers playing games of chance is not [52].
Life-Code: Central Dogma Modeling with Multi-Omics Sequence Unification
Liu, Zicheng, Li, Siyuan, Chen, Zhiyuan, Xin, Lei, Wu, Fang, Yu, Chang, Yang, Qirong, Guo, Yucheng, Yang, Yujie, Li, Stan Z.
The interactions between DNA, RNA, and proteins are fundamental to biological processes, as illustrated by the central dogma of molecular biology. While modern biological pre-trained models have achieved great success in analyzing these macromolecules individually, their interconnected nature remains under-explored. In this paper, we follow the guidance of the central dogma to redesign both the data and model pipeline and offer a comprehensive framework, Life-Code, that spans different biological functions. As for data flow, we propose a unified pipeline to integrate multi-omics data by reverse-transcribing RNA and reverse-translating amino acids into nucleotide-based sequences. As for the model, we design a codon tokenizer and a hybrid long-sequence architecture to encode the interactions of both coding and non-coding regions with masked modeling pre-training. To model the translation and folding process with coding sequences, Life-Code learns protein structures of the corresponding amino acids by knowledge distillation from off-the-shelf protein language models. Such designs enable Life-Code to capture complex interactions within genetic sequences, providing a more comprehensive understanding of multi-omics with the central dogma. Extensive Experiments show that Life-Code achieves state-of-the-art performance on various tasks across three omics, highlighting its potential for advancing multi-omics analysis and interpretation.
HELM: Hierarchical Encoding for mRNA Language Modeling
Yazdani-Jahromi, Mehdi, Prakash, Mangal, Mansi, Tommaso, Moskalev, Artem, Liao, Rui
Messenger RNA (mRNA) plays a crucial role in protein synthesis, with its codon structure directly impacting biological properties. While Language Models (LMs) have shown promise in analyzing biological sequences, existing approaches fail to account for the hierarchical nature of mRNA's codon structure. We introduce Hierarchical Encoding for mRNA Language Modeling (HELM), a novel pre-training strategy that incorporates codon-level hierarchical structure into language model training. HELM modulates the loss function based on codon synonymity, aligning the model's learning process with the biological reality of mRNA sequences. We evaluate HELM on diverse mRNA datasets and tasks, demonstrating that HELM outperforms standard language model pre-training as well as existing foundation model baselines on six diverse downstream property prediction tasks and an antibody region annotation tasks on average by around 8%. Additionally, HELM enhances the generative capabilities of language model, producing diverse mRNA sequences that better align with the underlying true data distribution compared to non-hierarchical baselines. RNA analysis is becoming increasingly important in molecular biology (Liu et al., 2023; Fu, 2014). Messenger RNA (mRNA) is of particular interest due to its unique role in protein synthesis (Sahin et al., 2014). Language Models (LMs) have emerged as powerful tools for analyzing biological sequences, with notable successes in protein (Elnaggar et al., 2021; Ferruz et al., 2022; Lin et al., 2023; Hie et al., 2024) and DNA (Nguyen et al., 2024a; Zhou et al., 2023) research. Despite the importance of mRNA, the field still lacks specialized LMs tailored for its analysis. Existing RNA LMs (Li et al., 2023; Chen et al., 2023) focus on non-coding sequences and do not account properly for codon hierarchy (Figure 1 right) which, as we demonstrate, falls short when dealing with mRNA tasks. In this work, we aim to address this gap in mRNA language modeling by focusing specifically on the unique challenges presented by mRNA sequences. To address the limitations of existing bio-language modeling methods, we introduce Hierarchical Encoding for mRNA Language Modeling (HELM), a novel pre-training strategy for mRNA sequences. The tree diagram illustrates the codon hierarchy used in the HELM approach, categorizing codons into Start, Coding (grouped by amino acids), and Stop. This hierarchy informs the loss calculation.
Self-Replicating Mechanical Universal Turing Machine
This paper presents the implementation of a self-replicating finite-state machine (FSM) and a self-replicating Turing Machine (TM) using bio-inspired mechanisms. Building on previous work that introduced self-replicating structures capable of sorting, copying, and reading information, this study demonstrates the computational power of these mechanisms by explicitly constructing a functioning FSM and TM. This study demonstrates the universality of the system by emulating the UTM(5,5) of Neary and Woods.