dna foundation model
MergeDNA: Context-aware Genome Modeling with Dynamic Tokenization through Token Merging
Li, Siyuan, Yu, Kai, Wang, Anna, Liu, Zicheng, Yu, Chang, Zhou, Jingbo, Yang, Qirong, Guo, Yucheng, Zhang, Xiaoming, Li, Stan Z.
Modeling genomic sequences faces two unsolved challenges: the information density varies widely across different regions, while there is no clearly defined minimum vocabulary unit. Relying on either four primitive bases or independently designed DNA tokenizers, existing approaches with naive masked language modeling pre-training often fail to adapt to the varying complexities of genomic sequences. Leveraging Token Merging techniques, this paper introduces a hierarchical architecture that jointly optimizes a dynamic ge-nomic tokenizer and latent Transformers with context-aware pre-training tasks. As for network structures, the tokeniza-tion module automatically chunks adjacent bases into words by stacking multiple layers of the differentiable token merging blocks with local-window constraints, then a Latent Encoder captures the global context of these merged words by full-attention blocks. Symmetrically employing a Latent Decoder and a Local Decoder, MergeDNA learns with two pre-training tasks: Merged Token Reconstruction simultaneously trains the dynamic tokenization module and adaptively filters important tokens, while Adaptive Masked Token Modeling learns to predict these filtered tokens to capture informative contents. Extensive experiments show that MergeDNA achieves superior performance on three popular DNA benchmarks and several multi-omics tasks with fine-tuning or zero-shot evaluation, outperforming typical tokenization methods and large-scale DNA foundation models.
- North America > United States (0.14)
- Asia > China > Zhejiang Province > Hangzhou (0.04)
- Asia > China > Beijing > Beijing (0.04)
BioReason: Incentivizing Multimodal Biological Reasoning within a DNA-LLM Model
Fallahpour, Adibvafa, Magnuson, Andrew, Gupta, Purav, Ma, Shihao, Naimer, Jack, Shah, Arnav, Duan, Haonan, Ibrahim, Omar, Goodarzi, Hani, Maddison, Chris J., Wang, Bo
Unlocking deep and interpretable biological reasoning from complex genomic data remains a major AI challenge limiting scientific progress. While current DNA foundation models excel at representing sequences, they struggle with multi-step reasoning and lack transparent, biologically meaningful explanations. BioReason addresses this by tightly integrating a DNA foundation model with a large language model (LLM), enabling the LLM to directly interpret and reason over genomic information. Through supervised fine-tuning and reinforcement learning, BioReason learns to produce logical, biologically coherent deductions. It achieves major performance gains, boosting KEGG-based disease pathway prediction accuracy from 86% to 98% and improving variant effect prediction by an average of 15% over strong baselines. BioReason can reason over unseen biological entities and explain its decisions step by step, offering a transformative framework for interpretable, mechanistic AI in biology. All data, code, and checkpoints are available at https://github.com/bowang-lab/BioReason
- North America > Canada > Ontario > Toronto (0.14)
- North America > United States > California > San Francisco County > San Francisco (0.14)
- Asia > Japan > Honshū > Kansai > Kyoto Prefecture > Kyoto (0.04)
- Europe > Monaco (0.04)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.93)
- Health & Medicine > Therapeutic Area > Neurology (1.00)
- Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
- Health & Medicine > Therapeutic Area > Musculoskeletal (0.67)
- Health & Medicine > Therapeutic Area > Oncology (0.67)
Beating Transformers using Synthetic Cognition
Ibias, Alfredo, Rodriguez-Galindo, Miguel, Antona, Hector, Ramirez-Miranda, Guillem, Guinovart, Enric
The road to Artificial General Intelligence goes through the generation of context-aware reactive behaviors, where the Transformer architecture has been proven to be the state-of-the-art. However, they still fail to develop reasoning. Recently, a novel approach for developing cognitive architectures, called Synthetic Cognition, has been proposed and implemented to develop instantaneous reactive behavior. In this study, we aim to explore the use of Synthetic Cognition to develop context-aware reactive behaviors. We propose a mechanism to deal with sequences for the recent implementation of Synthetic Cognition, and test it against DNA foundation models in DNA sequence classification tasks. In our experiments, our proposal clearly outperforms the DNA foundation models, obtaining the best score on more benchmark tasks than the alternatives. Thus, we achieve two goals: expanding Synthetic Cognition to deal with sequences, and beating the Transformer architecture for sequence classification.
- Europe > Austria > Vienna (0.14)
- North America > United States > Washington > King County > Seattle (0.04)
- North America > United States > Texas (0.04)
- (9 more...)
Enhancing DNA Foundation Models to Address Masking Inefficiencies
Safari, Monireh, Arias, Pablo Millan, Lowe, Scott C., Kari, Lila, Chang, Angel X., Taylor, Graham W.
Masked language modelling (MLM) as a pretraining objective has been widely adopted in genomic sequence modelling. While pretrained models can successfully serve as encoders for various downstream tasks, the distribution shift between pretraining and inference detrimentally impacts performance, as the pretraining task is to map [MASK] tokens to predictions, yet the [MASK] is absent during downstream applications. This means the encoder does not prioritize its encodings of non-[MASK] tokens, and expends parameters and compute on work only relevant to the MLM task, despite this being irrelevant at deployment time. In this work, we propose a modified encoder-decoder architecture based on the masked autoencoder framework, designed to address this inefficiency within a BERT-based transformer. We empirically show that the resulting mismatch is particularly detrimental in genomic pipelines where models are often used for feature extraction without fine-tuning. We evaluate our approach on the BIOSCAN-5M dataset, comprising over 2 million unique DNA barcodes. We achieve substantial performance gains in both closed-world and open-world classification tasks when compared against causal models and bidirectional architectures pretrained with MLM tasks.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > Canada > Alberta (0.04)
Efficient and Scalable Fine-Tune of Language Models for Genome Understanding
Zhan, Huixin, Wu, Ying Nian, Zhang, Zijun
Although DNA foundation models have advanced the understanding of genomes, they still face significant challenges in the limited scale and diversity of genomic data. This limitation starkly contrasts with the success of natural language foundation models, which thrive on substantially larger scales. Furthermore, genome understanding involves numerous downstream genome annotation tasks with inherent data heterogeneity, thereby necessitating more efficient and robust finetuning methods tailored for genomics. Lingo further accommodates numerous, heterogeneous downstream fine-tune tasks by an adaptive rank sampling method that prunes and stochastically reintroduces pruned singular vectors within small computational budgets. Adaptive rank sampling outperformed existing fine-tuning methods on all benchmarked 14 genome understanding tasks, while requiring fewer than 2% of trainable parameters as genomic-specific adapters. Impressively, applying these adapters on natural language foundation models matched or even exceeded the performance of DNA foundation models. Lingo presents a new paradigm of efficient and scalable genome understanding via genomic-specific adapters on language models. DNA foundation models, such as DNABERT [1], DNABERT-2 [2], and Nucleotide Transformer (NT) [3], have made significant progress in decoding the linguistic intricacies of the genome. An important paradigm of utilizing such DNA foundation models is "pre-training+finetuning", i.e., pre-training on unlabeled genomic sequences, and then adaptation to a particular genome understanding task. A critical aspect of genome annotation and downstream tasks is their considerable number and diversity. For example, state-of-the-art deep learning models in epigenetics alone can encompass nearly 22,000 individual tasks [4].