AITopics | dna foundation model

Collaborating Authors

dna foundation model

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

MergeDNA: Context-aware Genome Modeling with Dynamic Tokenization through Token Merging

Li, Siyuan, Yu, Kai, Wang, Anna, Liu, Zicheng, Yu, Chang, Zhou, Jingbo, Yang, Qirong, Guo, Yucheng, Zhang, Xiaoming, Li, Stan Z.

arXiv.org Artificial IntelligenceNov-20-2025

Modeling genomic sequences faces two unsolved challenges: the information density varies widely across different regions, while there is no clearly defined minimum vocabulary unit. Relying on either four primitive bases or independently designed DNA tokenizers, existing approaches with naive masked language modeling pre-training often fail to adapt to the varying complexities of genomic sequences. Leveraging Token Merging techniques, this paper introduces a hierarchical architecture that jointly optimizes a dynamic ge-nomic tokenizer and latent Transformers with context-aware pre-training tasks. As for network structures, the tokeniza-tion module automatically chunks adjacent bases into words by stacking multiple layers of the differentiable token merging blocks with local-window constraints, then a Latent Encoder captures the global context of these merged words by full-attention blocks. Symmetrically employing a Latent Decoder and a Local Decoder, MergeDNA learns with two pre-training tasks: Merged Token Reconstruction simultaneously trains the dynamic tokenization module and adaptively filters important tokens, while Adaptive Masked Token Modeling learns to predict these filtered tokens to capture informative contents. Extensive experiments show that MergeDNA achieves superior performance on three popular DNA benchmarks and several multi-omics tasks with fine-tuning or zero-shot evaluation, outperforming typical tokenization methods and large-scale DNA foundation models.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2511.14806

Country:

North America > United States (0.14)
Asia > China > Zhejiang Province > Hangzhou (0.04)
Asia > China > Beijing > Beijing (0.04)

Genre: Research Report (0.63)

Industry:

Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

BioReason: Incentivizing Multimodal Biological Reasoning within a DNA-LLM Model

Fallahpour, Adibvafa, Magnuson, Andrew, Gupta, Purav, Ma, Shihao, Naimer, Jack, Shah, Arnav, Duan, Haonan, Ibrahim, Omar, Goodarzi, Hani, Maddison, Chris J., Wang, Bo

arXiv.org Artificial IntelligenceOct-27-2025

Unlocking deep and interpretable biological reasoning from complex genomic data remains a major AI challenge limiting scientific progress. While current DNA foundation models excel at representing sequences, they struggle with multi-step reasoning and lack transparent, biologically meaningful explanations. BioReason addresses this by tightly integrating a DNA foundation model with a large language model (LLM), enabling the LLM to directly interpret and reason over genomic information. Through supervised fine-tuning and reinforcement learning, BioReason learns to produce logical, biologically coherent deductions. It achieves major performance gains, boosting KEGG-based disease pathway prediction accuracy from 86% to 98% and improving variant effect prediction by an average of 15% over strong baselines. BioReason can reason over unseen biological entities and explain its decisions step by step, offering a transformative framework for interpretable, mechanistic AI in biology. All data, code, and checkpoints are available at https://github.com/bowang-lab/BioReason

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2505.23579

Country:

North America > Canada > Ontario > Toronto (0.14)
North America > United States > California > San Francisco County > San Francisco (0.14)
Asia > Japan > Honshū > Kansai > Kyoto Prefecture > Kyoto (0.04)
Europe > Monaco (0.04)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.93)

Industry:

Health & Medicine > Therapeutic Area > Neurology (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Therapeutic Area > Musculoskeletal (0.67)
Health & Medicine > Therapeutic Area > Oncology (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Beating Transformers using Synthetic Cognition

Ibias, Alfredo, Rodriguez-Galindo, Miguel, Antona, Hector, Ramirez-Miranda, Guillem, Guinovart, Enric

arXiv.org Artificial IntelligenceJul-3-2025

The road to Artificial General Intelligence goes through the generation of context-aware reactive behaviors, where the Transformer architecture has been proven to be the state-of-the-art. However, they still fail to develop reasoning. Recently, a novel approach for developing cognitive architectures, called Synthetic Cognition, has been proposed and implemented to develop instantaneous reactive behavior. In this study, we aim to explore the use of Synthetic Cognition to develop context-aware reactive behaviors. We propose a mechanism to deal with sequences for the recent implementation of Synthetic Cognition, and test it against DNA foundation models in DNA sequence classification tasks. In our experiments, our proposal clearly outperforms the DNA foundation models, obtaining the best score on more benchmark tasks than the alternatives. Thus, we achieve two goals: expanding Synthetic Cognition to deal with sequences, and beating the Transformer architecture for sequence classification.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2504.07619

Country:

Europe > Austria > Vienna (0.14)
North America > United States > Washington > King County > Seattle (0.04)
North America > United States > Texas (0.04)
(9 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (0.91)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Cognitive Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.87)

Add feedback

Enhancing DNA Foundation Models to Address Masking Inefficiencies

Safari, Monireh, Arias, Pablo Millan, Lowe, Scott C., Kari, Lila, Chang, Angel X., Taylor, Graham W.

arXiv.org Artificial IntelligenceFeb-25-2025

Masked language modelling (MLM) as a pretraining objective has been widely adopted in genomic sequence modelling. While pretrained models can successfully serve as encoders for various downstream tasks, the distribution shift between pretraining and inference detrimentally impacts performance, as the pretraining task is to map [MASK] tokens to predictions, yet the [MASK] is absent during downstream applications. This means the encoder does not prioritize its encodings of non-[MASK] tokens, and expends parameters and compute on work only relevant to the MLM task, despite this being irrelevant at deployment time. In this work, we propose a modified encoder-decoder architecture based on the masked autoencoder framework, designed to address this inefficiency within a BERT-based transformer. We empirically show that the resulting mismatch is particularly detrimental in genomic pipelines where models are often used for feature extraction without fine-tuning. We evaluate our approach on the BIOSCAN-5M dataset, comprising over 2 million unique DNA barcodes. We achieve substantial performance gains in both closed-world and open-world classification tasks when compared against causal models and bidirectional architectures pretrained with MLM tasks.

barcodebert, barcodemae, sequence, (14 more...)

arXiv.org Artificial Intelligence

2502.18405

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > Canada > Alberta (0.04)

Genre: Research Report (0.82)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.90)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Efficient and Scalable Fine-Tune of Language Models for Genome Understanding

Zhan, Huixin, Wu, Ying Nian, Zhang, Zijun

arXiv.org Artificial IntelligenceFeb-12-2024

Although DNA foundation models have advanced the understanding of genomes, they still face significant challenges in the limited scale and diversity of genomic data. This limitation starkly contrasts with the success of natural language foundation models, which thrive on substantially larger scales. Furthermore, genome understanding involves numerous downstream genome annotation tasks with inherent data heterogeneity, thereby necessitating more efficient and robust finetuning methods tailored for genomics. Lingo further accommodates numerous, heterogeneous downstream fine-tune tasks by an adaptive rank sampling method that prunes and stochastically reintroduces pruned singular vectors within small computational budgets. Adaptive rank sampling outperformed existing fine-tuning methods on all benchmarked 14 genome understanding tasks, while requiring fewer than 2% of trainable parameters as genomic-specific adapters. Impressively, applying these adapters on natural language foundation models matched or even exceeded the performance of DNA foundation models. Lingo presents a new paradigm of efficient and scalable genome understanding via genomic-specific adapters on language models. DNA foundation models, such as DNABERT [1], DNABERT-2 [2], and Nucleotide Transformer (NT) [3], have made significant progress in decoding the linguistic intricacies of the genome. An important paradigm of utilizing such DNA foundation models is "pre-training+finetuning", i.e., pre-training on unlabeled genomic sequences, and then adaptation to a particular genome understanding task. A critical aspect of genome annotation and downstream tasks is their considerable number and diversity. For example, state-of-the-art deep learning models in epigenetics alone can encompass nearly 22,000 individual tasks [4].

adaptive rank, dna foundation model, foundation model, (16 more...)

arXiv.org Artificial Intelligence

2402.08075

Country: North America > United States > California > Los Angeles County > Los Angeles (0.14)

Genre: Research Report (0.82)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback