Goto

Collaborating Authors

 rna


Multi-modal Transfer Learning between Biological Foundation Models

Neural Information Processing Systems

Modeling these sequences is key to understand disease mechanisms and is an active research area in computational biology. Recently, Large Language Models have shown great promise in solving certain biological tasks but current approaches are limited to a single sequence modality (DNA, RNA, or protein). Key problems in genomics intrinsically involve multiple modalities, but it remains unclear how to adapt general-purpose sequence models to those cases. In this work we propose a multi-modal model that connects DNA, RNA, and proteins by leveraging information from different pre-trained modality-specific encoders. We demonstrate its capabilities by applying it to the largely unsolved problem of predicting how multiple \rna transcript isoforms originate from the same gene (i.e.


De-extinction of the woolly mammoth takes a MAJOR step forward: Scientists extract the RNA from a creature that lived 40,000 years ago - and it could allow them to resurrect the lost species

Daily Mail - Science & tech

Autopsy reveals the truth about newlywed couple found dead in their car after wife's haunting final post Justin Baldoni's texts detail alleged showdown with Blake Lively's'angry husband' Ryan Reynolds King Charles'never understood' Meghan Markle but Queen Camilla saw through her'performance' - as royal expert reveals what really happened at Castle of Mey in 2018 Grim truth about'catastrophic' diarrhea incident at Gwyneth Paltrow's house: One year later, insiders dare to tell full REAL story that will'forever haunt' her Furious Trump orders Pam Bondi to investigate Bill Clinton over Epstein after exploding at'weak Republicans' Top fighter pilot breaks 45-year silence to reveal bombshell UFO encounter with '50ft triangular craft' at nuclear base I have new evidence Amy Bradley is alive: Bombshell by private investigator trying to solve Caribbean cruise disappearance. Now he reveals fatal flaws in Netflix documentary, what they DIDN'T show... and new twist Clint Eastwood's daughter Francesca reveals how she got back in shape so fast after welcoming second child last month Amy Schumer's marriage on the BRINK as star sheds pounds and sells off homes amid'difficult time' Why the truth about Hitler's genitals helps explain his'terrifying urge for domination' Epstein is taunting Trump from beyond the grave. His secret emails are a dark threat to the president. Here's why it could get even worse: JAMES REINL The hearing aid that's changed my life: I couldn't hear in crowded places, missed words and was humiliated by my old pair whistling, says LIZ JONES. Now experts told me about the new super-aids... Chick-fil-A to launch brand new menu item and customers are ecstatic: 'This is excellent news' Nutritionist influencer Diana Areas, 39, dies after'falling from top of building' GQ's Men of the Year 2025 awards WORST dressed stars, from Emma Chamberlain to Alix Earle The world's oldest RNA - an essential nucleic acid present in all living cells - has been extracted from the extinct woolly mammoth, a new study reveals.


Oldest known RNA found in 40,000-year-old woolly mammoth leg

Popular Science

Cave lions likely killed'Yuka' when she was around 8 years old. Breakthroughs, discoveries, and DIY tips sent every weekday. A 40,000-year-old juvenile woolly mammoth named Yuka is not only remarkable because she was uncovered nearly intact or her grisly cause of death. Her muscles provided paleogeneticists with the oldest known RNA sequences ever recovered. Detailed in a study published on November 14 in the journal, the samples contradict previous assumptions about the genetic material's resilience while furthering our understanding of the famous, extinct megafauna.


Multi-modal Transfer Learning between Biological Foundation Models

Neural Information Processing Systems

Modeling these sequences is key to understand disease mechanisms and is an active research area in computational biology. Recently, Large Language Models have shown great promise in solving certain biological tasks but current approaches are limited to a single sequence modality (DNA, RNA, or protein). Key problems in genomics intrinsically involve multiple modalities, but it remains unclear how to adapt general-purpose sequence models to those cases. In this work we propose a multi-modal model that connects DNA, RNA, and proteins by leveraging information from different pre-trained modality-specific encoders. We demonstrate its capabilities by applying it to the largely unsolved problem of predicting how multiple \rna transcript isoforms originate from the same gene (i.e.


A Comparative Review of RNA Language Models

Wang, He, Zhang, Yikun, Chen, Jie, Zhan, Jian, Zhou, Yaoqi

arXiv.org Artificial Intelligence

Given usefulness of protein language models (LMs) in structure and functional inference, RNA LMs have received increased attentions in the last few years. However, these RNA models are often not compared against the same standard. Here, we divided RNA LMs into three classes (pretrained on multiple RNA types (especially noncoding RNAs), specific-purpose RNAs, and LMs that unify RNA with DNA or proteins or both) and compared 13 RNA LMs along with 3 DNA and 1 protein LMs as controls in zero-shot prediction of RNA secondary structure and functional classification. Results shows that the models doing well on secondary structure prediction often perform worse in function classification or vice versa, suggesting that more balanced unsupervised training is needed.


Biological Sequence with Language Model Prompting: A Survey

Jiang, Jiyue, Wang, Zikang, Shan, Yuheng, Chai, Heyan, Li, Jiayi, Ma, Zixian, Zhang, Xinrui, Li, Yu

arXiv.org Artificial Intelligence

Large Language models (LLMs) have emerged as powerful tools for addressing challenges across diverse domains. Notably, recent studies have demonstrated that large language models significantly enhance the efficiency of biomolecular analysis and synthesis, attracting widespread attention from academics and medicine. In this paper, we systematically investigate the application of prompt-based methods with LLMs to biological sequences, including DNA, RNA, proteins, and drug discovery tasks. Specifically, we focus on how prompt engineering enables LLMs to tackle domain-specific problems, such as promoter sequence prediction, protein structure modeling, and drug-target binding affinity prediction, often with limited labeled data. Furthermore, our discussion highlights the transformative potential of prompting in bioinformatics while addressing key challenges such as data scarcity, multimodal fusion, and computational resource limitations. Our aim is for this paper to function both as a foundational primer for newcomers and a catalyst for continued innovation within this dynamic field of study.


Large Language Models in Bioinformatics: A Survey

Wang, Zhenyu, Wang, Zikang, Jiang, Jiyue, Chen, Pengan, Shi, Xiangyu, Li, Yu

arXiv.org Artificial Intelligence

Large Language Models (LLMs) are revolutionizing bioinformatics, enabling advanced analysis of DNA, RNA, proteins, and single-cell data. This survey provides a systematic review of recent advancements, focusing on genomic sequence modeling, RNA structure prediction, protein function inference, and single-cell transcriptomics. Meanwhile, we also discuss several key challenges, including data scarcity, computational complexity, and cross-omics integration, and explore future directions such as multimodal learning, hybrid AI models, and clinical applications. By offering a comprehensive perspective, this paper underscores the transformative potential of LLMs in driving innovations in bioinformatics and precision medicine.


Exploring Multi-Modality Dynamics: Insights and Challenges in Multimodal Fusion for Biomedical Tasks

Wenderoth, Laura

arXiv.org Artificial Intelligence

This paper investigates the MM dynamics approach proposed by Han et al. (2022) for multi-modal fusion in biomedical classification tasks. The MM dynamics algorithm integrates feature-level and modality-level informativeness to dynamically fuse modalities for improved classification performance. However, our analysis reveals several limitations and challenges in replicating and extending the results of MM dynamics. We found that feature informativeness improves performance and explainability, while modality informativeness does not provide significant advantages and can lead to performance degradation. Based on these results, we have extended feature informativeness to image data, resulting in the development of Image MM dynamics. Although this approach showed promising qualitative results, it did not outperform baseline methods quantitatively.


RNA-GPT: Multimodal Generative System for RNA Sequence Understanding

Xiao, Yijia, Sun, Edward, Jin, Yiqiao, Wang, Wei

arXiv.org Artificial Intelligence

RNAs are essential molecules that carry genetic information vital for life, with profound implications for drug development and biotechnology. Despite this importance, RNA research is often hindered by the vast literature available on the topic. To streamline this process, we introduce RNA-GPT, a multi-modal RNA chat model designed to simplify RNA discovery by leveraging extensive RNA literature. RNA-GPT integrates RNA sequence encoders with linear projection layers and state-of-the-art large language models (LLMs) for precise representation alignment, enabling it to process user-uploaded RNA sequences and deliver concise, accurate responses. Built on a scalable training pipeline, RNA-GPT utilizes RNA-QA, an automated system that gathers RNA annotations from RNACentral using a divide-and-conquer approach with GPT-4o and latent Dirichlet allocation (LDA) to efficiently handle large datasets and generate instruction-tuning samples. Our experiments indicate that RNA-GPT effectively addresses complex RNA queries, thereby facilitating RNA research. Additionally, we present RNA-QA, a dataset of 407,616 RNA samples for modality alignment and instruction tuning, further advancing the potential of RNA research tools.


scFusionTTT: Single-cell transcriptomics and proteomics fusion with Test-Time Training layers

Meng, Dian, Xing, Bohao, Huang, Xinlei, Liu, Yanran, Zhou, Yijun, xiao, Yongjun, Yu, Zitong, Zheng, Xubin

arXiv.org Artificial Intelligence

Single-cell multi-omics (scMulti-omics) refers to the paired multimodal data, such as Cellular Indexing of Transcriptomes and Epitopes by Sequencing (CITE-seq), where the regulation of each cell was measured from different modalities, i.e. genes and proteins. scMulti-omics can reveal heterogeneity inside tumors and understand the distinct genetic properties of diverse cell types, which is crucial to targeted therapy. Currently, deep learning methods based on attention structures in the bioinformatics area face two challenges. The first challenge is the vast number of genes in a single cell. Traditional attention-based modules struggled to effectively leverage all gene information due to their limited capacity for long-context learning and high-complexity computing. The second challenge is that genes in the human genome are ordered and influence each other's expression. Most of the methods ignored this sequential information. The recently introduced Test-Time Training (TTT) layer is a novel sequence modeling approach, particularly suitable for handling long contexts like genomics data because TTT layer is a linear complexity sequence modeling structure and is better suited to data with sequential relationships. In this paper, we propose scFusionTTT, a novel method for Single-Cell multimodal omics Fusion with TTT-based masked autoencoder. Of note, we combine the order information of genes and proteins in the human genome with the TTT layer, fuse multimodal omics, and enhance unimodal omics analysis. Finally, the model employs a three-stage training strategy, which yielded the best performance across most metrics in four multimodal omics datasets and four unimodal omics datasets, demonstrating the superior performance of our model. The dataset and code will be available on https://github.com/DM0815/scFusionTTT.