cell type annotation
- North America > United States (0.04)
- Asia > Middle East > Jordan (0.04)
- Asia > China > Beijing > Beijing (0.04)
A Brain Cell Type Resource Created by Large Language Models and a Multi-Agent AI System for Collaborative Community Annotation
Li, Rongbin, Chen, Wenbo, Li, Zhao, Munoz-Castaneda, Rodrigo, Li, Jinbo, Maurya, Neha S., Solanki, Arnav, He, Huan, Xing, Hanwen, Ramlakhan, Meaghan, Wise, Zachary, Johansen, Nelson, Wu, Zhuhao, Xu, Hua, Hawrylycz, Michael, Zheng, W. Jim
Single-cell RNA sequencing has transformed our ability to identify diverse cell types and their transcriptomic signatures. However, annotating these signatures-especially those involving poorly characterized genes-remains a major challenge. Traditional methods, such as Gene Set Enrichment Analysis (GSEA), depend on well-curated annotations and often perform poorly in these contexts. Large Language Models (LLMs) offer a promising alternative but struggle to represent complex biological knowledge within structured ontologies. To address this, we present BRAINCELL-AID (BRAINCELL-AID: https://biodataai.uth.edu/BRAINCELL-AID), a novel multi-agent AI system that integrates free-text descriptions with ontology labels to enable more accurate and robust gene set annotation. By incorporating retrieval-augmented generation (RAG), we developed a robust agentic workflow that refines predictions using relevant PubMed literature, reducing hallucinations and enhancing interpretability. Using this workflow, we achieved correct annotations for 77% of mouse gene sets among their top predictions. Applying this approach, we annotated 5,322 brain cell clusters from the comprehensive mouse brain cell atlas generated by the BRAIN Initiative Cell Census Network, enabling novel insights into brain cell function by identifying region-specific gene co-expression patterns and inferring functional roles of gene ensembles. BRAINCELL-AID also identifies Basal Ganglia-related cell types with neurologically meaningful descriptions. Hence, we create a valuable resource to support community-driven cell type annotation.
- North America > United States > Washington > King County > Seattle (0.04)
- North America > United States > Texas > Harris County > Houston (0.04)
- North America > United States > New York (0.04)
- North America > United States > Connecticut > New Haven County > New Haven (0.04)
- Workflow (1.00)
- Research Report > New Finding (0.68)
- Research Report > Experimental Study (0.46)
scUnified: An AI-Ready Standardized Resource for Single-Cell RNA Sequencing Analysis
Xu, Ping, Wang, Zaitian, Wang, Zhirui, Li, Pengjiang, Zhang, Ran, Li, Gaoyang, Xie, Hanyu, Wang, Jiajia, Zhou, Yuanchun, Wang, Pengfei
Single-cell RNA sequencing (scRNA-seq) technology enables systematic delineation of cellular states and interactions, providing crucial insights into cellular heterogeneity. Building on this potential, numerous computational methods have been developed for tasks such as cell clustering, cell type annotation, and marker gene identification. To fully assess and compare these methods, standardized, analysis-ready datasets are essential. However, such datasets remain scarce, and variations in data formats, preprocessing workflows, and annotation strategies hinder reproducibility and complicate systematic evaluation of existing methods. To address these challenges, we present scUnified, an AI-ready standardized resource for single-cell RNA sequencing data that consolidates 13 high-quality datasets spanning two species (human and mouse) and nine tissue types. All datasets undergo standardized quality control and preprocessing and are stored in a uniform format to enable direct application in diverse computational analyses without additional data cleaning. We further demonstrate the utility of scUnified through experimental analyses of representative biological tasks, providing a reproducible foundation for the standardized evaluation of computational methods on a unified dataset.
- Europe > Netherlands > South Holland > Leiden (0.07)
- Asia > China > Zhejiang Province > Hangzhou (0.04)
- Asia > China > Beijing > Beijing (0.04)
- (2 more...)
ChromFound: Towards A Universal Foundation Model for Single-Cell Chromatin Accessibility Data
Jiao, Yifeng, Liu, Yuchen, Zhang, Yu, Guo, Xin, Wu, Yushuai, Jiang, Chen, Li, Jiyang, Zhang, Hongwei, Han, Limei, Gao, Xin, Qi, Yuan, Cheng, Yuan
The advent of single-cell Assay for Transposase-Accessible Chromatin using sequencing (scATAC-seq) offers an innovative perspective for deciphering regulatory mechanisms by assembling a vast repository of single-cell chromatin accessibility data. While foundation models have achieved significant success in single-cell transcriptomics, there is currently no foundation model for scATAC-seq that supports zero-shot high-quality cell identification and comprehensive multi-omics analysis simultaneously. Key challenges lie in the high dimensionality and sparsity of scATAC-seq data, as well as the lack of a standardized schema for representing open chromatin regions (OCRs). Here, we present ChromFound, a foundation model tailored for scATAC-seq. ChromFound utilizes a hybrid architecture and genome-aware tokenization to effectively capture genome-wide long contexts and regulatory signals from dynamic chromatin landscapes. Pretrained on 1.97 million cells from 30 tissues and 6 disease conditions, ChromFound demonstrates broad applicability across 6 diverse tasks. Notably, it achieves robust zero-shot performance in generating universal cell representations and exhibits excellent transferability in cell type annotation and cross-omics prediction. By uncovering enhancer-gene links undetected by existing computational methods, ChromFound offers a promising framework for understanding disease risk variants in the noncoding genome.
- Health & Medicine > Therapeutic Area > Oncology (1.00)
- Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
- Health & Medicine > Therapeutic Area > Neurology (0.93)
Cell2Text: Multimodal LLM for Generating Single-Cell Descriptions from RNA-Seq Data
Kharouiche, Oussama, Markogiannakis, Aris, Fei, Xiao, Chatzianastasis, Michail, Vazirgiannis, Michalis
Single-cell RNA sequencing has transformed biology by enabling the measurement of gene expression at cellular resolution, providing information for cell types, states, and disease contexts. Recently, single-cell foundation models have emerged as powerful tools for learning transferable representations directly from expression profiles, improving performance on classification and clustering tasks. However, these models are limited to discrete prediction heads, which collapse cellular complexity into predefined labels that fail to capture the richer, contextual explanations biologists need. We introduce Cell2Text, a multimodal generative framework that translates scRNA-seq profiles into structured natural language descriptions. By integrating gene-level embeddings from single-cell foundation models with pretrained large language models, Cell2Text generates coherent summaries that capture cellular identity, tissue origin, disease associations, and pathway activity, generalizing to unseen cells. Empirically, Cell2Text outperforms baselines on classification accuracy, demonstrates strong ontological consistency using PageRank-based similarity metrics, and achieves high semantic fidelity in text generation. These results demonstrate that coupling expression data with natural language offers both stronger predictive performance and inherently interpretable outputs, pointing to a scalable path for label-efficient characterization of unseen cells.
- Europe > Switzerland > Basel-City > Basel (0.04)
- Europe > Ireland (0.04)
- North America > United States (0.04)
- Asia > Middle East > Jordan (0.04)
- Asia > China > Beijing > Beijing (0.04)
GRIT: Graph-Regularized Logit Refinement for Zero-shot Cell Type Annotation
Hu, Tianxiang, Zhou, Chenyi, Liu, Jiaxiang, Wang, Jiongxin, Chen, Ruizhe, Xia, Haoxiang, Wang, Gaoang, Wu, Jian, Liu, Zuozhu
Cell type annotation is a fundamental step in the analysis of single-cell RNA sequencing (scRNA-seq) data. In practice, human experts often rely on the structure revealed by principal component analysis (PCA) followed by $k$-nearest neighbor ($k$-NN) graph construction to guide annotation. While effective, this process is labor-intensive and does not scale to large datasets. Recent advances in CLIP-style models offer a promising path toward automating cell type annotation. By aligning scRNA-seq profiles with natural language descriptions, models like LangCell enable zero-shot annotation. While LangCell demonstrates decent zero-shot performance, its predictions remain suboptimal, particularly in achieving consistent accuracy across all cell types. In this paper, we propose to refine the zero-shot logits produced by LangCell through a graph-regularized optimization framework. By enforcing local consistency over the task-specific PCA-based k-NN graph, our method combines the scalability of the pre-trained models with the structural robustness relied upon in expert annotation. We evaluate our approach on 14 annotated human scRNA-seq datasets from 4 distinct studies, spanning 11 organs and over 200,000 single cells. Our method consistently improves zero-shot annotation accuracy, achieving accuracy gains of up to 10%. Further analysis showcase the mechanism by which GRIT effectively propagates correct signals through the graph, pulling back mislabeled cells toward more accurate predictions. The method is training-free, model-agnostic, and serves as a simple yet effective plug-in for enhancing automated cell type annotation in practice.
- Europe > Netherlands > South Holland > Leiden (0.04)
- Europe > Italy > Sicily (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
Cell-o1: Training LLMs to Solve Single-Cell Reasoning Puzzles with Reinforcement Learning
Fang, Yin, Jin, Qiao, Xiong, Guangzhi, Jin, Bowen, Zhong, Xianrui, Ouyang, Siru, Zhang, Aidong, Han, Jiawei, Lu, Zhiyong
Cell type annotation is a key task in analyzing the heterogeneity of single-cell RNA sequencing data. Although recent foundation models automate this process, they typically annotate cells independently, without considering batch-level cellular context or providing explanatory reasoning. In contrast, human experts often annotate distinct cell types for different cell clusters based on their domain knowledge. To mimic this workflow, we introduce the CellPuzzles task, where the objective is to assign unique cell types to a batch of cells. This benchmark spans diverse tissues, diseases, and donor conditions, and requires reasoning across the batch-level cellular context to ensure label uniqueness. We find that off-the-shelf large language models (LLMs) struggle on CellPuzzles, with the best baseline (OpenAI's o1) achieving only 19.0% batch-level accuracy. To fill this gap, we propose Cell-o1, a 7B LLM trained via supervised fine-tuning on distilled reasoning traces, followed by reinforcement learning with batch-level rewards. Cell-o1 achieves state-of-the-art performance, outperforming o1 by over 73% and generalizing well across contexts. Further analysis of training dynamics and reasoning behaviors provides insights into batch-level annotation performance and emergent expert-like reasoning. Code and data are available at https://github.com/ncbi-nlp/cell-o1.
- Europe > Netherlands > South Holland > Leiden (0.04)
- North America > United States > Virginia (0.04)
- North America > United States > Illinois > Champaign County > Urbana (0.04)
CellTypeAgent: Trustworthy cell type annotation with Large Language Models
Chen, Jiawen, Zhang, Jianghao, Yao, Huaxiu, Li, Yun
Furthermore, the subsequent verification based on average gene expression correctly identifies pericyte cell, in contrast to GPTCellType's misclassification as fibroblasts. We further assessed various factors that could influence CellTypeAgent's performance. First, we examined how the number of initial candidate cell types in the 4 inference step affected accuracy (Figure 2c). Although performance remained relatively stable, using the top three candidates yielded a slight higher performance. We also evaluated the impact of the number of marker genes, finding that including more genes generally enhanced annotation quality (Figure 2d). Moreover, we tested CellTypeAgent's ability to handle mixtures of different cell types (Figure 2e). When explicitly prompted that multiple cell types might be present, the agent successfully identified one or more components within the mixed sample. While performance declined compared to annotating pure cell types, CellTypeAgent still demonstrated the capacity to accurately detect multiple cell types.
ReCellTy: Domain-specific knowledge graph retrieval-augmented LLMs workflow for single-cell annotation
Han, Dezheng, Jia, Yibin, Chen, Ruxiao, Han, Wenjie, Guo, Shuaishuai, Wang, Jianbo
These authors contributed equally to this work. Abstract To enable precise and fully automated cell type annotation with large language models (LLMs), we developed a graph-structured feature-marker database to retrieve entities linked to differential genes for cell reconstruction. We further designed a multi-task workflow to optimize the annotation process. Compared to general-purpose LLMs, our method improves human evaluation scores by up to 0.21 and semantic similarity by 6.1% across 11 tissue types, while more closely aligning with the cognitive logic of manual annotation. Keywords: Cell type annotation, Graph RAG, Large language models, Graph data curation, Multi-task workflow, scRNA-seq In single-cell RNA sequencing analysis, achieving precise cell type annotation through manual labeling typically requires two key steps: annotators retrieve relevant marker genes and integrate this information with their domain expertise to make informed decisions.
- Asia > China (0.05)
- North America > United States (0.04)