Bridging Large Language Models and Single-Cell Transcriptomics in Dissecting Selective Motor Neuron Vulnerability

Jiang, Douglas, Dai, Zilin, Zhang, Luxuan, Yu, Qiyi, Sun, Haoqi, Tian, Feng

May-14-2025–arXiv.org Artificial Intelligence

Understanding cell identity and function through single-cell level sequencing data remains a key challenge in computational biology. We present a novel framework that leverages gene-specific textual annotations from the NCBI Gene database to generate biologically contextualized cell embeddings. For each cell in a single-cell RNA sequencing (scRNA-seq) dataset, we rank genes by expression level, retrieve their NCBI Gene descriptions, and transform these descriptions into vector embedding representations using large language models (LLMs). The models used include OpenAI text-embedding-ada-002, text-embedding-3-small, and text-embedding-3-large (Jan 2024), as well as domain-specific models BioBERT and SciBERT. Embeddings are computed via an expression-weighted average across the top N most highly expressed genes in each cell, providing a compact, semantically rich representation. This multimodal strategy bridges structured biological data with state-of-the-art language modeling, enabling more interpretable downstream applications such as cell-type clustering, cell vulnerability dissection, and trajectory inference.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

May-14-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report > New Finding (0.68)

Industry:
- Health & Medicine
  - Pharmaceuticals & Biotechnology (1.00)
  - Therapeutic Area
    - Musculoskeletal (0.93)
    - Neurology > Amyotrophic Lateral Sclerosis (ALS) (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning
    - Statistical Learning (0.94)
    - Neural Networks > Deep Learning
      - Generative AI (0.34)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found