CapsDT: Diffusion-Transformer for Capsule Robot Manipulation

He, Xiting, Su, Mingwu, Jiang, Xinqi, Bai, Long, Lai, Jiewen, Ren, Hongliang

Jun-23-2025–arXiv.org Artificial Intelligence

-- Vision-Language-Action (VLA) models have emerged as a prominent research area, showcasing significant potential across a variety of applications. However, their performance in endoscopy robotics, particularly endoscopy capsule robots that perform actions within the digestive system, remains unexplored. The integration of VLA models into endoscopy robots allows more intuitive and efficient interactions between human operators and medical devices, improving both diagnostic accuracy and treatment outcomes. By processing interleaved visual inputs, and textual instructions, CapsDT can infer corresponding robotic control signals to facilitate endoscopy tasks. In addition, we developed a capsule endoscopy robot system, a capsule robot controlled by a robotic arm-held magnet, addressing different levels of four endoscopy tasks and creating corresponding capsule robot datasets within the stomach simulator . Comprehensive evaluations on various robotic tasks indicate that CapsDT can serve as a robust vision-language generalist, achieving state-of-the-art performance in various levels of endoscopy tasks while achieving a 26.25% success rate in real-world simulation manipulation. I. INTRODUCTION Endoscopy, for both diagnostic and therapeutic interventions, provides direct visualization and treatment capabilities within the gastrointestinal (GI) tract [1], [2], [3].

artificial intelligence, large language model, natural language, (17 more...)

arXiv.org Artificial Intelligence

Jun-23-2025

arXiv.org PDF

Add feedback

Country:
- Asia
  - Macao (0.04)
  - China
    - Hong Kong (0.05)
    - Guangdong Province > Shenzhen (0.05)

Genre:
- Research Report (1.00)

Industry:
- Health & Medicine
  - Therapeutic Area > Gastroenterology (1.00)
  - Diagnostic Medicine > Imaging (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Robots (1.00)
  - Natural Language > Large Language Model (0.94)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found