KRAST: Knowledge-Augmented Robotic Action Recognition with Structured Text for Vision-Language Models
Nguyen, Son Hai, Wang, Diwei, Jang, Jinhyeok, Seo, Hyewon
–arXiv.org Artificial Intelligence
Accurate vision-based action recognition is crucial for developing autonomous robots that can operate safely and reliably in complex, real-world environments. In this work, we advance video-based recognition of indoor daily actions for robotic perception by leveraging vision-language models (VLMs) enriched with domain-specific knowledge. We adapt a prompt-learning framework in which class-level textual descriptions of each action are embedded as learnable prompts into a frozen pre-trained VLM backbone. Several strategies for structuring and encoding these textual descriptions are designed and evaluated. Experiments on the ETRI-Activity3D dataset demonstrate that our method, using only RGB video inputs at test time, achieves over 95\% accuracy and outperforms state-of-the-art approaches. These results highlight the effectiveness of knowledge-augmented prompts in enabling robust action recognition with minimal supervision.
arXiv.org Artificial Intelligence
Sep-23-2025
- Country:
- Asia
- Japan > Honshū
- Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- South Korea > Seoul
- Seoul (0.04)
- Japan > Honshū
- North America > United States
- Hawaii > Honolulu County > Honolulu (0.04)
- Asia
- Genre:
- Research Report > Promising Solution (0.34)
- Industry:
- Health & Medicine > Therapeutic Area (0.68)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning > Neural Networks
- Deep Learning (0.69)
- Natural Language (1.00)
- Robots (1.00)
- Vision (1.00)
- Machine Learning > Neural Networks
- Information Technology > Artificial Intelligence