SmolRGPT: Efficient Spatial Reasoning for Warehouse Environments with 600M Parameters
Traore, Abdarahmane, Hervet, Éric, Couturier, Andy
–arXiv.org Artificial Intelligence
Recent advances in vision-language models (VLMs) have enabled powerful multimodal reasoning, but state-of-the-art approaches typically rely on extremely large models with prohibitive computational and memory requirements. This makes their deployment challenging in resource-constrained environments such as warehouses, robotics, and industrial applications, where both efficiency and robust spatial understanding are critical. In this work, we present SmolRGPT, a compact vision-language architecture that explicitly incorporates region-level spatial reasoning by integrating both RGB and depth cues. Smol-RGPT employs a three-stage curriculum that progressively align visual and language features, enables spatial relationship understanding, and adapts to task-specific datasets. W e demonstrate that with only 600M parameters, SmolRGPT achieves competitive results on challenging warehouse spatial reasoning benchmarks, matching or exceeding the performance of much larger alternatives.
arXiv.org Artificial Intelligence
Sep-22-2025
- Country:
- North America > Canada > New Brunswick > Westmorland County > Moncton (0.04)
- Genre:
- Research Report > New Finding (0.46)
- Technology: