DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution

May-29-2025, 20:16:37 GMT–Neural Information Processing Systems

Multimodal Large Language Models (MLLMs) have demonstrated remarkable comprehension and reasoning capabilities with complex language and visual data. These advances have spurred the vision of establishing a generalist robotic MLLM proficient in understanding complex human instructions and accomplishing various embodied tasks. However, developing MLLMs for real-world robots is challenging due to the typically limited computation and memory capacities available on robotic platforms. In contrast, the inference of MLLMs involves storing billions of parameters and performing tremendous computation, imposing significant hardware demands. In our paper, we seek to address this challenge by leveraging an intriguing observation: relatively easier situations make up the bulk of the procedure of controlling robots to fulfill diverse tasks, and they generally require far smaller models to obtain the correct robotic actions.

arxiv preprint arxiv, large language model, machine learning, (18 more...)

Neural Information Processing Systems

May-29-2025, 20:16:37 GMT

Conferences PDF

Add feedback

Country:
- Asia (0.14)

Genre:
- Research Report
  - Experimental Study (0.93)
  - New Finding (0.93)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning
    - Learning Graphical Models > Undirected Networks
      - Markov Models (0.67)
    - Neural Networks > Deep Learning (0.94)
  - Natural Language > Large Language Model (1.00)
  - Robots (1.00)