Visual Language Models as Operator Agents in the Space Domain
Carrasco, Alejandro, Nedungadi, Marco, Zucchelli, Enrico M., Jain, Amit, Rodriguez-Fernandez, Victor, Linares, Richard
–arXiv.org Artificial Intelligence
Since the emergence of the LLM trend, initiated with the first release of ChatGPT [1], these systems have undergone continuous development and have evolved into multimodal architectures. Multimodal models, such as GPT-4o [2], LLaMA 3.2 [3] and Claude with its latest 3.5 Sonnet model [4], integrate language understanding with non-language capabilities, including vision and audio processing. This progression unlocks new opportunities for developing intelligent agents capable of recognizing and interpreting patterns not only at a semantic level but also through components that can incorporate other types of unstructured data into prompts, significantly expanding their potential applications and impact. Extending these capabilities, Vision-Language Models (VLMs) build on multimodal principles by integrating visual reasoning into the LLM framework. By introducing new tokens into the prompts to process image frames, VLMs enable simultaneous semantic and visual reasoning. This enhancement is particularly valuable in dynamic applications like robotics, where the integration of vision and language reasoning enables systems to generate environment-responsive actions. Such actions, often described as descriptive policies, translate reasoning into meaningful, executable commands. Language models able to generate such commands are usually referred to as "agentic".
arXiv.org Artificial Intelligence
Jan-13-2025
- Country:
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
- Genre:
- Research Report > New Finding (0.68)
- Industry:
- Government (0.68)
- Technology: