Any2Policy: Learning Visuomotor Policy with Any-Modality

Mar-27-2025, 14:37:22 GMT–Neural Information Processing Systems

Humans can communicate and observe media with different modalities, such as texts, sounds, and images. For robots to be more generalizable embodied agents, they should be capable of following instructions and perceiving the world with adaptation to diverse modalities. Current robotic learning methodologies often focus on single-modal task specification and observation, thereby limiting their ability to process rich multi-modal information. Addressing this limitation, we present an end-to-end general-purpose multi-modal system named Any-to-Policy Embodied Agents. This system empowers robots to handle tasks using various modalities, whether in combinations like text-image, audio-image, text-point cloud, or in isolation.

arxiv preprint arxiv, large language model, machine learning, (18 more...)

Neural Information Processing Systems

Mar-27-2025, 14:37:22 GMT

Conferences PDF

Add feedback

Genre:
- Research Report > Experimental Study (0.93)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.93)
  - Natural Language > Large Language Model (1.00)
  - Representation & Reasoning (1.00)
  - Robots (1.00)
  - Vision (1.00)