Ag2x2: Robust Agent-Agnostic Visual Representations for Zero-Shot Bimanual Manipulation

Xiong, Ziyin, Chen, Yinghan, Li, Puhao, Zhu, Yixin, Liu, Tengyu, Huang, Siyuan

Jul-29-2025–arXiv.org Artificial Intelligence

Figure 1: Ag2x2 enables zero-shot acquisition of bimanual manipulation skills without relying on expert demonstrations or engineered rewards. The framework operates in two key stages: (left) learning coordination-aware visual representations directly from human manipulation videos (shown in sequential frames of cooking with highlighted hand) while preserving critical hand position data despite domain differences; and (right) leveraging these representations to acquire diverse bimanual manipulation skills in simulation autonomously, demonstrated through multiple Franka robot arms performing sequential steps of various tasks including cabinet opening (top row), door manipulation (middle row), and rope handling (bottom row). Abstract -- Bimanual manipulation, fundamental to human daily activities, remains a challenging task due to its inherent complexity of coordinated control. Recent advances have enabled zero-shot learning of single-arm manipulation skills through agent-agnostic visual representations derived from human videos; however, these methods overlook crucial agent-specific information necessary for bimanual coordination, such as end-effector positions. We propose Ag2x2, a computational framework for bimanual manipulation through coordination-aware visual representations that jointly encode object states and hand motion patterns while maintaining agent-agnosticism. Extensive experiments demonstrate that Ag2x2 achieves a 73.5% success rate across 13 diverse bimanual tasks from Bi-DexHands and PerAct This performance outperforms baseline methods and even surpasses the success rate of policies trained with expert-engineered rewards. Furthermore, we show that representations learned through Ag2x2 can be effectively leveraged for imitation learning, establishing a scalable pipeline for skill acquisition without expert supervision. Ziyin Xiong and Yinghan Chen contributed equally to this work.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Jul-29-2025

arXiv.org PDF

Add feedback

Country:
- Asia
  - China
    - Beijing > Beijing (0.04)
    - Hubei Province > Wuhan (0.04)
  - Japan > Honshū
    - Chūbu > Ishikawa Prefecture > Kanazawa (0.04)
- Europe > United Kingdom
  - England > Cambridgeshire > Cambridge (0.04)

Genre:
- Research Report (0.50)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (1.00)
  - Natural Language > Large Language Model (1.00)
  - Robots (1.00)