Large Pre-Trained Models for Bimanual Manipulation in 3D

Yurchyk, Hanna, Chang, Wei-Di, Dudek, Gregory, Meger, David

Nov-13-2025–arXiv.org Artificial Intelligence

We investigate the integration of attention maps from a pre-trained Vision Transformer into voxel representations to enhance bimanual robotic manipulation. Specifically, we extract attention maps from DINOv2, a self-supervised ViT model, and interpret them as pixel-level saliency scores over RGB images. These maps are lifted into a 3D voxel grid, resulting in voxel-level semantic cues that are incorporated into a behavior cloning policy. When integrated into a state-of-the-art voxel-based policy, our attention-guided featurization yields an average absolute improvement of 8.2% and a relative gain of 21.9% across all tasks in the RLBench bimanual benchmark.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

Nov-13-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.46)

Genre:
- Research Report (0.65)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Natural Language (1.00)
  - Machine Learning (1.00)
  - Robots > Robot Planning & Action (0.46)
  - Representation & Reasoning
    - Spatial Reasoning (0.46)
    - Agents (0.46)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found