ManiTrend: Bridging Future Generation and Action Prediction with 3D Flow for Robotic Manipulation

Feb-14-2025–arXiv.org Artificial Intelligence

Language-conditioned manipulation is a vital but challenging robotic task due to the high-level abstraction of language. To address this, researchers have sought improved goal representations derived from natural language. In this paper, we highlight 3D flow - representing the motion trend of 3D particles within a scene - as an effective bridge between language-based future image generation and fine-grained action prediction. To this end, we develop ManiTrend, a unified framework that models the dynamics of 3D particles, vision observations and manipulation actions with a causal transformer. Within this framework, features for 3D flow prediction serve as additional conditions for future image generation and action prediction, alleviating the complexity of pixel-wise spatiotemporal modeling and providing seamless action guidance. Furthermore, 3D flow can substitute missing or heterogeneous action labels during large-scale pretraining on cross-embodiment demonstrations. Experiments on two comprehensive benchmarks demonstrate that our method achieves state-of-the-art performance with high efficiency. Our code and model checkpoints will be available upon acceptance.

artificial intelligence, machine learning, manitrend, (16 more...)

arXiv.org Artificial Intelligence

Feb-14-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.50)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Robots (1.00)
  - Machine Learning > Neural Networks (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found