VISTA: A Vision and Intent-Aware Social Attention Framework for Multi-Agent Trajectory Prediction

Martins, Stephane Da Silva, Aldea, Emanuel, Hégarat-Mascle, Sylvie Le

arXiv.org Artificial Intelligence 

Multi-agent trajectory prediction is a key task in computer vision for autonomous systems, particularly in dense and interactive environments. Existing methods often struggle to jointly model goal-driven behavior and complex social dynamics, which leads to unrealistic predictions. In this paper, we introduce VISTA, a recursive goal-conditioned transformer architecture that features (1) a cross-attention fusion mechanism to integrate long-term goals with past trajectories, (2) a social-token attention module enabling fine-grained interaction modeling across agents, and (3) pairwise attention maps to show social influence patterns during inference. Our model enhances the single-agent goal-conditioned approach into a cohesive multi-agent forecasting framework. In addition to the standard evaluation metrics, we also consider trajectory collision rates, which capture the realism of the joint predictions. Evaluated on the high-density MADRAS benchmark and on SDD, VISTA achieves state-of-the-art accuracy with improved interaction modeling. On MADRAS, our approach reduces the average collision rate of strong baselines from 2.14% to 0.03%, and on SDD, it achieves a 0% collision rate while outperforming SOTA models in terms of ADE/FDE and minFDE. These results highlight the model's ability to generate socially compliant, goal-aware, and interpretable trajectory predictions, making it well-suited for deployment in safety-critical autonomous systems.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found