VISTA: A Vision and Intent-Aware Social Attention Framework for Multi-Agent Trajectory Prediction
Martins, Stephane Da Silva, Aldea, Emanuel, Hégarat-Mascle, Sylvie Le
–arXiv.org Artificial Intelligence
Multi-agent trajectory prediction is a key task in computer vision for autonomous systems, particularly in dense and interactive environments. Existing methods often struggle to jointly model goal-driven behavior and complex social dynamics, which leads to unrealistic predictions. In this paper, we introduce VISTA, a recursive goal-conditioned transformer architecture that features (1) a cross-attention fusion mechanism to integrate long-term goals with past trajectories, (2) a social-token attention module enabling fine-grained interaction modeling across agents, and (3) pairwise attention maps to show social influence patterns during inference. Our model enhances the single-agent goal-conditioned approach into a cohesive multi-agent forecasting framework. In addition to the standard evaluation metrics, we also consider trajectory collision rates, which capture the realism of the joint predictions. Evaluated on the high-density MADRAS benchmark and on SDD, VISTA achieves state-of-the-art accuracy with improved interaction modeling. On MADRAS, our approach reduces the average collision rate of strong baselines from 2.14% to 0.03%, and on SDD, it achieves a 0% collision rate while outperforming SOTA models in terms of ADE/FDE and minFDE. These results highlight the model's ability to generate socially compliant, goal-aware, and interpretable trajectory predictions, making it well-suited for deployment in safety-critical autonomous systems.
arXiv.org Artificial Intelligence
Nov-14-2025
- Country:
- Europe > France > Auvergne-Rhône-Alpes > Lyon > Lyon (0.04)
- Genre:
- Research Report (0.82)
- Technology: