Advancing Multi-agent Traffic Simulation via R1-Style Reinforcement Fine-Tuning

Pei, Muleilan, Shi, Shaoshuai, Shen, Shaojie

arXiv.org Artificial Intelligence 

Scalable and realistic simulation of multi-agent traffic behavior is critical for advancing autonomous driving technologies. Although existing data-driven simulators have made significant strides in this domain, they predominantly rely on supervised learning to align simulated distributions with real-world driving scenarios. A persistent challenge, however, lies in the distributional shift that arises between training and testing, which often undermines model generalization in unseen environments. To address this limitation, we propose SMART -R1, a novel R1-style reinforcement fine-tuning paradigm tailored for next-token prediction models to better align agent behavior with human preferences and evaluation metrics. Our approach introduces a metric-oriented policy optimization algorithm to improve distribution alignment and an iterative "SFT -RFT -SFT" training strategy that alternates between Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT) to maximize performance gains. The results on the Waymo Open Sim Agents Challenge (WOSAC) showcase that SMART -R1 achieves state-of-the-art performance with an overall realism meta score of 0.7858, ranking first on the leaderboard at the time of submission. Simulating multi-agent traffic behaviors plays a pivotal role in ensuring the safety and reliability of autonomous driving systems. However, modeling realistic and scalable traffic behaviors remains highly challenging due to the inherent uncertainty and multi-modality of human driving. Traditional simulators that simply replay logged data lack reactive capability, while rule-based methods, such as the Intelligent Driver Model (IDM) (Treiber et al., 2000), depend on handcrafted heuristics and fail to capture the diversity and realism of human behavior.