Qu, Ao
Sparkle: Mastering Basic Spatial Capabilities in Vision Language Models Elicits Generalization to Composite Spatial Reasoning
Tang, Yihong, Qu, Ao, Wang, Zhaokai, Zhuang, Dingyi, Wu, Zhaofeng, Ma, Wei, Wang, Shenhao, Zheng, Yunhan, Zhao, Zhan, Zhao, Jinhua
Vision language models (VLMs) have demonstrated impressive performance across a wide range of downstream tasks. However, their proficiency in spatial reasoning remains limited, despite its crucial role in tasks involving navigation and interaction with physical environments. Specifically, most of these tasks rely on the core spatial reasoning capabilities in two-dimensional (2D) environments, and our evaluation reveals that state-of-the-art VLMs frequently generate implausible and incorrect responses to composite spatial reasoning problems, including simple pathfinding tasks that humans can solve effortlessly at a glance. To address this, we explore an effective approach to enhance 2D spatial reasoning within VLMs by training the model solely on basic spatial capabilities. We begin by disentangling the key components of 2D spatial reasoning: direction comprehension, distance estimation, and localization. Our central hypothesis is that mastering these basic spatial capabilities can significantly enhance a model's performance on composite spatial tasks requiring advanced spatial understanding and combinatorial problem-solving, with generalized improvements in visual-spatial tasks. To investigate this hypothesis, we introduce Sparkle, a framework that fine-tunes VLMs on these three basic spatial capabilities by synthetic data generation and targeted supervision to form an instruction dataset for each capability. Our experiments demonstrate that VLMs fine-tuned with Sparkle achieve significant performance gains, not only in the basic tasks themselves but also in generalizing to composite and out-of-distribution spatial reasoning tasks. These findings underscore the effectiveness of mastering basic spatial capabilities in enhancing composite spatial problem-solving, offering insights into systematic strategies for improving VLMs' spatial reasoning capabilities.
IntersectionZoo: Eco-driving for Benchmarking Multi-Agent Contextual Reinforcement Learning
Jayawardana, Vindula, Freydt, Baptiste, Qu, Ao, Hickert, Cameron, Yan, Zhongxia, Wu, Cathy
Despite the popularity of multi-agent reinforcement learning (RL) in simulated and two-player applications, its success in messy real-world applications has been limited. A key challenge lies in its generalizability across problem variations, a common necessity for many real-world problems. Contextual reinforcement learning (CRL) formalizes learning policies that generalize across problem variations. However, the lack of standardized benchmarks for multi-agent CRL has hindered progress in the field. Such benchmarks are desired to be based on real-world applications to naturally capture the many open challenges of real-world problems that affect generalization. To bridge this gap, we propose IntersectionZoo, a comprehensive benchmark suite for multi-agent CRL through the real-world application of cooperative eco-driving in urban road networks. The task of cooperative eco-driving is to control a fleet of vehicles to reduce fleet-level vehicular emissions. By grounding IntersectionZoo in a real-world application, we naturally capture real-world problem characteristics, such as partial observability and multiple competing objectives. IntersectionZoo is built on data-informed simulations of 16,334 signalized intersections derived from 10 major US cities, modeled in an open-source industry-grade microscopic traffic simulator. By modeling factors affecting vehicular exhaust emissions (e.g., temperature, road conditions, travel demand), IntersectionZoo provides one million data-driven traffic scenarios. Using these traffic scenarios, we benchmark popular multi-agent RL and human-like driving algorithms and demonstrate that the popular multi-agent RL algorithms struggle to generalize in CRL settings. Having demonstrated impressive performance in simulated multi-agent applications such as Starcraft (Samvelyan et al., 2019), RL holds potential for various multi-agent real-world applications including autonomous driving (Kiran et al., 2021), robotic warehousing (Bahrpeyma & Reichelt, 2022), and traffic control (Wu et al., 2021). However, compared to simulated applications, the success of RL in real-world applications has been rather limited (Dulac-Arnold et al., 2021). A key challenge lies in making RL algorithms generalize across problem variations, such as when weather conditions change in autonomous driving.
Synergizing Spatial Optimization with Large Language Models for Open-Domain Urban Itinerary Planning
Tang, Yihong, Wang, Zhaokai, Qu, Ao, Yan, Yihao, Hou, Kebing, Zhuang, Dingyi, Guo, Xiaotong, Zhao, Jinhua, Zhao, Zhan, Ma, Wei
In this paper, we for the first time propose the task of Open-domain Urban Itinerary Planning (OUIP) for citywalk, which directly generates itineraries based on users' requests described in natural language. OUIP is different from conventional itinerary planning, which limits users from expressing more detailed needs and hinders true personalization. Recently, large language models (LLMs) have shown potential in handling diverse tasks. However, due to non-real-time information, incomplete knowledge, and insufficient spatial awareness, they are unable to independently deliver a satisfactory user experience in OUIP. Given this, we present ItiNera, an OUIP system that synergizes spatial optimization with Large Language Models (LLMs) to provide services that customize urban itineraries based on users' needs. Specifically, we develop an LLM-based pipeline for extracting and updating POI features to create a user-owned personalized POI database. For each user request, we leverage LLM in cooperation with an embedding-based module for retrieving candidate POIs from the user's POI database. Then, a spatial optimization module is used to order these POIs, followed by LLM crafting a personalized, spatially coherent itinerary. To the best of our knowledge, this study marks the first integration of LLMs to innovate itinerary planning solutions. Extensive experiments on offline datasets and online subjective evaluation have demonstrated the capacities of our system to deliver more responsive and spatially coherent itineraries than current LLM-based solutions. Our system has been deployed in production at the TuTu online travel service and has attracted thousands of users for their urban travel planning.
SEIP: Simulation-based Design and Evaluation of Infrastructure-based Collective Perception
Qu, Ao, Huang, Xuhuan, Suo, Dajiang
Recent advances in sensing and communication have paved the way for collective perception in traffic management, with real-time data sharing among multiple entities. While vehicle-based collective perception has gained traction, infrastructure-based approaches, which entail the real-time sharing and merging of sensing data from different roadside sensors for object detection, grapple with challenges in placement strategy and high ex-post evaluation costs. Despite anecdotal evidence of their effectiveness, many current deployments rely on engineering heuristics and face budget constraints that limit post-deployment adjustments. This paper introduces polynomial-time heuristic algorithms and a simulation tool for the ex-ante evaluation of infrastructure sensor deployment. By modeling it as an integer programming problem, we guide decisions on sensor locations, heights, and configurations to harmonize cost, installation constraints, and coverage. Our simulation engine, integrated with open-source urban driving simulators, enables us to evaluate the effectiveness of each sensor deployment solution through the lens of object detection. A case study with infrastructure LiDARs revealed that the incremental benefit derived from integrating additional low-resolution LiDARs could surpass that of incorporating more high-resolution ones. The results reinforce the necessity of investigating the cost-performance tradeoff prior to deployment. The code for our simulation experiments can be found at https://github.com/dajiangsuo/SEIP.
Domain Adversarial Spatial-Temporal Network: A Transferable Framework for Short-term Traffic Forecasting across Cities
Tang, Yihong, Qu, Ao, Chow, Andy H. F., Lam, William H. K., Wong, S. C., Ma, Wei
Accurate real-time traffic forecast is critical for intelligent transportation systems (ITS) and it serves as the cornerstone of various smart mobility applications. Though this research area is dominated by deep learning, recent studies indicate that the accuracy improvement by developing new model structures is becoming marginal. Instead, we envision that the improvement can be achieved by transferring the "forecasting-related knowledge" across cities with different data distributions and network topologies. To this end, this paper aims to propose a novel transferable traffic forecasting framework: Domain Adversarial Spatial-Temporal Network (DASTNet). DASTNet is pre-trained on multiple source networks and fine-tuned with the target network's traffic data. Specifically, we leverage the graph representation learning and adversarial domain adaptation techniques to learn the domain-invariant node embeddings, which are further incorporated to model the temporal traffic data. To the best of our knowledge, we are the first to employ adversarial multi-domain adaptation for network-wide traffic forecasting problems. DASTNet consistently outperforms all state-of-the-art baseline methods on three benchmark datasets. The trained DASTNet is applied to Hong Kong's new traffic detectors, and accurate traffic predictions can be delivered immediately (within one day) when the detector is available. Overall, this study suggests an alternative to enhance the traffic forecasting methods and provides practical implications for cities lacking historical traffic data.
Attacking Deep Reinforcement Learning-Based Traffic Signal Control Systems with Colluding Vehicles
Qu, Ao, Tang, Yihong, Ma, Wei
The rapid advancements of Internet of Things (IoT) and artificial intelligence (AI) have catalyzed the development of adaptive traffic signal control systems (ATCS) for smart cities. In particular, deep reinforcement learning (DRL) methods produce the state-of-the-art performance and have great potentials for practical applications. In the existing DRL-based ATCS, the controlled signals collect traffic state information from nearby vehicles, and then optimal actions (e.g., switching phases) can be determined based on the collected information. The DRL models fully "trust" that vehicles are sending the true information to the signals, making the ATCS vulnerable to adversarial attacks with falsified information. In view of this, this paper first time formulates a novel task in which a group of vehicles can cooperatively send falsified information to "cheat" DRL-based ATCS in order to save their total travel time. To solve the proposed task, we develop CollusionVeh, a generic and effective vehicle-colluding framework composed of a road situation encoder, a vehicle interpreter, and a communication mechanism. We employ our method to attack established DRL-based ATCS and demonstrate that the total travel time for the colluding vehicles can be significantly reduced with a reasonable number of learning episodes, and the colluding effect will decrease if the number of colluding vehicles increases. Additionally, insights and suggestions for the real-world deployment of DRL-based ATCS are provided. The research outcomes could help improve the reliability and robustness of the ATCS and better protect the smart mobility systems.
Graph Convolutional Networks for traffic anomaly
Event detection has been an important task in transportation, whose task is to detect points in time when large events disrupts a large portion of the urban traffic network. Travel information {Origin-Destination} (OD) matrix data by map service vendors has large potential to give us insights to discover historic patterns and distinguish anomalies. However, to fully capture the spatial and temporal traffic patterns remains a challenge, yet serves a crucial role for effective anomaly detection. Meanwhile, existing anomaly detection methods have not well-addressed the extreme data sparsity and high-dimension challenges, which are common in OD matrix datasets. To tackle these challenges, we formulate the problem in a novel way, as detecting anomalies in a set of directed weighted graphs representing the traffic conditions at each time interval. We further propose \textit{Context augmented Graph Autoencoder} (\textbf{Con-GAE }), that leverages graph embedding and context embedding techniques to capture the spatial traffic network patterns while working around the data sparsity and high-dimensionality issue. Con-GAE adopts an autoencoder framework and detect anomalies via semi-supervised learning. Extensive experiments show that our method can achieve up can achieve a 0.1-0.4 improvements of the area under the curve (AUC) score over state-of-art anomaly detection baselines, when applied on several real-world large scale OD matrix datasets.