Enhancing Vision-Language Models with Scene Graphs for Traffic Accident Understanding
Lohner, Aaron, Compagno, Francesco, Francis, Jonathan, Oltramari, Alessandro
–arXiv.org Artificial Intelligence
Recognizing a traffic accident is an essential part of any autonomous driving or road monitoring system. An accident can appear in a wide variety of forms, and understanding what type of accident is taking place may be useful to prevent it from reoccurring. The task of being able to classify a traffic scene as a specific type of accident is the focus of this work. We approach the problem by likening a traffic scene to a graph, where objects such as cars can be represented as nodes, and relative distances and directions between them as edges. This representation of an accident can be referred to as a scene graph, and is used as input for an accident classifier. Better results can be obtained with a classifier that fuses the scene graph input with representations from vision and language. This work introduces a multi-stage, multimodal pipeline to pre-process videos of traffic accidents, encode them as scene graphs, and align this representation with vision and language modalities for accident classification. When trained on 4 classes, our method achieves a balanced accuracy score of 57.77% on an (unbalanced) subset of the popular Detection of Traffic Anomaly (DoTA) benchmark, representing an increase of close to 5 percentage points from the case where scene graph information is not taken into account.
arXiv.org Artificial Intelligence
Jul-8-2024
- Country:
- Europe > Italy (0.14)
- North America > United States (0.14)
- Genre:
- Research Report (0.83)
- Industry:
- Automobiles & Trucks (0.34)
- Information Technology (0.48)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning (1.00)
- Natural Language (1.00)
- Representation & Reasoning (1.00)
- Robots > Autonomous Vehicles (0.34)
- Vision (1.00)
- Information Technology > Artificial Intelligence