Schmid, Stefan
Predicting the Road Ahead: A Knowledge Graph based Foundation Model for Scene Understanding in Autonomous Driving
Zhou, Hongkuan, Schmid, Stefan, Li, Yicong, Halilaj, Lavdim, Yao, Xiangtong, cao, Wei
The autonomous driving field has seen remarkable advancements in various topics, such as object recognition, trajectory prediction, and motion planning. However, current approaches face limitations in effectively comprehending the complex evolutions of driving scenes over time. This paper proposes FM4SU, a novel methodology for training a symbolic foundation model (FM) for scene understanding in autonomous driving. It leverages knowledge graphs (KGs) to capture sensory observation along with domain knowledge such as road topology, traffic rules, or complex interactions between traffic participants. A bird's eye view (BEV) symbolic representation is extracted from the KG for each driving scene, including the spatio-temporal information among the objects across the scenes. The BEV representation is serialized into a sequence of tokens and given to pre-trained language models (PLMs) for learning an inherent understanding of the co-occurrence among driving scene elements and generating predictions on the next scenes. We conducted a number of experiments using the nuScenes dataset and KG in various scenarios. The results demonstrate that fine-tuned models achieve significantly higher accuracy in all tasks. The fine-tuned T5 model achieved a next scene prediction accuracy of 86.7%. This paper concludes that FM4SU offers a promising foundation for developing more comprehensive models for scene understanding in autonomous driving.
Visual Representation Learning Guided By Multi-modal Prior Knowledge
Zhou, Hongkuan, Halilaj, Lavdim, Monka, Sebastian, Schmid, Stefan, Zhu, Yuqicheng, Xiong, Bo, Staab, Steffen
Despite the remarkable success of deep neural networks (DNNs) in computer vision, they fail to remain high-performing when facing distribution shifts between training and testing data. In this paper, we propose Knowledge-Guided Visual representation learning (KGV), a distribution-based learning approach leveraging multi-modal prior knowledge, to improve generalization under distribution shift. We use prior knowledge from two distinct modalities: 1) a knowledge graph (KG) with hierarchical and association relationships; and 2) generated synthetic images of visual elements semantically represented in the KG. The respective embeddings are generated from the given modalities in a common latent space, i.e., visual embeddings from original and synthetic images as well as knowledge graph embeddings (KGEs). These embeddings are aligned via a novel variant of translation-based KGE methods, where the node and relation embeddings of the KG are modeled as Gaussian distributions and translations respectively. We claim that incorporating multi-model prior knowledge enables more regularized learning of image representations. Thus, the models are able to better generalize across different data distributions. We evaluate KGV on different image classification tasks with major or minor distribution shifts, namely road sign classification across datasets from Germany, China, and Russia, image classification with the mini-ImageNet dataset and its variants, as well as the DVM-CAR dataset. The results demonstrate that KGV consistently exhibits higher accuracy and data efficiency than the baselines across all experiments.
Credence: Augmenting Datacenter Switch Buffer Sharing with ML Predictions
Addanki, Vamsi, Pacut, Maciej, Schmid, Stefan
Packet buffers in datacenter switches are shared across all the switch ports in order to improve the overall throughput. The trend of shrinking buffer sizes in datacenter switches makes buffer sharing extremely challenging and a critical performance issue. Literature suggests that push-out buffer sharing algorithms have significantly better performance guarantees compared to drop-tail algorithms. Unfortunately, switches are unable to benefit from these algorithms due to lack of support for push-out operations in hardware. Our key observation is that drop-tail buffers can emulate push-out buffers if the future packet arrivals are known ahead of time. This suggests that augmenting drop-tail algorithms with predictions about the future arrivals has the potential to significantly improve performance. This paper is the first research attempt in this direction. We propose Credence, a drop-tail buffer sharing algorithm augmented with machine-learned predictions. Credence can unlock the performance only attainable by push-out algorithms so far. Its performance hinges on the accuracy of predictions. Specifically, Credence achieves near-optimal performance of the best known push-out algorithm LQD (Longest Queue Drop) with perfect predictions, but gracefully degrades to the performance of the simplest drop-tail algorithm Complete Sharing when the prediction error gets arbitrarily worse. Our evaluations show that Credence improves throughput by $1.5$x compared to traditional approaches. In terms of flow completion times, we show that Credence improves upon the state-of-the-art approaches by up to $95\%$ using off-the-shelf machine learning techniques that are also practical in today's hardware. We believe this work opens several interesting future work opportunities both in systems and theory that we discuss at the end of this paper.
Heterogeneous Graph-based Trajectory Prediction using Local Map Context and Social Interactions
Grimm, Daniel, Zipfl, Maximilian, Hertlein, Felix, Naumann, Alexander, Lรผttin, Jรผrgen, Thoma, Steffen, Schmid, Stefan, Halilaj, Lavdim, Rettinger, Achim, Zรถllner, J. Marius
Precisely predicting the future trajectories of surrounding traffic participants is a crucial but challenging problem in autonomous driving, due to complex interactions between traffic agents, map context and traffic rules. Vector-based approaches have recently shown to achieve among the best performances on trajectory prediction benchmarks. These methods model simple interactions between traffic agents but don't distinguish between relation-type and attributes like their distance along the road. Furthermore, they represent lanes only by sequences of vectors representing center lines and ignore context information like lane dividers and other road elements. We present a novel approach for vector-based trajectory prediction that addresses these shortcomings by leveraging three crucial sources of information: First, we model interactions between traffic agents by a semantic scene graph, that accounts for the nature and important features of their relation. Second, we extract agent-centric image-based map features to model the local map context. Finally, we generate anchor paths to enforce the policy in multi-modal prediction to permitted trajectories only. Each of these three enhancements shows advantages over the baseline model HoliGraph.
Relation-based Motion Prediction using Traffic Scene Graphs
Zipfl, Maximilian, Hertlein, Felix, Rettinger, Achim, Thoma, Steffen, Halilaj, Lavdim, Luettin, Juergen, Schmid, Stefan, Henson, Cory
Representing relevant information of a traffic scene and understanding its environment is crucial for the success of autonomous driving. Modeling the surrounding of an autonomous car using semantic relations, i.e., how different traffic participants relate in the context of traffic rule based behaviors, is hardly been considered in previous work. This stems from the fact that these relations are hard to extract from real-world traffic scenes. In this work, we model traffic scenes in a form of spatial semantic scene graphs for various different predictions about the traffic participants, e.g., acceleration and deceleration. Our learning and inference approach uses Graph Neural Networks (GNNs) and shows that incorporating explicit information about the spatial semantic relations between traffic participants improves the predicdtion results. Specifically, the acceleration prediction of traffic participants is improved by up to 12% compared to the baselines, which do not exploit this explicit information. Furthermore, by including additional information about previous scenes, we achieve 73% improvements.
ConTraKG: Contrastive-based Transfer Learning for Visual Object Recognition using Knowledge Graphs
Monka, Sebastian, Halilaj, Lavdim, Schmid, Stefan, Rettinger, Achim
Deep learning techniques achieve high accuracy in computer vision tasks. However, their accuracy suffers considerably when they face a domain change, i.e., as soon as they are used in a domain that differs from their training domain. For example, a road sign recognition model trained to recognize road signs in Germany performs poorly in countries with different road sign standards like China. We propose ConTraKG, a neuro-symbolic approach that enables cross-domain transfer learning based on prior knowledge about the domain or context. A knowledge graph serves as a medium for encoding such prior knowledge, which is then transformed into a dense vector representation via embedding methods. Using a five-phase training pipeline, we train the deep neural network to adjust its visual embedding space according to the domain-invariant embedding space of the knowledge graph based on a contrastive loss function. This allows the neural network to incorporate training data from different target domains that are already represented in the knowledge graph. We conduct a series of empirical evaluations to determine the accuracy of our approach. The results show that ConTraKG is significantly more accurate than the conventional approach for dealing with domain changes. In a transfer learning setup, where the network is trained on both domains, ConTraKG achieves 21% higher accuracy when tested on the source domain and 15% when tested on the target domain compared to the standard approach. Moreover, with only 10% of the target data for training, it achieves the same accuracy as the cross-entropy-based model trained on the full target data.