Heide, Felix
CtRL-Sim: Reactive and Controllable Driving Agents with Offline Reinforcement Learning
Rowe, Luke, Girgis, Roger, Gosselin, Anthony, Carrez, Bruno, Golemo, Florian, Heide, Felix, Paull, Liam, Pal, Christopher
Evaluating autonomous vehicle stacks (AVs) in simulation typically involves replaying driving logs from real-world recorded traffic. However, agents replayed from offline data are not reactive and hard to intuitively control. Existing approaches address these challenges by proposing methods that rely on heuristics or generative models of real-world data but these approaches either lack realism or necessitate costly iterative sampling procedures to control the generated behaviours. In this work, we take an alternative approach and propose CtRL-Sim, a method that leverages return-conditioned offline reinforcement learning to efficiently generate reactive and controllable traffic agents. Specifically, we process real-world driving data through a physics-enhanced Nocturne simulator to generate a diverse offline reinforcement learning dataset, annotated with various reward terms. With this dataset, we train a return-conditioned multi-agent behaviour model that allows for fine-grained manipulation of agent behaviours by modifying the desired returns for the various reward components. This capability enables the generation of a wide range of driving behaviours beyond the scope of the initial dataset, including adversarial behaviours. We demonstrate that CtRL-Sim can generate diverse and realistic safety-critical scenarios while providing fine-grained control over agent behaviours.
Inverse Neural Rendering for Explainable Multi-Object Tracking
Ost, Julian, Banerjee, Tanushree, Bijelic, Mario, Heide, Felix
Today, most methods for image understanding tasks rely on feed-forward neural networks. While this approach has allowed for empirical accuracy, efficiency, and task adaptation via fine-tuning, it also comes with fundamental disadvantages. Existing networks often struggle to generalize across different datasets, even on the same task. By design, these networks ultimately reason about high-dimensional scene features, which are challenging to analyze. This is true especially when attempting to predict 3D information based on 2D images. We propose to recast 3D multi-object tracking from RGB cameras as an \emph{Inverse Rendering (IR)} problem, by optimizing via a differentiable rendering pipeline over the latent space of pre-trained 3D object representations and retrieve the latents that best represent object instances in a given input image. To this end, we optimize an image loss over generative latent spaces that inherently disentangle shape and appearance properties. We investigate not only an alternate take on tracking but our method also enables examining the generated objects, reasoning about failure situations, and resolving ambiguous cases. We validate the generalization and scaling capabilities of our method by learning the generative prior exclusively from synthetic data and assessing camera-based 3D tracking on the nuScenes and Waymo datasets. Both these datasets are completely unseen to our method and do not require fine-tuning. Videos and code are available at https://light.princeton.edu/inverse-rendering-tracking/.
Robust Depth Enhancement via Polarization Prompt Fusion Tuning
Ikemura, Kei, Huang, Yiming, Heide, Felix, Zhang, Zhaoxiang, Chen, Qifeng, Lei, Chenyang
Existing depth sensors are imperfect and may provide inaccurate depth values in challenging scenarios, such as in the presence of transparent or reflective objects. In this work, we present a general framework that leverages polarization imaging to improve inaccurate depth measurements from various depth sensors. Previous polarization-based depth enhancement methods focus on utilizing pure physics-based formulas for a single sensor. In contrast, our method first adopts a learning-based strategy where a neural network is trained to estimate a dense and complete depth map from polarization data and a sensor depth map from different sensors. To further improve the performance, we propose a Polarization Prompt Fusion Tuning (PPFT) strategy to effectively utilize RGB-based models pre-trained on large-scale datasets, as the size of the polarization dataset is limited to train a strong model from scratch. We conducted extensive experiments on a public dataset, and the results demonstrate that the proposed method performs favorably compared to existing depth enhancement baselines. Code and demos are available at https://lastbasket.github.io/PPFT/.
Instance Segmentation in the Dark
Chen, Linwei, Fu, Ying, Wei, Kaixuan, Zheng, Dezhi, Heide, Felix
Noname manuscript No. (will be inserted by the editor) Abstract Existing instance segmentation techniques are primarily depth can be critical for low-light instance segmentation. To tailored for high-visibility inputs, but their performance mitigate the scarcity of annotated RAW datasets, we leverage significantly deteriorates in extremely low-light environments. In addition, to facilitate further research in the dark and introduce several techniques that in this direction, we capture a real-world low-light instance substantially boost the low-light inference accuracy. The proposed segmentation dataset comprising over two thousand paired method is motivated by the observation that noise in low/normal-light images with instance-level pixel-wise annotations. To suppress this "feature noise", we in very low light (4 % AP higher than state-of-the-art propose a novel learning method that relies on an adaptive competitors), meanwhile opening new opportunities for future weighted downsampling layer, a smooth-oriented convolutional research. Our code and dataset are publicly available to block, and disturbance suppression learning. Furthermore, we discover that high-bit-depth RAW images can better preserve richer scene information in low-light conditions compared 1 Introduction to typical camera sRGB outputs, thus supporting the use of RAW-input algorithms. "buried" by severe noise caused by limited photon count and They substantially improve the capability of models to learn noiseresisted features and thus boost the low-light segmentation accuracy appreciably. It is worth noting that they are modelagnostic and lightweight or even cost-free. It aggregates local features adaptively and suppresses the high-frequency disturbance caused by noise as well as keeping the details in deep features. The smoothoriented convolutional block enhances the ordinary convolutional layers by adding a smooth-oriented convolution branch. Relevant Moreover, we notice that the high bit-depth can be crucial low-light recognition/detection methods (Cui et al., 2021; for low-light conditions.
Kissing to Find a Match: Efficient Low-Rank Permutation Representation
Drรถge, Hannah, Lรคhner, Zorah, Bahat, Yuval, Martorell, Onofre, Heide, Felix, Mรถller, Michael
Permutation matrices play a key role in matching and assignment problems across the fields, especially in computer vision and robotics. However, memory for explicitly representing permutation matrices grows quadratically with the size of the problem, prohibiting large problem instances. In this work, we propose to tackle the curse of dimensionality of large permutation matrices by approximating them using low-rank matrix factorization, followed by a nonlinearity. To this end, we rely on the Kissing number theory to infer the minimal rank required for representing a permutation matrix of a given size, which is significantly smaller than the problem size. This leads to a drastic reduction in computation and memory costs, e.g., up to $3$ orders of magnitude less memory for a problem of size $n=20000$, represented using $8.4\times10^5$ elements in two small matrices instead of using a single huge matrix with $4\times 10^8$ elements. The proposed representation allows for accurate representations of large permutation matrices, which in turn enables handling large problems that would have been infeasible otherwise. We demonstrate the applicability and merits of the proposed approach through a series of experiments on a range of problems that involve predicting permutation matrices, from linear and quadratic assignment to shape matching problems.
Biologically Inspired Dynamic Thresholds for Spiking Neural Networks
Ding, Jianchuan, Dong, Bo, Heide, Felix, Ding, Yufei, Zhou, Yunduo, Yin, Baocai, Yang, Xin
The dynamic membrane potential threshold, as one of the essential properties of a biological neuron, is a spontaneous regulation mechanism that maintains neuronal homeostasis, i.e., the constant overall spiking firing rate of a neuron. As such, the neuron firing rate is regulated by a dynamic spiking threshold, which has been extensively studied in biology. Existing work in the machine learning community does not employ bioinspired spiking threshold schemes. This work aims at bridging this gap by introducing a novel bioinspired dynamic energy-temporal threshold (BDETT) scheme for spiking neural networks (SNNs). The proposed BDETT scheme mirrors two bioplausible observations: a dynamic threshold has 1) a positive correlation with the average membrane potential and 2) a negative correlation with the preceding rate of depolarization. We validate the effectiveness of the proposed BDETT on robot obstacle avoidance and continuous control tasks under both normal conditions and various degraded conditions, including noisy observations, weights, and dynamic environments. We find that the BDETT outperforms existing static and heuristic threshold approaches by significant margins in all tested conditions, and we confirm that the proposed bioinspired dynamic threshold scheme offers homeostasis to SNNs in complex real-world tasks.
Latent Variable Nested Set Transformers & AutoBots
Girgis, Roger, Golemo, Florian, Codevilla, Felipe, D'Souza, Jim Aldon, Kahou, Samira Ebrahimi, Heide, Felix, Pal, Christopher
Humans have the innate ability to attend to the most relevant actors in their vicinity and can forecast how they may behave in the future. This ability will be crucial for the deployment of safety-critical agents such as robots or vehicles which interact with humans. We propose a theoretical framework for this problem setting based on autoregressively modelling sequences of nested sets, using latent variables to better capture multimodal distributions over future sets of sets. We present a new model architecture which we call a Nested Set Transformer which employs multi-head self-attention blocks over sets of sets that serve as a form of social attention between the elements of the sets at every timestep. Our approach can produce a distribution over future trajectories for all agents under consideration, or focus upon the trajectory of an ego-agent. We validate the Nested Set Transformer for autonomous driving settings which we refer to as ("AutoBot"), where we model the trajectory of an ego-agent based on the sequential observations of key attributes of multiple agents in a scene. AutoBot produces results better than state-of-the-art published prior work on the challenging nuScenes vehicle trajectory modeling benchmark. We also examine the multi-agent prediction version of our model and jointly forecast an ego-agent's future trajectory along with the other agents in the scene. We validate the behavior of our proposed Nested Set Transformer for scene level forecasting with a pedestrian trajectory dataset.