Athar, Ali
4D-Former: Multimodal 4D Panoptic Segmentation
Athar, Ali, Li, Enxu, Casas, Sergio, Urtasun, Raquel
Perception systems employed in self-driving vehicles (SDVs) aim to understand the scene both spatially and temporally. Recently, 4D panoptic segmentation has emerged as an important task which involves assigning a semantic label to each observation, as well as an instance ID representing each unique object consistently over time, thus combining semantic segmentation, instance segmentation and object tracking into a single, comprehensive task. Potential applications of this task include building semantic maps, auto-labelling object trajectories, and onboard perception. The task is, however, challenging due to the sparsity of the point-cloud observations, and the computational complexity of 4D spatio-temporal reasoning. Traditionally, researchers have tackled the constituent tasks in isolation, i.e., segmenting classes [1, 2, 3, 4], identifying individual objects [5, 6], and tracking them over time [7, 8]. However, combining multiple networks into a single perception system makes it error-prone, potentially slow, and cumbersome to train.
STEm-Seg: Spatio-temporal Embeddings for Instance Segmentation in Videos
Athar, Ali, Mahadevan, Sabarinath, Oลกep, Aljoลกa, Leal-Taixรฉ, Laura, Leibe, Bastian
Existing methods for instance segmentation in videos typically involve multi-stage pipelines that follow the tracking-by-detection paradigm and model a video clip as a sequence of images. Multiple networks are used to detect objects in individual frames, and then associate these detections over time. Hence, these methods are often non-end-to-end trainable and highly tailored to specific tasks. In this paper, we propose a different approach that is well-suited to a variety of tasks involving instance segmentation in videos. In particular, we model a video clip as a single 3D spatio-temporal volume, and propose a novel approach that segments and tracks instances across space and time in a single stage. Our problem formulation is centered around the idea of spatio-temporal embeddings which are trained to cluster pixels belonging to a specific object instance over an entire video clip. To this end, we introduce (i) novel mixing functions that enhance the feature representation of spatio-temporal embeddings, and (ii) a single-stage, proposal-free network that can reason about temporal context. Our network is trained end-to-end to learn spatio-temporal embeddings as well as parameters required to cluster these embeddings, thus simplifying inference. Our method achieves state-of-the-art results across multiple datasets and tasks. Code and models are available at https://github.com/sabarim/STEm-Seg.
TarViS: A Unified Approach for Target-based Video Segmentation
Athar, Ali, Hermans, Alexander, Luiten, Jonathon, Ramanan, Deva, Leibe, Bastian
The general domain of video segmentation is currently fragmented into different tasks spanning multiple benchmarks. Despite rapid progress in the state-of-the-art, current methods are overwhelmingly task-specific and cannot conceptually generalize to other tasks. Inspired by recent approaches with multi-task capability, we propose TarViS: a novel, unified network architecture that can be applied to any task that requires segmenting a set of arbitrarily defined 'targets' in video. Our approach is flexible with respect to how tasks define these targets, since it models the latter as abstract 'queries' which are then used to predict pixel-precise target masks. A single TarViS model can be trained jointly on a collection of datasets spanning different tasks, and can hot-swap between tasks during inference without any task-specific retraining. To demonstrate its effectiveness, we apply TarViS to four different tasks, namely Video Instance Segmentation (VIS), Video Panoptic Segmentation (VPS), Video Object Segmentation (VOS) and Point Exemplar-guided Tracking (PET). Our unified, jointly trained model achieves state-of-the-art performance on 5/7 benchmarks spanning these four tasks, and competitive performance on the remaining two. Code and model weights are available at: https://github.com/Ali2500/TarViS
HODOR: High-level Object Descriptors for Object Re-segmentation in Video Learned from Static Images
Athar, Ali, Luiten, Jonathon, Hermans, Alexander, Ramanan, Deva, Leibe, Bastian
Existing state-of-the-art methods for Video Object Segmentation (VOS) learn low-level pixel-to-pixel correspondences between frames to propagate object masks across video. This requires a large amount of densely annotated video data, which is costly to annotate, and largely redundant since frames within a video are highly correlated. In light of this, we propose HODOR: a novel method that tackles VOS by effectively leveraging annotated static images for understanding object appearance and scene context. We encode object instances and scene information from an image frame into robust high-level descriptors which can then be used to re-segment those objects in different frames. As a result, HODOR achieves state-of-the-art performance on the DAVIS and YouTube-VOS benchmarks compared to existing methods trained without video annotations. Without any architectural modification, HODOR can also learn from video context around single annotated video frames by utilizing cyclic consistency, whereas other methods rely on dense, temporally consistent annotations.