Goto

Collaborating Authors

 tracking


Tracking Any Point in Persistent 3D Geometry

Neural Information Processing Systems

We introduce TAPIP3D, a novel approach for long-term 3D point tracking in monocular RGB and RGB-D videos. TAPIP3D represents videos as camerastabilized spatio-temporal feature clouds, leveraging depth and camera motion information to lift 2D video features into a 3D world space where camera movement is effectively canceled out. Within this stabilized 3D representation, TAPIP3D iteratively refines multi-frame motion estimates, enabling robust point tracking over long time horizons.


Collaborating Vision, Depth, and Thermal Signals for Multi-Modal Tracking: Dataset and Algorithm

Neural Information Processing Systems

Existing multi-modal object tracking approaches primarily focus on dual-modal paradigms, such as RGB-Depth or RGB-Thermal, yet remain challenged in complex scenarios due to limited input modalities. To address this gap, this work introduces a novel multi-modal tracking task that leverages three complementary modalities, including visible RGB, Depth (D), and Thermal Infrared (TIR), aiming to enhance robustness in complex scenarios. To support this task, we construct a new multi-modal tracking dataset, coined RGBDT500, which consists of 500 videos with synchronised frames across the three modalities. Each frame provides spatially aligned RGB, depth, and thermal infrared images with precise object bounding box annotations. Furthermore, we propose a novel multi-modal tracker, dubbed RDTTrack. RDTTrack integrates tri-modal information for robust tracking by leveraging a pretrained RGB-only tracking model and prompt learning techniques. In specific, RDTTrack fuses thermal infrared and depth modalities under a proposed orthogonal projection constraint, then integrates them with RGB signals as prompts for the pre-trained foundation tracking model, effectively harmonising tri-modal complementary cues. The experimental results demonstrate the effectiveness and advantages of the proposed method, showing significant improvements over existing dual-modal approaches in terms of tracking accuracy and robustness in complex scenarios.


Fully Spiking Neural Networks for Unified Frame-Event Object Tracking

Neural Information Processing Systems

The integration of image and event streams offers a promising approach for achieving robust visual object tracking in complex environments. However, current fusion methods achieve high performance at the cost of significant computational overhead and struggle to efficiently extract the sparse, asynchronous information from event streams, failing to leverage the energy-efficient advantages of event-driven spiking paradigms. To address this challenge, we propose the first fully Spiking FrameEvent Tracking framework called SpikeFET. This network achieves synergistic integration of convolutional local feature extraction and Transformer-based global modeling within the spiking paradigm, effectively fusing frame and event data. To overcome the degradation of translation invariance caused by convolutional padding, we introduce a Random Patchwork Module (RPM) that eliminates positional bias through randomized spatial reorganization and learnable type encoding while preserving residual structures. Furthermore, we propose a Spatial-Temporal Regularization (STR) strategy that overcomes similarity metric degradation from asymmetric features by enforcing spatio-temporal consistency among temporal template features in latent space. Extensive experiments across multiple benchmarks demonstrate that the proposed framework achieves superior tracking accuracy over existing methods while significantly reducing power consumption, attaining an optimal balance between performance and efficiency.


LoRATv2: Enabling Low-Cost Temporal Modeling in One-Stream Trackers

Neural Information Processing Systems

Transformer-based algorithms, such as LoRAT, have significantly enhanced objecttracking performance. However, these approaches rely on a standard attention mechanism, which incurs quadratic token complexity, making real-time inference computationally expensive. In this paper, we introduce LoRATv2, a novel tracking framework that addresses these limitations with three main contributions. First, LoRATv2 integrates frame-wise causal attention, which ensures full selfattention within each frame while enabling causal dependencies across frames, significantly reducing computational overhead. Moreover, key-value (KV) caching is employed to efficiently reuse past embeddings for further speedup.


The Best Fitness Trackers of 2026: Garmin, Google Fitbit, and More

WIRED

Find the right wearable for your lifestyle, workouts, and goals. Like every piece of gear you wear on your body day in and day out, fitness trackers are incredibly personal. The right tracker for you should be comfortable, accurate, and tailored to your lifestyle, including your preferred workouts and health goals. Do you bike, row, or strength train? Do you run on trails for hours at a time, or do you just want a reminder to stand up every hour? Do you want to wear it on your wrist or your finger, or tuck it into your sports bra? No matter what your needs are, there's never been a better time to find a powerful, sophisticated tool to help optimize your workouts or jump-start your routine. We test dozens of fitness trackers every year while running, climbing, hiking, or just doing workout videos on our iPads at night, to bring you these picks. For more wearables, check out our guides to the Best Smartwatches, Best Smart Rings, and Best Sleep Trackers . Garmin makes some of the most accurate fitness trackers on the market, and the Vivoactive 6 is the best midrange option for most people.


HiFTTCTrack AVTrack Ours Ground Truth

Neural Information Processing Systems

UAV tracking can be widely applied in scenarios such as disaster rescue, environmental monitoring, and logistics transportation. However, existing UAV tracking methods predominantly emphasize speed and lack exploration in semantic awareness, which hinders the search region from extracting accurate localization information from the template.



Video Diffusion Models Excel at Tracking Similar Looking Objects Without Supervision

Neural Information Processing Systems

Distinguishing visually similar objects by their motion remains a critical challenge in computer vision. Although supervised trackers show promise, contemporary self-supervised trackers struggle when visual cues become ambiguous, limiting their scalability and generalization without extensive labeled data. We find that pretrained video diffusion models inherently learn motion representations suitable for tracking without task-specific training. This ability arises because their denoising process isolates motion in early, high-noise stages, distinct from later appearance refinement. Capitalizing on this discovery, our self-supervised tracker significantly improves performance in distinguishing visually similar objects, an underexplored failure point for existing methods. Our method achieves up to a 6-point improvement over recent self-supervised approaches on established benchmarks and our newly introduced tests focused on tracking visually similar items. Visualizations confirm that these diffusion-derived motion representations enable robust tracking of even identical objects across challenging viewpoint changes and deformations.


Dual-Path Temporal Decoder for End-to-End Multi-Object Tracking

Neural Information Processing Systems

We present a novel end-to-end transformer-based framework for Multiple Object Tracking (MOT) that advances temporal modeling and identity preservation. Despite recent progress in transformer-based MOT, existing methods still struggle to maintain consistent object identities across frames, especially under occlusions, appearance changes, or detection failures. We propose a dual-path temporal decoder that explicitly separates appearance adaptation and identity preservation. The appearance-adaptive decoder dynamically updates query features using current frame information, while the identity-preserving decoder freezes query features and reuses historical sampling offsets to maintain long-term temporal consistency. To further enhance stability, we introduce a confidence-guided update suppression strategy that retains previously reliable features when predictions are unreliable. Extensive experiments on MOT benchmarks demonstrate that our approach achieves state-of-the-art performance across major tracking metrics, with significant gains in association accuracy and identity consistency. Our results demonstrate the importance of decoupling dynamic appearance modeling from static identity cues, and provide a scalable foundation for robust tracking in complex scenarios.


LoRATv2: Enabling Low-Cost Temporal Modeling in One-Stream Trackers

Neural Information Processing Systems

Transformer-based algorithms, such as LoRAT, have significantly enhanced object-tracking performance. However, these approaches rely on a standard attention mechanism, which incurs quadratic token complexity, making real-time inference computationally expensive. In this paper, we introduce LoRATv2, a novel tracking framework that addresses these limitations with three main contributions. First, LoRATv2 integrates frame-wise causal attention, which ensures full self-attention within each frame while enabling causal dependencies across frames, significantly reducing computational overhead. Moreover, key-value (KV) caching is employed to efficiently reuse past embeddings for further speedup.