AITopics | temporal correspondence

Collaborating Authors

temporal correspondence

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Emergent Temporal Correspondences from Video Diffusion Transformers

Neural Information Processing SystemsJun-13-2026, 22:42:05 GMT

Recent advancements in video diffusion models based on Diffusion Transformers (DiTs) have achieved remarkable success in generating temporally coherent videos. Yet, a fundamental question persists: how do these models internally establish and represent temporal correspondences across frames? We introduce DiffTrack, the first quantitative analysis framework designed to answer this question. DiffTrack constructs a dataset of prompt-generated video with pseudo ground-truth tracking annotations and proposes novel evaluation metrics to systematically analyze how each component within the full 3D attention mechanism of DiTs (e.g., representations, layers, and timesteps) contributes to establishing temporal correspondences. Our analysis reveals that query-key similarities in specific (but not all) layers play a critical role in temporal matching, and that this matching becomes increasingly prominent throughout denoising. We demonstrate practical applications of DiffTrack in zero-shot point tracking, where it achieves state-of-the-art performance compared to existing vision foundation and self-supervised video models. Further, we extend our findings to motion-enhanced video generation with a novel guidance method that improves temporal consistency of generated videos without additional training. We believe our work offers crucial insights into the inner workings of video DiTs and establishes a foundation for further research and applications leveraging their temporal understanding.

artificial intelligence, machine learning, proceedings, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.40)

Add feedback

Joint-task Self-supervised Learning for Temporal Correspondence

Neural Information Processing SystemsDec-25-2025, 01:21:15 GMT

This paper proposes to learn reliable dense correspondence from videos in a self-supervised manner. Our learning process integrates two highly related tasks: tracking large image regions and establishing fine-grained pixel-level associations between consecutive video frames. We exploit the synergy between both tasks through a shared inter-frame affinity matrix, which simultaneously models transitions between video frames at both the region-and pixel-levels. While region-level localization helps reduce ambiguities in fine-grained matching by narrowing down search regions; fine-grained matching provides bottom-up features to facilitate region-level localization. Our method outperforms the state-of-the-art self-supervised methods on a variety of visual correspondence tasks, including video-object and part-segmentation propagation, keypoint tracking, and object tracking. Our self-supervised method even surpasses the fully-supervised affinity feature representation obtained from a ResNet-18 pre-trained on the ImageNet.

joint-task self-supervised learning, name change, temporal correspondence, (2 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.58)

Add feedback

Reviews: Joint-task Self-supervised Learning for Temporal Correspondence

Neural Information Processing SystemsJan-21-2025, 20:18:52 GMT

The work does not include original ideas. It is exclusively a collection of previous ideas combined together in a rather classical way. Major remarks: Equation (6) makes loss non-smooth and non-differentiable. The authors do not discuss how they handle this. I assume they use the typical approach by getting the right'case' in the forward step and then doing back-prop on the fixed smooth function.

joint-task self-supervised learning, temporal correspondence

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.40)

Add feedback

Reviews: Joint-task Self-supervised Learning for Temporal Correspondence

Neural Information Processing SystemsJan-21-2025, 20:18:41 GMT

The paper presents a new approach to tracking and pixel level correspondence using self-supervised learning in video. It goes in the direction of multi-task learning. As well results are solid. The reviewers at the beginning gave a score of 5,6 and 7, than after rebuttal also the more skeptic reviewer was convinced to improve its rate. .

joint-task self-supervised learning, reviewer, temporal correspondence

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.73)

Add feedback

Joint-task Self-supervised Learning for Temporal Correspondence

Neural Information Processing SystemsOct-9-2024, 14:00:27 GMT

This paper proposes to learn reliable dense correspondence from videos in a self-supervised manner. Our learning process integrates two highly related tasks: tracking large image regions and establishing fine-grained pixel-level associations between consecutive video frames. We exploit the synergy between both tasks through a shared inter-frame affinity matrix, which simultaneously models transitions between video frames at both the region- and pixel-levels. While region-level localization helps reduce ambiguities in fine-grained matching by narrowing down search regions; fine-grained matching provides bottom-up features to facilitate region-level localization. Our method outperforms the state-of-the-art self-supervised methods on a variety of visual correspondence tasks, including video-object and part-segmentation propagation, keypoint tracking, and object tracking.

joint-task self-supervised learning, temporal correspondence, video frame

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.40)

Add feedback

Cohere3D: Exploiting Temporal Coherence for Unsupervised Representation Learning of Vision-based Autonomous Driving

Xie, Yichen, Chen, Hongge, Meyer, Gregory P., Lee, Yong Jae, Wolff, Eric M., Tomizuka, Masayoshi, Zhan, Wei, Chai, Yuning, Huang, Xin

arXiv.org Artificial IntelligenceFeb-23-2024

Due to the lack of depth cues in images, multi-frame inputs are important for the success of vision-based perception, prediction, and planning in autonomous driving. Observations from different angles enable the recovery of 3D object states from 2D image inputs if we can identify the same instance in different input frames. However, the dynamic nature of autonomous driving scenes leads to significant changes in the appearance and shape of each instance captured by the camera at different time steps. To this end, we propose a novel contrastive learning algorithm, Cohere3D, to learn coherent instance representations in a long-term input sequence robust to the change in distance and perspective. The learned representation aids in instance-level correspondence across multiple input frames in downstream tasks. In the pretraining stage, the raw point clouds from LiDAR sensors are utilized to construct the long-term temporal correspondence for each instance, which serves as guidance for the extraction of instance-level representation from the vision-based bird's eye-view (BEV) feature map. Cohere3D encourages a consistent representation for the same instance at different frames but distinguishes between representations of different instances. We evaluate our algorithm by finetuning the pretrained model on various downstream perception, prediction, and planning tasks. Results show a notable improvement in both data efficiency and task performance.

cohere3d, correspondence, representation, (12 more...)

arXiv.org Artificial Intelligence

2402.15583

Country:

South America > Brazil (0.04)
North America > United States > California > Alameda County > Berkeley (0.04)
Asia > Myanmar > Tanintharyi Region > Dawei (0.04)

Genre: Research Report (0.70)

Industry:

Transportation > Ground > Road (0.92)
Information Technology > Robotics & Automation (0.82)
Automobiles & Trucks (0.82)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Joint-task Self-supervised Learning for Temporal Correspondence

Li, Xueting, Liu, Sifei, Mello, Shalini De, Wang, Xiaolong, Kautz, Jan, Yang, Ming-Hsuan

Neural Information Processing SystemsMar-18-2020, 20:30:45 GMT

This paper proposes to learn reliable dense correspondence from videos in a self-supervised manner. Our learning process integrates two highly related tasks: tracking large image regions and establishing fine-grained pixel-level associations between consecutive video frames. We exploit the synergy between both tasks through a shared inter-frame affinity matrix, which simultaneously models transitions between video frames at both the region- and pixel-levels. While region-level localization helps reduce ambiguities in fine-grained matching by narrowing down search regions; fine-grained matching provides bottom-up features to facilitate region-level localization. Our method outperforms the state-of-the-art self-supervised methods on a variety of visual correspondence tasks, including video-object and part-segmentation propagation, keypoint tracking, and object tracking.

joint-task self-supervised learning, temporal correspondence, video frame

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.40)

Add feedback