LearningState-AwareVisualRepresentationsfrom AudibleInteractions

Feb-10-2026, 21:38:43 GMT–Neural Information Processing Systems

We propose a self-supervised algorithm to learn representations from egocentric video data. Recently, significant efforts have been made to capture humans interacting with their own environments as they go about their daily activities. In result, several large egocentric datasets of interaction-rich multi-modal data have emerged. However, learning representations from videos can be challenging. First, given the uncurated nature of long-form continuous videos, learning effectiverepresentations require focusing onmoments intimewhen interactions take place. Second, visual representations of daily activities should be sensitive to changes in the state of the environment. However, current successful multimodal learning frameworks encourage representation invariance over time.

artificial intelligence, machine learning, representation, (17 more...)

Neural Information Processing Systems

Feb-10-2026, 21:38:43 GMT

Conferences PDF

Add feedback

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (1.00)
  - Vision > Video Understanding (0.34)

Duplicate Docs Excel Report

Title
9647157086adf5aa2c0217fb7f82bb19-Paper-Conference.pdf

Similar Docs Excel Report more

Title	Similarity	Source
None found