Learning State-Aware Visual Representations from Audible Interactions

Jan-18-2025, 00:25:33 GMT–Neural Information Processing Systems

We propose a self-supervised algorithm to learn representations from egocentric video data. Recently, significant efforts have been made to capture humans interacting with their own environments as they go about their daily activities. In result, several large egocentric datasets of interaction-rich multi-modal data have emerged. However, learning representations from videos can be challenging. First, given the uncurated nature of long-form continuous videos, learning effective representations require focusing on moments in time when interactions take place.

audible interaction, egocentric dataset, learning state-aware visual representation, (2 more...)

Neural Information Processing Systems

Jan-18-2025, 00:25:33 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Vision (0.82)