SAVi++: Towards End-to-End Object-Centric Learning from Real-World Videos

May-27-2025, 20:47:42 GMT–Neural Information Processing Systems

The visual world can be parsimoniously characterized in terms of distinct entities with sparse interactions. Discovering this compositional structure in dynamic visual scenes has proven challenging for end-to-end computer vision approaches unless explicit instance-level supervision is provided. Slot-based models leveraging motion cues have recently shown great promise in learning to represent, segment, and track objects without direct supervision, but they still fail to scale to complex real-world multi-object videos. In an effort to bridge this gap, we take inspiration from human development and hypothesize that information about scene geometry in the form of depth signals can facilitate object-centric learning. We introduce SAVi, an object-centric video model which is trained to predict depth signals from a slot-based video representation.

depth signal, end-to-end object-centric learning, real-world video, (2 more...)

Neural Information Processing Systems

May-27-2025, 20:47:42 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Vision (0.42)