SAVi++: Towards End-to-End Object-Centric Learning from Real-World Videos
–Neural Information Processing Systems
The visual world can be parsimoniously characterized in terms of distinct entities with sparse interactions. Discovering this compositional structure in dynamic visual scenes has proven challenging for end-to-end computer vision approaches unless explicit instance-level supervision is provided. Slot-based models leveraging motion cues have recently shown great promise in learning to represent, segment, and track objects without direct supervision, but they still fail to scale to complex real-world multi-object videos. In an effort to bridge this gap, we take inspiration from human development and hypothesize that information about scene geometry in the form of depth signals can facilitate object-centric learning. We introduce SAVi, an object-centric video model which is trained to predict depth signals from a slot-based video representation.
Neural Information Processing Systems
May-27-2025, 20:47:42 GMT
- Technology:
- Information Technology > Artificial Intelligence > Vision (0.42)