Statistical Learning
Self-supervised Object-Centric Learning for Videos Gรถrkay Aydemir
From these temporally-aware slots, the training objective is to reconstruct the middle frame in a high-level semantic feature space. We propose a masking strategy by dropping a significant portion of tokens in the feature space for efficiency and regularization. Additionally, we address over-clustering by merging slots based on similarity.