Evaluation Protocol: The most ambitious aim of self-supervised learning is to create universal visual representations
–Neural Information Processing Systems
We thank the reviewers for valuable feedback. Before addressing individual comments, we clarify common concerns. Moreover, "image-level" vs "pixel-level" training has no bearing on the validity of evaluating with Any method that uses a CNN learns more than just "image-level" representations; for Our task is to learn pixel-wise semantic-aware embeddings from scratch. We will update the final version to reflect the full 200 training epochs. We first sample regions, then a fixed number of pixels within chosen regions.
Neural Information Processing Systems
Aug-16-2025, 05:14:57 GMT