RevColV2: Exploring Disentangled Representations in Masked Image Modeling
–Neural Information Processing Systems
Masked image modeling (MIM) has become a prevalent pre-training setup for vision foundation models and attains promising performance. Despite its success, existing MIM methods discard the decoder network during downstream applications, resulting in inconsistent representations between pre-training and fine-tuning and can hamper downstream task performance. In this paper, we propose a new architecture, RevColV2, which tackles this issue by keeping the entire autoen-coder architecture during both pre-training and fine-tuning. The main body of RevColV2 contains bottom-up columns and top-down columns, between which information is reversibly propagated and gradually disentangled. Such design enables our architecture with the nice property: maintaining disentangled low-level and semantic information at the end of the network in MIM pre-training. Our experimental results suggest that a foundation model with decoupled features can achieve competitive performance across multiple downstream vision tasks such as image classification, semantic segmentation and object detection.
Neural Information Processing Systems
Dec-25-2025, 13:05:59 GMT
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning (1.00)
- Vision (0.97)
- Information Technology > Artificial Intelligence