MCMAE: Masked Convolution Meets Masked Autoencoders

Jan-19-2025, 04:30:16 GMT–Neural Information Processing Systems

Vision Transformers (ViT) become widely-adopted architectures for various vision tasks. Masked auto-encoding for feature pretraining and multi-scale hybrid convolution-transformer architectures can further unleash the potentials of ViT, leading to state-of-the-art performances on image classification, detection and semantic segmentation. In this paper, our MCMAE framework demonstrates that multi-scale hybrid convolution-transformer can learn more discriminative representations via the mask auto-encoding scheme. However, directly using the original masking strategy leads to the heavy computational cost and pretraining-finetuning discrepancy. To tackle the issue, we adopt the masked convolution to prevent information leakage in the convolution blocks.

architecture, masked convolution meet masked autoencoder, multi-scale feature, (1 more...)

Neural Information Processing Systems

Jan-19-2025, 04:30:16 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Machine Learning > Neural Networks (0.74)