AITopics | masked image modeling

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Vision (0.97)

Neural Information Processing SystemsDec-24-2025, 14:13:32 GMT

Green Hierarchical Vision Transformer for Masked Image Modeling

We present an efficient approach for Masked Image Modeling (MIM) with hierarchical Vision Transformers (ViTs), allowing the hierarchical ViTs to discard masked patches and operate only on the visible ones. Our approach consists of three key designs. First, for window attention, we propose a Group Window Attention scheme following the Divide-and-Conquer strategy. To mitigate the quadratic complexity of the self-attention w.r.t. the number of patches, group attention encourages a uniform partition that visible patches within each local window of arbitrary size can be grouped with equal size, where masked self-attention is then performed within each group. Second, we further improve the grouping strategy via the Dynamic Programming algorithm to minimize the overall computation cost of the attention on the grouped patches. Third, as for the convolution layers, we convert them to the Sparse Convolution that works seamlessly with the sparse data, i.e., the visible patches in MIM. As a result, MIM can now work on most, if not all, hierarchical ViTs in a green and efficient way. For example, we can train the hierarchical ViTs, e.g., Swin Transformer and Twins Transformer, about 2.7$\times$ faster and reduce the GPU memory usage by 70%, while still enjoying competitive performance on ImageNet classification and the superiority on downstream COCO object detection benchmarks.

green hierarchical vision transformer, masked image modeling, name change, (5 more...)

Technology: Information Technology > Artificial Intelligence > Vision (1.00)

Miyazaki, Asahi, Okita, Tsuyoshi

Thoughts on Objectives of Sparse and Hierarchical Masked Image Model

arXiv.org Artificial IntelligenceMay-15-2025

Masked image modeling is one of the most poplular objectives of training. Recently, the SparK model has been proposed with superior performance among self-supervised learning models. This paper proposes a new mask pattern for this SparK model, proposing it as the Mesh Mask-ed SparK model. We report the effect of the mask pattern used for image masking in pre-training on performance.

artificial intelligence, experiment, machine learning, (16 more...)

2505.08819

Country: Asia > Japan > Kyūshū & Okinawa > Kyūshū (0.04)

Genre: Research Report > New Finding (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Neural Information Processing SystemsJan-18-2025, 16:09:11 GMT

RevColV2: Exploring Disentangled Representations in Masked Image Modeling

Masked image modeling (MIM) has become a prevalent pre-training setup for vision foundation models and attains promising performance. Despite its success, existing MIM methods discard the decoder network during downstream applica- tions, resulting in inconsistent representations between pre-training and fine-tuning and can hamper downstream task performance. In this paper, we propose a new architecture, RevColV2, which tackles this issue by keeping the entire autoen- coder architecture during both pre-training and fine-tuning. The main body of RevColV2 contains bottom-up columns and top-down columns, between which information is reversibly propagated and gradually disentangled. Such design enables our architecture with the nice property: maintaining disentangled low-level and semantic information at the end of the network in MIM pre-training.

exploring disentangled representation, masked image modeling, revcolv2, (7 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Neural Information Processing SystemsJan-15-2025, 11:42:41 GMT

Green Hierarchical Vision Transformer for Masked Image Modeling

green hierarchical vision transformer, hierarchical vit, masked image modeling, (2 more...)

Technology: Information Technology > Artificial Intelligence > Vision (1.00)

arXiv.org Artificial IntelligenceDec-26-2024

Improving Generative Pre-Training: An In-depth Study of Masked Image Modeling and Denoising Models

Choi, Hyesong, Kim, Daeun, Cha, Sungmin, Yi, Kwang Moo, Min, Dongbo

In this work, we dive deep into the impact of additive noise in pre-training deep networks. While various methods have attempted to use additive noise inspired by the success of latent denoising diffusion models, when used in combination with masked image modeling, their gains have been marginal when it comes to recognition tasks. We thus investigate why this would be the case, in an attempt to find effective ways to combine the two ideas. Specifically, we find three critical conditions: corruption and restoration must be applied within the encoder, noise must be introduced in the feature space, and an explicit disentanglement between noised and masked tokens is necessary. By implementing these findings, we demonstrate improved pre-training performance for a wide range of recognition tasks, including those that require fine-grained, high-frequency information to solve.

artificial intelligence, machine learning, recognition task, (12 more...)

2412.19104

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
North America > United States > New York (0.04)
North America > Canada > British Columbia (0.04)

Genre: Research Report > New Finding (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Przewięźlikowski, Marcin, Balestriero, Randall, Jasiński, Wojciech, Śmieja, Marek, Zieliński, Bartosz

Beyond [cls]: Exploring the true potential of Masked Image Modeling representations

arXiv.org Artificial IntelligenceDec-4-2024

Masked Image Modeling (MIM) has emerged as a popular method for Self-Supervised Learning (SSL) of visual representations. However, for high-level perception tasks, MIM-pretrained models offer lower out-of-the-box representation quality than the Joint-Embedding Architectures (JEA) - another prominent SSL paradigm. To understand this performance gap, we analyze the information flow in Vision Transformers (ViT) learned by both approaches. We reveal that whereas JEAs construct their representation on a selected set of relevant image fragments, MIM models aggregate nearly whole image content. Moreover, we demonstrate that MIM-trained ViTs retain valuable information within their patch tokens, which is not effectively captured by the global [cls] token representations. Therefore, selective aggregation of relevant patch tokens, without any fine-tuning, results in consistently higher-quality of MIM representations. To our knowledge, we are the first to highlight the lack of effective representation aggregation as an emergent issue of MIM and propose directions to address it, contributing to future advances in Self-Supervised Learning.

mae, patch token, representation, (15 more...)

2412.03215

Country:

North America > United States > New York > New York County > New York City (0.04)
Europe > Switzerland (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.93)

Nguyen, Khanh-Binh, Park, Chae Jung

Symmetric masking strategy enhances the performance of Masked Image Modeling

arXiv.org Artificial IntelligenceAug-22-2024

Masked Image Modeling (MIM) is a technique in self-supervised learning that focuses on acquiring detailed visual representations from unlabeled images by estimating the missing pixels in randomly masked sections. It has proven to be a powerful tool for the preliminary training of Vision Transformers (ViTs), yielding impressive results across various tasks. Nevertheless, most MIM methods heavily depend on the random masking strategy to formulate the pretext task. This strategy necessitates numerous trials to ascertain the optimal dropping ratio, which can be resource-intensive, requiring the model to be pre-trained for anywhere between 800 to 1600 epochs. Furthermore, this approach may not be suitable for all datasets. In this work, we propose a new masking strategy that effectively helps the model capture global and local features. Based on this masking strategy, SymMIM, our proposed training pipeline for MIM is introduced. SymMIM achieves a new SOTA accuracy of 85.9\% on ImageNet using ViT-Large and surpasses previous SOTA across downstream tasks such as image classification, semantic segmentation, object detection, instance segmentation tasks, and so on.

arxiv preprint arxiv, proceedings, representation, (10 more...)

2408.12772

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
Asia > South Korea (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Vision > Image Understanding (0.66)

Williams, David S. W., Gadd, Matthew, Newman, Paul, De Martini, Daniele

Masked Gamma-SSL: Learning Uncertainty Estimation via Masked Image Modeling

arXiv.org Artificial IntelligenceFeb-27-2024

This work proposes a semantic segmentation network that produces high-quality uncertainty estimates in a single forward pass. We exploit general representations from foundation models and unlabelled datasets through a Masked Image Modeling (MIM) approach, which is robust to augmentation hyper-parameters and simpler than previous techniques. For neural networks used in safety-critical applications, bias in the training data can lead to errors; therefore it is crucial to understand a network's limitations at run time and act accordingly. To this end, we test our proposed method on a number of test domains including the SAX Segmentation benchmark, which includes labelled test data from dense urban, rural and off-road driving domains. The proposed method consistently outperforms uncertainty estimation and Out-of-Distribution (OoD) techniques on this difficult benchmark.

learning uncertainty estimation, masked gamma-ssl, masked image modeling

2402.17622

Country: Europe > United Kingdom (0.04)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.53)

arXiv.org Artificial IntelligenceAug-28-2023

GeoMIM: Towards Better 3D Knowledge Transfer via Masked Image Modeling for Multi-view 3D Understanding

Liu, Jihao, Wang, Tai, Liu, Boxiao, Zhang, Qihang, Liu, Yu, Li, Hongsheng

Multi-view camera-based 3D detection is a challenging problem in computer vision. Recent works leverage a pretrained LiDAR detection model to transfer knowledge to a camera-based student network. However, we argue that there is a major domain gap between the LiDAR BEV features and the camera-based BEV features, as they have different characteristics and are derived from different sources. In this paper, we propose Geometry Enhanced Masked Image Modeling (GeoMIM) to transfer the knowledge of the LiDAR model in a pretrain-finetune paradigm for improving the multi-view camera-based 3D detection. GeoMIM is a multi-camera vision transformer with Cross-View Attention (CVA) blocks that uses LiDAR BEV features encoded by the pretrained BEV model as learning targets. During pretraining, GeoMIM's decoder has a semantic branch completing dense perspective-view features and the other geometry branch reconstructing dense perspective-view depth maps. The depth branch is designed to be camera-aware by inputting the camera's parameters for better transfer capability. Extensive results demonstrate that GeoMIM outperforms existing methods on nuScenes benchmark, achieving state-of-the-art performance for camera-based 3D object detection and 3D segmentation. Code and pretrained models are available at https://github.com/Sense-X/GeoMIM.

artificial intelligence, detection, machine learning, (17 more...)

2303.11325

Country:

Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.05)
Asia > China > Shanghai > Shanghai (0.04)
Asia > China > Hong Kong (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Transfer Learning (0.34)