4M: Massively Multimodal Masked Modeling

Dec-26-2025, 14:51:18 GMT–Neural Information Processing Systems

Current machine learning models for vision are often highly specialized and limited to a single modality and task. In contrast, recent large language models exhibit a wide range of capabilities, hinting at a possibility for similarly versatile models in computer vision.In this paper, we take a step in this direction and propose a multimodal training scheme called 4M. It consists of training a single unified Transformer encoder-decoder using a masked modeling objective across a wide range of input/output modalities - including text, images, geometric, and semantic modalities, as well as neural network feature maps.

massively multimodal masked modeling, modality, name change, (3 more...)

Neural Information Processing Systems

Dec-26-2025, 14:51:18 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (1.00)
  - Natural Language (0.76)