Learning 1D Causal Visual Representation with De-focus Attention Networks
–Neural Information Processing Systems
Modality differences have led to the development of heterogeneous architectures for vision and language models. While images typically require 2D non-causal modeling, texts utilize 1D causal modeling. This distinction poses significant challenges in constructing unified multi-modal models. This paper explores the feasibility of representing images using 1D causal modeling. We identify an "over-focus" issue in existing 1D causal vision models, where attention overly concentrates on a small proportion of visual tokens.
Neural Information Processing Systems
May-28-2025, 22:59:32 GMT