Plotting




Unveiling Encoder-Free Vision-Language Models Xiaotong Li3,2 Yueze Wang 2

Neural Information Processing Systems

Existing vision-language models (VLMs) mostly rely on vision encoders to extract visual features followed by large language models (LLMs) for visual-language tasks. However, the vision encoders set a strong inductive bias in abstracting visual representation, e.g., resolution, aspect ratio, and semantic priors, which could impede the flexibility and efficiency of the VLMs. Training pure VLMs that accept the seamless vision and language inputs, i.e., without vision encoders, remains challenging and rarely explored. Empirical observations reveal that direct training without encoders results in slow convergence and large performance gaps. In this work, we bridge the gap between encoder-based and encoder-free models, and present a simple yet effective training recipe towards pure VLMs. Specifically, we unveil the key aspects of training encoder-free VLMs efficiently via thorough experiments: (1) Bridging vision-language representation inside one unified decoder; (2) Enhancing visual recognition capability via extra supervision.




A Appendix A.1 More Ablations and Visualizations Effect of Blocking Gradient of f(s

Neural Information Processing Systems

As mentioned in Section 3.2, we compare the performance of different detectors with or without blocking the gradient of f(s We attribute this to the unstable training caused by the gradient from the denominator, so they are blocked out by default in the experiments. Figure 1 visualizes the searched parameterized functions for different detectors on the COCO benchmark [5]. The dots on each line represent the control points for each parameterized function. It can be observed that loss functions for different detectors seem to differ from each other. Their intrinsic differences can lead to distinct loss functions.