Well File:



Unveiling Encoder-Free Vision-Language Models Xiaotong Li3,2 Yueze Wang 2

Neural Information Processing Systems

Existing vision-language models (VLMs) mostly rely on vision encoders to extract visual features followed by large language models (LLMs) for visual-language tasks. However, the vision encoders set a strong inductive bias in abstracting visual representation, e.g., resolution, aspect ratio, and semantic priors, which could impede the flexibility and efficiency of the VLMs. Training pure VLMs that accept the seamless vision and language inputs, i.e., without vision encoders, remains challenging and rarely explored. Empirical observations reveal that direct training without encoders results in slow convergence and large performance gaps. In this work, we bridge the gap between encoder-based and encoder-free models, and present a simple yet effective training recipe towards pure VLMs. Specifically, we unveil the key aspects of training encoder-free VLMs efficiently via thorough experiments: (1) Bridging vision-language representation inside one unified decoder; (2) Enhancing visual recognition capability via extra supervision.




A Appendix A.1 More Ablations and Visualizations Effect of Blocking Gradient of f(s

Neural Information Processing Systems

As mentioned in Section 3.2, we compare the performance of different detectors with or without blocking the gradient of f(s We attribute this to the unstable training caused by the gradient from the denominator, so they are blocked out by default in the experiments. Figure 1 visualizes the searched parameterized functions for different detectors on the COCO benchmark [5]. The dots on each line represent the control points for each parameterized function. It can be observed that loss functions for different detectors seem to differ from each other. Their intrinsic differences can lead to distinct loss functions.





A Label Noise: Effect of Identical Patches

Neural Information Processing Systems

Here, we show that false negatives that are identical to the positive - for example, patches of the sky - do not change the sign of gradient associated with the positive. Let q be the query, u be the positive, V be the set of negatives. It is easy to see that the contribution of the negatives that are identical to the positive do not reverse the sign of the positive gradient, i.e. The proposed method outperforms many supervised methods for video object segmentation, despite relying on a simple label propagation algorithm, not being trained for object segmentation, and not training on the DAVIS dataset. We also show comparisons to pretrained feature baselines with larger networks.