Zhang, Jialing
Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data
Gu, Shuhao, Zhang, Jialing, Zhou, Siyuan, Yu, Kevin, Xing, Zhaohu, Wang, Liangdong, Cao, Zhou, Jia, Jintao, Zhang, Zhuoyi, Wang, Yixuan, Hu, Zhenchong, Zhang, Bo-Wen, Li, Jijie, Liang, Dong, Zhao, Yingli, Wang, Songjing, Ao, Yulong, Ju, Yiming, Ma, Huanhuan, Li, Xiaotong, Diao, Haiwen, Cui, Yufeng, Wang, Xinlong, Liu, Yaoqi, Feng, Fangxiang, Liu, Guang
Recently, Vision-Language Models (VLMs) have achieved remarkable progress in multimodal tasks, and multimodal instruction data serves as the foundation for enhancing VLM capabilities. Despite the availability of several open-source multimodal datasets, limitations in the scale and quality of open-source instruction data hinder the performance of VLMs trained on these datasets, leading to a significant gap compared to models trained on closed-source data. To address this challenge, we introduce Infinity-MM, a large-scale multimodal instruction dataset. We collected the available multimodal instruction datasets and performed unified preprocessing, resulting in a dataset with over 40 million samples that ensures diversity and accuracy. Furthermore, to enable large-scale expansion of instruction data and support the continuous acquisition of high-quality data, we propose a synthetic instruction generation method based on a tagging system and open-source VLMs. By establishing correspondences between different types of images and associated instruction types, this method can provide essential guidance during data synthesis. Leveraging this high-quality data, we have trained a 2-billion-parameter Vision-Language Model, Aquila-VL-2B, which achieves state-of-the-art (SOTA) performance among models of similar scale. The data is available at: https://huggingface.co/datasets/BAAI/Infinity-MM.
Back to the Source: Diffusion-Driven Test-Time Adaptation
Gao, Jin, Zhang, Jialing, Liu, Xihui, Darrell, Trevor, Shelhamer, Evan, Wang, Dequan
Test-time adaptation harnesses test inputs to improve the accuracy of a model trained on source data when tested on shifted target data. Existing methods update the source model by (re-)training on each target domain. While effective, re-training is sensitive to the amount and order of the data and the hyperparameters for optimization. We instead update the target data, by projecting all test inputs toward the source domain with a generative diffusion model. Our diffusion-driven adaptation method, DDA, shares its models for classification and generation across all domains. Both models are trained on the source domain, then fixed during testing. We augment diffusion with image guidance and self-ensembling to automatically decide how much to adapt. Input adaptation by DDA is more robust than prior model adaptation approaches across a variety of corruptions, architectures, and data regimes on the ImageNet-C benchmark. With its input-wise updates, DDA succeeds where model adaptation degrades on too little data in small batches, dependent data in non-uniform order, or mixed data with multiple corruptions.