Not enough data to create a plot.
Try a different view from the menu above.
Cui, Erfei
OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
Li, Qingyun, Chen, Zhe, Wang, Weiyun, Wang, Wenhai, Ye, Shenglong, Jin, Zhenjiang, Chen, Guanzhou, He, Yinan, Gao, Zhangwei, Cui, Erfei, Yu, Jiashuo, Tian, Hao, Zhou, Jiasheng, Xu, Chao, Wang, Bin, Wei, Xingjian, Li, Wei, Zhang, Wenjian, Zhang, Bo, Cai, Pinlong, Wen, Licheng, Yan, Xiangchao, Li, Zhenxiang, Chu, Pei, Wang, Yi, Dou, Min, Tian, Changyao, Zhu, Xizhou, Lu, Lewei, Chen, Yushi, He, Junjun, Tu, Zhongying, Lu, Tong, Wang, Yali, Wang, Limin, Lin, Dahua, Qiao, Yu, Shi, Botian, He, Conghui, Dai, Jifeng
Image-text interleaved data, consisting of multiple images and texts arranged in a natural document format, aligns with the presentation paradigm of internet data and closely resembles human reading habits. Recent studies have shown that such data aids multimodal in-context learning and maintains the capabilities of large language models during multimodal fine-tuning. However, the limited scale and diversity of current image-text interleaved data restrict the development of multimodal large language models. In this paper, we introduce OmniCorpus, a 10 billion-level image-text interleaved dataset. Using an efficient data engine, we filter and extract large-scale high-quality documents, which contain 8.6 billion images and 1,696 billion text tokens. Compared to counterparts (e.g., MMC4, OBELICS), our dataset 1) has 15 times larger scales while maintaining good data quality; 2) features more diverse sources, including both English and non-English websites as well as video-centric websites; 3) is more flexible, easily degradable from an image-text interleaved format to pure text corpus and image-text pairs. Through comprehensive analysis and experiments, we validate the quality, usability, and effectiveness of the proposed dataset. We hope this could provide a solid data foundation for future multimodal model research.
Teaching MLP More Graph Information: A Three-stage Multitask Knowledge Distillation Framework
Li, Junxian, Shi, Bin, Cui, Erfei, Wei, Hua, Zheng, Qinghua
We study the challenging problem for inference tasks on large-scale graph datasets of Graph Neural Networks: huge time and memory consumption, and try to overcome it by reducing reliance on graph structure. Even though distilling graph knowledge to student MLP is an excellent idea, it faces two major problems of positional information loss and low generalization. To solve the problems, we propose a new three-stage multitask distillation framework. In detail, we use Positional Encoding to capture positional information. Also, we introduce Neural Heat Kernels responsible for graph data processing in GNN and utilize hidden layer outputs matching for better performance of student MLP's hidden layers. To the best of our knowledge, it is the first work to include hidden layer distillation for student MLP on graphs and to combine graph Positional Encoding with MLP. We test its performance and robustness with several settings and draw the conclusion that our work can outperform well with good stability.