taisu
- Law (0.69)
- Information Technology (0.47)
TaiSu: A 166M Large-scale High-Quality Dataset for Chinese Vision-Language Pre-training
Vision-Language Pre-training (VLP) has been shown to be an efficient method to improve the performance of models on different vision-and-language downstream tasks. Substantial studies have shown that neural networks may be able to learn some general rules about language and visual concepts from a large-scale weakly labeled image-text dataset. However, most of the public cross-modal datasets that contain more than 100M image-text pairs are in English; there is a lack of available large-scale and high-quality Chinese VLP datasets. In this work, we propose a new framework for automatic dataset acquisition and cleaning with which we construct a new large-scale and high-quality cross-modal dataset named as TaiSu, containing 166 million images and 219 million Chinese captions. Compared with the recently released Wukong dataset, our dataset is achieved with much stricter restrictions on the semantic correlation of image-text pairs. We also propose to combine texts collected from the web with texts generated by a pre-trained image-captioning model.
Appendix A The necessity of the construction of a large scale Chinese cross modal
CLIP's models are trained with 400M English image-text pairs and have shown great generalization However, they can not directly process Chinese captions. Even though CLIP's models are trained with much more data, they show a significantly poor We believe that this is due to the limited capacity of the translator. Therefore, simply attaching a machine translator to a model pretrained on a large-scale English corpus does not yield the results we expect. We also notice that CLIP's Hence, it's necessary to construct a large-scale We show the hyperparameters of our pretrained models in Table 7. The text prompts are from MSCOCO.
- Law (0.69)
- Information Technology (0.47)
TaiSu: A 166M Large-scale High-Quality Dataset for Chinese Vision-Language Pre-training
Vision-Language Pre-training (VLP) has been shown to be an efficient method to improve the performance of models on different vision-and-language downstream tasks. Substantial studies have shown that neural networks may be able to learn some general rules about language and visual concepts from a large-scale weakly labeled image-text dataset. However, most of the public cross-modal datasets that contain more than 100M image-text pairs are in English; there is a lack of available large-scale and high-quality Chinese VLP datasets. In this work, we propose a new framework for automatic dataset acquisition and cleaning with which we construct a new large-scale and high-quality cross-modal dataset named as TaiSu, containing 166 million images and 219 million Chinese captions. Compared with the recently released Wukong dataset, our dataset is achieved with much stricter restrictions on the semantic correlation of image-text pairs. We also propose to combine texts collected from the web with texts generated by a pre-trained image-captioning model.