AITopics | taisu

Vision-Language Pre-training (VLP) has been shown to be an efficient method to improve the performance of models on different vision-and-language downstream tasks. Substantial studies have shown that neural networks may be able to learn some general rules about language and visual concepts from a large-scale weakly labeled image-text dataset. However, most of the public cross-modal datasets that contain more than 100M image-text pairs are in English; there is a lack of available large-scale and high-quality Chinese VLP datasets. In this work, we propose a new framework for automatic dataset acquisition and cleaning with which we construct a new large-scale and high-quality cross-modal dataset named as TaiSu, containing 166 million images and 219 million Chinese captions. Compared with the recently released Wukong dataset, our dataset is achieved with much stricter restrictions on the semantic correlation of image-text pairs. We also propose to combine texts collected from the web with texts generated by a pre-trained image-captioning model.

dataset, large-scale high-quality dataset, vision-language pre-training, (9 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.98)

Add feedback

Appendix A The necessity of the construction of a large scale Chinese cross modal

Neural Information Processing SystemsAug-15-2025, 14:24:35 GMT

CLIP's models are trained with 400M English image-text pairs and have shown great generalization However, they can not directly process Chinese captions. Even though CLIP's models are trained with much more data, they show a significantly poor We believe that this is due to the limited capacity of the translator. Therefore, simply attaching a machine translator to a model pretrained on a large-scale English corpus does not yield the results we expect. We also notice that CLIP's Hence, it's necessary to construct a large-scale We show the hyperparameters of our pretrained models in Table 7. The text prompts are from MSCOCO.

caption, dataset, image-text pair, (13 more...)

Neural Information Processing Systems

Country: Asia > China (0.04)

Genre: Research Report (0.47)

Industry:

Law (0.69)
Information Technology (0.47)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.70)

Add feedback

6a386d703b50f1cf1f61ab02a15967bb-Paper-Datasets_and_Benchmarks.pdf

Neural Information Processing SystemsAug-15-2025, 14:24:32 GMT

arxiv preprint arxiv, dataset, image-text pair, (9 more...)

Neural Information Processing Systems

Country:

Asia > China > Beijing > Beijing (0.05)
Asia > China > Shaanxi Province > Xi'an (0.04)
Europe > Poland (0.04)
(2 more...)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.68)
(2 more...)

Add feedback

TaiSu: A 166M Large-scale High-Quality Dataset for Chinese Vision-Language Pre-training

Neural Information Processing SystemsOct-11-2024, 11:21:44 GMT

Vision-Language Pre-training (VLP) has been shown to be an efficient method to improve the performance of models on different vision-and-language downstream tasks. Substantial studies have shown that neural networks may be able to learn some general rules about language and visual concepts from a large-scale weakly labeled image-text dataset. However, most of the public cross-modal datasets that contain more than 100M image-text pairs are in English; there is a lack of available large-scale and high-quality Chinese VLP datasets. In this work, we propose a new framework for automatic dataset acquisition and cleaning with which we construct a new large-scale and high-quality cross-modal dataset named as TaiSu, containing 166 million images and 219 million Chinese captions. Compared with the recently released Wukong dataset, our dataset is achieved with much stricter restrictions on the semantic correlation of image-text pairs. We also propose to combine texts collected from the web with texts generated by a pre-trained image-captioning model.

dataset, taisu, vision-language pre-training, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Filters

Collaborating Authors

taisu

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

6a386d703b50f1cf1f61ab02a15967bb-Supplemental-Datasets_and_Benchmarks.pdf

TaiSu: A166MLarge-scaleHigh-QualityDatasetfor ChineseVision-LanguagePre-training

TaiSu: A 166M Large-scale High-Quality Dataset for Chinese Vision-Language Pre-training

Appendix A The necessity of the construction of a large scale Chinese cross modal

6a386d703b50f1cf1f61ab02a15967bb-Paper-Datasets_and_Benchmarks.pdf

TaiSu: A 166M Large-scale High-Quality Dataset for Chinese Vision-Language Pre-training