Appendix A The necessity of the construction of a large scale Chinese cross modal

Aug-15-2025, 14:24:35 GMT–Neural Information Processing Systems

CLIP's models are trained with 400M English image-text pairs and have shown great generalization However, they can not directly process Chinese captions. Even though CLIP's models are trained with much more data, they show a significantly poor We believe that this is due to the limited capacity of the translator. Therefore, simply attaching a machine translator to a model pretrained on a large-scale English corpus does not yield the results we expect. We also notice that CLIP's Hence, it's necessary to construct a large-scale We show the hyperparameters of our pretrained models in Table 7. The text prompts are from MSCOCO.

caption, dataset, image-text pair, (13 more...)

Neural Information Processing Systems

Aug-15-2025, 14:24:35 GMT

Conferences PDF

Add feedback

Country:
- Asia > China (0.04)

Genre:
- Research Report (0.47)

Industry:
- Law (0.69)
- Information Technology (0.47)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (1.00)
  - Natural Language (0.70)

Duplicate Docs Excel Report

Title
6a386d703b50f1cf1f61ab02a15967bb-Supplemental-Datasets_and_Benchmarks.pdf

Similar Docs Excel Report more

Title	Similarity	Source
None found