Appendix A The necessity of the construction of a large scale Chinese cross modal
–Neural Information Processing Systems
CLIP's models are trained with 400M English image-text pairs and have shown great generalization However, they can not directly process Chinese captions. Even though CLIP's models are trained with much more data, they show a significantly poor We believe that this is due to the limited capacity of the translator. Therefore, simply attaching a machine translator to a model pretrained on a large-scale English corpus does not yield the results we expect. We also notice that CLIP's Hence, it's necessary to construct a large-scale We show the hyperparameters of our pretrained models in Table 7. The text prompts are from MSCOCO.
Neural Information Processing Systems
Aug-15-2025, 14:24:35 GMT