Appendix A The necessity of the construction of a large scale Chinese cross modal

Neural Information Processing Systems 

CLIP's models are trained with 400M English image-text pairs and have shown great generalization However, they can not directly process Chinese captions. Even though CLIP's models are trained with much more data, they show a significantly poor We believe that this is due to the limited capacity of the translator. Therefore, simply attaching a machine translator to a model pretrained on a large-scale English corpus does not yield the results we expect. We also notice that CLIP's Hence, it's necessary to construct a large-scale We show the hyperparameters of our pretrained models in Table 7. The text prompts are from MSCOCO.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found