Top in Chinese Data Processing: English Code Models
Zheng, Linghan, Liu, Hui, Lin, Xiaojun, Dong, Jiayuan, Sheng, Yue, Shi, Gang, Liu, Zhiwei, Chen, Hongwei
–arXiv.org Artificial Intelligence
Recent advancements in the field of natural language processing (NLP) have led to the development of increasingly sophisticated models capable of understanding and generating human language with significant proficiency. Scaling up the size of language models has been shown to confer a range of benefits, such as improved performance and sample efficiency(Kaplan et al., 2020). Fine-tuning large models for diverse scenarios has also become consensus practice in the community.Traditionally, language models and code-based models(Rozière et al., 2023; Feng et al., 2020) have been separated into distinct categories based on their domains of expertise, with the former excelling in general linguistic tasks and the latter in programming-related scenarios. However, an interesting observation has arisen in our experiments with Chinese text data generation tasks--intuitively, one would expect such tasks to be dominated by Chinese domain-based language models, but code-based models trained on English datasets have, in fact, exhibited superior performance. This unexpected discovery challenges the traditional view that pre-trained models are domain-specific and calls for a more in-depth examination of their capabilities beyond their primary training language or format.
arXiv.org Artificial Intelligence
Jan-25-2024
- Genre:
- Research Report > New Finding (0.66)
- Industry:
- Information Technology > Software (0.40)
- Technology: