Top in Chinese Data Processing: English Code Models

Zheng, Linghan, Liu, Hui, Lin, Xiaojun, Dong, Jiayuan, Sheng, Yue, Shi, Gang, Liu, Zhiwei, Chen, Hongwei

Jan-25-2024–arXiv.org Artificial Intelligence

Recent advancements in the field of natural language processing (NLP) have led to the development of increasingly sophisticated models capable of understanding and generating human language with significant proficiency. Scaling up the size of language models has been shown to confer a range of benefits, such as improved performance and sample efficiency(Kaplan et al., 2020). Fine-tuning large models for diverse scenarios has also become consensus practice in the community.Traditionally, language models and code-based models(Rozière et al., 2023; Feng et al., 2020) have been separated into distinct categories based on their domains of expertise, with the former excelling in general linguistic tasks and the latter in programming-related scenarios. However, an interesting observation has arisen in our experiments with Chinese text data generation tasks--intuitively, one would expect such tasks to be dominated by Chinese domain-based language models, but code-based models trained on English datasets have, in fact, exhibited superior performance. This unexpected discovery challenges the traditional view that pre-trained models are domain-specific and calls for a more in-depth examination of their capabilities beyond their primary training language or format.

code-based model, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

Jan-25-2024

arXiv.org PDF

Add feedback

Country:
- Asia > China (0.14)
- Europe > Spain (0.14)

Genre:
- Research Report > New Finding (0.66)

Industry:
- Information Technology > Software (0.40)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.49)
  - Natural Language > Large Language Model (0.99)