Chinese Open Instruction Generalist: A Preliminary Release
Zhang, Ge, Shi, Yemin, Liu, Ruibo, Yuan, Ruibin, Li, Yizhi, Dong, Siwei, Shu, Yu, Li, Zhaoqun, Wang, Zekun, Lin, Chenghua, Huang, Wenhao, Fu, Jie
–arXiv.org Artificial Intelligence
Pre-trained large-scale language models (LLMs) have shown revolutionary performance in many downstream tasks (Guo et al., 2023; Wei et al., 2021). One crucial ability of LLMs is called instruction following. That is, models can complete the tasks described by instructions given as input. This ability is based on a specialized training stage called instruction tuning. Compared to unlabeled data used for pre-training, the data for instruction tuning is typically more goal-oriented, and it should explicitly demonstrate how a response follows its corresponding instruction with a given input. There are many instruction tuning datasets in English. For example, the FLAN collection (Longpre et al., 2023) contains 15M examples covering 1836 tasks, and OPT-IML (Iyer et al., 2022b) claims to have 18M examples for more than 2000 tasks (although it is still not publicly available). In contrast, existing data resources for Chinese instruction tuning are either small in scale or have questionable quality. For example, Ziang Leng and Li (2023) directly translate English instruction tuning data into Chinese, but do not consider mitigating translation errors or potential cultural gaps, e.g.
arXiv.org Artificial Intelligence
Apr-24-2023