Lei, Zhikai
Code-Driven Inductive Synthesis: Enhancing Reasoning Abilities of Large Language Models with Sequences
Chen, Kedi, Lei, Zhikai, Zhang, Fan, Zhang, Yinqi, Chen, Qin, Zhou, Jie, He, Liang, Guo, Qipeng, Chen, Kai, Zhang, Wei
Large language models make remarkable progress in reasoning capabilities. Existing works focus mainly on deductive reasoning tasks (e.g., code and math), while another type of reasoning mode that better aligns with human learning, inductive reasoning, is not well studied. We attribute the reason to the fact that obtaining high-quality process supervision data is challenging for inductive reasoning. Towards this end, we novelly employ number sequences as the source of inductive reasoning data. We package sequences into algorithmic problems to find the general term of each sequence through a code solution. In this way, we can verify whether the code solution holds for any term in the current sequence, and inject case-based supervision signals by using code unit tests. We build a sequence synthetic data pipeline and form a training dataset CodeSeq. Experimental results show that the models tuned with CodeSeq improve on both code and comprehensive reasoning benchmarks.
GAOKAO-Eval: Does high scores truly reflect strong capabilities in LLMs?
Lei, Zhikai, Liang, Tianyi, Hu, Hanglei, Zhang, Jin, Zhou, Yunhua, Shao, Yunfan, Li, Linyang, Li, Chenchui, Wang, Changbo, Yan, Hang, Guo, Qipeng
Large Language Models (LLMs) are commonly evaluated using human-crafted benchmarks, under the premise that higher scores implicitly reflect stronger human-like performance. However, there is growing concern that LLMs may ``game" these benchmarks due to data leakage, achieving high scores while struggling with tasks simple for humans. To substantively address the problem, we create GAOKAO-Eval, a comprehensive benchmark based on China's National College Entrance Examination (Gaokao), and conduct ``closed-book" evaluations for representative models released prior to Gaokao. Contrary to prevailing consensus, even after addressing data leakage and comprehensiveness, GAOKAO-Eval reveals that high scores still fail to truly reflect human-aligned capabilities. To better understand this mismatch, We introduce the Rasch model from cognitive psychology to analyze LLM scoring patterns and identify two key discrepancies: 1) anomalous consistent performance across various question difficulties, and 2) high variance in performance on questions of similar difficulty. In addition, We identified inconsistent grading of LLM-generated answers among teachers and recurring mistake patterns. we find that the phenomenons are well-grounded in the motivations behind OpenAI o1, and o1's reasoning-as-difficulties can mitigate the mismatch. These results show that GAOKAO-Eval can reveal limitations in LLM capabilities not captured by current benchmarks and highlight the need for more LLM-aligned difficulty analysis.
LongSafetyBench: Long-Context LLMs Struggle with Safety Issues
Huang, Mianqiu, Liu, Xiaoran, Zhou, Shaojun, Zhang, Mozhi, Tan, Chenkun, Wang, Pengyu, Guo, Qipeng, Xu, Zhe, Li, Linyang, Lei, Zhikai, Li, Linlin, Liu, Qun, Zhou, Yaqian, Qiu, Xipeng, Huang, Xuanjing
WARNING: This paper contains unsafe content. With the development of large language models (LLMs), the sequence length of these models continues to increase, drawing significant attention to long-context language models. However, the evaluation of these models has been primarily limited to their capabilities, with a lack of research focusing on their safety. Existing work, such as ManyShotJailbreak, has to some extent demonstrated that longcontext language models can exhibit safety concerns. However, the methods used are limited and lack comprehensiveness. In response, we introduce LongSafety-Bench, the first benchmark designed to objectively and comprehensively evaluate the safety of long-context models. LongSafetyBench consists of 10 task categories, with an average length of 41,889 words. After testing eight long-context language models on LongSafetyBench, we found that existing models generally exhibit insufficient safety capabilities. The proportion of safe responses from most mainstream long-context LLMs is below 50%. Moreover, models' safety performance in long-context scenarios does not always align with that in short-context scenarios. Further investigation revealed that long-context models tend to overlook harmful content within lengthy texts. We also proposed a simple yet effective solution, allowing open-source models to achieve performance comparable to that of top-tier closed-source models. We believe that LongSafetyBench can serve as a valuable benchmark for evaluating the safety capabilities of long-context language models. We hope that our work will encourage the broader community to pay attention to the safety of long-context models and contribute to the development of solutions to improve the safety of long-context LLMs. Recently, thanks to more advanced model architectures (Xiao et al., 2024b;a; Liu et al., 2024a) and expanded position encoding techniques (Su et al., 2023; Liu et al., 2024b), the context length of language models has been extended significantly (Achiam et al., 2023; Reid et al., 2024). In the foreseeable future, as language models continue to evolve and tackle increasingly complex problems, the demand for handling longer contexts is expected to grow accordingly. We anticipate that long-context language models will become mainstream. Previous research on long-context language models, such as LongBench (Bai et al., 2024), L-Eval (An et al., 2023), and RULER (Hsieh et al., 2024), has typically focused on their capabilities, while neglecting to address their safety. In short-context scenarios, the safety issues of language models have already been extensively studied.(Zhang Illegal Activities, Misinformation Harm, Offensiveness The question is composed of a long content and Bias.
Boosting Conversational Question Answering with Fine-Grained Retrieval-Augmentation and Self-Check
Ye, Linhao, Lei, Zhikai, Yin, Jianghao, Chen, Qin, Zhou, Jie, He, Liang
Retrieval-Augmented Generation (RAG) aims to generate more reliable Conversational Question Answering (CQA) has attracted great and accurate responses, by augmenting large language models attention in both academia and industry in recent years, which (LLMs) with the external vast and dynamic knowledge. Most previous provides more natural human-computer interactions by extending work focuses on using RAG for single-round question answering, single-turn question answering (QA) to conversational settings [23, while how to adapt RAG to the complex conversational setting 33]. In CQA, users usually ask multiple follow-up questions using wherein the question is interdependent on the preceding context is anaphora that refers to certain concepts in previous conversation not well studied. In this paper, we propose a conversation-level RAG history, or ellipsis that can be omitted. As shown in Figure 1, the (ConvRAG) approach, which incorporates fine-grained retrieval augmentation'battle' in the current question refers to'Hunayn' in the first turn, and self-check for conversational question answering making it more challenging than single-turn QA. (CQA). In particular, our approach consists of three components, One key challenge in CQA is how to explicitly represent the namely conversational question refiner, fine-grained retriever and questions based on the interdependent context. Previous work focuses self-check based response generator, which work collaboratively on using the question rewriting methods for a better question for question understanding and relevant information acquisition understanding. Elgoharyet et al. [11] first released a dataset with in conversational settings. Extensive experiments demonstrate the human rewrites of questions and analysed the writing quality.
InternLM2 Technical Report
Cai, Zheng, Cao, Maosong, Chen, Haojiong, Chen, Kai, Chen, Keyu, Chen, Xin, Chen, Xun, Chen, Zehui, Chen, Zhi, Chu, Pei, Dong, Xiaoyi, Duan, Haodong, Fan, Qi, Fei, Zhaoye, Gao, Yang, Ge, Jiaye, Gu, Chenya, Gu, Yuzhe, Gui, Tao, Guo, Aijia, Guo, Qipeng, He, Conghui, Hu, Yingfan, Huang, Ting, Jiang, Tao, Jiao, Penglong, Jin, Zhenjiang, Lei, Zhikai, Li, Jiaxing, Li, Jingwen, Li, Linyang, Li, Shuaibin, Li, Wei, Li, Yining, Liu, Hongwei, Liu, Jiangning, Hong, Jiawei, Liu, Kaiwen, Liu, Kuikun, Liu, Xiaoran, Lv, Chengqi, Lv, Haijun, Lv, Kai, Ma, Li, Ma, Runyuan, Ma, Zerun, Ning, Wenchang, Ouyang, Linke, Qiu, Jiantao, Qu, Yuan, Shang, Fukai, Shao, Yunfan, Song, Demin, Song, Zifan, Sui, Zhihao, Sun, Peng, Sun, Yu, Tang, Huanze, Wang, Bin, Wang, Guoteng, Wang, Jiaqi, Wang, Jiayu, Wang, Rui, Wang, Yudong, Wang, Ziyi, Wei, Xingjian, Weng, Qizhen, Wu, Fan, Xiong, Yingtong, Xu, Chao, Xu, Ruiliang, Yan, Hang, Yan, Yirong, Yang, Xiaogui, Ye, Haochen, Ying, Huaiyuan, Yu, Jia, Yu, Jing, Zang, Yuhang, Zhang, Chuyu, Zhang, Li, Zhang, Pan, Zhang, Peng, Zhang, Ruijie, Zhang, Shuo, Zhang, Songyang, Zhang, Wenjian, Zhang, Wenwei, Zhang, Xingcheng, Zhang, Xinyue, Zhao, Hui, Zhao, Qian, Zhao, Xiaomeng, Zhou, Fengzhe, Zhou, Zaida, Zhuo, Jingming, Zou, Yicheng, Qiu, Xipeng, Qiao, Yu, Lin, Dahua
The evolution of Large Language Models (LLMs) like ChatGPT and GPT-4 has sparked discussions on the advent of Artificial General Intelligence (AGI). However, replicating such advancements in open-source models has been challenging. This paper introduces InternLM2, an open-source LLM that outperforms its predecessors in comprehensive evaluations across 6 dimensions and 30 benchmarks, long-context modeling, and open-ended subjective evaluations through innovative pre-training and optimization techniques. The pre-training process of InternLM2 is meticulously detailed, highlighting the preparation of diverse data types including text, code, and long-context data. InternLM2 efficiently captures long-term dependencies, initially trained on 4k tokens before advancing to 32k tokens in pre-training and fine-tuning stages, exhibiting remarkable performance on the 200k ``Needle-in-a-Haystack" test. InternLM2 is further aligned using Supervised Fine-Tuning (SFT) and a novel Conditional Online Reinforcement Learning from Human Feedback (COOL RLHF) strategy that addresses conflicting human preferences and reward hacking. By releasing InternLM2 models in different training stages and model sizes, we provide the community with insights into the model's evolution.
WanJuan-CC: A Safe and High-Quality Open-sourced English Webtext Dataset
Qiu, Jiantao, Lv, Haijun, Jin, Zhenjiang, Wang, Rui, Ning, Wenchang, Yu, Jia, Zhang, ChaoBin, Li, Zhenxiang, Chu, Pei, Qu, Yuan, Shi, Jin, Lu, Lindong, Peng, Runyu, Zeng, Zhiyuan, Tang, Huanze, Lei, Zhikai, Hong, Jiawei, Chen, Keyu, Fei, Zhaoye, Xu, Ruiliang, Li, Wei, Tu, Zhongying, Dahua, Lin, Qiao, Yu, Yan, Hang, He, Conghui
This paper presents WanJuan-CC, a safe and high-quality open-sourced English webtext dataset derived from Common Crawl data. The study addresses the challenges of constructing large-scale pre-training datasets for language models, which require vast amounts of high-quality data. A comprehensive process was designed to handle Common Crawl data, including extraction, heuristic rule filtering, fuzzy deduplication, content safety filtering, and data quality filtering. From approximately 68 billion original English documents, we obtained 2.22T Tokens of safe data and selected 1.0T Tokens of high-quality data as part of WanJuan-CC. We have open-sourced 100B Tokens from this dataset. The paper also provides statistical information related to data quality, enabling users to select appropriate data according to their needs. To evaluate the quality and utility of the dataset, we trained 1B-parameter and 3B-parameter models using WanJuan-CC and another dataset, RefinedWeb. Results show that WanJuan-CC performs better on validation datasets and downstream tasks.
AI-Generated Content Enhanced Computer-Aided Diagnosis Model for Thyroid Nodules: A ChatGPT-Style Assistant
Yao, Jincao, Wang, Yunpeng, Lei, Zhikai, Wang, Kai, Li, Xiaoxian, Zhou, Jianhua, Hao, Xiang, Shen, Jiafei, Wang, Zhenping, Ru, Rongrong, Chen, Yaqing, Zhou, Yahan, Chen, Chen, Zhang, Yanming, Liang, Ping, Xu, Dong
An artificial intelligence-generated content-enhanced computer-aided diagnosis (AIGC-CAD) model, designated as ThyGPT, has been developed. This model, inspired by the architecture of ChatGPT, could assist radiologists in assessing the risk of thyroid nodules through semantic-level human-machine interaction. A dataset comprising 19,165 thyroid nodule ultrasound cases from Zhejiang Cancer Hospital was assembled to facilitate the training and validation of the model. After training, ThyGPT could automatically evaluate thyroid nodule and engage in effective communication with physicians through human-computer interaction. The performance of ThyGPT was rigorously quantified using established metrics such as the receiver operating characteristic (ROC) curve, area under the curve (AUC), sensitivity, and specificity. The empirical findings revealed that radiologists, when supplemented with ThyGPT, markedly surpassed the diagnostic acumen of their peers utilizing traditional methods as well as the performance of the model in isolation. These findings suggest that AIGC-CAD systems, exemplified by ThyGPT, hold the promise to fundamentally transform the diagnostic workflows of radiologists in forthcoming years.
EduChat: A Large-Scale Language Model-based Chatbot System for Intelligent Education
Dan, Yuhao, Lei, Zhikai, Gu, Yiyang, Li, Yong, Yin, Jianghao, Lin, Jiaju, Ye, Linhao, Tie, Zhiyan, Zhou, Yougen, Wang, Yilei, Zhou, Aimin, Zhou, Ze, Chen, Qin, Zhou, Jie, He, Liang, Qiu, Xipeng
EduChat (https://www.educhat.top/) is a large-scale language model (LLM)-based chatbot system in the education domain. Its goal is to support personalized, fair, and compassionate intelligent education, serving teachers, students, and parents. Guided by theories from psychology and education, it further strengthens educational functions such as open question answering, essay assessment, Socratic teaching, and emotional support based on the existing basic LLMs. Particularly, we learn domain-specific knowledge by pre-training on the educational corpus and stimulate various skills with tool use by fine-tuning on designed system prompts and instructions. Currently, EduChat is available online as an open-source project, with its code, data, and model parameters available on platforms (e.g., GitHub https://github.com/icalk-nlp/EduChat, Hugging Face https://huggingface.co/ecnu-icalk ). We also prepare a demonstration of its capabilities online (https://vimeo.com/851004454). This initiative aims to promote research and applications of LLMs for intelligent education.