Liu, Yantao
Pairwise RM: Perform Best-of-N Sampling with Knockout Tournament
Liu, Yantao, Yao, Zijun, Min, Rui, Cao, Yixin, Hou, Lei, Li, Juanzi
Best-of-N (BoN) sampling, a common strategy for test-time scaling of Large Language Models (LLMs), relies on reward models to select the best candidate solution from multiple generations. However, traditional reward models often assign arbitrary and inconsistent scores, limiting their effectiveness. To address this, we propose a Pairwise Reward Model (Pairwise RM) combined with a knockout tournament for BoN sampling. Instead of assigning absolute scores, given one math problem, Pairwise RM evaluates two candidate solutions' correctness simultaneously. This approach eliminates the need for arbitrary scoring and enables cross-validation of solutions through parallel comparison. In the knockout tournament, Pairwise RM conducts pairwise comparisons between candidate solutions and eliminates the incorrect ones iteratively. We construct \ourdataset, a large-scale dataset of 443K pairwise comparisons derived from NumiaMath and annotated using \texttt{gemini-1.5-flash}, and train the Pairwise RM via supervised fine-tuning. Experiments on MATH-500 and the Olympiad Bench demonstrate significant improvements over traditional discriminative reward models. And a 40\% to 60\% relative improvement is achieved on the top 50\% challenging problems.
RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style
Liu, Yantao, Yao, Zijun, Min, Rui, Cao, Yixin, Hou, Lei, Li, Juanzi
Reward models are critical in techniques like Reinforcement Learning from Human Feedback (RLHF) and Inference Scaling Laws, where they guide language model alignment and select optimal responses. Despite their importance, existing reward model benchmarks often evaluate models by asking them to distinguish between responses generated by models of varying power. However, this approach fails to assess reward models on subtle but critical content changes and variations in style, resulting in a low correlation with policy model performance. Reward models play a pivotal role in both techniques. In RLHF, reward models serve as proxies for human values, providing feedback on generated text, which helps align language models (policy models) during training (Ouyang et al., 2022; Dong et al., 2024). In Inference Scaling Law, reward models are used to select the best response from a set of candidates based on predicted rewards (Wu et al., 2024; Snell et al., 2024). Despite their significance, benchmarks for reward models remain under-explored compared to the rapid advancements in aligned language model evaluation, namely the policy model (Hendrycks et al., 2020; bench authors, 2023; Chiang et al., 2024; Hendrycks et al., 2021). To conduct a faithful and systematical evaluation, an ideal benchmark for reward models should adhere to three key principles: 1) Assessing Reward Models' Sensitivity to Subtle Changes: A faithful reward model should sensitively distinguish subtle changes and assign a higher reward to the correct response. For example, in Table 1, Response 1 and Response 2 differ by only one word but express completely different meanings, requiring the reward model to focus on content quality. For example, in Table 1, Response 3 is factually incorrect but longer than Response 1, which could mislead the reward model into assigning a higher reward to Response 3. 3) Correlating with Policy Models: A good reward model benchmark should highly correlate with the performance of the aligned language model (the policy model). This would make it a reliable proxy for selecting the best reward model for alignment. Recent efforts (Lambert et al., 2024; Zhu et al., 2023; Jiang et al., 2023) have made progress by constructing benchmarks from existing preference datasets.
Untangle the KNOT: Interweaving Conflicting Knowledge and Reasoning Skills in Large Language Models
Liu, Yantao, Yao, Zijun, Lv, Xin, Fan, Yuchen, Cao, Shulin, Yu, Jifan, Hou, Lei, Li, Juanzi
Providing knowledge documents for large language models (LLMs) has emerged as a promising solution to update the static knowledge inherent in their parameters. However, knowledge in the document may conflict with the memory of LLMs due to outdated or incorrect knowledge in the LLMs' parameters. This leads to the necessity of examining the capability of LLMs to assimilate supplemental external knowledge that conflicts with their memory. While previous studies have explained to what extent LLMs extract conflicting knowledge from the provided text, they neglect the necessity to reason with conflicting knowledge. Furthermore, there lack a detailed analysis on strategies to enable LLMs to resolve conflicting knowledge via prompting, decoding strategy, and supervised fine-tuning. To address these limitations, we construct a new dataset, dubbed KNOT, for knowledge conflict resolution examination in the form of question answering. KNOT facilitates in-depth analysis by dividing reasoning with conflicting knowledge into three levels: (1) Direct Extraction, which directly extracts conflicting knowledge to answer questions. (2) Explicit Reasoning, which reasons with conflicting knowledge when the reasoning path is explicitly provided in the question. (3) Implicit Reasoning, where reasoning with conflicting knowledge requires LLMs to infer the reasoning path independently to answer questions. We also conduct extensive experiments on KNOT to establish empirical guidelines for LLMs to utilize conflicting knowledge in complex circumstances. Dataset and associated codes can be accessed at https://github.com/THU-KEG/KNOT .
Evaluating Generative Language Models in Information Extraction as Subjective Question Correction
Fan, Yuchen, Liu, Yantao, Yao, Zijun, Yu, Jifan, Hou, Lei, Li, Juanzi
Modern Large Language Models (LLMs) have showcased remarkable prowess in various tasks necessitating sophisticated cognitive behaviors. Nevertheless, a paradoxical performance discrepancy is observed, where these models underperform in seemingly elementary tasks like relation extraction and event extraction due to two issues in conventional evaluation. (1) The imprecision of existing evaluation metrics that struggle to effectively gauge semantic consistency between model outputs and ground truth, and (2) The inherent incompleteness of evaluation benchmarks, primarily due to restrictive human annotation schemas, resulting in underestimated LLM performances. Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score. This method innovatively utilizes LLMs, fine-tuned through subjective question correction data, to refine matching between model outputs and golden labels. Additionally, by incorporating a Natural Language Inference (NLI) model, SQC-Score enriches golden labels, addressing benchmark incompleteness by acknowledging correct yet previously omitted answers. Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics. Utilizing SQC-Score, we conduct a comprehensive evaluation of the state-of-the-art LLMs and provide insights for future research for information extraction. Dataset and associated codes can be accessed at https://github.com/THU-KEG/SQC-Score.
KnowCoder: Coding Structured Knowledge into LLMs for Universal Information Extraction
Li, Zixuan, Zeng, Yutao, Zuo, Yuxin, Ren, Weicheng, Liu, Wenxuan, Su, Miao, Guo, Yucan, Liu, Yantao, Li, Xiang, Hu, Zhilei, Bai, Long, Li, Wei, Liu, Yidan, Yang, Pan, Jin, Xiaolong, Guo, Jiafeng, Cheng, Xueqi
In this paper, we propose KnowCoder, a Large Language Model (LLM) to conduct Universal Information Extraction (UIE) via code generation. KnowCoder aims to develop a kind of unified schema representation that LLMs can easily understand and an effective learning framework that encourages LLMs to follow schemas and extract structured knowledge accurately. To achieve these, KnowCoder introduces a code-style schema representation method to uniformly transform different schemas into Python classes, with which complex schema information, such as constraints among tasks in UIE, can be captured in an LLM-friendly manner. We further construct a code-style schema library covering over $\textbf{30,000}$ types of knowledge, which is the largest one for UIE, to the best of our knowledge. To ease the learning process of LLMs, KnowCoder contains a two-phase learning framework that enhances its schema understanding ability via code pretraining and its schema following ability via instruction tuning. After code pretraining on around $1.5$B automatically constructed data, KnowCoder already attains remarkable generalization ability and achieves relative improvements by $\textbf{49.8%}$ F1, compared to LLaMA2, under the few-shot setting. After instruction tuning, KnowCoder further exhibits strong generalization ability on unseen schemas and achieves up to $\textbf{12.5%}$ and $\textbf{21.9%}$, compared to sota baselines, under the zero-shot setting and the low resource setting, respectively. Additionally, based on our unified schema representations, various human-annotated datasets can simultaneously be utilized to refine KnowCoder, which achieves significant improvements up to $\textbf{7.5%}$ under the supervised setting.
Retrieval-Augmented Code Generation for Universal Information Extraction
Guo, Yucan, Li, Zixuan, Jin, Xiaolong, Liu, Yantao, Zeng, Yutao, Liu, Wenxuan, Li, Xiang, Yang, Pan, Bai, Long, Guo, Jiafeng, Cheng, Xueqi
Information Extraction (IE) aims to extract structural knowledge (e.g., entities, relations, events) from natural language texts, which brings challenges to existing methods due to task-specific schemas and complex text expressions. Code, as a typical kind of formalized language, is capable of describing structural knowledge under various schemas in a universal way. On the other hand, Large Language Models (LLMs) trained on both codes and texts have demonstrated powerful capabilities of transforming texts into codes, which provides a feasible solution to IE tasks. Therefore, in this paper, we propose a universal retrieval-augmented code generation framework based on LLMs, called Code4UIE, for IE tasks. Specifically, Code4UIE adopts Python classes to define task-specific schemas of various structural knowledge in a universal way. By so doing, extracting knowledge under these schemas can be transformed into generating codes that instantiate the predefined Python classes with the information in texts. To generate these codes more precisely, Code4UIE adopts the in-context learning mechanism to instruct LLMs with examples. In order to obtain appropriate examples for different tasks, Code4UIE explores several example retrieval strategies, which can retrieve examples semantically similar to the given texts. Extensive experiments on five representative IE tasks across nine datasets demonstrate the effectiveness of the Code4UIE framework.
An In-Context Schema Understanding Method for Knowledge Base Question Answering
Liu, Yantao, Li, Zixuan, Jin, Xiaolong, Bai, Long, Guan, Saiping, Guo, Jiafeng, Cheng, Xueqi
The Knowledge Base Question Answering (KBQA) task aims to answer natural language questions based on a given knowledge base. As a kind of common method for this task, semantic parsing-based ones first convert natural language questions to logical forms (e.g., SPARQL queries) and then execute them on knowledge bases to get answers. Recently, Large Language Models (LLMs) have shown strong abilities in language understanding and may be adopted as semantic parsers in such kinds of methods. However, in doing so, a great challenge for LLMs is to understand the schema of knowledge bases. Therefore, in this paper, we propose an In-Context Schema Understanding (ICSU) method for facilitating LLMs to be used as a semantic parser in KBQA. Specifically, ICSU adopts the In-context Learning mechanism to instruct LLMs to generate SPARQL queries with examples. In order to retrieve appropriate examples from annotated question-query pairs, which contain comprehensive schema information related to questions, ICSU explores four different retrieval strategies. Experimental results on the largest KBQA benchmark, KQA Pro, show that ICSU with all these strategies outperforms that with a random retrieval strategy significantly (from 12\% to 78.76\% in accuracy).
KoLA: Carefully Benchmarking World Knowledge of Large Language Models
Yu, Jifan, Wang, Xiaozhi, Tu, Shangqing, Cao, Shulin, Zhang-Li, Daniel, Lv, Xin, Peng, Hao, Yao, Zijun, Zhang, Xiaohan, Li, Hanming, Li, Chunyang, Zhang, Zheyuan, Bai, Yushi, Liu, Yantao, Xin, Amy, Lin, Nianyi, Yun, Kaifeng, Gong, Linlu, Chen, Jianhui, Wu, Zhili, Qi, Yunjia, Li, Weikai, Guan, Yong, Zeng, Kaisheng, Qi, Ji, Jin, Hailong, Liu, Jinxin, Gu, Yu, Yao, Yuan, Ding, Ning, Hou, Lei, Liu, Zhiyuan, Xu, Bin, Tang, Jie, Li, Juanzi
The unprecedented performance of large language models (LLMs) necessitates improvements in evaluations. Rather than merely exploring the breadth of LLM abilities, we believe meticulous and thoughtful designs are essential to thorough, unbiased, and applicable evaluations. Given the importance of world knowledge to LLMs, we construct a Knowledge-oriented LLM Assessment benchmark (KoLA), in which we carefully design three crucial factors: (1) For ability modeling, we mimic human cognition to form a four-level taxonomy of knowledge-related abilities, covering $19$ tasks. (2) For data, to ensure fair comparisons, we use both Wikipedia, a corpus prevalently pre-trained by LLMs, along with continuously collected emerging corpora, aiming to evaluate the capacity to handle unseen data and evolving knowledge. (3) For evaluation criteria, we adopt a contrastive system, including overall standard scores for better numerical comparability across tasks and models and a unique self-contrast metric for automatically evaluating knowledge hallucination. We evaluate $21$ open-source and commercial LLMs and obtain some intriguing findings. The KoLA dataset and open-participation leaderboard are publicly released at https://kola.xlore.cn and will be continuously updated to provide references for developing LLMs and knowledge-related systems.
KoRC: Knowledge oriented Reading Comprehension Benchmark for Deep Text Understanding
Yao, Zijun, Liu, Yantao, Lv, Xin, Cao, Shulin, Yu, Jifan, Hou, Lei, Li, Juanzi
Deep text understanding, which requires the connections between a given document and prior knowledge beyond its text, has been highlighted by many benchmarks in recent years. However, these benchmarks have encountered two major limitations. On the one hand, most of them require human annotation of knowledge, which leads to limited knowledge coverage. On the other hand, they usually use choices or spans in the texts as the answers, which results in narrow answer space. To overcome these limitations, we build a new challenging benchmark named KoRc in this paper. Compared with previous benchmarks, KoRC has two advantages, i.e., broad knowledge coverage and flexible answer format. Specifically, we utilize massive knowledge bases to guide annotators or large language models (LLMs) to construct knowledgable questions. Moreover, we use labels in knowledge bases rather than spans or choices as the final answers. We test state-of-the-art models on KoRC and the experimental results show that the strongest baseline only achieves 68.3% and 30.0% F1 measure in the in-distribution and out-of-distribution test set, respectively. These results indicate that deep text understanding is still an unsolved challenge. The benchmark dataset, leaderboard, and baseline methods are released in https://github.com/THU-KEG/KoRC.