Chen, Jiangjie
EASYTOOL: Enhancing LLM-based Agents with Concise Tool Instruction
Yuan, Siyu, Song, Kaitao, Chen, Jiangjie, Tan, Xu, Shen, Yongliang, Kan, Ren, Li, Dongsheng, Yang, Deqing
To address intricate real-world tasks, there has been a rising interest in tool utilization in applications of large language models (LLMs). To develop LLM-based agents, it usually requires LLMs to understand many tool functions from different tool documentation. But these documentations could be diverse, redundant or incomplete, which immensely affects the capability of LLMs in using tools. To solve this, we introduce EASYTOOL, a framework transforming diverse and lengthy tool documentation into a unified and concise tool instruction for easier tool usage. EasyTool purifies essential information from extensive tool documentation of different sources, and elaborates a unified interface (i.e., tool instruction) to offer standardized tool descriptions and functionalities for LLM-based agents. Extensive experiments on multiple different tasks demonstrate that EasyTool can significantly reduce token consumption and improve the performance of tool utilization in real-world scenarios. Our code will be available at \url{https://github.com/microsoft/JARVIS/} in the future.
Translate Meanings, Not Just Words: IdiomKB's Role in Optimizing Idiomatic Translation with Language Models
Li, Shuang, Chen, Jiangjie, Yuan, Siyu, Wu, Xinyi, Yang, Hao, Tao, Shimin, Xiao, Yanghua
To translate well, machine translation (MT) systems and general-purposed language models (LMs) need a deep understanding of both source and target languages and cultures. Therefore, idioms, with their non-compositional nature, pose particular challenges for Transformer-based systems, as literal translations often miss the intended meaning. Traditional methods, which replace idioms using existing knowledge bases (KBs), often lack scale and context awareness. Addressing these challenges, our approach prioritizes context awareness and scalability, allowing for offline storage of idioms in a manageable KB size. This ensures efficient serving with smaller models and provides a more comprehensive understanding of idiomatic expressions. We introduce a multilingual idiom KB (IdiomKB) developed using large LMs to address this. This KB facilitates better translation by smaller models, such as BLOOMZ (7.1B), Alpaca (7B), and InstructGPT (6.7B), by retrieving idioms' figurative meanings. We present a novel, GPT-4-powered metric for human-aligned evaluation, demonstrating that IdiomKB considerably boosts model performance. Human evaluations further validate our KB's quality.
Beneath Surface Similarity: Large Language Models Make Reasonable Scientific Analogies after Structure Abduction
Yuan, Siyu, Chen, Jiangjie, Ge, Xuyang, Xiao, Yanghua, Yang, Deqing
The vital role of analogical reasoning in human cognition allows us to grasp novel concepts by linking them with familiar ones through shared relational structures. Despite the attention previous research has given to word analogies, this work suggests that Large Language Models (LLMs) often overlook the structures that underpin these analogies, raising questions about the efficacy of word analogies as a measure of analogical reasoning skills akin to human cognition. In response to this, our paper introduces a task of analogical structure abduction, grounded in cognitive psychology, designed to abduce structures that form an analogy between two systems. In support of this task, we establish a benchmark called SCAR, containing 400 scientific analogies from 13 distinct fields, tailored for evaluating analogical reasoning with structure abduction. The empirical evidence underlines the continued challenges faced by LLMs, including ChatGPT and GPT-4, in mastering this task, signifying the need for future exploration to enhance their abilities.
Put Your Money Where Your Mouth Is: Evaluating Strategic Planning and Execution of LLM Agents in an Auction Arena
Chen, Jiangjie, Yuan, Siyu, Ye, Rong, Majumder, Bodhisattwa Prasad, Richardson, Kyle
Can Large Language Models (LLMs) simulate human behavior in complex environments? LLMs have recently been shown to exhibit advanced reasoning skills but much of NLP evaluation still relies on static benchmarks. Answering this requires evaluation environments that probe strategic reasoning in competitive, dynamic scenarios that involve long-term planning. We conduct several controlled simulations using state-of-the-art LLMs as bidding agents. We find that through simple prompting, LLMs do indeed demonstrate many of the skills needed for effectively engaging in auctions (e.g., managing budget, adhering to long-term goals and priorities), skills that we find can be sharpened by explicitly encouraging models to be adaptive and observe strategies in past auctions. These results are significant as they show the potential of using LLM agents to model intricate social dynamics, especially in competitive settings. However, we also observe considerable variability in the capabilities of individual LLMs. Notably, even our most advanced models (GPT-4) are occasionally surpassed by heuristic baselines and human agents, highlighting the potential for further improvements in the design of LLM agents and the important role that our simulation environment can play in further testing and refining agent architectures. A long-term goal of the AI community has been the development of autonomous agents that can independently make decisions and freely interact in the environment to carry out different tasks (Steels, 1995; Franklin & Graesser, 1996). Being autonomous requires an agent to have a certain set of skills, such as the ability to do complex reasoning, and manage risk and resources, among many others. Large Language Models (LLMs) have proven to be able to solve a wide range of different reasoning problems, with the boundaries of what's possible being pushed every day (Wei et al., 2022a; Bubeck et al., 2023). Despite the increasing view of these models as autonomous agents (Wang et al., 2023a; Sumers et al., 2023; Xi et al., 2023), a crucial question remains: Can these agents effectively do sequential decision-making in dynamic environments for achieving their strategic objectives? While the potential is evident (Nakajima, 2023; Significant-Gravitas, 2023), these capabilities have yet to be rigorously evaluated. Traditional reasoning and planning benchmarks in NLP (Geva et al., 2021; Sakaguchi et al., 2021; Yuan et al., 2023) mostly assess agents in static contexts. Yet, real-world scenarios demand that autonomous agents not merely respond to input but also have the ability to create long-term goals and plans, and continuously revise their decisions. To bridge this gap, one recent line of research focuses on immersing agents in simulation environments that mimic real-world scenarios (Wang et al., 2022; Park et al., 2023; Liu et al., 2023), ones that often focus on a targeted Work done during Jiangjie's internship at Allen Institute for Artificial Intelligence.
Adaptive Chameleon or Stubborn Sloth: Revealing the Behavior of Large Language Models in Knowledge Conflicts
Xie, Jian, Zhang, Kai, Chen, Jiangjie, Lou, Renze, Su, Yu
By providing external information to large language models (LLMs), tool augmentation (including retrieval augmentation) has emerged as a promising solution for addressing the limitations of LLMs' static parametric memory. However, how receptive are LLMs to such external evidence, especially when the evidence conflicts with their parametric memory? We present the first comprehensive and controlled investigation into the behavior of LLMs when encountering knowledge conflicts. We propose a systematic framework to elicit high-quality parametric memory from LLMs and construct the corresponding counter-memory, which enables us to conduct a series of controlled experiments. Our investigation reveals seemingly contradicting behaviors of LLMs. On the one hand, different from prior wisdom, we find that LLMs can be highly receptive to external evidence even when that conflicts with their parametric memory, given that the external evidence is coherent and convincing. On the other hand, LLMs also demonstrate a strong confirmation bias when the external evidence contains some information that is consistent with their parametric memory, despite being presented with conflicting evidence at the same time. These results pose important implications that are worth careful consideration for the further development and deployment of tool- and retrieval-augmented LLMs.
Distilling Script Knowledge from Large Language Models for Constrained Language Planning
Yuan, Siyu, Chen, Jiangjie, Fu, Ziquan, Ge, Xuyang, Shah, Soham, Jankowski, Charles Robert, Xiao, Yanghua, Yang, Deqing
In everyday life, humans often plan their actions by following step-by-step instructions in the form of goal-oriented scripts. Previous work has exploited language models (LMs) to plan for abstract goals of stereotypical activities (e.g., "make a cake"), but leaves more specific goals with multi-facet constraints understudied (e.g., "make a cake for diabetics"). In this paper, we define the task of constrained language planning for the first time. We propose an overgenerate-then-filter approach to improve large language models (LLMs) on this task, and use it to distill a novel constrained language planning dataset, CoScript, which consists of 55,000 scripts. Empirical results demonstrate that our method significantly improves the constrained language planning ability of LLMs, especially on constraint faithfulness. Furthermore, CoScript is demonstrated to be quite effective in endowing smaller LMs with constrained language planning ability.
Say What You Mean! Large Language Models Speak Too Positively about Negative Commonsense Knowledge
Chen, Jiangjie, Shi, Wei, Fu, Ziquan, Cheng, Sijie, Li, Lei, Xiao, Yanghua
Large language models (LLMs) have been widely studied for their ability to store and utilize positive knowledge. However, negative knowledge, such as "lions don't live in the ocean", is also ubiquitous in the world but rarely mentioned explicitly in the text. What do LLMs know about negative knowledge? This work examines the ability of LLMs to negative commonsense knowledge. We design a constrained keywords-to-sentence generation task (CG) and a Boolean question-answering task (QA) to probe LLMs. Our experiments reveal that LLMs frequently fail to generate valid sentences grounded in negative commonsense knowledge, yet they can correctly answer polar yes-or-no questions. We term this phenomenon the belief conflict of LLMs. Our further analysis shows that statistical shortcuts and negation reporting bias from language modeling pre-training cause this conflict.
ANALOGYKB: Unlocking Analogical Reasoning of Language Models with A Million-scale Knowledge Base
Yuan, Siyu, Chen, Jiangjie, Sun, Changzhi, Liang, Jiaqing, Xiao, Yanghua, Yang, Deqing
Analogical reasoning is a fundamental cognitive ability of humans. However, current language models (LMs) still struggle to achieve human-like performance in analogical reasoning tasks due to a lack of resources for model training. In this work, we address this gap by proposing ANALOGYKB, a million-scale analogy knowledge base (KB) derived from existing knowledge graphs (KGs). ANALOGYKB identifies two types of analogies from the KGs: 1) analogies of the same relations, which can be directly extracted from the KGs, and 2) analogies of analogous relations, which are identified with a selection and filtering pipeline enabled by large LMs (InstructGPT), followed by minor human efforts for data quality control. Evaluations on a series of datasets of two analogical reasoning tasks (analogy recognition and generation) demonstrate that ANALOGYKB successfully enables LMs to achieve much better results than previous state-of-the-art methods.
Converge to the Truth: Factual Error Correction via Iterative Constrained Editing
Chen, Jiangjie, Xu, Rui, Zeng, Wenxuan, Sun, Changzhi, Li, Lei, Xiao, Yanghua
Given a possibly false claim sentence, how can we automatically correct it with minimal editing? Existing methods either require a large number of pairs of false and corrected claims for supervised training or do not handle well errors spanning over multiple tokens within an utterance. In this paper, we propose VENCE, a novel method for factual error correction (FEC) with minimal edits. VENCE formulates the FEC problem as iterative sampling editing actions with respect to a target density function. We carefully design the target function with predicted truthfulness scores from an offline trained fact verification model. VENCE samples the most probable editing positions based on back-calculated gradients of the truthfulness score concerning input tokens and the editing actions using a distantly-supervised language model (T5). Experiments on a public dataset show that VENCE improves the well-adopted SARI metric by 5.3 (or a relative improvement of 11.8%) over the previous best distantly-supervised methods.
Harnessing Knowledge and Reasoning for Human-Like Natural Language Generation: A Brief Review
Chen, Jiangjie, Xiao, Yanghua
The rapid development and application of natural language generation (NLG) techniques has revolutionized the field of automatic text production. However, these techniques are still limited in their ability to produce human-like text that is truly reasonable and informative. In this paper, we explore the importance of NLG being guided by knowledge, in order to convey human-like reasoning through language generation. We propose ten goals for intelligent NLG systems to pursue, and briefly review the achievement of NLG techniques guided by knowledge and reasoning. We also conclude by envisioning future directions and challenges in the pursuit of these goals.