Goto

Collaborating Authors

 zero-shot-cot


8bb0d291acd4acf06ef112099c16f326-Supplemental-Conference.pdf

Neural Information Processing Systems

LastLetters F 500 15.0 - CoinFlip Y 500 37.0 - A.2.2 Datasetcreation Regarding "Last Letter Concatenation" and "Coin Flip", datasets are not publicly available sowe created the datasets following Wei et al. [2022] with a minor rephrasing of the question template. Asfor Coin Flip, we use the following template. A.5 PromptsForAnswerExtraction Table 9 and Table 10 summarizes a list of answer extraction prompts used for the experiments at Table1. Number Pick up the first number encounteredinthetext. MultipleChoice Pick up the first large letter encountered in the text. YesorNo Pickupthefirst"yes" or "no" encountered in the text after removing unnecessaryletters. Table 13 lists example texts generated by Zero-shot-CoT for each reasoning extraction template(SeeTable4). Dataset Question Answer SingleEq Q: A spaceship traveled 0.5 of a light-year from Earth to Planet X and 0.1 of a lightyearfromPlanetXtoPlanetY. A: Let's think step by step. So the total distance the spaceship traveled is 0.5 + 0.1 + 0.1 = 0.7 light-years. Therefore, the answer (arabic numerals) is: 0.7 light-years Q:Whilemaking desserts for abakesale,Victorused0.625 of a scoop of brown sugar as well as 0.25 of a scoop of whitesugar.Howmuchmore brownsugardidVictoruse? A: Let's think step by step.


LargeLanguageModelsareZero-ShotReasoners

Neural Information Processing Systems

Notably,chainofthought(CoT)prompting, a recent technique for eliciting complex multi-step reasoning through step-bystep answer examples, achieved the state-of-the-art performances in arithmetics and symbolic reasoning, difficultsystem-2 tasks that do not follow the standard scaling laws for LLMs. While these successes are often attributed to LLMs' ability forfew-shot learning, weshowthatLLMs aredecentzero-shotreasoners by simply adding "Let's think step by step" before each answer.




Effectiveness of Zero-shot-CoT in Japanese Prompts

Takayama, Shusuke, Frank, Ian

arXiv.org Artificial Intelligence

We compare the effectiveness of zero-shot Chain-of-Thought (CoT) prompting in Japanese and English using ChatGPT-3.5 and 4o-mini. The technique of zero-shot CoT, which involves appending a phrase such as "Let's think step by step" to a prompt to encourage reasoning before answering, has been shown to offer LLM performance improvements in mathematical and reasoning tasks, particularly in English. We investigate how these effects transfer to Japanese using the Japanese Multi-task Language Understanding Benchmark (JMMLU) and the Multi-task Language Understanding Benchmark (MMLU). Our results show that while zero-shot CoT prompting can lead to notable performance gains for some prompt categories in GPT-3.5, its impact in GPT-4o-mini is associated with significant performance declines. However, for Japanese prompts there remain certain categories, such as college mathematics and abstract algebra, that still exhibit improvements, despite the broader trend of diminishing effectiveness in more advanced models.


CauseJudger: Identifying the Cause with LLMs for Abductive Logical Reasoning

He, Jinwei, Lu, Feng

arXiv.org Artificial Intelligence

Large language models (LLMs) have been utilized in solving diverse reasoning tasks, encompassing common sense, arithmetic and deduction tasks. However, with difficulties of reversing thinking patterns and irrelevant premises, how to determine the authenticity of the cause in abductive logical reasoning remains underexplored. Inspired by hypothesis and verification method and identification of irrelevant information in human thinking process, we propose a new framework for LLMs abductive logical reasoning called CauseJudger (CJ), which identifies the authenticity of possible cause by transforming thinking from reverse to forward and removing irrelevant information. In addition, we construct an abductive logical reasoning dataset for decision task called CauseLogics, which contains 200,000 tasks of varying reasoning lengths. Our experiments show the efficiency of CJ with overall experiments and ablation experiments as well as case studies on our dataset and reconstructed public dataset. Notably, CJ's implementation is efficient, requiring only two calls to LLM. Its impact is profound: when using gpt-3.5, CJ achieves a maximum correctness improvement of 41% compared to Zero-Shot-CoT. Moreover, with gpt-4, CJ attains an accuracy exceeding 90% across all datasets.


Divide and Conquer for Large Language Models Reasoning

Meng, Zijie, Zhang, Yan, Feng, Zhaopeng, Feng, Yang, Wang, Gaoang, Zhou, Joey Tianyi, Wu, Jian, Liu, Zuozhu

arXiv.org Artificial Intelligence

Large language models (LLMs) have shown impressive performance in various reasoning benchmarks with the emergence of Chain-of-Thought (CoT) and its derivative methods, particularly in tasks involving multi-choice questions (MCQs). However, current works all process data uniformly without considering the problem-solving difficulty, which means an excessive focus on simple questions while insufficient to intricate ones. To address this challenge, we inspired by humans using heuristic strategies to categorize tasks and handle them individually, propose to apply the Divide and Conquer to LLMs reasoning. First, we divide questions into different subsets based on the statistical confidence score ($\mathcal{CS}$), then fix nearly resolved sets and conquer demanding nuanced process ones with elaborately designed methods, including Prior Knowledge based Reasoning (PKR) and Filter Choices based Reasoning (FCR), as well as their integration variants. Our experiments demonstrate that this proposed strategy significantly boosts the models' reasoning abilities across nine datasets involving arithmetic, commonsense, and logic tasks. For instance, compared to baseline, we make a striking improvement on low confidence subsets of 8.72\% for AQuA, 15.07\% for ARC Challenge and 7.71\% for RiddleSense. In addition, through extensive analysis on length of rationale and number of options, we verify that longer reasoning paths in PKR could prevent models from referring infer-harmful shortcuts, and also find that removing irrelevant choices in FCR would substantially avoid models' confusion. The code is at \url{https://github.com/AiMijie/Divide-and-Conquer}


In-Context Learning with Iterative Demonstration Selection

Qin, Chengwei, Zhang, Aston, Dagar, Anirudh, Ye, Wenming

arXiv.org Artificial Intelligence

Spurred by advancements in scale, large language models (LLMs) have demonstrated strong few-shot learning ability via in-context learning (ICL). However, the performance of ICL has been shown to be highly sensitive to the selection of few-shot demonstrations. Selecting the most suitable examples as context remains an ongoing challenge and an open problem. Existing literature has highlighted the importance of selecting examples that are diverse or semantically similar to the test sample while ignoring the fact that the optimal selection dimension, i.e., diversity or similarity, is task-specific. Leveraging the merits of both dimensions, we propose Iterative Demonstration Selection (IDS). Using zero-shot chain-of-thought reasoning (Zero-shot-CoT), IDS iteratively selects examples that are diverse but still strongly correlated with the test sample as ICL demonstrations. Specifically, IDS applies Zero-shot-CoT to the test sample before demonstration selection. The output reasoning path is then used to choose demonstrations that are prepended to the test sample for inference. The generated answer is accompanied by its corresponding reasoning path for extracting a new set of demonstrations in the next iteration. After several iterations, IDS adopts majority voting to obtain the final result. Through extensive experiments on tasks including commonsense reasoning, question answering, topic classification, and sentiment analysis, we demonstrate that IDS can consistently outperform existing ICL demonstration selection methods.


Better Zero-Shot Reasoning with Role-Play Prompting

Kong, Aobo, Zhao, Shiwan, Chen, Hao, Li, Qicheng, Qin, Yong, Sun, Ruiqi, Zhou, Xin

arXiv.org Artificial Intelligence

Modern large language models (LLMs), such as ChatGPT, exhibit a remarkable capacity for role-playing, enabling them to embody not only human characters but also non-human entities like a Linux terminal. This versatility allows them to simulate complex human-like interactions and behaviors within various contexts, as well as to emulate specific objects or systems. While these capabilities have enhanced user engagement and introduced novel modes of interaction, the influence of role-playing on LLMs' reasoning abilities remains underexplored. In this study, we introduce a strategically designed role-play prompting methodology and assess its performance under the zero-shot setting across twelve diverse reasoning benchmarks, encompassing arithmetic, commonsense reasoning, symbolic reasoning, and more. Leveraging models such as ChatGPT and Llama 2, our empirical results illustrate that role-play prompting consistently surpasses the standard zero-shot approach across most datasets. Notably, accuracy on AQuA rises from 53.5% to 63.8%, and on Last Letter from 23.8% to 84.2%. Beyond enhancing contextual understanding, we posit that role-play prompting serves as an implicit Chain-of-Thought (CoT) trigger, thereby improving the quality of reasoning. By comparing our approach with the Zero-Shot-CoT technique, which prompts the model to "think step by step", we further demonstrate that role-play prompting can generate a more effective CoT. This highlights its potential to augment the reasoning capabilities of LLMs.


Chain of Thought Prompting Elicits Knowledge Augmentation

Wu, Dingjun, Zhang, Jing, Huang, Xinmei

arXiv.org Artificial Intelligence

The knowledge-augmented deep learning paradigm refers to a paradigm in which domain knowledge is identified and integrated into deep models. Conventional methods typically employ task-specific approaches to gather external knowledge from various sources. In contrast, large language models are extensively pre-trained and can serve as a comprehensive source of external knowledge. In this paper, we propose CoT-KA, a Chain-of-Thought-based method that augments knowledge for deep learning. CoT-KA avoids the need for additional knowledge retrieval or knowledge reasoning models, as required in conventional augmentation methods. Our results demonstrate that CoT-KA outperforms both pure CoT-based methods and the non-augmented method across the majority of eleven publicly available benchmarks for various reasoning tasks.