Goto

Collaborating Authors

 Chen, Jiuhai


How Many Demonstrations Do You Need for In-context Learning?

arXiv.org Artificial Intelligence

Large language models (LLMs) are capable to perform complex reasoning by in-context learning (ICL) when provided with a few input-output demonstrations (demos) and more powerful when intermediate reasoning steps ("chain of thoughts (CoT)") of the demos are given. Is it necessary to use multi-demo in ICL? In this paper, we study ICL using fewer demos for each test query on the tasks in~\cite{wei2022chain}. Surprisingly, we do not observe significant degradation when using only one randomly chosen demo. To study this phenomenon, for each test query, we categorize demos into "correct demos" leading to the correct answer, and "wrong demos" resulting in wrong answers. Our analysis reveals an inherent bias in those widely studied datasets: most demos are correct for a majority of test queries, which explains the good performance of using one random demo. Moreover, ICL (with and w/o CoT) using only one correct demo significantly outperforms all-demo ICL adopted by most previous works, indicating the weakness of LLMs in finding correct demo(s) for input queries, which is difficult to evaluate on the biased datasets. Furthermore, we observe a counterintuitive behavior of ICL using multi-demo, i.e., its accuracy degrades(improves) when given more correct(wrong) demos. This implies that ICL can be easily misguided by interference among demos and their spurious correlations. Our analyses highlight several fundamental challenges that need to be addressed in LLMs training, ICL, and benchmark design.


When do you need Chain-of-Thought Prompting for ChatGPT?

arXiv.org Artificial Intelligence

Chain-of-Thought (CoT) prompting can effectively elicit complex multi-step reasoning from Large Language Models~(LLMs). For example, by simply adding CoT instruction ``Let's think step-by-step'' to each input query of MultiArith dataset, GPT-3's accuracy can be improved from 17.7\% to 78.7\%. However, it is not clear whether CoT is still effective on more recent instruction finetuned (IFT) LLMs such as ChatGPT. Surprisingly, on ChatGPT, CoT is no longer effective for certain tasks such as arithmetic reasoning while still keeping effective on other reasoning tasks. Moreover, on the former tasks, ChatGPT usually achieves the best performance and can generate CoT even without being instructed to do so. Hence, it is plausible that ChatGPT has already been trained on these tasks with CoT and thus memorized the instruction so it implicitly follows such an instruction when applied to the same queries, even without CoT. Our analysis reflects a potential risk of overfitting/bias toward instructions introduced in IFT, which becomes more common in training LLMs. In addition, it indicates possible leakage of the pretraining recipe, e.g., one can verify whether a dataset and instruction were used in training ChatGPT. Our experiments report new baseline results of ChatGPT on a variety of reasoning tasks and shed novel insights into LLM's profiling, instruction memorization, and pretraining dataset leakage.


Particle-based Energetic Variational Inference

arXiv.org Machine Learning

We introduce a new variational inference (VI) framework, called energetic variational inference (EVI). It minimizes the VI object function based on a prescribed energy-dissipation law. Using the EVI framework, we can derive many existing Particle-based Variational Inference (ParVI) methods, including the popular Stein Variational Gradient Descent (SVGD) approach. More importantly, many new ParVI schemes can be created under this framework. For illustration, we propose a new particle-based EVI scheme, which performs the particle-based approximation of the density first and then uses the approximated density in the variational procedure, or "Approximation-then-Variation" for short. Thanks to this order of approximation and variation, the new scheme can maintain the variational structure at the particle level and can significantly decrease the KL-divergence in each iteration. Numerical experiments show the proposed method outperforms some existing ParVI methods in terms of fidelity to the target distribution.