Thinking LLMs: General Instruction Following with Thought Generation
Wu, Tianhao, Lan, Janice, Yuan, Weizhe, Jiao, Jiantao, Weston, Jason, Sukhbaatar, Sainbayar
–arXiv.org Artificial Intelligence
LLMs are typically trained to answer user questions or follow instructions similarly to how human experts respond. However, in the standard alignment framework they lack the basic ability of explicit thinking before answering. Thinking is important for complex questions that require reasoning and planning - but can be applied to any task. We propose a training method for equipping existing LLMs with such thinking abilities for general instruction following without use of additional human data. We achieve this by an iterative search and optimization procedure that explores the space of possible thought generations, allowing the model to learn how to think without direct supervision. For each instruction, the thought candidates are scored using a judge model to evaluate their responses only, and then optimized via preference optimization. We show that this procedure leads to superior performance on AlpacaEval and Arena-Hard, and shows gains from thinking on non-reasoning categories such as marketing, health and general knowledge, in addition to more traditional reasoning & problem-solving tasks. Large Language Models (LLMs) are based on the Transformer architecture (Vaswani et al., 2017) that predicts the next token at each step. Each token takes the same amount of compute, so when LLMs are prompted with a user instruction, they have a fixed compute budget to generate the first response token regardless of the instruction's complexity. One way to increase the compute budget for harder instructions is to allow LLMs to think internally before outputting an response. This is similar to humans who will take more time and think before answering complex questions. One approach is to generate thoughts as text, which takes advantage of the natural language capabilities of LLMs. LLMs are pre-trained on text containing human-written thoughts, which are hence encoded into the model. Chain-of-Thought (CoT) (Wei et al., 2022) is a widely used prompting technique that elicits such behavior by asking the model to write down its reasoning steps. However, the usage of CoT has been mostly limited to math and reasoning tasks. Meta-analysis by Sprague et al. (2024) found CoT methods to be unhelpful on tasks that do not involve math and logic. In this paper, we focus on general instruction following instead of focusing on math or logic tasks. We argue that "thinking" should have broad utility. For example, in a creative writing task, internal thoughts can be used to plan overall structure and characters. In other tasks, internal thoughts can be used for understanding the user instruction better. Of course, it is likely that less thinking is required for simpler tasks, and more thinking for more complex ones. In general, we hypothesize that such Thinking LLMs will have an advantage on all sufficiently complex tasks.
arXiv.org Artificial Intelligence
Oct-14-2024
- Country:
- North America > United States (0.46)
- Genre:
- Research Report (1.00)
- Industry:
- Leisure & Entertainment > Games (0.69)
- Technology: