Goto

Collaborating Authors

 let



8bb0d291acd4acf06ef112099c16f326-Supplemental-Conference.pdf

Neural Information Processing Systems

LastLetters F 500 15.0 - CoinFlip Y 500 37.0 - A.2.2 Datasetcreation Regarding "Last Letter Concatenation" and "Coin Flip", datasets are not publicly available sowe created the datasets following Wei et al. [2022] with a minor rephrasing of the question template. Asfor Coin Flip, we use the following template. A.5 PromptsForAnswerExtraction Table 9 and Table 10 summarizes a list of answer extraction prompts used for the experiments at Table1. Number Pick up the first number encounteredinthetext. MultipleChoice Pick up the first large letter encountered in the text. YesorNo Pickupthefirst"yes" or "no" encountered in the text after removing unnecessaryletters. Table 13 lists example texts generated by Zero-shot-CoT for each reasoning extraction template(SeeTable4). Dataset Question Answer SingleEq Q: A spaceship traveled 0.5 of a light-year from Earth to Planet X and 0.1 of a lightyearfromPlanetXtoPlanetY. A: Let's think step by step. So the total distance the spaceship traveled is 0.5 + 0.1 + 0.1 = 0.7 light-years. Therefore, the answer (arabic numerals) is: 0.7 light-years Q:Whilemaking desserts for abakesale,Victorused0.625 of a scoop of brown sugar as well as 0.25 of a scoop of whitesugar.Howmuchmore brownsugardidVictoruse? A: Let's think step by step.


LargeLanguageModelsareZero-ShotReasoners

Neural Information Processing Systems

Notably,chainofthought(CoT)prompting, a recent technique for eliciting complex multi-step reasoning through step-bystep answer examples, achieved the state-of-the-art performances in arithmetics and symbolic reasoning, difficultsystem-2 tasks that do not follow the standard scaling laws for LLMs. While these successes are often attributed to LLMs' ability forfew-shot learning, weshowthatLLMs aredecentzero-shotreasoners by simply adding "Let's think step by step" before each answer.


Let's nitpick about the physics of Stranger Things, not its ending

New Scientist

Let's nitpick about the physics of Stranger Things, not its ending Feedback has seen all the fuss about the finale of Stranger Things, but would like to point out that if we're going to dissect the plot, we have bigger things to worry about In common, it seems, with a substantial fraction of the human species, Feedback spent part of our holiday watching the final episodes of Stranger Things . We laughed, we cried, we wondered if it would have even more endings than The Return of the King (it did). As is almost inevitable these days, a group of fans vocally disliked the finale, and went so far as to create a conspiracy theory about it. According to "Conformity Gate" (don't blame us, we didn't name it), the finale wasn't the real finale - despite lasting more than 2 hours, costing an enormous amount of money and being shown in cinemas. No, a super-secret final episode was going to air in January, which would reveal the true ending.


Where does an LLM begin computing an instruction?

Pola, Aditya, Balasubramanian, Vineeth N.

arXiv.org Artificial Intelligence

Following an instruction involves distinct sub-processes, such as reading content, reading the instruction, executing it, and producing an answer. We ask where, along the layer stack, instruction following begins, the point where reading gives way to doing. We introduce three simple datasets (Key-Value, Quote Attribution, Letter Selection) and two hop compositions of these tasks. Using activation patching on minimal-contrast prompt pairs, we measure a layer-wise flip rate that indicates when substituting selected residual activations changes the predicted answer. Across models in the Llama family, we observe an inflection point, which we term onset, where interventions that change predictions before this point become largely ineffective afterward. Multi-hop compositions show a similar onset location. These results provide a simple, replicable way to locate where instruction following begins and to compare this location across tasks and model sizes.



Training-free LLM Verification via Recycling Few-shot Examples

Lee, Dongseok, Hong, Jimyung, Kim, Dongyoung, Kim, Jaehyung

arXiv.org Artificial Intelligence

Although LLMs have achieved remarkable performance, the inherent stochasticity of their reasoning process and varying conclusions present significant challenges. Majority voting or Best-of-N with external verification models has been explored to find the most promising solution among multiple LLM outputs. However, these approaches have certain limitations, such as limited applicability or the cost of an additional training step. To address this problem, we propose a novel and effective framework that Recycles Few-shot examples to verify LLM outputs (ReFeri). Our key idea is to additionally utilize the given few-shot examples to evaluate the candidate outputs of the target query, not only using them to generate outputs as the conventional few-shot prompting setup. Specifically, ReFeri evaluates the generated outputs by combining two different scores, designed motivated from Bayes' rule, and subsequently selects the candidate that is both confidently determined and contextually coherent through a few additional LLM inferences. Experiments with three different LLMs and across seven diverse tasks demonstrate that our framework significantly improves the accuracy of LLMs-achieving an average gain of 4.8%-through effective response selection, without additional training.


WebSailor-V2: Bridging the Chasm to Proprietary Agents via Synthetic Data and Scalable Reinforcement Learning

Li, Kuan, Zhang, Zhongwang, Yin, Huifeng, Ye, Rui, Zhao, Yida, Zhang, Liwen, Ou, Litu, Zhang, Dingchu, Wu, Xixi, Wu, Jialong, Wang, Xinyu, Qiao, Zile, Zhang, Zhen, Jiang, Yong, Xie, Pengjun, Huang, Fei, Zhou, Jingren

arXiv.org Artificial Intelligence

Transcending human cognitive limitations represents a critical frontier in LLM training. Proprietary agentic systems like DeepResearch have demonstrated superhuman capabilities on extremely complex information-seeking benchmarks such as BrowseComp, a feat previously unattainable. We posit that their success hinges on a sophisticated reasoning pattern absent in open-source models: the ability to systematically reduce extreme uncertainty when navigating vast information landscapes. Based on this insight, we introduce WebSailor, a complete post-training methodology designed to instill this crucial capability. Our approach involves generating novel, high-uncertainty tasks through structured sampling and information obfuscation, RFT cold start, and an efficient agentic RL training algorithm, Duplicating Sampling Policy Optimization (DUPO). With this integrated pipeline, WebSailor significantly outperforms all open-source agents in complex information-seeking tasks, matching proprietary agents' performance and closing the capability gap.