Goto

Collaborating Authors

 let



8bb0d291acd4acf06ef112099c16f326-Supplemental-Conference.pdf

Neural Information Processing Systems

LastLetters F 500 15.0 - CoinFlip Y 500 37.0 - A.2.2 Datasetcreation Regarding "Last Letter Concatenation" and "Coin Flip", datasets are not publicly available sowe created the datasets following Wei et al. [2022] with a minor rephrasing of the question template. Asfor Coin Flip, we use the following template. A.5 PromptsForAnswerExtraction Table 9 and Table 10 summarizes a list of answer extraction prompts used for the experiments at Table1. Number Pick up the first number encounteredinthetext. MultipleChoice Pick up the first large letter encountered in the text. YesorNo Pickupthefirst"yes" or "no" encountered in the text after removing unnecessaryletters. Table 13 lists example texts generated by Zero-shot-CoT for each reasoning extraction template(SeeTable4). Dataset Question Answer SingleEq Q: A spaceship traveled 0.5 of a light-year from Earth to Planet X and 0.1 of a lightyearfromPlanetXtoPlanetY. A: Let's think step by step. So the total distance the spaceship traveled is 0.5 + 0.1 + 0.1 = 0.7 light-years. Therefore, the answer (arabic numerals) is: 0.7 light-years Q:Whilemaking desserts for abakesale,Victorused0.625 of a scoop of brown sugar as well as 0.25 of a scoop of whitesugar.Howmuchmore brownsugardidVictoruse? A: Let's think step by step.


LargeLanguageModelsareZero-ShotReasoners

Neural Information Processing Systems

Notably,chainofthought(CoT)prompting, a recent technique for eliciting complex multi-step reasoning through step-bystep answer examples, achieved the state-of-the-art performances in arithmetics and symbolic reasoning, difficultsystem-2 tasks that do not follow the standard scaling laws for LLMs. While these successes are often attributed to LLMs' ability forfew-shot learning, weshowthatLLMs aredecentzero-shotreasoners by simply adding "Let's think step by step" before each answer.



WebSailor-V2: Bridging the Chasm to Proprietary Agents via Synthetic Data and Scalable Reinforcement Learning

Li, Kuan, Zhang, Zhongwang, Yin, Huifeng, Ye, Rui, Zhao, Yida, Zhang, Liwen, Ou, Litu, Zhang, Dingchu, Wu, Xixi, Wu, Jialong, Wang, Xinyu, Qiao, Zile, Zhang, Zhen, Jiang, Yong, Xie, Pengjun, Huang, Fei, Zhou, Jingren

arXiv.org Artificial Intelligence

Transcending human cognitive limitations represents a critical frontier in LLM training. Proprietary agentic systems like DeepResearch have demonstrated superhuman capabilities on extremely complex information-seeking benchmarks such as BrowseComp, a feat previously unattainable. We posit that their success hinges on a sophisticated reasoning pattern absent in open-source models: the ability to systematically reduce extreme uncertainty when navigating vast information landscapes. Based on this insight, we introduce WebSailor, a complete post-training methodology designed to instill this crucial capability. Our approach involves generating novel, high-uncertainty tasks through structured sampling and information obfuscation, RFT cold start, and an efficient agentic RL training algorithm, Duplicating Sampling Policy Optimization (DUPO). With this integrated pipeline, WebSailor significantly outperforms all open-source agents in complex information-seeking tasks, matching proprietary agents' performance and closing the capability gap.




CoDAE: Adapting Large Language Models for Education via Chain-of-Thought Data Augmentation

Yuan, Shuzhou, LaCroix, William, Ghoshal, Hardik, Nie, Ercong, Färber, Michael

arXiv.org Artificial Intelligence

Large Language Models (LLMs) are increasingly employed as AI tutors due to their scalability and potential for personalized instruction. However, off-the-shelf LLMs often underperform in educational settings: they frequently reveal answers too readily, fail to adapt their responses to student uncertainty, and remain vulnerable to emotionally manipulative prompts. To address these challenges, we introduce CoDAE, a framework that adapts LLMs for educational use through Chain-of-Thought (CoT) data augmentation. We collect real-world dialogues between students and a ChatGPT-based tutor and enrich them using CoT prompting to promote step-by-step reasoning and pedagogically aligned guidance. Furthermore, we design targeted dialogue cases to explicitly mitigate three key limitations: over-compliance, low response adaptivity, and threat vulnerability. We fine-tune four open-source LLMs on different variants of the augmented datasets and evaluate them in simulated educational scenarios using both automatic metrics and LLM-as-a-judge assessments. Our results show that models fine-tuned with CoDAE deliver more pedagogically appropriate guidance, better support reasoning processes, and effectively resist premature answer disclosure.


P-CoT: A Pedagogically-motivated Participatory Chain-of-Thought Prompting for Phonological Reasoning in LLMs

Jang, Dongjun, Ahn, Youngchae, Shin, Hyopil

arXiv.org Artificial Intelligence

This study explores the potential of phonological reasoning within text-based large language models (LLMs). Utilizing the PhonologyBench benchmark, we assess tasks like rhyme word generation, g2p conversion, and syllable counting. Our evaluations across 12 LLMs reveal that while few-shot learning offers inconsistent gains, the introduction of a novel Pedagogically-motivated Participatory Chain-of-Thought (P-CoT) prompt, which is anchored in educational theories like scaffolding and discovery learning, consistently enhances performance. This method leverages structured guidance to activate latent phonological abilities, achieving up to 52% improvement and even surpassing human baselines in certain tasks. Future work could aim to optimize P-CoT prompts for specific models or explore their application across different linguistic domains.


Graceful Forgetting in Generative Language Models

Jiang, Chunyang, Chan, Chi-min, Cai, Yiyang, Liu, Yulong, Xue, Wei, Guo, Yike

arXiv.org Artificial Intelligence

Recently, the pretrain-finetune paradigm has become a cornerstone in various deep learning areas. While in general the pre-trained model would promote both effectiveness and efficiency of downstream tasks fine-tuning, studies have shown that not all knowledge acquired during pre-training is beneficial. Some of the knowledge may actually bring detrimental effects to the fine-tuning tasks, which is also known as negative transfer. To address this problem, graceful forgetting has emerged as a promising approach. The core principle of graceful forgetting is to enhance the learning plasticity of the target task by selectively discarding irrelevant knowledge. However, this approach remains underexplored in the context of generative language models, and it is often challenging to migrate existing forgetting algorithms to these models due to architecture incompatibility. To bridge this gap, in this paper we propose a novel framework, Learning With Forgetting (LWF), to achieve graceful forgetting in generative language models. With Fisher Information Matrix weighting the intended parameter updates, LWF computes forgetting confidence to evaluate self-generated knowledge regarding the forgetting task, and consequently, knowledge with high confidence is periodically unlearned during fine-tuning. Our experiments demonstrate that, although thoroughly uncovering the mechanisms of knowledge interaction remains challenging in pre-trained language models, applying graceful forgetting can contribute to enhanced fine-tuning performance.