AITopics | multi-step reasoning

Collaborating Authors

multi-step reasoning

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Learning to Reason and Memorize with Self-Notes

Neural Information Processing SystemsDec-24-2025, 06:46:08 GMT

Large language models have been shown to struggle with multi-step reasoning, and do not retain previous reasoning steps for future use. We propose a simple method for solving both of these problems by allowing the model to take Self-Notes. Unlike recent chain-of-thought or scratchpad approaches, the model can deviate from the input context at any time to explicitly think and write down its thoughts. This allows the model to perform reasoning on the fly as it reads the context and even integrate previous reasoning steps, thus enhancing its memory with useful information and enabling multi-step reasoning. Experiments across a wide variety of tasks demonstrate that our method can outperform chain-of-thought and scratchpad methods by taking Self-Notes that interleave the input text.

name change, reason and memorize, self-note, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language (0.62)

Add feedback

Beyond the Black Box: Demystifying Multi-Turn LLM Reasoning with VISTA

Zhang, Yiran, Lin, Mingyang, Dras, Mark, Naseem, Usman

arXiv.org Artificial IntelligenceNov-14-2025

Recent research has increasingly focused on the reasoning capabilities of Large Language Models (LLMs) in multi-turn interactions, as these scenarios more closely mirror real-world problem-solving. However, analyzing the intricate reasoning processes within these interactions presents a significant challenge due to complex contextual dependencies and a lack of specialized visualization tools, leading to a high cognitive load for researchers. To address this gap, we present VIST A, an web-based Visual Interactive System for Textual Analytics in multi-turn reasoning tasks. VIST A allows users to visualize the influence of context on model decisions and interactively modify conversation histories to conduct "what-if" analyses across different models. Furthermore, the platform can automatically parse a session and generate a reasoning dependency tree, offering a transparent view of the model's step-by-step logical path. By providing a unified and interactive framework, VIST A significantly reduces the complexity of analyzing reasoning chains, thereby facilitating a deeper understanding of the capabilities and limitations of current LLMs. The platform is open-source and supports easy integration of custom benchmarks and local models.

artificial intelligence, large language model, natural language, (14 more...)

arXiv.org Artificial Intelligence

2511.10182

Country: Europe > Austria > Vienna (0.15)

Genre: Research Report (0.65)

Industry: Transportation > Air (0.41)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.88)

Add feedback

Knowledge-Augmented Long-CoT Generation for Complex Biomolecular Reasoning

Lyu, Tianwen, Zhuang, Xiang, Ding, Keyan, Cao, Xinzhe, Liang, Lei, Zhao, Wei, Zhang, Qiang, Chen, Huajun

arXiv.org Artificial IntelligenceNov-12-2025

Understanding complex biomolecular mechanisms requires multi-step reasoning across molecular interactions, signaling cascades, and metabolic pathways. While large language models(LLMs) show promise in such tasks, their application to biomolecular problems is hindered by logical inconsistencies and the lack of grounding in domain knowledge. Existing approaches often exacerbate these issues: reasoning steps may deviate from biological facts or fail to capture long mechanistic dependencies. To address these challenges, we propose a Knowledge-Augmented Long-CoT Reasoning framework that integrates LLMs with knowledge graph-based multi-hop reasoning chains. The framework constructs mechanistic chains via guided multi-hop traversal and pruning on the knowledge graph; these chains are then incorporated into supervised fine-tuning to improve factual grounding and further refined with reinforcement learning to enhance reasoning reliability and consistency. Furthermore, to overcome the shortcomings of existing benchmarks, which are often restricted in scale and scope and lack annotations for deep reasoning chains, we introduce PrimeKGQA, a comprehensive benchmark for biomolecular question answering. Experimental results on both PrimeKGQA and existing datasets demonstrate that although larger closed-source models still perform well on relatively simple tasks, our method demonstrates clear advantages as reasoning depth increases, achieving state-of-the-art performance on multi-hop tasks that demand traversal of structured biological knowledge. These findings highlight the effectiveness of combining structured knowledge with advanced reasoning strategies for reliable and interpretable biomolecular reasoning.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2511.08024

Genre: Research Report > New Finding (0.68)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

ViFP: A Framework for Visual False Positive Detection to Enhance Reasoning Reliability in VLMs

Zhang, Ben, Yu, LuLu, Gao, Lei, Guo, QuanJiang, Liu, Jing, Gao, Hui

arXiv.org Artificial IntelligenceNov-6-2025

Abstract--During reasoning in vision-language models (VLMs), false positive (FP) reasoning occurs when a model produces the correct answer but follows an incorrect reasoning path, resulting in undermined reasoning reliability. Existing approaches mainly rely on prompt engineering, knowledge distillation or reinforcement learning to improve reasoning reliability, both of which require large amounts of high-quality data and thus limit practical applicability. Few approaches have focused on directly detecting and correcting FPs. T o address these issues, we propose ViFP, a framework for Visual False Positive Detection to Enhance Reasoning Reliability in VLMs. ViFP builds effective reasoning paths through multi-turn QA and dynamically analyzes the consistency of the reasoning path to identify potential FPs. It also introduces a targeted reasoning chain correction mechanism to modify FP reasoning, thereby improving logical consistency and accuracy. Finally, we introduce a reliability evaluation metric--V oC, which integrates answer accuracy and the FP rate, providing a quantitative tool to assess whether a VLM not only answers correctly but also reasons reliably. Our experiments on closed-source VLMs show that ViFP consistently improves performance across three datasets: A-OKVQA, OK-VQA, and FVQA. On A-OKVQA, ViFP improves accuracy by up to 5.4%, surpassing the previous state-of-the-art by 4.3%, and significantly reduces the number of FPs, validating its benefits in enhancing reasoning reliability.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2508.04201

Country:

North America > United States (0.14)
Asia > China (0.14)

Genre: Research Report (0.65)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.91)

Add feedback

Multi-Step Reasoning for Embodied Question Answering via Tool Augmentation

Zhai, Mingliang, Liang, Hansheng, Fan, Xiaomeng, Gao, Zhi, Li, Chuanhao, Sun, Che, Bin, Xu, Wu, Yuwei, Jia, Yunde

arXiv.org Artificial IntelligenceOct-28-2025

Figure 1: Overview of the proposed ToolEQA for Embodied Question Answering (EQA). ToolEQA enables to decompose questions into structured plans, reasoning to select tools, and invoke tools to explore and answer. ToolEQA achieves highest accuracy with fewer reasoning steps. Embodied Question Answering (EQA) requires agents to explore 3D environments to obtain observations and answer questions related to the scene. Existing methods leverage VLMs to directly explore the environment and answer questions without explicit thinking or planning, which limits their reasoning ability and results in excessive or inefficient exploration as well as ineffective responses. In this paper, we introduce T oolEQA, an agent that integrates external tools with multi-step reasoning, where external tools can provide more useful information for completing the task, helping the model derive better exploration directions in the next step of reasoning and thus obtaining additional effective information. This enables ToolEQA to generate more accurate responses with a shorter exploration distance. To enhance the model's ability for tool-usage and multi-step reasoning, we further design a novel EQA data generation pipeline that automatically Based on the pipeline, we collect the EQA-RT dataset that contains about 18K tasks, divided into a training set EQA-RT -Train, and two test sets EQA-RT -Seen (scenes overlapping with the training set) and EQA-RT -Unseen (novel scenes). Experiments on EQA-RT -Seen and EQA-RT -Unseen show that ToolEQA improves the success rate by 9.2 20.2% over state-of-the-art baselines, while outperforming the zero-shot ToolEQA by 10% in success rate.

large language model, machine learning, question answering, (19 more...)

arXiv.org Artificial Intelligence

2510.2031

Country: Asia > China (0.69)

Genre: Research Report (1.00)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.92)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback

CoT Vectors: Transferring and Probing the Reasoning Mechanisms of LLMs

Li, Li, Wang, Ziyi, Wu, Yongliang, Cai, Jianfei, Yang, Xu

arXiv.org Artificial IntelligenceOct-2-2025

Chain-of-Thought (CoT) prompting has emerged as a powerful approach to enhancing the reasoning capabilities of Large Language Models (LLMs). However, existing implementations, such as in-context learning and fine-tuning, remain costly and inefficient. To improve CoT reasoning at a lower cost, and inspired by the task vector paradigm, we introduce CoT Vectors, compact representations that encode task-general, multi-step reasoning knowledge. Through experiments with Extracted CoT Vectors, we observe pronounced layer-wise instability, manifesting as a U-shaped performance curve that reflects a systematic three-stage reasoning process in LLMs. To address this limitation, we propose Learnable CoT Vectors, optimized under a teacher-student framework to provide more stable and robust guidance. Extensive evaluations across diverse benchmarks and models demonstrate that CoT Vectors not only outperform existing baselines but also achieve performance comparable to parameter-efficient fine-tuning methods, while requiring fewer trainable parameters. Moreover, by treating CoT Vectors as a probe, we uncover how their effectiveness varies due to latent space structure, information density, acquisition mechanisms, and pre-training differences, offering new insights into the functional organization of multi-step reasoning in LLMs. The source code will be released.

artificial intelligence, large language model, natural language, (15 more...)

arXiv.org Artificial Intelligence

2510.00579

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (1.00)

Add feedback

Tool-R1: Sample-Efficient Reinforcement Learning for Agentic Tool Use

Zhang, Yabo, Zeng, Yihan, Li, Qingyun, Hu, Zhen, Han, Kavin, Zuo, Wangmeng

arXiv.org Artificial IntelligenceSep-17-2025

Large language models (LLMs) have demonstrated strong capabilities in language understanding and reasoning, yet they remain limited when tackling real-world tasks that require up-to-date knowledge, precise operations, or specialized tool use. To address this, we propose Tool-R1, a reinforcement learning framework that enables LLMs to perform general, compositional, and multi-step tool use by generating executable Python code. Tool-R1 supports integration of user-defined tools and standard libraries, with variable sharing across steps to construct coherent workflows. An outcome-based reward function, combining LLM-based answer judgment and code execution success, guides policy optimization. To improve training efficiency, we maintain a dynamic sample queue to cache and reuse high-quality trajectories, reducing the overhead of costly online sampling. Experiments on the GAIA benchmark show that Tool-R1 substantially improves both accuracy and robustness, achieving about 10\% gain over strong baselines, with larger improvements on complex multi-step tasks. These results highlight the potential of Tool-R1 for enabling reliable and efficient tool-augmented reasoning in real-world applications. Our code will be available at https://github.com/YBYBZhang/Tool-R1.

arxiv preprint arxiv, large language model, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2509.12867

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

Beyond Memorization: Extending Reasoning Depth with Recurrence, Memory and Test-Time Compute Scaling

Rodkin, Ivan, Orel, Daniil, Smirnov, Konstantin, Bolatov, Arman, Elbouardi, Bilal, Hassan, Besher, Kuratov, Yuri, Bulatov, Aydar, Nakov, Preslav, Baldwin, Timothy, Shelmanov, Artem, Burtsev, Mikhail

arXiv.org Artificial IntelligenceAug-26-2025

Reasoning is a core capability of large language models, yet understanding how they learn and perform multi-step reasoning remains an open problem. In this study, we explore how different architectures and training methods affect model multi-step reasoning capabilities within a cellular automata framework. By training on state sequences generated with random Boolean functions for random initial conditions to exclude memorization, we demonstrate that most neural architectures learn to abstract the underlying rules. While models achieve high accuracy in next-state prediction, their performance declines sharply if multi-step reasoning is required. We confirm that increasing model depth plays a crucial role for sequential computations. We demonstrate that an extension of the effective model depth with recurrence, memory, and test-time compute scaling substantially enhances reasoning capabilities.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2508.16745

Country:

North America > United States (0.68)
Asia > Middle East > UAE (0.28)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (1.00)

Add feedback

Large Language Models are Zero-Shot Reasoners

Neural Information Processing SystemsAug-16-2025, 20:00:53 GMT

While these successes are often attributed to LLMs'

large language model, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.05)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Asia > Middle East > Jordan (0.04)
Asia > Japan > Honshū > Chūbu > Toyama Prefecture > Toyama (0.04)

Genre: Research Report (0.69)

Industry: Health & Medicine > Consumer Health (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.97)

Add feedback

ToolVQA: A Dataset for Multi-step Reasoning VQA with External Tools

Yin, Shaofeng, Lei, Ting, Liu, Yang

arXiv.org Artificial IntelligenceAug-6-2025

Integrating external tools into Large F oundation Models (LFMs) has emerged as a promising approach to enhance their problem-solving capabilities. While existing studies have demonstrated strong performance in tool-augmented Visual Question Answering (VQA), recent benchmarks reveal significant gaps in real-world tool-use proficiency, particularly in functionally diverse multimodal settings requiring multi-step reasoning. In this work, we introduce T oolVQA, a large-scale multimodal dataset comprising 23K samples, designed to bridge this gap. Unlike previous datasets that rely on synthetic scenarios and simplified queries, T oolVQA features real-world visual contexts and challenging implicit multi-step reasoning tasks, better aligning with real user interactions. T o construct this dataset, we propose T oolEngine, a novel data generation pipeline that employs image-guided Depth-First Search (DFS) with a Longest Common Subsequence (LCS)-based example matching mechanism to simulate human-like tool-use reasoning. T oolVQA encompasses 10 multimodal tools across 7 diverse domains, with an average inference length of 2.78 reasoning steps per sample. The LLaVA-7B model fine-tuned on T oolVQA not only achieves impressive performance on the T oolVQA test set, but also surpasses the large closed-source model GPT-3.5-turbo on five out-of-distribution (OOD) datasets, showing strong generalizabil-ity in real-world tool-use scenarios. Code is available at https://github.com/Fugtemypt123/T

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2508.03284

Country: Asia (0.46)

Genre:

Research Report > New Finding (0.74)
Research Report > Promising Solution (0.54)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback