AITopics

2509.18316

Country: North America > United States > Colorado (0.46)

Genre:

Research Report > Experimental Study (0.93)
Research Report > New Finding (0.88)

Industry:

Health & Medicine > Diagnostic Medicine (1.00)
Health & Medicine > Therapeutic Area (0.68)
Health & Medicine > Pharmaceuticals & Biotechnology (0.66)
Health & Medicine > Health Care Technology > Medical Record (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)

de Araújo, George Corrêa, Maia, Helena de Almeida, Pedrini, Helio

A Framework for Generating Artificial Datasets to Validate Absolute and Relative Position Concepts

arXiv.org Artificial IntelligenceSep-24-2025

In this paper, we present the Scrapbook framework, a novel methodology designed to generate extensive datasets for probing the learned concepts of artificial intelligence (AI) models. The framework focuses on fundamental concepts such as object recognition, absolute and relative positions, and attribute identification. By generating datasets with a large number of questions about individual concepts and a wide linguistic variation, the Scrapbook framework aims to validate the model's understanding of these basic elements before tackling more complex tasks. Our experimental findings reveal that, while contemporary models demonstrate proficiency in recognizing and enumerating objects, they encounter challenges in comprehending positional information and addressing inquiries with additional constraints. Specifically, the MobileVLM-V2 model showed significant answer disagreements and plausible wrong answers, while other models exhibited a bias toward affirmative answers and struggled with questions involving geometric shapes and positional information, indicating areas for improvement in understanding and consistency. The proposed framework offers a valuable instrument for generating diverse and comprehensive datasets, which can be utilized to systematically assess and enhance the performance of AI models.

large language model, machine learning, natural language, (15 more...)

2509.18177

Genre:

Research Report > New Finding (1.00)
Overview (1.00)

Industry: Health & Medicine (0.92)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(5 more...)

arXiv.org Artificial IntelligenceSep-24-2025

Mitigating Strategy-Selection Bias in Reasoning for More Effective Test-Time Scaling

Wu, Zongqian, Xu, Baoduo, Li, Tianyu, Sun, Zhu, Zhu, Xiaofeng, Feng, Lei

Test-time scaling (TTS) has been shown to improve the performance of large language models (LLMs) by sampling and aggregating diverse reasoning paths. However, existing research has overlooked a critical issue: selection bias of reasoning strategies during scaling. Specifically, when generating reasoning processes, LLMs tend to follow certain strategies (e.g., algebraic solutions for math problems) while neglecting other valid alternatives (e.g., geometric solutions), resulting in insufficient exploration of the solution space. To further understand the impact of this bias, we present a theoretical analysis that reveals when it undermines the effectiveness of test-time scaling. Motivated by this theoretical insight, we introduce TTS-Uniform, a framework designed to mitigate the selection bias of reasoning strategies. It (i) identifies potential strategies, (ii) uniformly allocates the sampling budget across them, and (iii) filters out unstable strategies prior to aggregation. Experimental results show that TTS-Uniform significantly enhances scaling effectiveness across multiple mainstream LLMs and benchmark datasets. Code is available at https://github.com/zongqianwu/Uniform-TTS. Chain-of-thought (CoT) (Wei et al., 2022; Kojima et al., 2022) enhances the reasoning capabilities of large language models (LLMs) by explicitly unfolding intermediate steps (i.e., reasoning paths) before arriving at the final answer. Building on CoT, test-time scaling (TTS) (Zhang et al., 2025; Ji et al., 2025) further improves performance by sampling and aggregating diverse paths. However, existing TTS research (Wang et al., 2022; Snell et al., 2024) overlooks a critical limitation of CoT, which in turn constrains the effectiveness of scaling. Specifically, when tackling a problem, CoT reasoning tends to follow certain strategies while neglecting other valid alternatives.

large language model, machine learning, natural language, (16 more...)

2509.17905

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.32)

Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning

Wei, Yana, Zhao, Liang, Sun, Jianjian, Lin, Kangheng, Yin, Jisheng, Hu, Jingcheng, Zhang, Yinmin, Yu, En, Lv, Haoran, Weng, Zejia, Wang, Jia, Han, Chunrui, Peng, Yuang, Han, Qi, Ge, Zheng, Zhang, Xiangyu, Jiang, Daxin, Patel, Vishal M.

The remarkable reasoning capability of large language models (LLMs) stems from cognitive behaviors that emerge through reinforcement with verifiable rewards. This work investigates how to transfer this principle to Multimodal LLMs (MLLMs) to unlock advanced visual reasoning. We introduce a two-stage paradigm built on Qwen2.5-VL-7B: a massive linguistic cold-start fine-tuning, followed by multimodal reinforcement learning (RL) spanning nearly 1,000 steps, surpassing all previous open-source efforts in scale. This pioneering work reveals three fundamental insights: 1) Behavior transfer emerges surprisingly early in cold start due to linguistic mental imagery. 2) Cold start broadly memorizes visual behaviors, while RL critically discerns and scales up effective patterns. 3) Transfer strategically favors high-utility behaviors such as visual reflection. Our resulting model, Open-Vision-Reasoner (OVR), achieves state-of-the-art performance on a suite of reasoning benchmarks, including 95.3% on MATH500, 51.8% on MathVision and 54.6% on MathVerse. We release our model, data, and training dynamics to catalyze the development of more capable, behavior-aligned multimodal reasoners.

arxiv preprint arxiv, large language model, machine learning, (17 more...)

2507.05255

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (1.00)

Anugraha, David, Tang, Zilu, Miranda, Lester James V., Zhao, Hanyang, Farhansyah, Mohammad Rifqi, Kuwanto, Garry, Wijaya, Derry, Winata, Genta Indra

R3: Robust Rubric-Agnostic Reward Models

Reward models are essential for aligning language model outputs with human preferences, yet existing approaches often lack both controllability and interpretability. These models are typically optimized for narrow objectives, limiting their generalizability to broader downstream tasks. Moreover, their scalar outputs are difficult to interpret without contextual reasoning. To address these limitations, we introduce $\shortmethodname$, a novel reward modeling framework that is rubric-agnostic, generalizable across evaluation dimensions, and provides interpretable, reasoned score assignments. $\shortmethodname$ enables more transparent and flexible evaluation of language models, supporting robust alignment with diverse human values and use cases. Our models, data, and code are available as open source at https://github.com/rubricreward/r3.

arxiv preprint arxiv, large language model, machine learning, (19 more...)

2505.13388

Country:

Asia (0.45)
North America (0.28)

Genre: Research Report > New Finding (0.92)

Industry:

Media (0.46)
Leisure & Entertainment (0.46)
Information Technology > Security & Privacy (0.45)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(2 more...)

ConfClip: Confidence-Weighted and Clipped Reward for Reinforcement Learning in LLMs

Zhang, Bonan, Chen, Zhongqi, Song, Bowen, Li, Qinya, Wu, Fan, Chen, Guihai

Reinforcement learning (RL) has become a standard paradigm for refining large language models (LLMs) beyond pre-training and instruction tuning. A prominent line of work is RL with verifiable rewards (RLVR), which leverages automatically verifiable outcomes (e.g., correctness or executability) to generate reward signals. While efficient, this framework faces two key limitations: First, its binary feedback is too sparse to capture the quality of the reasoning process. Second, its coarse-grained rewards potentially lead to vanishing gradients. Inspired by observations from human learning, we introduce a RL technique that integrates verifiable outcomes with the model's own confidence estimates. This joint design enriches the reward signal, providing finer-grained feedback and implicitly supervising the reasoning process. Experimental results demonstrate that our proposed method enhances RL performance across multiple datasets and reduces token consumption during inference, while incurring negligible additional training cost. Moreover, it can be used as a plug-in module to enhance other state-of-the-art RL methods.

arxiv preprint arxiv, large language model, machine learning, (16 more...)

2509.1773

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.88)

Pan, Tsung-Hsuan, Chen, Chung-Chi, Huang, Hen-Hsen, Chen, Hsin-Hsi

Diagnosing Model Editing via Knowledge Spectrum

Model editing, the process of efficiently modifying factual knowledge in pre-trained language models, is critical for maintaining their accuracy and relevance. However, existing editing methods often introduce unintended side effects, degrading model performance in unpredictable ways. While much research has focused on improving editing algorithms, the role of the target knowledge's intrinsic properties remains a significant, underexplored factor. This paper addresses this gap by first proposing the ``Knowledge Spectrum,'' a systematic framework for categorizing knowledge based on its real-world popularity, the model's pre-edit familiarity, and the linguistic structure of the eliciting question. Our empirical analysis reveals that these characteristics are strong predictors of editing success and stability. Informed by these findings, we introduce the ``Knowledge-Diagnostic Framework,'' an adaptive strategy that tailors editing intensity to the diagnosed difficulty of a knowledge item. We demonstrate that this framework significantly improves success rates for challenging edits while optimizing computational resources. Our work provides a more comprehensive understanding of the factors governing model editing.

large language model, machine learning, natural language, (18 more...)

2509.17482

Country: Asia (0.29)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.46)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.34)

Correlation or Causation: Analyzing the Causal Structures of LLM and LRM Reasoning Process

FU, Zhizhang, Bao, Guangsheng, Zhang, Hongbo, Hu, Chenkai, Zhang, Yue

LLMs suffer from critical reasoning issues such as unfaithfulness, bias, and inconsistency, since they lack robust causal underpinnings and may rely on superficial correlations rather than genuine understanding. Successive LRMs have emerged as a promising alternative, leveraging advanced training techniques such as reinforcement learning (RL) and distillation to improve task accuracy. However, the impact of these training methods on causality remains largely unexplored. In this study, we conduct a systematic causal analysis on LLMs and LRMs, examining structural causal models (SCMs) of four key variables: problem instruction (Z), thinking process (T), reasoning steps (X), and answer (Y). Our findings reveal that RLVR-trained LRMs exhibit enhanced causal reasoning capabilities, aligning more closely with ideal causal structures, while LLMs and distilled LRMs fail to address causality-related deficiencies. Our further investigation indicates that RLVR reduces spurious correlations and strengthens genuine causal patterns, thereby mitigating unfaithfulness and bias. In addition, our inspection on the dynamics of the RLVR training process observes a high correlation between reduced spurious features and improved causal structures, where the causal relationships consistently improve in the training process. This study contributes to the understanding of causality in reasoning models, highlights the critical role of RLVR in enhancing causal reasoning, and provides insights for designing future AI systems with stronger causal foundations. We release our code and data at https://github.com/Harryking1999/CoT_Causal_Analysis.

large language model, machine learning, natural language, (19 more...)

2509.1738

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (1.00)

Imagine2Act: Leveraging Object-Action Motion Consistency from Imagined Goals for Robotic Manipulation

Heng, Liang, Xu, Jiadong, Wang, Yiwen, Li, Xiaoqi, Cai, Muhe, Shen, Yan, Zhu, Juan, Ren, Guanghui, Dong, Hao

Relational object rearrangement (ROR) tasks (e.g., insert flower to vase) require a robot to manipulate objects with precise semantic and geometric reasoning. Existing approaches either rely on pre-collected demonstrations that struggle to capture complex geometric constraints or generate goal-state observations to capture semantic and geometric knowledge, but fail to explicitly couple object transformation with action prediction, resulting in errors due to generative noise. To address these limitations, we propose Imagine2Act, a 3D imitation-learning framework that incorporates semantic and geometric constraints of objects into policy learning to tackle high-precision manipulation tasks. We first generate imagined goal images conditioned on language instructions and reconstruct corresponding 3D point clouds to provide robust semantic and geometric priors. These imagined goal point clouds serve as additional inputs to the policy model, while an object-action consistency strategy with soft pose supervision explicitly aligns predicted end-effector motion with generated object transformation. This design enables Imagine2Act to reason about semantic and geometric relationships between objects and predict accurate actions across diverse tasks. Experiments in both simulation and the real world demonstrate that Imagine2Act outperforms previous state-of-the-art policies. More visualizations can be found at https://sites.google.com/view/imagine2act.

large language model, machine learning, natural language, (17 more...)

2509.17125

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Constraint-Based Reasoning (0.71)
(2 more...)

How Good are Foundation Models in Step-by-Step Embodied Reasoning?

Dissanayake, Dinura, Heakl, Ahmed, Thawakar, Omkar, Ahsan, Noor, Thawkar, Ritesh, More, Ketan, Lahoud, Jean, Anwer, Rao, Cholakkal, Hisham, Laptev, Ivan, Khan, Fahad Shahbaz, Khan, Salman

Embodied agents operating in the physical world must make decisions that are not only effective but also safe, spatially coherent, and grounded in context. While recent advances in large multimodal models (LMMs) have shown promising capabilities in visual understanding and language generation, their ability to perform structured reasoning for real-world embodied tasks remains underexplored. In this work, we aim to understand how well foundation models can perform step-by-step reasoning in embodied environments. To this end, we propose the Foundation Model Embodied Reasoning (FoMER) benchmark, designed to evaluate the reasoning capabilities of LMMs in complex embodied decision-making scenarios. Our benchmark spans a diverse set of tasks that require agents to interpret multimodal observations, reason about physical constraints and safety, and generate valid next actions in natural language. We present (i) a large-scale, curated suite of embodied reasoning tasks, (ii) a novel evaluation framework that disentangles perceptual grounding from action reasoning, and (iii) empirical analysis of several leading LMMs under this setting. Our benchmark includes over 1.1k samples with detailed step-by-step reasoning across 10 tasks and 8 embodiments, covering three different robot types. Our results highlight both the potential and current limitations of LMMs in embodied reasoning, pointing towards key challenges and opportunities for future research in robot intelligence. Our data and code will be made publicly available.

large language model, machine learning, natural language, (20 more...)

2509.15293

Country: Europe > Austria (0.28)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)