AITopics

ARC-AGI and ARC-AGI-2 measure generalization-through-composition on small color-quantized grids, and their prize competitions make progress on these harder held-out tasks a meaningful proxy for systematic generalization. Recent instruction-first systems translate grids into concise natural-language or DSL rules executed in generate-execute-select loops, yet we lack a principled account of how encodings shape model perception and how to separate instruction errors from execution errors. We hypothesize that modality imposes perceptual bottlenecks -- text flattens 2D structure into 1D tokens while images preserve layout but can introduce patch-size aliasing -- thereby shaping which grid features are reliably perceived. To test this, we isolate perception from reasoning across nine text and image modalities using a weighted set-disagreement metric and a two-stage reasoning pipeline, finding that structured text yields precise coordinates on sparse features, images capture 2D shapes yet are resolution-sensitive, and combining them improves execution (about 8 perception points; about 0.20 median similarity). Overall, aligning representations with transformer inductive biases and enabling cross-validation between text and image yields more accurate instructions and more reliable execution without changing the underlying model.

large language model, pattern recognition, pixel, (22 more...)

2511.15717

Genre:

Research Report > New Finding (1.00)
Workflow (0.93)

Industry: Education (0.92)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(5 more...)

Graph-Memoized Reasoning: Foundations Structured Workflow Reuse in Intelligent Systems

Singh, Yash Raj

Modern large language model-based reasoning systems frequently recompute similar reasoning steps across tasks, wasting computational resources, inflating inference latency, and limiting reproducibility. These inefficiencies underscore the need for persistent reasoning mechanisms that can recall and reuse prior computational traces. We introduce Graph-Memoized Reasoning, a formal framework for representing, storing, and reusing reasoning workflows as graph-structured memory. By encoding past decision graphs and retrieving them through structural and semantic similarity, our approach enables compositional reuse of subgraphs across new reasoning tasks. We formulate an optimization objective that minimizes total reasoning cost regularized by inconsistency between stored and generated workflows, providing a theoretical foundation for efficiency-consistency trade-offs in intelligent systems. We outline a conceptual evaluation protocol aligned with the proposed optimization objective. This framework establishes the groundwork for interpretable, cost-efficient, and self-improving reasoning architectures, offering a step toward persistent memory in large-scale agentic systems.

artificial intelligence, graph, reuse, (15 more...)

2511.15715

Genre: Workflow (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (1.00)

Shojaee, Parshin, Mirzadeh, Iman, Alizadeh, Keivan, Horton, Maxwell, Bengio, Samy, Farajtabar, Mehrdad

The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

Recent generations of language models have introduced Large Reasoning Models (LRMs) that generate detailed thinking processes before providing answers. While these models demonstrate improved performance on reasoning benchmarks, their fundamental capabilities, scaling properties, and limitations remain insufficiently understood. Current evaluations primarily focus on established math and coding benchmarks, emphasizing final answer accuracy. However, this evaluation paradigm often suffers from contamination and does not provide insights into the reasoning traces. In this work, we systematically investigate these gaps with the help of controllable puzzle environments that allow precise manipulation of complexity while maintaining consistent logical structures. This setup enables the analysis of not only final answers but also the internal reasoning traces, offering insights into how LRMs think. Through extensive experiments, we show that LRMs face a complete accuracy collapse beyond certain complexities. Moreover, they exhibit a counterintuitive scaling limit: their reasoning effort increases with problem complexity up to a point, then declines despite having remaining token budget. By comparing LRMs with their standard LLM counterparts under same inference compute, we identify three performance regimes: (1) low-complexity tasks where standard models outperform LRMs, (2) medium-complexity tasks where LRMs demonstrates advantage, and (3) high-complexity tasks where both models face complete collapse. We found that LRMs have limitations in exact computation: they fail to use explicit algorithms and reason inconsistently across scales. We also investigate the reasoning traces in more depth, studying the patterns of explored solutions and analyzing the models' computational behavior, shedding light on their strengths, limitations, and raising questions about their reasoning capabilities.

large language model, machine learning, natural language, (21 more...)

2506.06941

Country: Asia (0.49)

Genre: Research Report > New Finding (1.00)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

PoE-World: Compositional World Modeling with Products of Programmatic Experts

Piriyakulkij, Wasu Top, Liang, Yichao, Tang, Hao, Weller, Adrian, Kryven, Marta, Ellis, Kevin

Learning how the world works is central to building AI agents that can adapt to complex environments. Traditional world models based on deep learning demand vast amounts of training data, and do not flexibly update their knowledge from sparse observations. Recent advances in program synthesis using Large Language Models (LLMs) give an alternate approach which learns world models represented as source code, supporting strong generalization from little data. To date, application of program-structured world models remains limited to natural language and grid-world domains. We introduce a novel program synthesis method for effectively modeling complex, non-gridworld domains by representing a world model as an exponentially-weighted product of programmatic experts (PoE-World) synthesized by LLMs. We show that this approach can learn complex, stochastic world models from just a few observations. We evaluate the learned world models by embedding them in a model-based planning agent, demonstrating efficient performance and generalization to unseen levels on Atari's Pong and Montezuma's Revenge. We release our code and display the learned world models and videos of the agent's gameplay at https://topwasu.github.io/poe-world.

machine learning, natural language, world model, (18 more...)

2505.10819

Genre: Research Report > Experimental Study (1.00)

Industry:

Education (1.00)
Health & Medicine (0.92)
Leisure & Entertainment > Games > Computer Games (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(2 more...)

Neural Information Processing SystemsNov-20-2025, 19:31:59 GMT

Norm-Ranging LSH for Maximum Inner Product Search

Xiao Yan, Jinfeng Li, Xinyan Dai, Hongzhi Chen, James Cheng

MIPS is a challenging problem as modern datasets often have high dimensionality and large cardinality. Initially, tree-based methods [Ram and Gray, 2012, Koenigstein et al., 2012] were proposed for MIPS, which use the idea of branch and bound similar to k-d tree [Friedman and Tukey, 1974].

artificial intelligence, information retrieval, natural language, (19 more...)

Neural Information Processing Systems

Country:

Asia > China > Hong Kong (0.04)
North America > Canada > Quebec > Montreal (0.04)
Asia > Afghanistan > Parwan Province > Charikar (0.04)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.48)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.46)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.34)

Neural Information Processing SystemsNov-20-2025, 16:57:46 GMT

Learning Loop Invariants for Program Verification

Xujie Si, Hanjun Dai, Mukund Raghothaman, Mayur Naik, Le Song

The problem is undecidable and even practical instances are challenging.

logic & formal reasoning, machine learning, programming language, (19 more...)

Neural Information Processing Systems

Country:

North America > United States > Pennsylvania (0.04)
North America > Canada > Quebec > Montreal (0.04)
Europe > Slovenia > Drava > Municipality of Benedikt > Benedikt (0.04)

Genre: Workflow (0.68)

Technology:

Information Technology > Software Engineering (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.96)
(4 more...)

David Ha, Jürgen Schmidhuber

Recurrent World Models Facilitate Policy Evolution

Neural Information Processing SystemsNov-20-2025, 15:24:28 GMT

The world model's extracted features are fed into compact and simple policies trained by evolution, achieving state of the art results in various

artificial intelligence, machine learning, reinforcement learning, (16 more...)

Neural Information Processing Systems

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
North America > United States > Washington > King County > Seattle (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
(6 more...)

Industry:

Leisure & Entertainment > Games > Computer Games (1.00)
Health & Medicine > Therapeutic Area > Neurology (0.68)
Leisure & Entertainment > Sports (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.70)

arXiv.org Artificial IntelligenceNov-20-2025

DeepThinkVLA: Enhancing Reasoning Capability of Vision-Language-Action Models

Yin, Cheng, Lin, Yankai, Xu, Wang, Tam, Sikyuen, Zeng, Xiangrui, Liu, Zhiyuan, Yin, Zhouping

Enabling Vision-Language-Action (VLA) models to "think before acting" via Chain-of-Thought (CoT) is a promising path to overcoming the data-hungry nature of end-to-end robot policies. However, progress is stalled by a fundamental conflict: existing models use a single autoregressive decoder for both sequential CoT reasoning and high-dimensional, parallelizable robot actions. This architectural mismatch degrades motor control and fails to forge a strong causal link between thought and action. We introduce DeepThinkVLA, which resolves this conflict through a tightly integrated architecture and training strategy. Architecturally, our hybrid-attention decoder generates sequential CoT with causal attention and then switches to bidirectional attention for fast, parallel decoding of action vectors. This design is complemented by a two-stage training pipeline: we first use Supervised Fine-Tuning (SFT) to teach the model foundational reasoning, then apply Reinforcement Learning (RL) with task-success rewards to causally align the full reasoning-action sequence with desired outcomes. This synergy leads to state-of-the-art performance, achieving a 97.0% Our ablations confirm the design's effectiveness: the hybrid architecture alone outperforms standard decoders by 15.5%, and the final RL stage provides a crucial 2% boost to secure top performance. Vision-Language-Action (VLA) models have driven notable progress in robotic manipulation, enabling tasks like stacking blocks, opening drawers, and arranging household objects (Huang et al., 2023; Zitkovich et al., 2023; Y ang et al., 2024; Cadene et al., 2024). The dominant paradigm learns a reactive, end-to-end policy that directly maps high-level goals and sensory inputs to low-level motor commands (Chi et al., 2023; Kim et al., 2024; Bjorck et al., 2025).

artificial intelligence, arxiv preprint arxiv, machine learning, (16 more...)

2511.15669

Country: Asia > China (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.64)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.46)

arXiv.org Artificial IntelligenceNov-20-2025

C2F-Space: Coarse-to-Fine Space Grounding for Spatial Instructions using Vision-Language Models

Oh, Nayoung, Kim, Dohyun, Bang, Junhyeong, Paul, Rohan, Park, Daehyung

Space grounding refers to localizing a set of spatial references described in natural language instructions. Traditional methods often fail to account for complex reasoning -- such as distance, geometry, and inter-object relationships -- while vision-language models (VLMs), despite strong reasoning abilities, struggle to produce a fine-grained region of outputs. To overcome these limitations, we propose C2F-Space, a novel coarse-to-fine space-grounding framework that (i) estimates an approximated yet spatially consistent region using a VLM, then (ii) refines the region to align with the local environment through superpixelization. For the coarse estimation, we design a grid-based visual-grounding prompt with a propose-validate strategy, maximizing VLM's spatial understanding and yielding physically and semantically valid canonical region (i.e., ellipses). For the refinement, we locally adapt the region to surrounding environment without over-relaxed to free space. We construct a new space-grounding benchmark and compare C2F-Space with five state-of-the-art baselines using success rate and intersection-over-union. Our C2F-Space significantly outperforms all baselines. Our ablation study confirms the effectiveness of each module in the two-step process and their synergistic effect of the combined framework. We finally demonstrate the applicability of C2F-Space to simulated robotic pick-and-place tasks.

artificial intelligence, large language model, natural language, (20 more...)

2511.15333

Country: Asia (0.28)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.95)

arXiv.org Artificial IntelligenceNov-20-2025

Towards High-Consistency Embodied World Model with Multi-View Trajectory Videos

Su, Taiyi, Zhu, Jian, Li, Yaxuan, Ma, Chong, Huang, Zitai, Wang, Hanli, Xu, Yi

Embodied world models aim to predict and interact with the physical world through visual observations and actions. However, existing models struggle to accurately translate low-level actions (e.g., joint positions) into precise robotic movements in predicted frames, leading to inconsistencies with real-world physical interactions. To address these limitations, we propose MTV-World, an embodied world model that introduces Multi-view Trajectory-Video control for precise visuomotor prediction. Specifically, instead of directly using low-level actions for control, we employ trajectory videos obtained through camera intrinsic and extrinsic parameters and Cartesian-space transformation as control signals. However, projecting 3D raw actions onto 2D images inevitably causes a loss of spatial information, making a single view insufficient for accurate interaction modeling. To overcome this, we introduce a multi-view framework that compensates for spatial information loss and ensures high-consistency with physical world. MTV-World forecasts future frames based on multi-view trajectory videos as input and conditioning on an initial frame per view. Furthermore, to systematically evaluate both robotic motion precision and object interaction accuracy, we develop an auto-evaluation pipeline leveraging multimodal large models and referring video object segmentation models. To measure spatial consistency, we formulate it as an object location matching problem and adopt the Jaccard Index as the evaluation metric. Extensive experiments demonstrate that MTV-World achieves precise control execution and accurate physical interaction modeling in complex dual-arm scenarios.

artificial intelligence, arxiv preprint arxiv, spatial reasoning, (16 more...)

2511.12882

Genre: Research Report (0.82)

Industry: Media > Television (0.79)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.85)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.54)