Goto

Collaborating Authors

 human reasoning


ARCTraj: A Dataset and Benchmark of Human Reasoning Trajectories for Abstract Problem Solving

Kim, Sejin, Choi, Hayan, Lee, Seokki, Kim, Sundong

arXiv.org Artificial Intelligence

We present ARCTraj, a dataset and methodological framework for modeling human reasoning through complex visual tasks in the Abstraction and Reasoning Corpus (ARC). While ARC has inspired extensive research on abstract reasoning, most existing approaches rely on static input--output supervision, which limits insight into how reasoning unfolds over time. ARCTraj addresses this gap by recording temporally ordered, object-level actions that capture how humans iteratively transform inputs into outputs, revealing intermediate reasoning steps that conventional datasets overlook. Collected via the O2ARC web interface, it contains around 10,000 trajectories annotated with task identifiers, timestamps, and success labels across 400 training tasks from the ARC-AGI-1 benchmark. It further defines a unified reasoning pipeline encompassing data collection, action abstraction, Markov decision process (MDP) formulation, and downstream learning, enabling integration with reinforcement learning, generative modeling, and sequence modeling methods such as PPO, World Models, GFlowNets, Diffusion agents, and Decision Transformers. Analyses of spatial selection, color attribution, and strategic convergence highlight the structure and diversity of human reasoning. Together, these contributions position ARCTraj as a structured and interpretable foundation for studying human-like reasoning, advancing explainability, alignment, and generalizable intelligence.


Simulating Society Requires Simulating Thought

Li, Chance Jiajie, Wu, Jiayi, Mo, Zhenze, Qu, Ao, Tang, Yuhan, Zhao, Kaiya Ivy, Gan, Yulu, Fan, Jie, Yu, Jiangbo, Zhao, Jinhua, Liang, Paul, Alonso, Luis, Larson, Kent

arXiv.org Artificial Intelligence

Simulating society with large language models (LLMs), we argue, requires more than generating plausible behavior; it demands cognitively grounded reasoning that is structured, revisable, and traceable. LLM-based agents are increasingly used to emulate individual and group behavior, primarily through prompting and supervised fine-tuning. Yet current simulations remain grounded in a behaviorist "demographics in, behavior out" paradigm, focusing on surface-level plausibility. As a result, they often lack internal coherence, causal reasoning, and belief traceability, making them unreliable for modeling how people reason, deliberate, and respond to interventions. To address this, we present a conceptual modeling paradigm, Generative Minds (GenMinds), which draws from cognitive science to support structured belief representations in generative agents. To evaluate such agents, we introduce the RECAP (REconstructing CAusal Paths) framework, a benchmark designed to assess reasoning fidelity via causal traceability, demographic grounding, and intervention consistency. These contributions advance a broader shift: from surface-level mimicry to generative agents that simulate thought, not just language, for social simulations.


The Universal Landscape of Human Reasoning

Chen, Qiguang, Liu, Jinhao, Qin, Libo, Zhang, Yimeng, Liang, Yihao, Ren, Shangxu, Luan, Chengyu, Peng, Dengyun, Li, Hanjing, Guan, Jiannan, Yan, Zheng, Wang, Jiaqi, Hu, Mengkang, Du, Yantao, Chen, Zhi, Chen, Xie, Che, Wanxiang

arXiv.org Artificial Intelligence

Understanding how information is dynamically accumulated and transformed in human reasoning has long challenged cognitive psychology, philosophy, and artificial intelligence. Existing accounts, from classical logic to probabilistic models, illuminate aspects of output or individual modelling, but do not offer a unified, quantitative description of general human reasoning dynamics. To solve this, we introduce Information Flow Tracking (IF-Track), that uses large language models (LLMs) as probabilistic encoder to quantify information entropy and gain at each reasoning step. Through fine-grained analyses across diverse tasks, our method is the first successfully models the universal landscape of human reasoning behaviors within a single metric space. We show that IF-Track captures essential reasoning features, identifies systematic error patterns, and characterizes individual differences. Applied to discussion of advanced psychological theory, we first reconcile single- versus dual-process theories in IF-Track and discover the alignment of artificial and human cognition and how LLMs reshaping human reasoning process. This approach establishes a quantitative bridge between theory and measurement, offering mechanistic insights into the architecture of reasoning.


Measuring Reasoning Utility in LLMs via Conditional Entropy Reduction

Guo, Xu

arXiv.org Artificial Intelligence

Recent advancements in large language models (LLMs) often rely on generating intermediate reasoning steps to enhance accuracy. However, little work has examined how reasoning utility contributes to the final answer's correctness. Due to the stochastic nature of autoregressive generation, generating more context does not guarantee increased confidence in the answer. If we could predict, during generation, whether a reasoning step will be useful, we could stop early or prune ineffective steps, avoiding distractions in the final decision. We present an oracle study on MATH dataset, using Qwen2.5-32B and GPT-4o to generate reasoning chains, and then employing a separate model (Qwen3-8B) to quantify the utility of these chains for final accuracy. Specifically, we measure the model's uncertainty on the answer span Y at each reasoning step using conditional entropy (expected negative log-likelihood over the vocabulary) with context expanding step by step. Our results show a clear pattern: conditional entropy that decreases over steps is strongly associated with correct answers, whereas flat or increasing entropy often results in wrong answers. We also corroborate that incorrect reasoning paths tend to be longer than correct ones, suggesting that longer reasoning does not necessarily yield better outcomes. These findings serve as a foundation to inspire future work on designing efficient reasoning pipelines that detect and avoid unproductive reasoning early.


Aristotle's Original Idea: For and Against Logic in the era of AI

Kakas, Antonis C.

arXiv.org Artificial Intelligence

The ideas that he raised in his study of logical reasoning carried the development of science over the centuries. Any scientific theory's mathematical formalization is one that falls under his idea of Demonstrative Science. T oday, in the era of AI, this title of the fatherhood of logic has a renewed significance . Behind it li es his original idea that human reasoning c ould be studied as a process and that perhaps there exist universal systems of reasoning that underly all human reasoning irrespective of the content of what we are reasoning about . This is a daring idea as it ess entially says that the human mind can study itself and indeed that it has the capacity to unravel its own self. Irrespective of whether this is possible or not, it is a thought that is a prerequisite for the existence and development of Artificial Intellig ence. In this article, we look into Aristotle's work on human thought, his work on reasoning itself but also on how it relates to science and human endeavour more generally, from a modern perspective of Artificial Intelligence and ask if this can help enli ghten our understanding of AI and S cience more generally.


Giving AI Personalities Leads to More Human-Like Reasoning

Nighojkar, Animesh, Moydinboyev, Bekhzodbek, Duong, My, Licato, John

arXiv.org Artificial Intelligence

In computational cognitive modeling, capturing the full spectrum of human judgment and decision-making processes, beyond just optimal behaviors, is a significant challenge. This study explores whether Large Language Models (LLMs) can emulate the breadth of human reasoning by predicting both intuitive, fast System 1 and deliberate, slow System 2 processes. We investigate the potential of AI to mimic diverse reasoning behaviors across a human population, addressing what we call the "full reasoning spectrum problem". We designed reasoning tasks using a novel generalization of the Natural Language Inference (NLI) format to evaluate LLMs' ability to replicate human reasoning. The questions were crafted to elicit both System 1 and System 2 responses. Human responses were collected through crowd-sourcing and the entire distribution was modeled, rather than just the majority of the answers. We used personality-based prompting inspired by the Big Five personality model to elicit AI responses reflecting specific personality traits, capturing the diversity of human reasoning, and exploring how personality traits influence LLM outputs. Combined with genetic algorithms to optimize the weighting of these prompts, this method was tested alongside traditional machine learning models. The results show that LLMs can mimic human response distributions, with open-source models like Llama and Mistral outperforming proprietary GPT models. Personality-based prompting, especially when optimized with genetic algorithms, significantly enhanced LLMs' ability to predict human response distributions, suggesting that capturing suboptimal, naturalistic reasoning may require modeling techniques incorporating diverse reasoning styles and psychological profiles. The study concludes that personality-based prompting combined with genetic algorithms is promising for enhancing AI's 'human-ness' in reasoning.


xAI launches Grok 3 AI, claiming it is capable of 'human reasoning'

Engadget

Meanwhile, the Grok 3 Reasoning and Grok 3 mini Reasoning models are capable of mimicking human-like reasoning when it comes to analyzing information the user needs. Other examples of AI models capable of reasoning tasks are DeepSeek's R1 and OpenAI's o3-mini. According to TechCrunch, xAI claimed during the event that Grok 3 Reasoning performed better than the best version of o3-mini on several benchmarks. Grok 3's features will initially be available to subscribers paying for X's Premium tier, which now costs 40 a month in the US. They will also be available through an upcoming separate subscription option for the standalone Grok app and Grok on the web.


A Beautiful Mind: Principles and Strategies for AI-Augmented Human Reasoning

Koon, Sean

arXiv.org Artificial Intelligence

T he past century ha s witnessed incredible technological change . The many benefits and conveniences o f technology are accompanied by new complexities and human challenges that affect work, home, social, and civic realms. Th ere is a w idening gap "between a growing complexity of our own making and a lagging development of our own capacities" (Botkin et al., 1998) . Now, artificial intelligence promises to increase the rate of scientific discovery and innovation exponentially, creating new changes and p otential complexities to which humans must adapt (Friedman, 2017) . On the other hand, new AI tools, especially generative AI models, may help people to engage with the growing volume and complexity of information in their reasoning tasks such as decisionmaking and problem solving.


Should We Fear Large Language Models? A Structural Analysis of the Human Reasoning System for Elucidating LLM Capabilities and Risks Through the Lens of Heidegger's Philosophy

Zhang, Jianqiiu

arXiv.org Artificial Intelligence

In the rapidly evolving field of Large Language Models (LLMs), there is a critical need to thoroughly analyze their capabilities and risks. Central to our investigation are two novel elements. Firstly, it is the innovative parallels between the statistical patterns of word relationships within LLMs and Martin Heidegger's concepts of "ready-to-hand" and "present-at-hand," which encapsulate the utilitarian and scientific altitudes humans employ in interacting with the world. This comparison lays the groundwork for positioning LLMs as the digital counterpart to the Faculty of Verbal Knowledge, shedding light on their capacity to emulate certain facets of human reasoning. Secondly, a structural analysis of human reasoning, viewed through Heidegger's notion of truth as "unconcealment" is conducted This foundational principle enables us to map out the inputs and outputs of the reasoning system and divide reasoning into four distinct categories. Respective cognitive faculties are delineated, allowing us to place LLMs within the broader schema of human reasoning, thus clarifying their strengths and inherent limitations. Our findings reveal that while LLMs possess the capability for Direct Explicative Reasoning and Pseudo Rational Reasoning, they fall short in authentic rational reasoning and have no creative reasoning capabilities, due to the current lack of many analogous AI models such as the Faculty of Judgement. The potential and risks of LLMs when they are augmented with other AI technologies are also evaluated. The results indicate that although LLMs have achieved proficiency in some reasoning abilities, the aspiration to match or exceed human intellectual capabilities is yet unattained. This research not only enriches our comprehension of LLMs but also propels forward the discourse on AI's potential and its bounds, paving the way for future explorations into AI's evolving landscape.


The Future of Censorship Is AI-Generated

TIME - Tech

The brave new world of Generative AI has become the latest battleground for U.S. culture wars. Google issued an apology after anti-woke X-users, including Elon Musk, shared examples of Google's chatbot Gemini refusing to generate images of white people--including historical figures--even when specifically prompted to do so. Gemini's insistence on prioritizing diversity and inclusion over accuracy is likely a well intentioned attempt to stamp out bias in early GenAI datasets that tended to create stereotypical images of Africans and other minority groups as well women, causing outrage among progressives. But there is much more at stake than the selective outrage of U.S. conservatives and progressives. How the "guardrails" of GenAI are defined and deployed is likely to have a significant and increasing impact on shaping the ecosystem of information and ideas that most humans engage with.