Goto

Collaborating Authors

 confirmation


HypoBootstrap: ABootstrapping Framework for Inductive Reasoning

Neural Information Processing Systems

Inductive reasoning infers general rules from observed evidence, which is one of the most critical intelligence abilities. Previous works have succeeded in formal languages but suffer from onerous and error-prone conversions between a particular formal language and the working language. As large language models (LLMs) have emerged, direct reasoning with various kinds of languages, especially natural languages, without formal language involvement has become feasible. However, existing LLM-based inductive reasoning usually relies on LLM's intrinsic generation ability, which is prone to LLM's hallucination and lacks systematic guidance according to the nature of inductive reasoning. To this end, we propose HypoBootstrap, an integrated framework for inductive reasoning that generates and confirms hypotheses both in a bootstrapping manner. Regarding hypothesis generation, we propose a novel bootstrapping generation strategy, bootstrapping object hypotheses, relational hypotheses, and functional hypotheses successively, which assists LLM in observing the evidence from trivial patterns to non-trivial patterns. Regarding hypothesis confirmation, we utilize Glymour's theory of bootstrap confirmation, a hypothesis confirmation theory from the philosophy of science that can confirm a set of hypotheses. We use its principles to confirm the object hypotheses, relational hypotheses, and functional hypotheses. Empirical studies on four inductive reasoning scenarios of different natures, involving causal induction, concept learning, grammar learning, and abstract reasoning, demonstrate that HypoBootstrap significantly outperforms existing methods.


HypoBootstrap: A Bootstrapping Framework for Inductive Reasoning

Neural Information Processing Systems

Inductive reasoning infers general rules from observed evidence, which is one of the most critical intelligence abilities. Previous works have succeeded in formal languages but suffer from onerous and error-prone conversions between a particular formal language and the working language. As large language models (LLMs) have emerged, direct reasoning with various kinds of languages, especially natural languages, without formal language involvement has become feasible. However, existing LLM-based inductive reasoning usually relies on LLM's intrinsic generation ability, which is prone to LLM's hallucination and lacks systematic guidance according to the nature of inductive reasoning. To this end, we propose HypoBootstrap, an integrated framework for inductive reasoning that generates and confirms hypotheses both in a bootstrapping manner. Regarding hypothesis generation, we propose a novel bootstrapping generation strategy, bootstrapping object hypotheses, relational hypotheses, and functional hypotheses successively, which assists LLM in observing the evidence from trivial patterns to non-trivial patterns. Regarding hypothesis confirmation, we utilize Glymour's theory of bootstrap confirmation, a hypothesis confirmation theory from the philosophy of science that can confirm a set of hypotheses. We use its principles to confirm the object hypotheses, relational hypotheses, and functional hypotheses. Empirical studies on four inductive reasoning scenarios of different natures, involving causal induction, concept learning, grammar learning, and abstract reasoning, demonstrate that HypoBootstrap significantly outperforms existing methods.


Why Real-Life Disclosure Day Will Look Nothing Like Steven Spielberg's New Movie

WIRED

Why Real-Life Disclosure Day Will Look Nothing Like Steven Spielberg's New Movie Previous landmark scientific discoveries like the Higgs boson provide a better template for what it will take to confirm whether aliens have made contact with Earth. Steven Spielberg's new film imagines the moment 8 billion humans find out that we are not alone in the universe. The movie, which opens in US theaters on June 12, is a fictional account of the government cover-up and subsequent "disclosure" of evidence that aliens have contacted Earth. The UFO community has been chasing that type of cinematic big reveal for 80 years. But it's more likely that monumental scientific discoveries, like the detection of the Higgs boson in 2012 and the confirmation of gravitational waves in 2016, are a better guideline for how real-world disclosure is likely to play out: through long-running research and with verifiable results.


AskDB: An LLM Agent for Natural Language Interaction with Relational Databases

arXiv.org Artificial Intelligence

Interacting with relational databases remains challenging for users across different expertise levels, particularly when composing complex analytical queries or performing administrative tasks. Existing systems typically address either natural language querying or narrow aspects of database administration, lacking a unified and intelligent interface for general-purpose database interaction. We introduce AskDB, a large language model powered agent designed to bridge this gap by supporting both data analysis and administrative operations over SQL databases through natural language. Built on Gemini 2, AskDB integrates two key innovations: a dynamic schema-aware prompting mechanism that effectively incorporates database metadata, and a task decomposition framework that enables the agent to plan and execute multi-step actions. These capabilities allow AskDB to autonomously debug derived SQL, retrieve contextual information via real-time web search, and adaptively refine its responses. We evaluate AskDB on a widely used Text-to-SQL benchmark and a curated set of DBA tasks, demonstrating strong performance in both analytical and administrative scenarios. Our results highlight the potential of AskDB as a unified and intelligent agent for relational database systems, offering an intuitive and accessible experience for end users.


Beyond Hallucinations: The Illusion of Understanding in Large Language Models

arXiv.org Artificial Intelligence

Large language models (LLMs) are becoming deeply embedded in human communication and decision-making, yet they inherit the ambiguity, bias, and lack of direct access to truth inherent in language itself. While their outputs are fluent, emotionally resonant, and coherent, they are generated through statistical prediction rather than grounded reasoning. This creates the risk of hallucination, responses that sound convincing but lack factual validity. Building on Geoffrey Hinton's observation that AI mirrors human intuition rather than reasoning, this paper argues that LLMs operationalize System 1 cognition at scale: fast, associative, and persuasive, but without reflection or falsification. To address this, we introduce the Rose-Frame, a three-dimensional framework for diagnosing cognitive and epistemic drift in human-AI interaction. The three axes are: (i) Map vs. Territory, which distinguishes representations of reality (epistemology) from reality itself (ontology); (ii) Intuition vs. Reason, drawing on dual-process theory to separate fast, emotional judgments from slow, reflective thinking; and (iii) Conflict vs. Confirmation, which examines whether ideas are critically tested through disagreement or simply reinforced through mutual validation. Each dimension captures a distinct failure mode, and their combination amplifies misalignment. Rose-Frame does not attempt to fix LLMs with more data or rules. Instead, it offers a reflective tool that makes both the model's limitations and the user's assumptions visible, enabling more transparent and critically aware AI deployment. It reframes alignment as cognitive governance: intuition, whether human or artificial, must remain governed by human reason. Only by embedding reflective, falsifiable oversight can we align machine fluency with human understanding.


SGM: A Statistical Godel Machine for Risk-Controlled Recursive Self-Modification

arXiv.org Artificial Intelligence

Recursive self-modification has often been discussed as a cornerstone for building continually improving ML systems (Y ampolskiy, 2015). Modern ML already hints at this trend: reinforcement learning agents tune hyperparameters online, AutoML loops search over training recipes, and optimization pipelines reconfigure code and settings during runs. Y et these procedures often adopt changes on the basis of noisy gains, creating the risk of harmful edits - modifications that seems beneficial in finite trials but ultimately degrade true performance. Such risks are especially concerning in high-stakes scientific domains such as drug design, protein engineering, or climate modeling, where spurious gains can misdirect costly pipelines. G odel machines (Schmidhuber, 2007) offer a conceptually clean answer: an agent rewrites its code only when it can prove the rewrite increases expected utility. But in stochastic, high-dimensional ML, such formal proofs are unattainable. At the other extreme, practical AutoML and RL systems adopt edits using heuristics such as rolling averages, best-of-seeds, or bandit rules, which lack guarantees and may silently accumulate regressions.


Lateral Tree-of-Thoughts Surpasses ToT by Incorporating Logically-Consistent, Low-Utility Candidates

arXiv.org Artificial Intelligence

Modern deployments increasingly allocate large test-time compute (thousands of tokens or many node expansions) to boost reliability. Under such budgets, standard Tree-of-Thoughts-style search exhibits two pathologies: breadth saturation (additional samples mostly produce near-duplicates, so width stops growing) and depth myopia (noisy short-horizon utilities prune branches whose payoff appears after a few more steps). We propose Lateral Tree-of-Thoughts (LToT), a drop-in controller that separates utility from logical consistency and treats low-utility but consistent candidates as assets rather than waste. The frontier is split into mainlines (high-utility candidates used for exploitation) and laterals (consistent, initially low-utility candidates that receive short, cheap probes before judgment). LToT explores laterals via Lateral Racing with Short-Circuit (LR--SC): a capped successive-halving race that spreads tiny probes across a very wide lateral set, uses width-aware thresholds with repeat-to-confirm, and immediately promotes a branch once its envelope clears the mainline bar; mainlines are kept intentionally narrow so surplus compute is invested where width is cheap. We prove a pseudolinear lateral cost $ฮ˜(N_0 \log_ฮท N_0)$ with logarithmically many rungs (initial lateral width $N_0$; culling factor $ฮท>1$), in contrast to the exponential growth of uncapped mainlines. Empirical evaluations on benchmark tasks are in preparation and will be added in a future revision. In short, LToT turns large test-time budgets into principled diversity while preserving promotion discipline, mitigating saturation and myopia without inflating compute.


Astronomers Have Found 6,000 Planets Outside the Solar System

WIRED

From lava worlds to gas giants, NASA says the variety of these worlds is staggering--and that signs of a further 8,000 distant planets are awaiting confirmation. The number of confirmed planets outside of our solar system--known as exoplanets-- has risen to 6,000, NASA has said. There is huge variety across these distant worlds, the space agency says, with discoveries including rocky planets, lava worlds, and gas giants enveloping their stars. Plenty more discoveries are likely on the way. As a result of continued monitoring by NASA's Exoplanet Science Institute (NExScI), there are more than 8,000 potential planets that have been identified and are awaiting confirmation.


Automated Classification of Tutors' Dialogue Acts Using Generative AI: A Case Study Using the CIMA Corpus

arXiv.org Artificial Intelligence

First submitted: 30 Oct 2023. The final version will be available open access via the journal. Abstract This study explores the use of generative AI for automating the classification of tutors' Dialogue Acts (DAs), aiming to reduce the time and effort required by traditional manual coding. This case study uses the open - source CIMA corpus, in which tutors' re sponses are pre - annotated into four DA categories. Both GPT - 3.5 - turbo and GPT - 4 models were tested using tailored prompts. Results show that GPT - 4 achieved 80% accuracy, a weighted F1 - score of 0.81, and a Cohen's Kappa of 0.74, surpassing baseline performa nce and indicating substantial agreement with human annotations. These findings suggest that generative AI has strong potential to provide an efficient and accessible approach to DA classification, with meaningful implications for educational dialogue analysis. The study also highlights the importance of task - specific label definitions and contextual information in enhanc ing the quality of automated annotation. Finally, it underscores the ethical considerations associated with the use of generative AI and the need for responsible and transparent research practices.


MUA-RL: Multi-turn User-interacting Agent Reinforcement Learning for agentic tool use

arXiv.org Artificial Intelligence

With the recent rapid advancement of Agentic Intelligence, agentic tool use in LLMs has become increasingly important. During multi-turn interactions between agents and users, the dynamic, uncertain, and stochastic nature of user demands poses significant challenges to the agent's tool invocation capabilities. Agents are no longer expected to simply call tools to deliver a result; rather, they must iteratively refine their understanding of user needs through communication while simultaneously invoking tools to resolve user queries. Existing reinforcement learning (RL) approaches for tool use lack the integration of genuinely dynamic users during the RL training process. To bridge this gap, we introduce MUA-RL (Multi-turn User-interacting Agent Reinforcement Learning for agentic tool use), a novel reinforcement learning framework that, for the first time in the field of agentic tool use, integrates LLM-simulated users into the reinforcement learning loop. MUA-RL aims to enable autonomous learning of models to communicate with users efficiently and use various tools to solve practical problems in dynamic multi-turn interactions. Evaluations are done on several multi-turn tool-using benchmarks (see Figure 1). Specifically, MUA-RL-32B achieves 67.3 on TAU2 Retail, 45.4 on TAU2 Airline, 28.3 on TAU2 Telecom, 28.4 on BFCL-V3 Multi Turn, and 82.5 on ACEBench Agent -- outperforming or matching the performance of larger open-source models such as DeepSeek-V3-0324 and Qwen3-235B-A22B in non-thinking settings.