confirmation
AskDB: An LLM Agent for Natural Language Interaction with Relational Databases
Phan, Xuan-Quang, Mai, Tan-Ha, Dinh, Thai-Duy, Nguyen, Minh-Thuan, Lê, Lam-Son
Interacting with relational databases remains challenging for users across different expertise levels, particularly when composing complex analytical queries or performing administrative tasks. Existing systems typically address either natural language querying or narrow aspects of database administration, lacking a unified and intelligent interface for general-purpose database interaction. We introduce AskDB, a large language model powered agent designed to bridge this gap by supporting both data analysis and administrative operations over SQL databases through natural language. Built on Gemini 2, AskDB integrates two key innovations: a dynamic schema-aware prompting mechanism that effectively incorporates database metadata, and a task decomposition framework that enables the agent to plan and execute multi-step actions. These capabilities allow AskDB to autonomously debug derived SQL, retrieve contextual information via real-time web search, and adaptively refine its responses. We evaluate AskDB on a widely used Text-to-SQL benchmark and a curated set of DBA tasks, demonstrating strong performance in both analytical and administrative scenarios. Our results highlight the potential of AskDB as a unified and intelligent agent for relational database systems, offering an intuitive and accessible experience for end users.
Beyond Hallucinations: The Illusion of Understanding in Large Language Models
Rosenbacke, Rikard, Rosenbacke, Carl, Rosenbacke, Victor, McKee, Martin
Large language models (LLMs) are becoming deeply embedded in human communication and decision-making, yet they inherit the ambiguity, bias, and lack of direct access to truth inherent in language itself. While their outputs are fluent, emotionally resonant, and coherent, they are generated through statistical prediction rather than grounded reasoning. This creates the risk of hallucination, responses that sound convincing but lack factual validity. Building on Geoffrey Hinton's observation that AI mirrors human intuition rather than reasoning, this paper argues that LLMs operationalize System 1 cognition at scale: fast, associative, and persuasive, but without reflection or falsification. To address this, we introduce the Rose-Frame, a three-dimensional framework for diagnosing cognitive and epistemic drift in human-AI interaction. The three axes are: (i) Map vs. Territory, which distinguishes representations of reality (epistemology) from reality itself (ontology); (ii) Intuition vs. Reason, drawing on dual-process theory to separate fast, emotional judgments from slow, reflective thinking; and (iii) Conflict vs. Confirmation, which examines whether ideas are critically tested through disagreement or simply reinforced through mutual validation. Each dimension captures a distinct failure mode, and their combination amplifies misalignment. Rose-Frame does not attempt to fix LLMs with more data or rules. Instead, it offers a reflective tool that makes both the model's limitations and the user's assumptions visible, enabling more transparent and critically aware AI deployment. It reframes alignment as cognitive governance: intuition, whether human or artificial, must remain governed by human reason. Only by embedding reflective, falsifiable oversight can we align machine fluency with human understanding.
- Europe > Sweden (0.04)
- North America > United States (0.04)
- Europe > United Kingdom (0.04)
SGM: A Statistical Godel Machine for Risk-Controlled Recursive Self-Modification
Wu, Xuening, Yin, Shenqin, Kang, Yanlan, Zhang, Xinhang, Xu, Qianya, Chen, Zeping, Zhang, Wenqiang
Recursive self-modification has often been discussed as a cornerstone for building continually improving ML systems (Y ampolskiy, 2015). Modern ML already hints at this trend: reinforcement learning agents tune hyperparameters online, AutoML loops search over training recipes, and optimization pipelines reconfigure code and settings during runs. Y et these procedures often adopt changes on the basis of noisy gains, creating the risk of harmful edits - modifications that seems beneficial in finite trials but ultimately degrade true performance. Such risks are especially concerning in high-stakes scientific domains such as drug design, protein engineering, or climate modeling, where spurious gains can misdirect costly pipelines. G odel machines (Schmidhuber, 2007) offer a conceptually clean answer: an agent rewrites its code only when it can prove the rewrite increases expected utility. But in stochastic, high-dimensional ML, such formal proofs are unattainable. At the other extreme, practical AutoML and RL systems adopt edits using heuristics such as rolling averages, best-of-seeds, or bandit rules, which lack guarantees and may silently accumulate regressions.
- Asia > China > Shanghai > Shanghai (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
Lateral Tree-of-Thoughts Surpasses ToT by Incorporating Logically-Consistent, Low-Utility Candidates
Modern deployments increasingly allocate large test-time compute (thousands of tokens or many node expansions) to boost reliability. Under such budgets, standard Tree-of-Thoughts-style search exhibits two pathologies: breadth saturation (additional samples mostly produce near-duplicates, so width stops growing) and depth myopia (noisy short-horizon utilities prune branches whose payoff appears after a few more steps). We propose Lateral Tree-of-Thoughts (LToT), a drop-in controller that separates utility from logical consistency and treats low-utility but consistent candidates as assets rather than waste. The frontier is split into mainlines (high-utility candidates used for exploitation) and laterals (consistent, initially low-utility candidates that receive short, cheap probes before judgment). LToT explores laterals via Lateral Racing with Short-Circuit (LR--SC): a capped successive-halving race that spreads tiny probes across a very wide lateral set, uses width-aware thresholds with repeat-to-confirm, and immediately promotes a branch once its envelope clears the mainline bar; mainlines are kept intentionally narrow so surplus compute is invested where width is cheap. We prove a pseudolinear lateral cost $Θ(N_0 \log_η N_0)$ with logarithmically many rungs (initial lateral width $N_0$; culling factor $η>1$), in contrast to the exponential growth of uncapped mainlines. Empirical evaluations on benchmark tasks are in preparation and will be added in a future revision. In short, LToT turns large test-time budgets into principled diversity while preserving promotion discipline, mitigating saturation and myopia without inflating compute.
- Oceania > Australia > Australian Capital Territory > Canberra (0.04)
- Europe > United Kingdom > England > Greater London > London (0.04)
Astronomers Have Found 6,000 Planets Outside the Solar System
From lava worlds to gas giants, NASA says the variety of these worlds is staggering--and that signs of a further 8,000 distant planets are awaiting confirmation. The number of confirmed planets outside of our solar system--known as exoplanets-- has risen to 6,000, NASA has said. There is huge variety across these distant worlds, the space agency says, with discoveries including rocky planets, lava worlds, and gas giants enveloping their stars. Plenty more discoveries are likely on the way. As a result of continued monitoring by NASA's Exoplanet Science Institute (NExScI), there are more than 8,000 potential planets that have been identified and are awaiting confirmation.
- South America (0.05)
- Oceania > Australia (0.05)
- North America > United States > California (0.05)
- (6 more...)
- Government > Space Agency (1.00)
- Government > Regional Government > North America Government > United States Government (0.98)
Dark Patterns Meet GUI Agents: LLM Agent Susceptibility to Manipulative Interfaces and the Role of Human Oversight
Tang, Jingyu, Chen, Chaoran, Li, Jiawen, Zhang, Zhiping, Guo, Bingcan, Khalilov, Ibrahim, Gebreegziabher, Simret Araya, Yao, Bingsheng, Wang, Dakuo, Ye, Yanfang, Li, Tianshi, Xiao, Ziang, Yao, Yaxing, Li, Toby Jia-Jun
The dark patterns, deceptive interface designs manipulating user behaviors, have been extensively studied for their effects on human decision-making and autonomy. Yet, with the rising prominence of LLM-powered GUI agents that automate tasks from high-level intents, understanding how dark patterns affect agents is increasingly important. We present a two-phase empirical study examining how agents, human participants, and human-AI teams respond to 16 types of dark patterns across diverse scenarios. Phase 1 highlights that agents often fail to recognize dark patterns, and even when aware, prioritize task completion over protective action. Phase 2 revealed divergent failure modes: humans succumb due to cognitive shortcuts and habitual compliance, while agents falter from procedural blind spots. Human oversight improved avoidance but introduced costs such as attentional tunneling and cognitive load. Our findings show neither humans nor agents are uniformly resilient, and collaboration introduces new vulnerabilities, suggesting design needs for transparency, adjustable autonomy, and oversight.
- Europe > Austria > Vienna (0.14)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.14)
- North America > United States > New York > New York County > New York City (0.06)
- (15 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.92)
- Information Technology > Security & Privacy (1.00)
- Health & Medicine (0.94)
- Law (0.92)
- Government (0.67)
Automated Classification of Tutors' Dialogue Acts Using Generative AI: A Case Study Using the CIMA Corpus
First submitted: 30 Oct 2023. The final version will be available open access via the journal. Abstract This study explores the use of generative AI for automating the classification of tutors' Dialogue Acts (DAs), aiming to reduce the time and effort required by traditional manual coding. This case study uses the open - source CIMA corpus, in which tutors' re sponses are pre - annotated into four DA categories. Both GPT - 3.5 - turbo and GPT - 4 models were tested using tailored prompts. Results show that GPT - 4 achieved 80% accuracy, a weighted F1 - score of 0.81, and a Cohen's Kappa of 0.74, surpassing baseline performa nce and indicating substantial agreement with human annotations. These findings suggest that generative AI has strong potential to provide an efficient and accessible approach to DA classification, with meaningful implications for educational dialogue analysis. The study also highlights the importance of task - specific label definitions and contextual information in enhanc ing the quality of automated annotation. Finally, it underscores the ethical considerations associated with the use of generative AI and the need for responsible and transparent research practices.
- North America > United States > Maine (0.04)
- Europe > United Kingdom (0.04)
- Europe > Netherlands (0.04)
- (2 more...)
MUA-RL: Multi-turn User-interacting Agent Reinforcement Learning for agentic tool use
Zhao, Weikang, Wang, Xili, Ma, Chengdi, Kong, Lingbin, Yang, Zhaohua, Tuo, Mingxiang, Shi, Xiaowei, Zhai, Yitao, Cai, Xunliang
With the recent rapid advancement of Agentic Intelligence, agentic tool use in LLMs has become increasingly important. During multi-turn interactions between agents and users, the dynamic, uncertain, and stochastic nature of user demands poses significant challenges to the agent's tool invocation capabilities. Agents are no longer expected to simply call tools to deliver a result; rather, they must iteratively refine their understanding of user needs through communication while simultaneously invoking tools to resolve user queries. Existing reinforcement learning (RL) approaches for tool use lack the integration of genuinely dynamic users during the RL training process. To bridge this gap, we introduce MUA-RL (Multi-turn User-interacting Agent Reinforcement Learning for agentic tool use), a novel reinforcement learning framework that, for the first time in the field of agentic tool use, integrates LLM-simulated users into the reinforcement learning loop. MUA-RL aims to enable autonomous learning of models to communicate with users efficiently and use various tools to solve practical problems in dynamic multi-turn interactions. Evaluations are done on several multi-turn tool-using benchmarks (see Figure 1). Specifically, MUA-RL-32B achieves 67.3 on TAU2 Retail, 45.4 on TAU2 Airline, 28.3 on TAU2 Telecom, 28.4 on BFCL-V3 Multi Turn, and 82.5 on ACEBench Agent -- outperforming or matching the performance of larger open-source models such as DeepSeek-V3-0324 and Qwen3-235B-A22B in non-thinking settings.
- North America > United States > Texas > Harris County > Houston (0.14)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- Asia > China > Beijing > Beijing (0.04)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.60)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)
MemGuide: Intent-Driven Memory Selection for Goal-Oriented Multi-Session LLM Agents
Du, Yiming, Wang, Bingbing, He, Yang, Liang, Bin, Wang, Baojun, Li, Zhongyang, Gui, Lin, Pan, Jeff Z., Xu, Ruifeng, Wong, Kam-Fai
Modern task-oriented dialogue (TOD) systems increasingly rely on large language model (LLM) agents, leveraging Retrieval-Augmented Generation (RAG) and long-context capabilities for long-term memory utilization. However, these methods are primarily based on semantic similarity, overlooking task intent and reducing task coherence in multi-session dialogues. To address this challenge, we introduce MemGuide, a two-stage framework for intent-driven memory selection. (1) Intent-Aligned Retrieval matches the current dialogue context with stored intent descriptions in the memory bank, retrieving QA-formatted memory units that share the same goal. (2) Missing-Slot Guided Filtering employs a chain-of-thought slot reasoner to enumerate unfilled slots, then uses a fine-tuned LLaMA-8B filter to re-rank the retrieved units by marginal slot-completion gain. The resulting memory units inform a proactive strategy that minimizes conversational turns by directly addressing information gaps. Based on this framework, we introduce the MS-TOD, the first multi-session TOD benchmark comprising 132 diverse personas, 956 task goals, and annotated intent-aligned memory targets, supporting efficient multi-session task completion. Evaluations on MS-TOD show that MemGuide raises the task success rate by 11% (88% -> 99%) and reduces dialogue length by 2.84 turns in multi-session settings, while maintaining parity with single-session benchmarks.
- North America > United States > California > Los Angeles County > Los Angeles (0.04)
- Asia > Thailand > Bangkok > Bangkok (0.04)
- Asia > China > Hong Kong (0.04)
- (9 more...)
State-Inference-Based Prompting for Natural Language Trading with Game NPCs
Kim, Minkyung, Kim, Junsik, Bae, Hwidong, Yang, Woongcheol, Park, Sangdon, Bae, Sohee
Large Language Models enable dynamic game interactions but struggle with rule-governed trading systems. Current implementations suffer from rule violations, such as item hallucinations and calculation errors, that erode player trust. Here, State-Inference-Based Prompting (SIBP) enables reliable trading through autonomous dialogue state inference and context-specific rule adherence. The approach decomposes trading into six states within a unified prompt framework, implementing context-aware item referencing and placeholder-based price calculations. Evaluation across 100 trading dialogues demonstrates >97% state compliance, >95% referencing accuracy, and 99.7% calculation precision. SIBP maintains computational efficiency while outperforming baseline approaches, establishing a practical foundation for trustworthy NPC interactions in commercial games.
- Asia > South Korea > Daejeon > Daejeon (0.04)
- North America > United States (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- (2 more...)
- Leisure & Entertainment > Games > Computer Games (1.00)
- Information Technology > Software (1.00)