AITopics

arXiv.org Artificial IntelligenceDec-1-2025

RefineBench: Evaluating Refinement Capability of Language Models via Checklists

Lee, Young-Jun, Kim, Seungone, Lee, Byung-Kwan, Moon, Minkyeong, Hwang, Yechan, Kim, Jong Myoung, Neubig, Graham, Welleck, Sean, Choi, Ho-Jin

Can language models (LMs) self-refine their own responses? This question is increasingly relevant as a wide range of real-world user interactions involve refinement requests. However, prior studies have largely tested LMs' refinement abilities on verifiable tasks such as competition math or symbolic reasoning with simplified scaffolds, whereas users often pose open-ended queries and provide varying degrees of feedback on what they desire. The recent advent of reasoning models that exhibit self-reflection patterns in their chains-of-thought further motivates this question. To analyze this, we introduce RefineBench, a benchmark of 1,000 challenging problems across 11 domains paired with a checklist-based evaluation framework. We evaluate two refinement modes: (1) guided refinement, where an LM is provided natural language feedback, and (2) self-refinement, where LMs attempt to improve without guidance. In the self-refinement setting, even frontier LMs such as Gemini 2.5 Pro and GPT-5 achieve modest baseline scores of 31.3% and 29.1%, respectively, and most models fail to consistently improve across iterations (e.g., Gemini-2.5-Pro gains only +1.8%, while DeepSeek-R1 declines by -0.1%). By contrast, in guided refinement, both proprietary LMs and large open-weight LMs (>70B) can leverage targeted feedback to refine responses to near-perfect levels within five turns. These findings suggest that frontier LMs require breakthroughs to self-refine their incorrect responses, and that RefineBench provides a valuable testbed for tracking progress.

large language model, machine learning, refinement capability, (16 more...)

2511.22173

Country: North America > United States (1.00)

Genre: Research Report > New Finding (1.00)

Industry:

Law (1.00)
Health & Medicine (1.00)
Government (1.00)
(4 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceMay-20-2025

BELLE: A Bi-Level Multi-Agent Reasoning Framework for Multi-Hop Question Answering

Zhang, Taolin, Li, Dongyang, Chen, Qizhou, Wang, Chengyu, He, Xiaofeng

Multi-hop question answering (QA) involves finding multiple relevant passages and performing step-by-step reasoning to answer complex questions. Previous works on multi-hop QA employ specific methods from different modeling perspectives based on large language models (LLMs), regardless of the question types. In this paper, we first conduct an in-depth analysis of public multi-hop QA benchmarks, dividing the questions into four types and evaluating five types of cutting-edge methods for multi-hop QA: Chain-of-Thought (CoT), Single-step, Iterative-step, Sub-step, and Adaptive-step. We find that different types of multi-hop questions have varying degrees of sensitivity to different types of methods. Thus, we propose a Bi-levEL muLti-agEnt reasoning (BELLE) framework to address multi-hop QA by specifically focusing on the correspondence between question types and methods, where each type of method is regarded as an ''operator'' by prompting LLMs differently. The first level of BELLE includes multiple agents that debate to obtain an executive plan of combined ''operators'' to address the multi-hop QA task comprehensively. During the debate, in addition to the basic roles of affirmative debater, negative debater, and judge, at the second level, we further leverage fast and slow debaters to monitor whether changes in viewpoints are reasonable. Extensive experiments demonstrate that BELLE significantly outperforms strong baselines in various datasets. Additionally, the model consumption of BELLE is higher cost-effectiveness than that of single models in more complex multi-hop QA scenarios.

large language model, machine learning, natural language, (21 more...)

2505.11811

Country: Asia > China (0.28)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Upreti, Nijesh, Belle, Vaishak

An Algebraic Framework for Hierarchical Probabilistic Abstraction

arXiv.org Artificial IntelligenceFeb-28-2025

Abstraction is essential for reducing the complexity of systems across diverse fields, yet designing effective abstraction methodology for probabilistic models is inherently challenging due to stochastic behaviors and uncertainties. Current approaches often distill detailed probabilistic data into higher-level summaries to support tractable and interpretable analyses, though they typically struggle to fully represent the relational and probabilistic hierarchies through single-layered abstractions. We introduce a hierarchical probabilistic abstraction framework aimed at addressing these challenges by extending a measure-theoretic foundation for hierarchical abstraction. The framework enables modular problem-solving via layered mappings, facilitating both detailed layer-specific analysis and a cohesive system-wide understanding. This approach bridges high-level conceptualization with low-level perceptual data, enhancing interpretability and allowing layered analysis. Our framework provides a robust foundation for abstraction analysis across AI subfields, particularly in aligning System 1 and System 2 thinking, thereby supporting the development of diverse abstraction methodologies.

abstraction, hpoa, probability space, (15 more...)

2502.21216

Country:

North America > United States > Illinois > Cook County > Chicago (0.04)
Europe > United Kingdom (0.04)

Genre: Research Report (0.64)

Industry:

Education (0.68)
Health & Medicine > Therapeutic Area > Neurology > Alzheimer's Disease (0.30)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

The GuardianJul-10-2024, 09:00:53 GMT

Zenless Zone Zero review – stylish, enchanting and seductive

One of the biggest revolutions in the modern video game industry has taken place almost out of sight of your average console gamer. The rise of the free-to-play gacha game, in which you pay either real or in-game money for randomised bundles of characters and weapons, has been meteoric in the Chinese market, dominated by publishers such as miHoYo, NetEase and Yostar. The most successful such games, including Genshin Impact, Arknights and Another Eden, have tens of millions of players, mostly on smartphones, and draw vast incomes from those willing to pay to complete their collections of in-game items. Recently, the genre has been expanding beyond mobile, and Zenless Zone Zero is the latest example. Created by HoYoverse, this is a sprawling anime-styled action role-playing adventure set in a chaotic sci-fi dystopia.

agent, gacha game, zenless zone zero, (4 more...)

The Guardian

Industry: Leisure & Entertainment > Games > Computer Games (1.00)

Technology: Information Technology > Artificial Intelligence (0.70)

Hofmann, Till, Belle, Vaishak

Using Abstraction for Interpretable Robot Programs in Stochastic Domains

arXiv.org Artificial IntelligenceJul-26-2022

A robot's actions are inherently stochastic, as its sensors are noisy and its actions do not always have the intended effects. For this reason, the agent language Golog has been extended to models with degrees of belief and stochastic actions. While this allows more precise robot models, the resulting programs are much harder to comprehend, because they need to deal with the noise, e.g., by looping until some desired state has been reached with certainty, and because the resulting action traces consist of a large number of actions cluttered with sensor noise. To alleviate these issues, we propose to use abstraction. We define a high-level and nonstochastic model of the robot and then map the high-level model into the lower-level stochastic model. The resulting programs are much easier to understand, often do not require belief operators or loops, and produce much shorter action traces.

artificial intelligence, robot, sonar, (14 more...)

2207.12763

Country: Europe > Germany > North Rhine-Westphalia > Cologne Region > Aachen (0.04)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Robots (1.00)

AAAI ConferencesFeb-8-2022, 12:55:44 GMT

Belle

In a seminal paper, Lin and Reiter introduced the progression of basic action theories in the situation calculus. In this paper, we study the progression of knowledge in multiagent settings, where after actions, an agent updates her beliefs but also updates what she believes other agents know given what has occurred. By appealing to the notion of only knowing, we are able to avoid limitations of earlier work on multiagent progression, and obtain a new general account: we show that after an action, knowledge bases are updated in a Lin and Reiter fashion at every nesting of modalities.

belle, progression

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)

AAAI ConferencesFeb-8-2022, 12:51:14 GMT

Belle

Generalized plans, such as plans with loops, are widely used in AI. Among other things, they are straightforward to execute, they allow action repetition, and they solve multiple problem instances. However, the correctness of such plans is non-trivial to define, making it difficult to provide a clear specification of what we should be looking for. Proposals in the literature, such as strong planning, are universally adopted by the community, but were initially formulated for finite state systems. There is yet to emerge a study on the sensitivity of such correctness notions to the structural assumptions of the underlying plan framework. In this paper, we are interested in the applicability and correctness of generalized plans in domains that are possibly unbounded, and/or stochastic, and/or continuous. To that end, we introduce a generic controller framework to capture different types of planning domains. Using this framework, we then study a number of termination and goal satisfaction criteria from first principles, relate them to existing proposals, and show plans that meet these criteria in the different types of domains.

belle, correctness, different type, (1 more...)

Technology: Information Technology > Artificial Intelligence (0.65)

AAAI ConferencesFeb-8-2022, 11:52:44 GMT

Belle

Weighted model counting (WMC) on a propositional knowledge base is an effective and general approach to probabilistic inference in a variety of formalisms, including Bayesian and Markov Networks. However, an inherent limitation of WMC is that it only admits the inference of discrete probability distributions. In this paper, we introduce a strict generalization of WMC called weighted model integration that is based on annotating Boolean and arithmetic constraints, and combinations thereof. This methodology is shown to capture discrete, continuous and hybrid Markov networks. We then consider the task of parameter learning for a fragment of the language. An empirical evaluation demonstrates the applicability and promise of the proposal.

artificial intelligence, belle, machine learning, (2 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.72)

AAAI ConferencesFeb-8-2022, 11:52:42 GMT

Belle

High-level programming languages are an influential control paradigm for building agents that are purposeful in an incompletely known world. GOLOG, for example, allows us to write programs, with loops, whose constructs refer to an explicit world model axiomatized in the expressive language of the situation calculus. Over the years, GOLOG has been extended to deal with many other features, the claim being that these would be useful in robotic applications. Unfortunately, when robots are actually deployed, effectors and sensors are noisy, typically characterized over continuous probability distributions, none of which is supported in GOLOG, its dialects or its cousins. This paper presents ALLEGRO, a belief-based programming language for stochastic domains, that refashions GOLOG to allow for discrete and continuous initial uncertainty and noise. It is fully implemented and experiments demonstrate that ALLEGRO could be the basis for bridging high-level programming and probabilistic robotics technologies in a general way.

artificial intelligence, golog, programming language, (2 more...)

Technology:

Information Technology > Artificial Intelligence > Robots (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.67)