AITopics | Problem Solving

Collaborating Authors

Problem Solving

News Overviews Instructional Materials AI-Alerts Classics

Accelerating Model-Based Reinforcement Learning with State-Space World Models

Krinner, Maria, Aljalbout, Elie, Romero, Angel, Scaramuzza, Davide

arXiv.org Machine LearningFeb-27-2025

Reinforcement learning (RL) is a powerful approach for robot learning. However, model-free RL (MFRL) requires a large number of environment interactions to learn successful control policies. This is due to the noisy RL training updates and the complexity of robotic systems, which typically involve highly non-linear dynamics and noisy sensor signals. In contrast, model-based RL (MBRL) not only trains a policy but simultaneously learns a world model that captures the environment's dynamics and rewards. The world model can either be used for planning, for data collection, or to provide first-order policy gradients for training. Leveraging a world model significantly improves sample efficiency compared to model-free RL. However, training a world model alongside the policy increases the computational complexity, leading to longer training times that are often intractable for complex real-world scenarios. In this work, we propose a new method for accelerating model-based RL using state-space world models. Our approach leverages state-space models (SSMs) to parallelize the training of the dynamics model, which is typically the main computational bottleneck. Additionally, we propose an architecture that provides privileged information to the world model during training, which is particularly relevant for partially observable environments. We evaluate our method in several real-world agile quadrotor flight tasks, involving complex dynamics, for both fully and partially observable environments. We demonstrate a significant speedup, reducing the world model training time by up to 10 times, and the overall MBRL training time by up to 4 times. This benefit comes without compromising performance, as our method achieves similar sample efficiency and task rewards to state-of-the-art MBRL methods.

international conference, sample efficiency, world model, (12 more...)

arXiv.org Machine Learning

2502.20168

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)

Genre: Research Report (1.00)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

WorldModelBench: Judging Video Generation Models As World Models

Li, Dacheng, Fang, Yunhao, Chen, Yukang, Yang, Shuo, Cao, Shiyi, Wong, Justin, Luo, Michael, Wang, Xiaolong, Yin, Hongxu, Gonzalez, Joseph E., Stoica, Ion, Han, Song, Lu, Yao

arXiv.org Artificial IntelligenceFeb-27-2025

Video generation models have rapidly progressed, positioning themselves as video world models capable of supporting decision-making applications like robotics and autonomous driving. However, current benchmarks fail to rigorously evaluate these claims, focusing only on general video quality, ignoring important factors to world models such as physics adherence. To bridge this gap, we propose WorldModelBench, a benchmark designed to evaluate the world modeling capabilities of video generation models in application-driven domains. WorldModelBench offers two key advantages: (1) Against to nuanced world modeling violations: By incorporating instruction-following and physics-adherence dimensions, WorldModelBench detects subtle violations, such as irregular changes in object size that breach the mass conservation law - issues overlooked by prior benchmarks. (2) Aligned with large-scale human preferences: We crowd-source 67K human labels to accurately measure 14 frontier models. Using our high-quality human labels, we further fine-tune an accurate judger to automate the evaluation procedure, achieving 8.6% higher average accuracy in predicting world modeling violations than GPT-4o with 2B parameters. In addition, we demonstrate that training to align human annotations by maximizing the rewards from the judger noticeably improve the world modeling capability. The website is available at https://worldmodelbench-team.github.io.

arxiv preprint arxiv, video generation model, worldmodelbench, (12 more...)

arXiv.org Artificial Intelligence

2502.20694

Country:

North America > United States > Massachusetts > Middlesex County > Newton (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > United States > California > San Diego County > San Diego (0.04)

Genre: Research Report (0.50)

Industry: Information Technology (0.34)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Add feedback

OpenEarthSensing: Large-Scale Fine-Grained Benchmark for Open-World Remote Sensing

Xiang, Xiang, Xu, Zhuo, Deng, Yao, Zhou, Qinhao, Liang, Yifan, Chen, Ke, Zheng, Qingfang, Wang, Yaowei, Chen, Xilin, Gao, Wen

arXiv.org Artificial IntelligenceFeb-27-2025

In open-world remote sensing, deployed models must continuously adapt to a steady influx of new data, which often exhibits various shifts compared to what the model encountered during the training phase. To effectively handle the new data, models are required to detect semantic shifts, adapt to covariate shifts, and continuously update themselves. These challenges give rise to a variety of open-world tasks. However, existing open-world remote sensing studies typically train and test within a single dataset to simulate open-world conditions. Currently, there is a lack of large-scale benchmarks capable of evaluating multiple open-world tasks. In this paper, we introduce OpenEarthSensing, a large-scale fine-grained benchmark for open-world remote sensing. OpenEarthSensing includes 189 scene and objects categories, covering the vast majority of potential semantic shifts that may occur in the real world. Additionally, OpenEarthSensing encompasses five data domains with significant covariate shifts, including two RGB satellite domians, one RGB aerial domian, one MS RGB domian, and one infrared domian. The various domains provide a more comprehensive testbed for evaluating the generalization performance of open-world models. We conduct the baseline evaluation of current mainstream open-world tasks and methods on OpenEarthSensing, demonstrating that it serves as a challenging benchmark for open-world remote sensing.

dataset, detection, learning, (14 more...)

arXiv.org Artificial Intelligence

2502.20668

Country:

Europe > Austria > Vienna (0.14)
North America > Canada > British Columbia > Vancouver (0.04)
Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
(3 more...)

Genre: Research Report (0.50)

Industry: Energy > Renewable > Geothermal > Geothermal Energy Exploration and Development > Geophysical Analysis & Survey (1.00)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(4 more...)

Add feedback

ATLAS Navigator: Active Task-driven LAnguage-embedded Gaussian Splatting

Ong, Dexter, Tao, Yuezhan, Murali, Varun, Spasojevic, Igor, Kumar, Vijay, Chaudhari, Pratik

arXiv.org Artificial IntelligenceFeb-27-2025

The module also clusters features based on geometry and semantics in the map. The hierarchical mapper [B] runs bottom-up, ingesting the RGB and depth images and the odometric path from the robot to build a map. The top level of the map contains the submaps, the middle level the regions, and the bottom level the objects. The local map compsises the loaded submaps. The other submaps are unloaded to save memory (shown here in gray). The planning module [C] consists of a discrete planner that operates on the sparse map and generates a reference path, while the dense Gaussians in the local map are used to find the trajectory to be executed on the robot. Abstract --We address the challenge of task-oriented navigation in unstructured and unknown environments, where robots must incrementally build and reason on rich, metric-semantic maps in real time. Since tasks may require clarification or re-specification, it is necessary for the information in the map to be rich enough to enable generalization across a wide range of tasks. T o effectively execute tasks specified in natural language, we propose a hierarchical representation built on language-embedded Gaussian splatting that enables both sparse semantic planning that lends itself to online operation and dense geometric representation for collision-free navigation. We validate the effectiveness of our method through real-world robot experiments conducted in both cluttered indoor and kilometer-scale outdoor environments, with a competitive ratio of about 60% against privileged baselines. Experiment videos and more details can be found on our project page: https://atlasnav.github.io This, in turn, requires robots to autonomously perceive their surroundings, gather relevant information, and make safe and efficient decisions - capabilities crucial for a variety of open-world tasking approaches over kilometer-scale environments with sparse semantics . To enable these capabilities on-board robots with privacy & compute constraints, we develop a framework to efficiently store and plan on hierarchical metric-semantic maps with visual and inertial sensors only. An overview of our method is shown in Figure 1. A cornerstone of autonomous navigation is the creation of actionable maps that effectively represent the environment and support diverse navigation and task-specific operations. These properties collectively ensure that the proposed map is not only manageable but also capable of supporting large-scale autonomous navigation to complete tasks provided in natural language. To achieve these goals, we propose an agglomerative data structure that is consistent across both geometric and semantic scales built upon 3D Gaussian Splatting [5] (3DGS).

gaussian, representation, robot, (12 more...)

arXiv.org Artificial Intelligence

2502.20386

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
North America > United States > Pennsylvania (0.04)
Europe > Italy (0.04)
Asia > Japan > Honshū > Chūbu > Ishikawa Prefecture > Kanazawa (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
(3 more...)

Add feedback

Do Retrieval-Augmented Language Models Adapt to Varying User Needs?

Wu, Peilin, Zhang, Xinlu, Yu, Wenhao, Liu, Xingyu, Du, Xinya, Chen, Zhiyu Zoey

arXiv.org Artificial IntelligenceFeb-27-2025

Recent advancements in Retrieval-Augmented Language Models (RALMs) have demonstrated their efficacy in knowledge-intensive tasks. However, existing evaluation benchmarks often assume a single optimal approach to leveraging retrieved information, failing to account for varying user needs. This paper introduces a novel evaluation framework that systematically assesses RALMs under three user need cases-Context-Exclusive, Context-First, and Memory-First-across three distinct context settings: Context Matching, Knowledge Conflict, and Information Irrelevant. By varying both user instructions and the nature of retrieved information, our approach captures the complexities of real-world applications where models must adapt to diverse user requirements. Through extensive experiments on multiple QA datasets, including HotpotQA, DisentQA, and our newly constructed synthetic URAQ dataset, we find that restricting memory usage improves robustness in adversarial retrieval conditions but decreases peak performance with ideal retrieval results and model family dominates behavioral differences. Our findings highlight the necessity of user-centric evaluations in the development of retrieval-augmented systems and provide insights into optimizing model performance across varied retrieval contexts. We will release our code and URAQ dataset upon acceptance of the paper.

accuracy curve, computational linguistic, dataset, (15 more...)

arXiv.org Artificial Intelligence

2502.19779

Country:

North America > United States > Florida > Miami-Dade County > Miami (0.04)
North America > Canada > Ontario > Toronto (0.04)
Asia > Middle East > Jordan (0.04)
(11 more...)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.99)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.72)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.51)

Add feedback

Knowledge representation and scalable abstract reasoning for simulated democracy in Unity

Katsiri, Eleftheria, Gazis, Alexandros, Protopapas, Angelos

arXiv.org Artificial IntelligenceFeb-26-2025

We present a novel form of scalable knowledge representation about agents in a simulated democracy, e-polis, where real users respond to social challenges associated with democratic institutions, structured as Smart Spatial Types, a new type of Smart Building that changes architectural form according to the philosophical doctrine of a visitor. At the end of the game players vote on the Smart City that results from their collective choices. Our approach uses deductive systems in an unusual way: by integrating a model of democracy with a model of a Smart City we are able to prove quality aspects of the simulated democracy in different urban and social settings, while adding ease and flexibility to the development. Second, we can infer and reason with abstract knowledge, which is a limitation of the Unity platform; third, our system enables real-time decision-making and adaptation of the game flow based on the player's abstract state, paving the road to explainability. Scalability is achieved by maintaining a dual-layer knowledge representation mechanism for reasoning about the simulated democracy that functions in a similar way to a two-level cache. The lower layer knows about the current state of the game by continually processing a high rate of events produced by the in-built physics engine of the Unity platform, e.g., it knows of the position of a player in space, in terms of his coordinates x,y,z as well as their choices for each challenge. The higher layer knows of easily-retrievable, user-defined abstract knowledge about current and historical states, e.g., it knows of the political doctrine of a Smart Spatial Type, a player's philosophical doctrine, and the collective philosophical doctrine of a community players with respect to current social issues.

philosophical doctrine, sid, smart spatial type, (12 more...)

arXiv.org Artificial Intelligence

2503.05783

Country:

North America > United States > Hawaii (0.04)
Europe > Spain > Galicia > Madrid (0.04)
Europe > Greece (0.04)
(6 more...)

Genre:

Research Report (0.50)
Overview (0.46)

Industry:

Leisure & Entertainment > Games > Computer Games (1.00)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Information Technology (1.00)
(4 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (1.00)
Information Technology > Architecture > Real Time Systems (1.00)

Add feedback

XCOMPS: A Multilingual Benchmark of Conceptual Minimal Pairs

He, Linyang, Nie, Ercong, Dindar, Sukru Samet, Firoozi, Arsalan, Florea, Adrian, Nguyen, Van, Puffay, Corentin, Shimizu, Riki, Ye, Haotian, Brennan, Jonathan, Schmid, Helmut, Schütze, Hinrich, Mesgarani, Nima

arXiv.org Artificial IntelligenceFeb-26-2025

We introduce XCOMPS in this work, a multilingual conceptual minimal pair dataset covering 17 languages. Using this dataset, we evaluate LLMs' multilingual conceptual understanding through metalinguistic prompting, direct probability measurement, and neurolinguistic probing. By comparing base, instruction-tuned, and knowledge-distilled models, we find that: 1) LLMs exhibit weaker conceptual understanding for low-resource languages, and accuracy varies across languages despite being tested on the same concept sets. 2) LLMs excel at distinguishing concept-property pairs that are visibly different but exhibit a marked performance drop when negative pairs share subtle semantic similarities. 3) Instruction tuning improves performance in concept understanding but does not enhance internal competence; knowledge distillation can enhance internal competence in conceptual understanding for low-resource languages with limited gains in explicit task performance. 4) More morphologically complex languages yield lower concept understanding scores and require deeper layers for conceptual reasoning.

arxiv preprint arxiv, computational linguistic, language model, (14 more...)

arXiv.org Artificial Intelligence

2502.19737

Country:

Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
North America > United States > Michigan (0.04)
(8 more...)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Add feedback

MEDDxAgent: A Unified Modular Agent Framework for Explainable Automatic Differential Diagnosis

Rose, Daniel, Hung, Chia-Chien, Lepri, Marco, Alqassem, Israa, Gashteovski, Kiril, Lawrence, Carolin

arXiv.org Artificial IntelligenceFeb-26-2025

Differential Diagnosis (DDx) is a fundamental yet complex aspect of clinical decision-making, in which physicians iteratively refine a ranked list of possible diseases based on symptoms, antecedents, and medical knowledge. While recent advances in large language models have shown promise in supporting DDx, existing approaches face key limitations, including single-dataset evaluations, isolated optimization of components, unrealistic assumptions about complete patient profiles, and single-attempt diagnosis. We introduce a Modular Explainable DDx Agent (MEDDxAgent) framework designed for interactive DDx, where diagnostic reasoning evolves through iterative learning, rather than assuming a complete patient profile is accessible. MEDDxAgent integrates three modular components: (1) an orchestrator (DDxDriver), (2) a history taking simulator, and (3) two specialized agents for knowledge retrieval and diagnosis strategy. To ensure robust evaluation, we introduce a comprehensive DDx benchmark covering respiratory, skin, and rare diseases. We analyze single-turn diagnostic approaches and demonstrate the importance of iterative refinement when patient profiles are not available at the outset. Our broad evaluation demonstrates that MEDDxAgent achieves over 10% accuracy improvements in interactive DDx across both large and small LLMs, while offering critical explainability into its diagnostic reasoning process.

agent, diagnosis, gtpa, (16 more...)

arXiv.org Artificial Intelligence

2502.19175

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > Canada (0.04)
Asia > China (0.04)
(7 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine > Diagnostic Medicine (1.00)
Health & Medicine > Therapeutic Area > Dermatology (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Diagnosis (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Add feedback

Multi-LLM Collaborative Search for Complex Problem Solving

Yang, Sen, Li, Yafu, Lam, Wai, Cheng, Yu

arXiv.org Artificial IntelligenceFeb-26-2025

Large language models (LLMs) often struggle with complex reasoning tasks due to their limitations in addressing the vast reasoning space and inherent ambiguities of natural language. We propose the Mixture-of-Search-Agents (MoSA) paradigm, a novel approach leveraging the collective expertise of multiple LLMs to enhance search-based reasoning. MoSA integrates diverse reasoning pathways by combining independent exploration with iterative refinement among LLMs, mitigating the limitations of single-model approaches. Using Monte Carlo Tree Search (MCTS) as a backbone, MoSA enables multiple agents to propose and aggregate reasoning steps, resulting in improved accuracy. Our comprehensive evaluation across four reasoning benchmarks demonstrates MoSA's consistent performance improvements over single-agent and other multi-agent baselines, particularly in complex mathematical and commonsense reasoning tasks.

llm, multi-llm collaborative search, zhang, (15 more...)

arXiv.org Artificial Intelligence

2502.18873

Country:

Asia > Thailand > Bangkok > Bangkok (0.04)
Asia > Singapore (0.04)
Asia > Indonesia > Bali (0.04)
(4 more...)

Genre: Research Report > Promising Solution (0.48)

Industry: Leisure & Entertainment (0.47)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (1.00)

Add feedback

Medical Hallucinations in Foundation Models and Their Impact on Healthcare

Kim, Yubin, Jeong, Hyewon, Chen, Shan, Li, Shuyue Stella, Lu, Mingyu, Alhamoud, Kumail, Mun, Jimin, Grau, Cristina, Jung, Minseok, Gameiro, Rodrigo, Fan, Lizhou, Park, Eugene, Lin, Tristan, Yoon, Joonsik, Yoon, Wonjin, Sap, Maarten, Tsvetkov, Yulia, Liang, Paul, Xu, Xuhai, Liu, Xin, McDuff, Daniel, Lee, Hyeonhoon, Park, Hae Won, Tulebaev, Samir, Breazeal, Cynthia

arXiv.org Artificial IntelligenceFeb-25-2025

Foundation Models that are capable of processing and generating multi-modal data have transformed AI's role in medicine. However, a key limitation of their reliability is hallucination, where inaccurate or fabricated information can impact clinical decisions and patient safety. We define medical hallucination as any instance in which a model generates misleading medical content. This paper examines the unique characteristics, causes, and implications of medical hallucinations, with a particular focus on how these errors manifest themselves in real-world clinical scenarios. Our contributions include (1) a taxonomy for understanding and addressing medical hallucinations, (2) benchmarking models using medical hallucination dataset and physician-annotated LLM responses to real medical cases, providing direct insight into the clinical impact of hallucinations, and (3) a multi-national clinician survey on their experiences with medical hallucinations. Our results reveal that inference techniques such as Chain-of-Thought (CoT) and Search Augmented Generation can effectively reduce hallucination rates. However, despite these improvements, non-trivial levels of hallucination persist. These findings underscore the ethical and practical imperative for robust detection and mitigation strategies, establishing a foundation for regulatory policies that prioritize patient safety and maintain clinical integrity as AI becomes more integrated into healthcare. The feedback from clinicians highlights the urgent need for not only technical advances but also for clearer ethical and regulatory guidelines to ensure patient safety. A repository organizing the paper resources, summaries, and additional information is available at https://github.com/mitmedialab/medical hallucination.

data mining, large language model, machine learning, (23 more...)

arXiv.org Artificial Intelligence

2503.05777

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > Colorado (0.04)
North America > United States > California (0.04)
(16 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)
Overview (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine > Therapeutic Area > Pulmonary/Respiratory Diseases (1.00)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
(13 more...)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Expert Systems (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Diagnosis (1.00)
(9 more...)

Add feedback