Goto

Collaborating Authors

 Problem Solving


Inferring Implicit Goals Across Differing Task Models

arXiv.org Artificial Intelligence

This should be all well and good, provided value-aligned behavior is to not only account for the human bottleneck states are also bottleneck states for the the specified user objectives but also any implicit agent. Otherwise, the agent must make an effort to figure out or unspecified user requirements. The existence what the user's underlying subgoals may be. of such implicit requirements could be particularly To see how such problems may arise, consider an agent common in settings where the user's understanding tasked with guiding a tourist to a famous art museum. The of the task model may differ from the agent's estimate tourist simply says, "Get me a plan to get to the art museum," of the model. Under this scenario, the user unaware of the city's metro system and expecting an may incorrectly expect some agent behavior to be above-ground route passing certain landmarks. The agent, inevitable or guaranteed. This paper addresses such however, might plan a route using the metro system. For the expectation mismatch in the presence of differing agent's metro route, bottlenecks migh include entering the models by capturing the possibility of unspecified metro, making transfers, and exiting at the correct station.


Instantiation-based Formalization of Logical Reasoning Tasks using Language Models and Logical Solvers

arXiv.org Artificial Intelligence

Robustness of reasoning remains a significant challenge for large language models, and addressing it is essential for the practical applicability of AI-driven reasoning systems. We introduce Semantic Self-Verification (SSV), a novel approach that addresses the key challenge in combining language models with the rigor of logical solvers: to accurately formulate the reasoning problem from natural language to the formal language of the solver. SSV uses a consistency-based approach to produce strong abstract formalizations of problems using concrete instantiations that are generated by the model and verified by the solver. In addition to significantly advancing the overall reasoning accuracy over the state-of-the-art, a key novelty that this approach presents is a feature of verification that has near-perfect precision over a significant coverage of cases, as we demonstrate on open reasoning benchmarks. We propose such *near-certain reasoning* as a new approach to reduce the need for manual verification in many cases, taking us closer to more dependable and autonomous AI reasoning systems.


Rethinking External Slow-Thinking: From Snowball Errors to Probability of Correct Reasoning

arXiv.org Artificial Intelligence

Test-time scaling, which is also often referred to as slow-thinking, has been demonstrated to enhance multi-step reasoning in large language models (LLMs). However, despite its widespread utilization, the mechanisms underlying slow-thinking methods remain poorly understood. This paper explores the mechanisms of external slow-thinking from a theoretical standpoint. We begin by examining the snowball error effect within the LLM reasoning process and connect it to the likelihood of correct reasoning using information theory. Building on this, we show that external slow-thinking methods can be interpreted as strategies to mitigate the error probability. We further provide a comparative analysis of popular external slow-thinking approaches, ranging from simple to complex, highlighting their differences and interrelationships. Our findings suggest that the efficacy of these methods is not primarily determined by the specific framework employed, and that expanding the search scope or the model's internal reasoning capacity may yield more sustained improvements in the long term. We open-source our code at https://github.com/ZyGan1999/Snowball-Errors-and-Probability.


PhysBench: Benchmarking and Enhancing Vision-Language Models for Physical World Understanding

arXiv.org Artificial Intelligence

Understanding the physical world is a fundamental challenge in embodied AI, critical for enabling agents to perform complex tasks and operate safely in real-world environments. While Vision-Language Models (VLMs) have shown great promise in reasoning and task planning for embodied agents, their ability to comprehend physical phenomena remains extremely limited. To close this gap, we introduce PhysBench, a comprehensive benchmark designed to evaluate VLMs' physical world understanding capability across a diverse set of tasks. PhysBench contains 10,002 entries of interleaved video-image-text data, categorized into four major domains: physical object properties, physical object relationships, physical scene understanding, and physics-based dynamics, further divided into 19 subclasses and 8 distinct capability dimensions. Our extensive experiments, conducted on 75 representative VLMs, reveal that while these models excel in common-sense reasoning, they struggle with understanding the physical world -- likely due to the absence of physical knowledge in their training data and the lack of embedded physical priors. To tackle the shortfall, we introduce PhysAgent, a novel framework that combines the generalization strengths of VLMs with the specialized expertise of vision models, significantly enhancing VLMs' physical understanding across a variety of tasks, including an 18.4\% improvement on GPT-4o. Furthermore, our results demonstrate that enhancing VLMs' physical world understanding capabilities can help embodied agents such as MOKA. We believe that PhysBench and PhysAgent offer valuable insights and contribute to bridging the gap between VLMs and physical world understanding.


Reviews: Modeling Conceptual Understanding in Image Reference Games

Neural Information Processing Systems

Summary --- Consider a speaker agent and many listeners where listeners perceive differently (e.g., some know what cat furr looks like and others don't). This paper proposes an image reference game and develops a speaker that performs better at the reference game by modeling listener abilities. For example, one person might be able to visually classify many specific dog breeds wheras another person might not know anything about what dogs look like. The speaker utters image attributes which the listener uses to distinguish between the two images. Reference Game Flow: There are two stages of interaction analogous to meta-learning setups: practice and evaluation.


Reviews: Modeling Conceptual Understanding in Image Reference Games

Neural Information Processing Systems

Reviewers all voted to accept this submission and had their concerns generally addressed by the rebuttal. They were impressed by the clarity of the experimental setting and empirical results.


Reviews: Bayesian Optimization with Unknown Search Space

Neural Information Processing Systems

Applying Bayesian optimization to expensive black-box problems needs to specify the bound of search space. However, when tackling a completely new problem, there is no prior knowledge to guarantee that the specified search space contains the global optimum. The paper proposes an approach to deal with this situation. In the approach, the user first specifies an initial search space; then the bound of search space automatically expands as the iteration proceeds; finally the algorithm will return a solution achieving \epsilon-accuracy. The key is how to expand the search space.


Reviews: Bayesian Optimization with Unknown Search Space

Neural Information Processing Systems

This paper proposes an algorithm to expand the search space for Bayesian optimization. The reviewers thought the work tackles an important problem and would be of interest to the community. The claims are well supported by empirical evidence and the paper is clearly written. There were concerns about the practicality of the method and that the work is a combination of well-known techniques. Because the paper presents a relatively novel approach and substantiates the claims with strong supporting evidence, it seems to be above the bar of acceptance.


Smart Cubing for Graph Search: A Comparative Study

arXiv.org Artificial Intelligence

Propositional satisfiability (SAT) solvers based on conflict-driven clause learning can solve huge instances with millions of variables and clauses [Fichte et al., 2023a]. However, for hard instances, particularly in combinatorial problems, parallelization becomes necessary. The cube-and-conquer technique has proven highly effective for such problems, most notably in resolving the Pythagorean triples conjecture [Heule et al., 2016]. In cube-and-conquer, a look-ahead solver first partitions the search space into disjoint subproblems via cubes (partial assignments), which are then solved independently by CDCL solvers. This independence enables efficient parallel solving. When encoding combinatorial problems into SAT, particularly those involving graphs, we often encounter highly symmetric search spaces. Many mutually isomorphic graphs satisfy the same constraints, but a solver needs to check only one representative, the canonical element, from each isomorphism class. Standard CDCL solvers cannot leverage these symmetries, and static symmetry breaking methods cannot break all symmetries [Codish et al., 2019]. SAT Modulo Symmetries (SMS) [Kirchweger and Szeider, 2021; Kirchweger and Szeider, 2024] addresses this limitation through dynamic symmetry breaking, using a custom propagator that learns symmetry-breaking predicates during the search.


ACT-JEPA: Joint-Embedding Predictive Architecture Improves Policy Representation Learning

arXiv.org Artificial Intelligence

Learning efficient representations for decision-making policies is a challenge in imitation learning (IL). Current IL methods require expert demonstrations, which are expensive to collect. Consequently, they often have underdeveloped world models. Self-supervised learning (SSL) offers an alternative by allowing models to learn from diverse, unlabeled data, including failures. However, SSL methods often operate in raw input space, making them inefficient. In this work, we propose ACT-JEPA, a novel architecture that integrates IL and SSL to enhance policy representations. We train a policy to predict (1) action sequences and (2) abstract observation sequences. The first objective uses action chunking to improve action prediction and reduce compounding errors. The second objective extends this idea of chunking by predicting abstract observation sequences. We utilize Joint-Embedding Predictive Architecture to predict in abstract representation space, allowing the model to filter out irrelevant details, improve efficiency, and develop a robust world model. Our experiments show that ACT-JEPA improves the quality of representations by learning temporal environment dynamics. Additionally, the model's ability to predict abstract observation sequences results in representations that effectively generalize to action sequence prediction. ACT-JEPA performs on par with established baselines across a range of decision-making tasks.