Goto

Collaborating Authors

 correct hypothesis


Hypothesis Search: Inductive Reasoning with Language Models

Wang, Ruocheng, Zelikman, Eric, Poesia, Gabriel, Pu, Yewen, Haber, Nick, Goodman, Noah D.

arXiv.org Artificial Intelligence

Inductive reasoning is a core problem-solving capacity: humans can identify underlying principles from a few examples, which can then be robustly generalized to novel scenarios. Recent work has evaluated large language models (LLMs) on inductive reasoning tasks by directly prompting them yielding "in context learning." This can work well for straightforward inductive tasks, but performs very poorly on more complex tasks such as the Abstraction and Reasoning Corpus (ARC). In this work, we propose to improve the inductive reasoning ability of LLMs by generating explicit hypotheses at multiple levels of abstraction: we prompt the LLM to propose multiple abstract hypotheses about the problem, in natural language, then implement the natural language hypotheses as concrete Python programs. These programs can be directly verified by running on the observed examples and generalized to novel inputs. Because of the prohibitive cost of generation with state-of-the-art LLMs, we consider a middle step to filter the set of hypotheses that will be implemented into programs: we either ask the LLM to summarize into a smaller set of hypotheses, or ask human annotators to select a subset of the hypotheses. We verify our pipeline's effectiveness on the ARC visual inductive reasoning benchmark, its variant 1D-ARC, and string transformation dataset SyGuS. On a random 40-problem subset of ARC, our automated pipeline using LLM summaries achieves 27.5% accuracy, significantly outperforming the direct prompting baseline (accuracy of 12.5%). With the minimal human input of selecting from LLM-generated candidates, the performance is boosted to 37.5%. (And we argue this is a lower bound on the performance of our approach without filtering.) Our ablation studies show that abstract hypothesis generation and concrete program representations are both beneficial for LLMs to perform inductive reasoning tasks.


Where Does My Model Underperform? A Human Evaluation of Slice Discovery Algorithms

Johnson, Nari, Cabrera, Ángel Alexander, Plumb, Gregory, Talwalkar, Ameet

arXiv.org Artificial Intelligence

Machine learning (ML) models that achieve high average accuracy can still underperform on semantically coherent subsets (i.e. "slices") of data. This behavior can have significant societal consequences for the safety or bias of the model in deployment, but identifying these underperforming slices can be difficult in practice, especially in domains where practitioners lack access to group annotations to define coherent subsets of their data. Motivated by these challenges, ML researchers have developed new slice discovery algorithms that aim to group together coherent and high-error subsets of data. However, there has been little evaluation focused on whether these tools help humans form correct hypotheses about where (for which groups) their model underperforms. We conduct a controlled user study (N = 15) where we show 40 slices output by two state-of-the-art slice discovery algorithms to users, and ask them to form hypotheses about where an object detection model underperforms. Our results provide positive evidence that these tools provide some benefit over a naive baseline, and also shed light on challenges faced by users during the hypothesis formation step. We conclude by discussing design opportunities for ML and HCI researchers. Our findings point to the importance of centering users when designing and evaluating new tools for slice discovery.


Hybrid Belief Pruning with Guarantees for Viewpoint-Dependent Semantic SLAM

Lemberg, Tuvy, Indelman, Vadim

arXiv.org Artificial Intelligence

Semantic simultaneous localization and mapping is a subject of increasing interest in robotics and AI that directly influences the autonomous vehicles industry, the army industries, and more. One of the challenges in this field is to obtain object classification jointly with robot trajectory estimation. Considering view-dependent semantic measurements, there is a coupling between different classes, resulting in a combinatorial number of hypotheses. A common solution is to prune hypotheses that have a sufficiently low probability and to retain only a limited number of hypotheses. However, after pruning and renormalization, the updated probability is overconfident with respect to the original probability. This is especially problematic for systems that require high accuracy. If the prior probability of the classes is independent, the original normalization factor can be computed efficiently without pruning hypotheses. To the best of our knowledge, this is the first work to present these results. If the prior probability of the classes is dependent, we propose a lower bound on the normalization factor that ensures cautious results. The bound is calculated incrementally and with similar efficiency as in the independent case. After pruning and updating based on the bound, this belief is shown empirically to be close to the original belief.


Interactive Model with Structural Loss for Language-based Abductive Reasoning

Li, Linhao, Xu, Ming, Dong, Yongfeng, Li, Xin, Wang, Ao, Hu, Qinghua

arXiv.org Artificial Intelligence

The abductive natural language inference task ($\alpha$NLI) is proposed to infer the most plausible explanation between the cause and the event. In the $\alpha$NLI task, two observations are given, and the most plausible hypothesis is asked to pick out from the candidates. Existing methods model the relation between each candidate hypothesis separately and penalize the inference network uniformly. In this paper, we argue that it is unnecessary to distinguish the reasoning abilities among correct hypotheses; and similarly, all wrong hypotheses contribute the same when explaining the reasons of the observations. Therefore, we propose to group instead of ranking the hypotheses and design a structural loss called ``joint softmax focal loss'' in this paper. Based on the observation that the hypotheses are generally semantically related, we have designed a novel interactive language model aiming at exploiting the rich interaction among competing hypotheses. We name this new model for $\alpha$NLI: Interactive Model with Structural Loss (IMSL). The experimental results show that our IMSL has achieved the highest performance on the RoBERTa-large pretrained model, with ACC and AUC results increased by about 1\% and 5\% respectively.


Top Program Construction and Reduction for polynomial time Meta-Interpretive Learning

Patsantzis, Stassa, Muggleton, Stephen H.

arXiv.org Artificial Intelligence

Meta-Interpretive Learners, like most ILP systems, learn by searching for a correct hypothesis in the hypothesis space, the powerset of all constructible clauses. We show how this exponentially-growing search can be replaced by the construction of a Top program: the set of clauses in all correct hypotheses that is itself a correct hypothesis. We give an algorithm for Top program construction and show that it constructs a correct Top program in polynomial time and from a finite number of examples. We implement our algorithm in Prolog as the basis of a new MIL system, Louise, that constructs a Top program and then reduces it by removing redundant clauses. We compare Louise to the state-of-the-art search-based MIL system Metagol in experiments on grid world navigation, graph connectedness and grammar learning datasets and find that Louise improves on Metagol's predictive accuracy when the hypothesis space and the target theory are both large, or when the hypothesis space does not include a correct hypothesis because of "classification noise" in the form of mislabelled examples. When the hypothesis space or the target theory are small, Louise and Metagol perform equally well.


Delegative Reinforcement Learning: learning to avoid traps with a little help

Kosoy, Vanessa

arXiv.org Machine Learning

Most known regret bounds for reinforcement learning are either episodic or assume an environment without traps. We derive a regret bound without making either assumption, by allowing the algorithm to occasionally delegate an action to an external advisor. We thus arrive at a setting of active one-shot model-based reinforcement learning that we call DRL (delegative reinforcement learning.) The algorithm we construct in order to demonstrate the regret bound is a variant of Posterior Sampling Reinforcement Learning supplemented by a subroutine that decides which actions should be delegated. The algorithm is not anytime, since the parameters must be adjusted according to the target time discount. Currently, our analysis is limited to Markov decision processes with finite numbers of hypotheses, states and actions.