constrainedness
KITAB: Evaluating LLMs on Constraint Satisfaction for Information Retrieval
Abdin, Marah I, Gunasekar, Suriya, Chandrasekaran, Varun, Li, Jerry, Yuksekgonul, Mert, Peshawaria, Rahee Ghosh, Naik, Ranjita, Nushi, Besmira
We study the ability of state-of-the art models to answer constraint satisfaction queries for information retrieval (e.g., 'a list of ice cream shops in San Diego'). In the past, such queries were considered to be tasks that could only be solved via web-search or knowledge bases. More recently, large language models (LLMs) have demonstrated initial emergent abilities in this task. However, many current retrieval benchmarks are either saturated or do not measure constraint satisfaction. Motivated by rising concerns around factual incorrectness and hallucinations of LLMs, we present KITAB, a new dataset for measuring constraint satisfaction abilities of language models. KITAB consists of book-related data across more than 600 authors and 13,000 queries, and also offers an associated dynamic data collection and constraint verification approach for acquiring similar test data for other authors. Our extended experiments on GPT4 and GPT3.5 characterize and decouple common failure modes across dimensions such as information popularity, constraint types, and context availability. Results show that in the absence of context, models exhibit severe limitations as measured by irrelevant information, factual errors, and incompleteness, many of which exacerbate as information popularity decreases. While context availability mitigates irrelevant information, it is not helpful for satisfying constraints, identifying fundamental barriers to constraint satisfaction. We open source our contributions to foster further research on improving constraint satisfaction abilities of future models.
A Continuum of Generation Tasks for Investigating Length Bias and Degenerate Repetition
Language models suffer from various degenerate behaviors. These differ between tasks: machine translation (MT) exhibits length bias, while tasks like story generation exhibit excessive repetition. Recent work has attributed the difference to task constrainedness, but evidence for this claim has always involved many confounding variables. To study this question directly, we introduce a new experimental framework that allows us to smoothly vary task constrainedness, from MT at one end to fully open-ended generation at the other, while keeping all other aspects fixed. We find that: (1) repetition decreases smoothly with constrainedness, explaining the difference in repetition across tasks; (2) length bias surprisingly also decreases with constrainedness, suggesting some other cause for the difference in length bias; (3) across the board, these problems affect the mode, not the whole distribution; (4) the differences cannot be attributed to a change in the entropy of the distribution, since another method of changing the entropy, label smoothing, does not produce the same effect.
Understanding Boolean Function Learnability on Deep Neural Networks
Tavares, Anderson R., Avelar, Pedro, Flach, João M., Nicolau, Marcio, Lamb, Luis C., Vardi, Moshe
Computational learning theory states that many classes of boolean formulas are learnable in polynomial time. This paper addresses the understudied subject of how, in practice, such formulas can be learned by deep neural networks. Specifically, we analyse boolean formulas associated with the decision version of combinatorial optimisation problems, model sampling benchmarks, and random 3-CNFs with varying degrees of constrainedness. Our extensive experiments indicate that: (i) regardless of the combinatorial optimisation problem, relatively small and shallow neural networks are very good approximators of the associated formulas; (ii) smaller formulas seem harder to learn, possibly due to the fewer positive (satisfying) examples available; and (iii) interestingly, underconstrained 3-CNF formulas are more challenging to learn than overconstrained ones. Source code and relevant datasets are publicly available (https://github.com/machine-reasoning-ufrgs/mlbf).
Fat- and Heavy-Tailed Behavior in Satisficing Planning
Cohen, Eldan (University of Toronto) | Beck, J. Christopher (University of Toronto)
In this work, we study the runtime distribution of satisficing planning in ensembles of random planning problems and in multiple runs of a randomized heuristic search on a single planning instance. Using common heuristic functions (such as FF) and six benchmark problem domains from the IPC, we find a heavy-tailed behavior, similar to that found in CSP and SAT. We investigate two notions of constrainedness, often used in the modeling of planning problems, and show that the heavy-tailed behavior tends to appear in relatively relaxed problems, where the required effort is, on average, low. Finally, we show that as with randomized restarts in CSP and SAT solving, recent search enhancements that incorporate randomness in the search process can help mitigate the effect of the heavy tail.
Phase Transition and Network Structure in Realistic SAT Problems
Kambhampati, Soumya C., Liu, Thomas
A fundamental question in Computer Science is understanding when a specific class of problems go from being computationally easy to hard. Because of its generality and applications, the problem of Boolean Satisfiability (aka SAT) is often used as a vehicle for investigating this question. A signal result from these studies is that the hardness of SAT problems exhibits a dramatic easy-to-hard phase transition with respect to the problem constrainedness. Past studies have however focused mostly on SAT instances generated using uniform random distributions, where all constraints are independently generated, and the problem variables are all considered of equal importance. These assumptions are unfortunately not satisfied by most real problems. Our project aims for a deeper understanding of hardness of SAT problems that arise in practice. We study two key questions: (i) How does easy-to-hard transition change with more realistic distributions that capture neighborhood sensitivity and rich-get-richer aspects of real problems and (ii) Can these changes be explained in terms of the network properties (such as node centrality and small-worldness) of the clausal networks of the SAT problems. Our results, based on extensive empirical studies and network analyses, provide important structural and computational insights into realistic SAT problems. Our extensive empirical studies show that SAT instances from realistic distributions do exhibit phase transition, but the transition occurs sooner (at lower values of constrainedness) than the instances from uniform random distribution. We show that this behavior can be explained in terms of their clausal network properties such as eigenvector centrality and small-worldness (measured indirectly in terms of the clustering coefficients and average node distance).
Improving Local Search for Resource-Constrained Planning
Nakhost, Hootan (University of Alberta) | Hoffmann, Jörg (INRIA) | Müller, Martin (University of Alberta)
A ubiquitous feature of planning problems — problems involving the automatic generation of action sequences for attaining a given goal — is the need to economize limited resources such as fuel or money. While heuristic search, mostly based on standard algorithms such as A*, is currently the superior method for most varieties of planning, its ability to solve critically resource-constrained problems is limited: current planning heuristics are bad at dealing with this kind of structure. To address this, one can try to devise better heuristics. An alternative approach is to change the nature of the search instead. Local search has received some attention in planning, but not with a specific focus on how to deal with limited resources. We herein begin to fill this gap. We highlight the limitations of previous methods, and we devise a new improvement (smart restarts) to the local search method of a previously proposed planner (Arvand). Systematic experiments show how performance depends on problem structure and search parameters. In particular, we show that our new method can outperform previous planners by a large margin.