bongard problem
Bongard-RWR+: Real-World Representations of Fine-Grained Concepts in Bongard Problems
Pawlonka, Szymon, Małkiński, Mikołaj, Mańdziuk, Jacek
Bongard Problems (BPs) provide a challenging testbed for abstract visual reasoning (AVR), requiring models to identify visual concepts fromjust a few examples and describe them in natural language. Early BP benchmarks featured synthetic black-and-white drawings, which might not fully capture the complexity of real-world scenes. Subsequent BP datasets employed real-world images, albeit the represented concepts are identifiable from high-level image features, reducing the task complexity. Differently, the recently released Bongard-RWR dataset aimed at representing abstract concepts formulated in the original BPs using fine-grained real-world images. Its manual construction, however, limited the dataset size to just $60$ instances, constraining evaluation robustness. In this work, we introduce Bongard-RWR+, a BP dataset composed of $5\,400$ instances that represent original BP abstract concepts using real-world-like images generated via a vision language model (VLM) pipeline. Building on Bongard-RWR, we employ Pixtral-12B to describe manually curated images and generate new descriptions aligned with the underlying concepts, use Flux.1-dev to synthesize images from these descriptions, and manually verify that the generated images faithfully reflect the intended concepts. We evaluate state-of-the-art VLMs across diverse BP formulations, including binary and multiclass classification, as well as textual answer generation. Our findings reveal that while VLMs can recognize coarse-grained visual concepts, they consistently struggle with discerning fine-grained concepts, highlighting limitations in their reasoning capabilities.
Some things to know about achieving artificial general intelligence
Current and foreseeable GenAI models are not capable of achieving artificial general intelligence because they are burdened with anthropogenic debt. They depend heavily on human input to provide well-structured problems, architecture, and training data. They cast every problem as a language pattern learning problem and are thus not capable of the kind of autonomy needed to achieve artificial general intelligence. Current models succeed at their tasks because people solve most of the problems to which these models are directed, leaving only simple computations for the model to perform, such as gradient descent. Another barrier is the need to recognize that there are multiple kinds of problems, some of which cannot be solved by available computational methods (for example, "insight problems"). Current methods for evaluating models (benchmarks and tests) are not adequate to identify the generality of the solutions, because it is impossible to infer the means by which a problem was solved from the fact of its solution. A test could be passed, for example, by a test-specific or a test-general method. It is a logical fallacy (affirming the consequent) to infer a method of solution from the observation of success.
Reasoning Limitations of Multimodal Large Language Models. A case study of Bongard Problems
Małkiński, Mikołaj, Pawlonka, Szymon, Mańdziuk, Jacek
Abstract visual reasoning (AVR) encompasses a suite of tasks whose solving requires the ability to discover common concepts underlying the set of pictures through an analogy-making process, similarly to human IQ tests. Bongard Problems (BPs), proposed in 1968, constitute a fundamental challenge in this domain mainly due to their requirement to combine visual reasoning and verbal description. This work poses a question whether multimodal large language models (MLLMs) inherently designed to combine vision and language are capable of tackling BPs. To this end, we propose a set of diverse MLLM-suited strategies to tackle BPs and examine four popular proprietary MLLMs: GPT-4o, GPT-4 Turbo, Gemini 1.5 Pro, and Claude 3.5 Sonnet, and four open models: InternVL2-8B, LLaVa-1.6 Mistral-7B, Phi-3.5-Vision, and Pixtral 12B. The above MLLMs are compared on three BP datasets: a set of original BP instances relying on synthetic, geometry-based images and two recent datasets based on real-world images, i.e., Bongard-HOI and Bongard-OpenWorld. The experiments reveal significant limitations of MLLMs in solving BPs. In particular, the models struggle to solve the classical set of synthetic BPs, despite their visual simplicity. Though their performance ameliorates on real-world concepts expressed in Bongard-HOI and Bongard-OpenWorld, the models still have difficulty in utilizing new information to improve their predictions, as well as utilizing a dialog context window effectively. To capture the reasons of performance discrepancy between synthetic and real-world AVR domains, we propose Bongard-RWR, a new BP dataset consisting of real-world images that translates concepts from hand-crafted synthetic BPs to real-world concepts. The MLLMs' results on Bongard-RWR suggest that their poor performance on classical BPs is not due to domain specificity but rather reflects their general AVR limitations.
Bongard in Wonderland: Visual Puzzles that Still Make AI Go Mad?
Wüst, Antonia, Tobiasch, Tim, Helff, Lukas, Dhami, Devendra S., Rothkopf, Constantin A., Kersting, Kristian
Visual reasoning, the ability to understand, interpret, and reason about the visual world, is a fundamental aspect of human intelligence [27]. It allows us to navigate our environment, interact with objects, and make sense of complex visual scenes. In recent years, the field of artificial intelligence (AI) has advanced rapidly toward replicating aspects of this visual reasoning, with significant focus placed on Vision-Language Models (VLMs) [5, 24, 25]. These models integrate visual and textual information to generate descriptive content, aiming to mimic how humans comprehend and reason about the world. Because of their human-like responses, VLMs often create the illusion of possessing human-like perception and intelligence. However, as recent work shows, VLMs and the Large Language Models (LLM) on which they are based have dramatic shortcomings in the case of reasoning [30] and visual perception [12, 13, 19, 34] or their combination [39, 47, 48]. Bongard problems (BPs), a class of visual puzzles that require the identification of underlying rules based on a limited set of images, provide a unique and challenging benchmark for assessing visual reasoning abilities in AI systems [4]. Conceived by Russian scientist Mikhail Bongard in 1967, these visual puzzles test cognitive abilities in pattern recognition and abstract reasoning, posing a formidable challenge even to advanced AI systems [15].
Neural networks for abstraction and reasoning: Towards broad generalization in machines
Bober-Irizar, Mikel, Banerjee, Soumya
For half a century, artificial intelligence research has attempted to reproduce the human qualities of abstraction and reasoning - creating computer systems that can learn new concepts from a minimal set of examples, in settings where humans find this easy. While specific neural networks are able to solve an impressive range of problems, broad generalisation to situations outside their training data has proved elusive.In this work, we look at several novel approaches for solving the Abstraction & Reasoning Corpus (ARC), a dataset of abstract visual reasoning tasks introduced to test algorithms on broad generalization. Despite three international competitions with $100,000 in prizes, the best algorithms still fail to solve a majority of ARC tasks and rely on complex hand-crafted rules, without using machine learning at all. We revisit whether recent advances in neural networks allow progress on this task. First, we adapt the DreamCoder neurosymbolic reasoning solver to ARC. DreamCoder automatically writes programs in a bespoke domain-specific language to perform reasoning, using a neural network to mimic human intuition. We present the Perceptual Abstraction and Reasoning Language (PeARL) language, which allows DreamCoder to solve ARC tasks, and propose a new recognition model that allows us to significantly improve on the previous best implementation.We also propose a new encoding and augmentation scheme that allows large language models (LLMs) to solve ARC tasks, and find that the largest models can solve some ARC tasks. LLMs are able to solve a different group of problems to state-of-the-art solvers, and provide an interesting way to complement other approaches. We perform an ensemble analysis, combining models to achieve better results than any system alone. Finally, we publish the arckit Python library to make future research on ARC easier.
Bongard-OpenWorld: Few-Shot Reasoning for Free-form Visual Concepts in the Real World
Wu, Rujie, Ma, Xiaojian, Li, Qing, Wang, Wei, Zhang, Zhenliang, Zhu, Song-Chun, Wang, Yizhou
We introduce Bongard-OpenWorld, a new benchmark for evaluating real-world few-shot reasoning for machine vision. It originates from the classical Bongard Problems (BPs): Given two sets of images (positive and negative), the model needs to identify the set that query images belong to by inducing the visual concepts, which is exclusively depicted by images from the positive set. Our benchmark inherits the few-shot concept induction of the original BPs while adding the two novel layers of challenge: 1) open-world free-form concepts, as the visual concepts in Bongard-OpenWorld are unique compositions of terms from an open vocabulary, ranging from object categories to abstract visual attributes and commonsense factual knowledge; 2) real-world images, as opposed to the synthetic diagrams used by many counterparts. In our exploration, Bongard-OpenWorld already imposes a significant challenge to current few-shot reasoning algorithms. We further investigate to which extent the recently introduced Large Language Models (LLMs) and Vision-Language Models (VLMs) can solve our task, by directly probing VLMs, and combining VLMs and LLMs in an interactive reasoning scheme. We even designed a neuro-symbolic reasoning approach that reconciles LLMs & VLMs with logical reasoning to emulate the human problem-solving process for Bongard Problems. However, none of these approaches manage to close the human-machine gap, as the best learner achieves 64% accuracy while human participants easily reach 91%. We hope Bongard-OpenWorld can help us better understand the limitations of current visual intelligence and facilitate future research on visual agents with stronger few-shot visual reasoning capabilities.
Cross-Image Context Matters for Bongard Problems
Raghuraman, Nikhil, Harley, Adam W., Guibas, Leonidas
Current machine learning methods struggle to solve Bongard problems, which are a type of IQ test that requires deriving an abstract "concept" from a set of positive and negative "support" images, and then classifying whether or not a new query image depicts the key concept. On Bongard-HOI, a benchmark for natural-image Bongard problems, existing methods have only reached 66% accuracy (where chance is 50%). Low accuracy is often attributed to neural nets' lack of ability to find human-like symbolic rules. In this work, we point out that many existing methods are forfeiting accuracy due to a much simpler problem: they do not incorporate information contained in the support set as a whole, and rely instead on information extracted from individual supports. This is a critical issue, because unlike in few-shot learning tasks concerning object classification, the "key concept" in a typical Bongard problem can only be distinguished using multiple positives and multiple negatives. We explore a variety of simple methods to take this cross-image context into account, and demonstrate substantial gains over prior methods, leading to new state-of-the-art performance on Bongard-LOGO (75.3%) and Bongard-HOI (72.45%) and strong performance on the original Bongard problem set (60.84%).
Towards a Solution to Bongard Problems: A Causal Approach
Youssef, Salahedine, Zečević, Matej, Dhami, Devendra Singh, Kersting, Kristian
Even though AI has advanced rapidly in recent years displaying success in solving highly complex problems, the class of Bongard Problems (BPs) yet remain largely unsolved by modern ML techniques. In this paper, we propose a new approach in an attempt to not only solve BPs but also extract meaning out of learned representations. This includes the reformulation of the classical BP into a reinforcement learning (RL) setting which will allow the model to gain access to counterfactuals to guide its decisions but also explain its decisions. Since learning meaningful representations in BPs is an essential sub-problem, we further make use of contrastive learning for the extraction of low level features from pixel data. Several experiments have been conducted for analyzing the general BP-RL setup, feature extraction methods and using the best combination for the feature space analysis and its interpretation.