Goto

Collaborating Authors

 Sun, Kexuan


MARVEL: Multidimensional Abstraction and Reasoning through Visual Evaluation and Learning

arXiv.org Artificial Intelligence

While multi-modal large language models (MLLMs) have shown significant progress on many popular visual reasoning benchmarks, whether they possess abstract visual reasoning abilities remains an open question. Similar to the Sudoku puzzles, abstract visual reasoning (AVR) problems require finding high-level patterns (e.g., repetition constraints) that control the input shapes (e.g., digits) in a specific task configuration (e.g., matrix). However, existing AVR benchmarks only considered a limited set of patterns (addition, conjunction), input shapes (rectangle, square), and task configurations (3 by 3 matrices). To evaluate MLLMs' reasoning abilities comprehensively, we introduce MARVEL, a multidimensional AVR benchmark with 770 puzzles composed of six core knowledge patterns, geometric and abstract shapes, and five different task configurations. To inspect whether the model accuracy is grounded in perception and reasoning, MARVEL complements the general AVR question with perception questions in a hierarchical evaluation framework. We conduct comprehensive experiments on MARVEL with nine representative MLLMs in zero-shot and few-shot settings. Our experiments reveal that all models show near-random performance on the AVR question, with significant performance gaps (40%) compared to humans across all patterns and task configurations. Further analysis of perception questions reveals that MLLMs struggle to comprehend the visual features (near-random performance) and even count the panels in the puzzle ( <45%), hindering their ability for abstract reasoning. We release our entire code and dataset.


The Curious Case of Nonverbal Abstract Reasoning with Multi-Modal Large Language Models

arXiv.org Artificial Intelligence

While large language models (LLMs) are still being adopted to new domains and utilized in novel applications, we are experiencing an influx of the new generation of foundation models, namely multi-modal large language models (MLLMs). These models integrate verbal and visual information, opening new possibilities to demonstrate more complex reasoning abilities at the intersection of the two modalities. However, despite the revolutionizing prospect of MLLMs, our understanding of their reasoning abilities is limited. In this study, we assess the nonverbal abstract reasoning abilities of open-source and closed-source MLLMs using variations of Raven's Progressive Matrices. Our experiments expose the difficulty of solving such problems while showcasing the immense gap between open-source and closed-source models. We also reveal critical shortcomings with individual visual and textual modules, subjecting the models to low-performance ceilings. Finally, to improve MLLMs' performance, we experiment with various methods, such as Chain-of-Thought prompting, resulting in a significant (up to 100%) boost in performance.


Table-based Fact Verification with Salience-aware Learning

arXiv.org Artificial Intelligence

Tables provide valuable knowledge that can be used to verify textual statements. While a number of works have considered table-based fact verification, direct alignments of tabular data with tokens in textual statements are rarely available. Moreover, training a generalized fact verification model requires abundant labeled training data. In this paper, we propose a novel system to address these problems. Inspired by counterfactual causality, our system identifies token-level salience in the statement with probing-based salience estimation. Salience estimation allows enhanced learning of fact verification from two perspectives. From one perspective, our system conducts masked salient token prediction to enhance the model for alignment and reasoning between the table and the statement. From the other perspective, our system applies salience-aware data augmentation to generate a more diverse set of training instances by replacing non-salient terms. Experimental results on TabFact show the effective improvement by the proposed salience-aware learning techniques, leading to the new SOTA performance on the benchmark. Our code is publicly available at https://github.com/luka-group/Salience-aware-Learning .


Retrieving Complex Tables with Multi-Granular Graph Representation Learning

arXiv.org Artificial Intelligence

The task of natural language table retrieval (NLTR) seeks to retrieve semantically relevant tables based on natural language queries. Existing learning systems for this task often treat tables as plain text based on the assumption that tables are structured as dataframes. However, tables can have complex layouts which indicate diverse dependencies between subtable structures, such as nested headers. As a result, queries may refer to different spans of relevant content that is distributed across these structures. Moreover, such systems fail to generalize to novel scenarios beyond those seen in the training set. Prior methods are still distant from a generalizable solution to the NLTR problem, as they fall short in handling complex table layouts or queries over multiple granularities. To address these issues, we propose Graph-based Table Retrieval (GTR), a generalizable NLTR framework with multi-granular graph representation learning. In our framework, a table is first converted into a tabular graph, with cell nodes, row nodes and column nodes to capture content at different granularities. Then the tabular graph is input to a Graph Transformer model that can capture both table cell content and the layout structures. To enhance the robustness and generalizability of the model, we further incorporate a self-supervised pre-training task based on graph-context matching. Experimental results on two benchmarks show that our method leads to significant improvements over the current state-of-the-art systems. Further experiments demonstrate promising performance of our method on cross-dataset generalization, and enhanced capability of handling complex tables and fulfilling diverse query intents. Code and data are available at https://github.com/FeiWang96/GTR.


Moving Agents in Formation in Congested Environments

AAAI Conferences

In this paper, we formalize and study the Moving Agents in Formation (MAiF) problem, that combines the tasks of finding short collision-free paths for multiple agents and keeping them in close adherence to a desired formation. Previous work includes controller-based algorithms, swarm-based algorithms, and potential-field-based algorithms. They usually focus on only one or the other of these tasks, solve the problem greedily without systematic search, and thus generate costly solutions or even fail to find solutions in congested environment. In this paper, we develop a two-phase search algorithm, called SWARM-MAPF, whose first phase is inspired by swarm-based algorithms (in open regions) and whose second phase is inspired by multi-agent path-finding (MAPF) algorithms (in congested regions). In the first phase, SWARM-MAPF selects a leader among the agents and finds a path for it that is sufficiently far away from the obstacles so that the other agents can preserve the desired formation around it. It also identifies the critical segments of the leader's path where the other agents cannot preserve the desired formation and the refinement of which has thus to be delegated to the second phase. In the second phase, SWARM-MAPF refines these segments. Theoretically, we prove that SWARM-MAPF is complete. Empirically, we show that SWARM-MAPF scales well and is able to find close-to-optimal solutions.


Decision Tree Learning-Inspired Dynamic Variable Ordering for the Weighted CSP

AAAI Conferences

The weighted constraint satisfaction problem (WCSP) is a powerful mathematical framework for combinatorial optimization. The branch and bound search paradigm is very successful in solving the WCSP but critically depends on the ordering in which variables are instantiated. In this paper, we introduce a new framework for dynamic variable ordering for solving the WCSP. This framework is inspired by regression decision tree learning. Variables are ordered dynamically based on samples of random assignments of values to variables as well as their corresponding total weights. Within this framework, we propose four variable ordering heuristics (sdr, inv-sdr, rr and inv-rr). We compare them with many other state-of-the-art dynamic variable ordering heuristics, and show that sdr and rr outperform them on many real-world and random benchmark instances.


Top K Hypotheses Selection on a Knowledge Graph

AAAI Conferences

A Knowledge Graph (KG), popularly used in both industry and academia, is an effective representation of knowledge. It consists of a collection of knowledge elements, each of which in turn is extracted from the web or other sources. Information extractors that use natural language processing techniques or other complex algorithms are usually noisy. That is, the vast number of knowledge elements extracted from the web may not only be associated with different confidence values but may also be inconsistent with each other. Many applications such as question answering systems that are built on top of large-scale KGs are required to reason efficiently about these confidence values and inconsistencies. In addition, they are required to incorporate ontological constraints in their reasoning. One way to do this is to extract a subgraph of a KG that is consistent with the ontological constraints and is of maximum total confidence value. Such a subgraph is referred to as the top hypothesis and is combinatorially hard to find. In this paper, we introduce an algorithmic framework for efficiently addressing the combinatorial hardness and selecting the top K hypotheses. Our approach is based on powerful algorithmic techniques recently invented in the context of the Weighted Constraint Satisfaction Problem (WCSP).