ground-truth
Supplementary Information: Acausalviewofcompositionalzero-shotrecognition
Next, we introduce two additional approximations we use to apply Eq. (S.9). An SCM matches a set of assignments to a causal graph. This implies that the error of the approximation Eq. (S.13) is mainly dominated by the gradients of g at hao, and the variance ofnao. Specifically, we use a positive differentiable measure of the statistical dependence, denoted by I. PIDA measures disentanglement of representations for models that are trained from unsupervised data. As a result, we have the following: Minimizing Eq. (S.21) leads topdo(a,o)(ˆφa0) approaching p(ˆφa0|a), which as we have just shown, leads top(ˆφa0|a) approaching pdo(a)(ˆφa0).
MP-GUI: Modality Perception with MLLMs for GUI Understanding
Wang, Ziwei, Chen, Weizhi, Yang, Leyang, Zhou, Sheng, Zhao, Shengchu, Zhan, Hanbei, Jin, Jiongchao, Li, Liangcheng, Shao, Zirui, Bu, Jiajun
Graphical user interface (GUI) has become integral to modern society, making it crucial to be understood for human-centric systems. However, unlike natural images or documents, GUIs comprise artificially designed graphical elements arranged to convey specific semantic meanings. Current multi-modal large language models (MLLMs) already proficient in processing graphical and textual components suffer from hurdles in GUI understanding due to the lack of explicit spatial structure modeling. Moreover, obtaining high-quality spatial structure data is challenging due to privacy issues and noisy environments. To address these challenges, we present MP-GUI, a specially designed MLLM for GUI understanding. MP-GUI features three precisely specialized perceivers to extract graphical, textual, and spatial modalities from the screen as GUI-tailored visual clues, with spatial structure refinement strategy and adaptively combined via a fusion gate to meet the specific preferences of different GUI understanding tasks. To cope with the scarcity of training data, we also introduce a pipeline for automatically data collecting. Extensive experiments demonstrate that MP-GUI achieves impressive results on various GUI understanding tasks with limited data.
Simple is Effective: The Roles of Graphs and Large Language Models in Knowledge-Graph-Based Retrieval-Augmented Generation
Li, Mufei, Miao, Siqi, Li, Pan
Large Language Models (LLMs) demonstrate strong reasoning abilities but face limitations such as hallucinations and outdated knowledge. Knowledge Graph (KG)-based Retrieval-Augmented Generation (RAG) addresses these issues by grounding LLM outputs in structured external knowledge from KGs. However, current KG-based RAG frameworks still struggle to optimize the trade-off between retrieval effectiveness and efficiency in identifying a suitable amount of relevant graph information for the LLM to digest. We introduce SubgraphRAG, extending the KG-based RAG framework that retrieves subgraphs and leverages LLMs for reasoning and answer prediction. Our approach innovatively integrates a lightweight multilayer perceptron with a parallel triple-scoring mechanism for efficient and flexible subgraph retrieval while encoding directional structural distances to enhance retrieval effectiveness. The size of retrieved subgraphs can be flexibly adjusted to match the query's need and the downstream LLM's capabilities. This design strikes a balance between model complexity and reasoning power, enabling scalable and generalizable retrieval processes. Notably, based on our retrieved subgraphs, smaller LLMs like Llama3.1-8B-Instruct deliver competitive results with explainable reasoning, while larger models like GPT-4o achieve state-of-the-art accuracy compared with previous baselines -- all without fine-tuning. Extensive evaluations on the WebQSP and CWQ benchmarks highlight SubgraphRAG's strengths in efficiency, accuracy, and reliability by reducing hallucinations and improving response grounding.
Mitigating Selection Bias with Node Pruning and Auxiliary Options
Choi, Hyeong Kyu, Xu, Weijie, Xue, Chi, Eckman, Stephanie, Reddy, Chandan K.
To mitigate this selection bias problem, previous solutions utilized debiasing methods to adjust the model's input and/or output. Our work, in contrast, investigates the model's internal representation of the selection bias. Specifically, we introduce a novel debiasing approach, Bias Node Pruning (BNP), which eliminates the linear layer parameters that contribute to the bias. Furthermore, we present Auxiliary Option Injection (AOI), a simple yet effective input modification technique for debiasing, which is compatible even with black-box LLMs. To provide a more systematic evaluation of selection bias, we review existing metrics and introduce Choice Kullback-Leibler Divergence (CKLD), which addresses the insensitivity of the commonly used metrics to imbalance in choice labels. Experiments show that our methods are robust and adaptable across various datasets when applied to three LLMs. The advent of large language models (LLMs) has revolutionized artificial intelligence applications, particularly in the domain of natural language processing. These models have demonstrated outstanding performance across a variety of use cases, including chatbots, machine translation, text generation, data annotation, etc. Their ability to answer questions with high precision has opened up new avenues for automated systems. Despite their remarkable abilities, LLMs suffer from the selection bias problem that often occurs in answering multiplechoice questions (MCQs). When selecting the answer for an MCQ, many LLMs prefer the choices in a given position (e.g., the last choice), or with a specific choice symbol (e.g., (A) or (3)) (Zheng et al., 2024; Wei et al., 2024; Pezeshkpour & Hruschka, 2024). Many previous works have attempted to explain this phenomenon and/or propose diverse ways to mitigate selection bias. While there are a few works focused on either modifying the input format (Li et al., 2023b; Robinson et al., 2023) or calibrating the output probabilities (Zheng et al., 2024; Reif Figure 1: We propose BNP and & Schwartz, 2024; Wei et al., 2024), to the best of our knowledge, AOI to reduce selection bias for no embedding or parameter-level investigation has been white-box and black-box models. Because selection bias originates from internal The CKLD metric is also proposed parameter-level computations, it is crucial to explore how the to encourage a more standardized LLM embeddings contribute to the bias in their output responses. Understanding the internal representation of selection bias can help us combat it. By scrutinizing the interaction between the internal representation and the LLM parameters, we develop a novel approach to debias the model. Specifically, we propose Bias Node Pruning (BNP), which eliminates nodes in the final linear layer that contribute to selection bias. By dropping as few as 32 out of 4096 nodes in the final layer, we can significantly reduce selection bias and improve question-answering performance.
Reactzyme: A Benchmark for Enzyme-Reaction Prediction
Hua, Chenqing, Zhong, Bozitao, Luan, Sitao, Hong, Liang, Wolf, Guy, Precup, Doina, Zheng, Shuangjia
Enzymes, with their specific catalyzed reactions, are necessary for all aspects of life, enabling diverse biological processes and adaptations. Predicting enzyme functions is essential for understanding biological pathways, guiding drug development, enhancing bioproduct yields, and facilitating evolutionary studies. Addressing the inherent complexities, we introduce a new approach to annotating enzymes based on their catalyzed reactions. This method provides detailed insights into specific reactions and is adaptable to newly discovered reactions, diverging from traditional classifications by protein family or expert-derived reaction classes. We employ machine learning algorithms to analyze enzyme reaction datasets, delivering a much more refined view on the functionality of enzymes. Our evaluation leverages the largest enzyme-reaction dataset to date, derived from the SwissProt and Rhea databases with entries up to January 8, 2024. We frame the enzyme-reaction prediction as a retrieval problem, aiming to rank enzymes by their catalytic ability for specific reactions. With our model, we can recruit proteins for novel reactions and predict reactions in novel proteins, facilitating enzyme discovery and function annotation.
Image Restoration Using Deep Regulated Convolutional Networks
Liu, Peng, Zhou, Xiaoxiao, Li, Yangjunyi, D, El Basha Mohammad, Fang, Ruogu
While the depth of convolutional neural networks has attracted substantial attention in the deep learning research, the width of these networks has recently received greater interest. The width of networks, defined as the size of the receptive fields and the density of the channels, has demonstrated crucial importance in low-level vision tasks such as image denoising and restoration. However, the limited generalization ability, due to the increased width of networks, creates a bottleneck in designing wider networks. In this paper, we propose the Deep Regulated Convolutional Network (RC-Net), a deep network composed of regulated sub-network blocks cascaded by skip-connections, to overcome this bottleneck. Specifically, the Regulated Convolution block (RC-block), featured by a combination of large and small convolution filters, balances the effectiveness of prominent feature extraction and the generalization ability of the network. RC-Nets have several compelling advantages: they embrace diversified features through large-small filter combinations, alleviate the hazy boundary and blurred details in image denoising and super-resolution problems, and stabilize the learning process. Our proposed RC-Nets outperform state-of-the-art approaches with significant performance gains in various image restoration tasks while demonstrating promising generalization ability. The code is available at https://github.com/cswin/RC-Nets.
VideoVista: A Versatile Benchmark for Video Understanding and Reasoning
Li, Yunxin, Chen, Xinyu, Hu, Baotian, Wang, Longyue, Shi, Haoyuan, Zhang, Min
Despite significant breakthroughs in video analysis driven by the rapid development of large multimodal models (LMMs), there remains a lack of a versatile evaluation benchmark to comprehensively assess these models' performance in video understanding and reasoning. To address this, we present VideoVista, a video QA benchmark that integrates challenges across diverse content categories, durations, and abilities. Specifically, VideoVista comprises 25,000 questions derived from 3,400 videos spanning 14 categories (e.g., Howto, Film, and Entertainment) with durations ranging from a few seconds to over 10 minutes. Besides, it encompasses 19 types of understanding tasks (e.g., anomaly detection, interaction understanding) and 8 reasoning tasks (e.g., logical reasoning, causal reasoning). To achieve this, we present an automatic data construction framework, leveraging powerful GPT-4o alongside advanced analysis tools (e.g., video splitting, object segmenting, and tracking). We also utilize this framework to construct training data to enhance the capabilities of video-related LMMs (Video-LMMs). Through a comprehensive and quantitative evaluation of cutting-edge models, we reveal that: 1) Video-LMMs face difficulties in fine-grained video tasks involving temporal location, object tracking, and anomaly detection; 2) Video-LMMs present inferior logical and relation reasoning abilities; 3) Open-source Video-LMMs' performance is significantly lower than GPT-4o and Gemini-1.5, lagging by 20 points. This highlights the crucial role VideoVista will play in advancing LMMs that can accurately understand videos and perform precise reasoning.
MoReVQA: Exploring Modular Reasoning Models for Video Question Answering
Min, Juhong, Buch, Shyamal, Nagrani, Arsha, Cho, Minsu, Schmid, Cordelia
This paper addresses the task of video question answering (videoQA) via a decomposed multi-stage, modular reasoning framework. Previous modular methods have shown promise with a single planning stage ungrounded in visual content. However, through a simple and effective baseline, we find that such systems can lead to brittle behavior in practice for challenging videoQA settings. Thus, unlike traditional single-stage planning methods, we propose a multi-stage system consisting of an event parser, a grounding stage, and a final reasoning stage in conjunction with an external memory. All stages are training-free, and performed using few-shot prompting of large models, creating interpretable intermediate outputs at each stage. By decomposing the underlying planning and task complexity, our method, MoReVQA, improves over prior work on standard videoQA benchmarks (NExT-QA, iVQA, EgoSchema, ActivityNet-QA) with state-of-the-art results, and extensions to related tasks (grounded videoQA, paragraph captioning).