Goto

Collaborating Authors

 text information




RealBench: A Chinese Multi-image Understanding Benchmark Close to Real-world Scenarios

Zhao, Fei, Lu, Chengqiang, Shen, Yufan, Wang, Qimeng, Qian, Yicheng, Zhang, Haoxin, Gao, Yan, Wu, Yi, Hu, Yao, Wu, Zhen, Xing, Shangyu, Dai, Xinyu

arXiv.org Artificial Intelligence

While various multimodal multi-image evaluation datasets have been emerged, but these datasets are primarily based on English, and there has yet to be a Chinese multi-image dataset. To fill this gap, we introduce RealBench, the first Chinese multimodal multi-image dataset, which contains 9393 samples and 69910 images. RealBench distinguishes itself by incorporating real user-generated content, ensuring high relevance to real-world applications. Additionally, the dataset covers a wide variety of scenes, image resolutions, and image structures, further increasing the difficulty of multi-image understanding. Ultimately, we conduct a comprehensive evaluation of RealBench using 21 multimodal LLMs of different sizes, including closed-source models that support multi-image inputs as well as open-source visual and video models. The experimental results indicate that even the most powerful closed-source models still face challenges when handling multi-image Chinese scenarios. Moreover, there remains a noticeable performance gap of around 71.8\% on average between open-source visual/video models and closed-source models. These results show that RealBench provides an important research foundation for further exploring multi-image understanding capabilities in the Chinese context.


Graph-MLLM: Harnessing Multimodal Large Language Models for Multimodal Graph Learning

Liu, Jiajin, Fan, Dongzhe, Shen, Jiacheng, Ji, Chuanhao, Zha, Daochen, Tan, Qiaoyu

arXiv.org Artificial Intelligence

Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in representing and understanding diverse modalities. However, they typically focus on modality alignment in a pairwise manner while overlooking structural relationships across data points. Integrating multimodality with structured graph information (i.e., multimodal graphs, MMGs) is essential for real-world applications such as social networks, healthcare, and recommendation systems. Existing MMG learning methods fall into three paradigms based on how they leverage MLLMs: Encoder, Aligner, and Predictor. MLLM-as-Encoder focuses on enhancing graph neural networks (GNNs) via multimodal feature fusion; MLLM-as-Aligner aligns multimodal attributes in language or hidden space to enable LLM-based graph reasoning; MLLM-as-Predictor treats MLLMs as standalone reasoners with in-context learning or fine-tuning. Despite their advances, the MMG field lacks a unified benchmark to fairly evaluate across these approaches, making it unclear what progress has been made. To bridge this gap, we present Graph-MLLM, a comprehensive benchmark for multimodal graph learning by systematically evaluating these three paradigms across six datasets with different domains. Through extensive experiments, we observe that jointly considering the visual and textual attributes of the nodes benefits graph learning, even when using pre-trained text-to-image alignment models (e.g., CLIP) as encoders. We also find that converting visual attributes into textual descriptions further improves performance compared to directly using visual inputs. Moreover, we observe that fine-tuning MLLMs on specific MMGs can achieve state-of-the-art results in most scenarios, even without explicit graph structure information. We hope that our open-sourced library will facilitate rapid, equitable evaluation and inspire further innovative research in this field.


PASemiQA: Plan-Assisted Agent for Question Answering on Semi-Structured Data with Text and Relational Information

Yang, Hansi, Zhang, Qi, Jiang, Wei, Li, Jianguo

arXiv.org Artificial Intelligence

Large language models (LLMs) have shown impressive abilities in answering questions across various domains, but they often encounter hallucination issues on questions that require professional and up-to-date knowledge. To address this limitation, retrieval-augmented generation (RAG) techniques have been proposed, which retrieve relevant information from external sources to inform their responses. However, existing RAG methods typically focus on a single type of external data, such as vectorized text database or knowledge graphs, and cannot well handle real-world questions on semi-structured data containing both text and relational information. To bridge this gap, we introduce PASemiQA, a novel approach that jointly leverages text and relational information in semi-structured data to answer questions. PASemiQA first generates a plan to identify relevant text and relational information to answer the question in semi-structured data, and then uses an LLM agent to traverse the semi-structured data and extract necessary information. Our empirical results demonstrate the effectiveness of PASemiQA across different semi-structured datasets from various domains, showcasing its potential to improve the accuracy and reliability of question answering systems on semi-structured data.


A Thorough Investigation into the Application of Deep CNN for Enhancing Natural Language Processing Capabilities

Weng, Chang, Rood, Scott, Ramezani, Mehdi Ali, Aslani, Amir, Zarrab, Reza, Zwuo, Wang, Salimans, Sanjeev, Satheesh, Tim

arXiv.org Artificial Intelligence

Natural Language Processing (NLP) is widely used in fields like machine translation and sentiment analysis. However, traditional NLP models struggle with accuracy and efficiency. This paper introduces Deep Convolutional Neural Networks (DCNN) into NLP to address these issues. By integrating DCNN, machine learning (ML) algorithms, and generative adversarial networks (GAN), the study improves language understanding, reduces ambiguity, and enhances task performance. The high-performance NLP model shows a 10% improvement in segmentation accuracy and a 4% increase in recall rate compared to traditional models. This integrated approach excels in tasks such as word segmentation, part-of-speech tagging, machine translation, and text classification, offering better recognition accuracy and processing efficiency.


FATH: Authentication-based Test-time Defense against Indirect Prompt Injection Attacks

Wang, Jiongxiao, Wu, Fangzhou, Li, Wendi, Pan, Jinsheng, Suh, Edward, Mao, Z. Morley, Chen, Muhao, Xiao, Chaowei

arXiv.org Artificial Intelligence

Large language models (LLMs) have been widely deployed as the backbone with additional tools and text information for real-world applications. However, integrating external information into LLM-integrated applications raises significant security concerns. Among these, prompt injection attacks are particularly threatening, where malicious instructions injected in the external text information can exploit LLMs to generate answers as the attackers desire. While both training-time and test-time defense methods have been developed to mitigate such attacks, the unaffordable training costs associated with training-time methods and the limited effectiveness of existing test-time methods make them impractical. This paper introduces a novel test-time defense strategy, named Formatting AuThentication with Hash-based tags (FATH). Unlike existing approaches that prevent LLMs from answering additional instructions in external text, our method implements an authentication system, requiring LLMs to answer all received instructions with a security policy and selectively filter out responses to user instructions as the final output. To achieve this, we utilize hash-based authentication tags to label each response, facilitating accurate identification of responses according to the user's instructions and improving the robustness against adaptive attacks. Comprehensive experiments demonstrate that our defense method can effectively defend against indirect prompt injection attacks, achieving state-of-the-art performance under Llama3 and GPT3.5 models across various attack methods. Our code is released at: https://github.com/Jayfeather1024/FATH


Do We Trust What They Say or What They Do? A Multimodal User Embedding Provides Personalized Explanations

Ren, Zhicheng, Xiao, Zhiping, Sun, Yizhou

arXiv.org Artificial Intelligence

With the rapid development of social media, the importance of analyzing social network user data has also been put on the agenda. User representation learning in social media is a critical area of research, based on which we can conduct personalized content delivery, or detect malicious actors. Being more complicated than many other types of data, social network user data has inherent multimodal nature. Various multimodal approaches have been proposed to harness both text (i.e. post content) and relation (i.e. inter-user interaction) information to learn user embeddings of higher quality. The advent of Graph Neural Network models enables more end-to-end integration of user text embeddings and user interaction graphs in social networks. However, most of those approaches do not adequately elucidate which aspects of the data - text or graph structure information - are more helpful for predicting each specific user under a particular task, putting some burden on personalized downstream analysis and untrustworthy information filtering. We propose a simple yet effective framework called Contribution-Aware Multimodal User Embedding (CAMUE) for social networks. We have demonstrated with empirical evidence, that our approach can provide personalized explainable predictions, automatically mitigating the impact of unreliable information. We also conducted case studies to show how reasonable our results are. We observe that for most users, graph structure information is more trustworthy than text information, but there are some reasonable cases where text helps more. Our work paves the way for more explainable, reliable, and effective social media user embedding which allows for better personalized content delivery.


Unlocking the Multi-modal Potential of CLIP for Generalized Category Discovery

Wang, Enguang, Peng, Zhimao, Xie, Zhengyuan, Yang, Fei, Liu, Xialei, Cheng, Ming-Ming

arXiv.org Artificial Intelligence

Given unlabelled datasets containing both old and new categories, generalized category discovery (GCD) aims to accurately discover new classes while correctly classifying old classes, leveraging the class concepts learned from labeled samples. Current GCD methods only use a single visual modality of information, resulting in poor classification of visually similar classes. As a different modality, text information can provide complementary discriminative information, which motivates us to introduce it into the GCD task. However, the lack of class names for unlabelled data makes it impractical to utilize text information. To tackle this challenging problem, in this paper, we propose a Text Embedding Synthesizer (TES) to generate pseudo text embeddings for unlabelled samples. Specifically, our TES leverages the property that CLIP can generate aligned vision-language features, converting visual embeddings into tokens of the CLIP's text encoder to generate pseudo text embeddings. Besides, we employ a dual-branch framework, through the joint learning and instance consistency of different modality branches, visual and semantic information mutually enhance each other, promoting the interaction and fusion of visual and text knowledge. Our method unlocks the multi-modal potentials of CLIP and outperforms the baseline methods by a large margin on all GCD benchmarks, achieving new state-of-the-art. The code will be released at https://github.com/enguangW/GET .


Cradle: Empowering Foundation Agents Towards General Computer Control

Tan, Weihao, Zhang, Wentao, Xu, Xinrun, Xia, Haochong, Ding, Ziluo, Li, Boyu, Zhou, Bohan, Yue, Junpeng, Jiang, Jiechuan, Li, Yewen, An, Ruyi, Qin, Molei, Zong, Chuqiao, Zheng, Longtao, Wu, Yujie, Chai, Xiaoqiang, Bi, Yifei, Xie, Tianbao, Gu, Pengjie, Li, Xiyun, Zhang, Ceyao, Tian, Long, Wang, Chaojie, Wang, Xinrun, Karlsson, Börje F., An, Bo, Yan, Shuicheng, Lu, Zongqing

arXiv.org Artificial Intelligence

Despite the success in specific scenarios, existing foundation agents still struggle to generalize across various virtual scenarios, mainly due to the dramatically different encapsulations of environments with manually designed observation and action spaces. To handle this issue, we propose the General Computer Control (GCC) setting to restrict foundation agents to interact with software through the most unified and standardized interface, i.e., using screenshots as input and keyboard and mouse actions as output. We introduce Cradle, a modular and flexible LMM-powered framework, as a preliminary attempt towards GCC. Enhanced by six key modules, Cradle can understand input screenshots and output executable code for low-level keyboard and mouse control after high-level planning, so that Cradle can interact with any software and complete long-horizon complex tasks without relying on any built-in APIs. Experimental results show that Cradle exhibits remarkable generalizability and impressive performance across four previously unexplored commercial video games, five software applications, and a comprehensive benchmark, OSWorld. Cradle is the first to enable foundation agents to follow the main storyline and complete 40-minute-long real missions in the complex AAA game Red Dead Redemption 2 (RDR2). Cradle can also create a city of a thousand people in Cities: Skylines, farm and harvest parsnips in Stardew Valley, and trade and bargain with a maximal weekly total profit of 87% in Dealer's Life 2. Cradle can not only operate daily software, like Chrome, Outlook, and Feishu, but also edit images and videos using Meitu and CapCut. Cradle greatly extends the reach of foundation agents by enabling the easy conversion of any software, especially complex games, into benchmarks to evaluate agents' various abilities and facilitate further data collection, thus paving the way for generalist agents.