multimodal agent
- Asia > China > Hong Kong (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
- (3 more...)
- Information Technology > Software (0.93)
- Law (0.92)
- Information Technology > Services (0.68)
- Information Technology > Software (1.00)
- Information Technology > Information Management (1.00)
- Information Technology > Data Science > Data Mining (1.00)
- (8 more...)
- Law (1.00)
- Information Technology (1.00)
- Leisure & Entertainment > Games > Computer Games (0.46)
- Asia > China > Hong Kong (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
- (3 more...)
- Information Technology > Software (0.93)
- Law (0.92)
- Information Technology > Services (0.68)
- Information Technology > Software (1.00)
- Information Technology > Information Management (1.00)
- Information Technology > Data Science > Data Mining (1.00)
- (8 more...)
- Law (1.00)
- Information Technology (1.00)
- Leisure & Entertainment > Games > Computer Games (0.46)
AgentTypo: Adaptive Typographic Prompt Injection Attacks against Black-box Multimodal Agents
Li, Yanjie, Cao, Yiming, Wang, Dong, Xiao, Bin
Abstract--Multimodal agents built on large vision-language models (L VLMs) are increasingly deployed in open-world settings but remain highly vulnerable to prompt injection, especially through visual inputs. We introduce AgentTypo, a black-box red-teaming framework that mounts adaptive typographic prompt injection by embedding optimized text into webpage images. Our automatic typographic prompt injection (A TPI) algorithm maximizes prompt reconstruction by substituting captioners while minimizing human detectability via a stealth loss, with a Tree-structured Parzen Estimator guiding black-box optimization over text placement, size, and color . T o further enhance attack strength, we develop AgentTypo-pro, a multi-LLM system that iteratively refines injection prompts using evaluation feedback and retrieves successful past examples for continual learning. Effective prompts are abstracted into generalizable strategies and stored in a strategy repository, enabling progressive knowledge accumulation and reuse in future attacks. Experiments on the VW A-Adv benchmark across Classifieds, Shopping, and Reddit scenarios show that AgentTypo significantly outperforms the latest image-based attacks such as AgentAttack. On GPT -4o agents, our image-only attack raises the success rate from 23% to 45%, with consistent results across GPT -4V, GPT -4o-mini, Gemini 1.5 Pro, and Claude 3 Opus. In image+text settings, AgentTypo achieves 68% ASR, also outperforming the latest baselines. Our findings reveal that AgentTypo poses a practical and potent threat to multimodal agents and highlight the urgent need for effective defense. As the reasoning capabilities of large vision language models (L VLMs) [1]-[5] continue to advance, increasingly powerful agents have been constructed based on these models [6]-[12]. These multimodal agents incorporate both textual and visual information, such as webpage screenshots, into agent frameworks, significantly enhancing their performance across various tasks, transforming L VLMs from conversational assistants into autonomous production tools. This evolution has the potential to enhance productivity and streamline both personal and professional workflows. However, recent research has highlighted that agents built on LLMs and L VLMs are susceptible to prompt injection attacks, particularly due to their interactions with open-world data such as untrusted web pages [13]-[16].
- Information Technology > Security & Privacy (1.00)
- Transportation > Air (0.82)
Agent-ScanKit: Unraveling Memory and Reasoning of Multimodal Agents via Sensitivity Perturbations
Cheng, Pengzhou, Dong, Lingzhong, Wu, Zeng, Wu, Zongru, Tang, Xiangru, Qin, Chengwei, Zhang, Zhuosheng, Liu, Gongshen
Although numerous strategies have recently been proposed to enhance the autonomous interaction capabilities of multimodal agents in graphical user interface (GUI), their reliability remains limited when faced with complex or out-of-domain tasks. This raises a fundamental question: Are existing multimodal agents reasoning spuriously? In this paper, we propose \textbf{Agent-ScanKit}, a systematic probing framework to unravel the memory and reasoning capabilities of multimodal agents under controlled perturbations. Specifically, we introduce three orthogonal probing paradigms: visual-guided, text-guided, and structure-guided, each designed to quantify the contributions of memorization and reasoning without requiring access to model internals. In five publicly available GUI benchmarks involving 18 multimodal agents, the results demonstrate that mechanical memorization often outweighs systematic reasoning. Most of the models function predominantly as retrievers of training-aligned knowledge, exhibiting limited generalization. Our findings underscore the necessity of robust reasoning modeling for multimodal agents in real-world scenarios, offering valuable insights toward the development of reliable multimodal agents.
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
- Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.88)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)
Automotive-ENV: Benchmarking Multimodal Agents in Vehicle Interface Systems
Yan, Junfeng, Wu, Biao, Fang, Meng, Chen, Ling
Multimodal agents have demonstrated strong performance in general GUI interactions, but their application in automotive systems has been largely unexplored. In-vehicle GUIs present distinct challenges: drivers' limited attention, strict safety requirements, and complex location-based interaction patterns. To address these challenges, we introduce Automotive-ENV, the first high-fidelity benchmark and interaction environment tailored for vehicle GUIs. This platform defines 185 parameterized tasks spanning explicit control, implicit intent understanding, and safety-aware tasks, and provides structured multimodal observations with precise programmatic checks for reproducible evaluation. Building on this benchmark, we propose ASURADA, a geo-aware multimodal agent that integrates GPS-informed context to dynamically adjust actions based on location, environmental conditions, and regional driving norms. Experiments show that geo-aware information significantly improves success on safety-aware tasks, highlighting the importance of location-based context in automotive environments. We will release Automotive-ENV, complete with all tasks and benchmarking tools, to further the development of safe and adaptive in-vehicle agents.
- North America > United States (0.04)
- Asia > Japan > Honshū > Chūbu > Toyama Prefecture > Toyama (0.04)
- Oceania > Australia > New South Wales > Sydney (0.04)
- (4 more...)
- Automobiles & Trucks (0.68)
- Transportation > Ground > Road (0.46)
See, Think, Act: Teaching Multimodal Agents to Effectively Interact with GUI by Identifying Toggles
Wu, Zongru, Mao, Rui, Tian, Zhiyuan, Cheng, Pengzhou, Ju, Tianjie, Wu, Zheng, Dong, Lingzhong, Sheng, Haiyue, Zhang, Zhuosheng, Liu, Gongshen
The advent of multimodal agents facilitates effective interaction within graphical user interface (GUI), especially in ubiquitous GUI control. However, their inability to reliably execute toggle control instructions remains a key bottleneck. To investigate this, we construct a state control benchmark with binary toggle instructions from public datasets. Evaluations of existing agents demonstrate their unreliability, particularly when the current toggle state already matches the desired state. To address the challenge, we propose State-aware Reasoning (StaR), a training method that teaches agents to perceive the current toggle state, analyze the desired state from the instruction, and act accordingly. Experiments on three multimodal agents demonstrate that StaR can improve toggle instruction execution accuracy by over 30\%. Further evaluations on three public benchmarks show that StaR also enhances general task performance. Finally, evaluations on a dynamic environment highlight the potential of StaR for real-world applications. Code, benchmark, and StaR-enhanced agents are available at https://github.com/ZrW00/StaR.
Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?
Data science and engineering workflows often span multiple stages, from warehousing to orchestration, using tools like BigQuery, dbt, and Airbyte. This automation can improve the productivity of experts while democratizing access to large-scale data analysis. In this paper, we introduce Spider2-V, the first multimodal agent benchmark focusing on professional data science and engineering workflows, featuring 494 real-world tasks in authentic computer environments and incorporating 20 enterprise-level professional applications. These tasks, derived from real-world use cases, evaluate the ability of a multimodal agent to perform data-related tasks by writing code and managing the GUI in enterprise data software systems. To balance realistic simulation with evaluation simplicity, we devote significant effort to developing automatic configurations for task setup and carefully crafting evaluation metrics for each task.
- Information Technology > Data Science (1.00)
- Information Technology > Artificial Intelligence (0.82)