AITopics

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.39)

Neural Information Processing SystemsMar-22-2026, 09:36:31 GMT

Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?

Data science and engineering workflows often span multiple stages, from warehousing to orchestration, using tools like BigQuery, dbt, and Airbyte.

artificial intelligence, name change, proceedings, (12 more...)

Genre: Workflow (0.48)

Technology: Information Technology > Artificial Intelligence (0.79)

Neural Information Processing SystemsFeb-17-2026, 23:42:17 GMT

c2f71567cd53464161cab3336e8fc865-Paper-Datasets_and_Benchmarks_Track.pdf

data mining, large language model, machine learning, (23 more...)

Country:

Asia > China > Hong Kong (0.04)
North America > Canada > Ontario > Toronto (0.04)
Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
(3 more...)

Genre: Workflow (0.47)

Industry:

Information Technology > Software (0.93)
Law (0.92)
Information Technology > Services (0.68)

Technology:

Information Technology > Software (1.00)
Information Technology > Information Management (1.00)
Information Technology > Data Science > Data Mining (1.00)
(8 more...)

Neural Information Processing SystemsFeb-16-2026, 23:51:31 GMT

WhodunitBench: Evaluating Large Multimodal Agents via Murder Mystery Games

Recently, large language models (LLMs) have achieved superior performance, empowering the development of large multimodal agents (LMAs).

artificial intelligence, large language model, natural language, (17 more...)

Country:

Asia > China > Guangdong Province > Shenzhen (0.04)
Europe > Iceland > Capital Region > Reykjavik (0.04)
Asia > China > Hong Kong (0.04)

Genre: Research Report > New Finding (0.46)

Industry:

Law (1.00)
Information Technology (1.00)
Leisure & Entertainment > Games > Computer Games (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.93)

Neural Information Processing SystemsOct-10-2025, 15:48:16 GMT

c2f71567cd53464161cab3336e8fc865-Paper-Datasets_and_Benchmarks_Track.pdf

Furthermore, extensive analysis ( 4.3) on Spider2-V demonstrate that these strategies remarkably

agent, application, spider2-v, (15 more...)

Country:

Asia > China > Hong Kong (0.04)
North America > Canada > Ontario > Toronto (0.04)
Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
(3 more...)

Industry:

Information Technology > Software (0.93)
Law (0.92)
Information Technology > Services (0.68)

Technology:

Information Technology > Software (1.00)
Information Technology > Information Management (1.00)
Information Technology > Data Science > Data Mining (1.00)
(8 more...)

Neural Information Processing SystemsOct-10-2025, 11:22:16 GMT

WhodunitBench: Evaluating Large Multimodal Agents via Murder Mystery Games

Recently, large language models (LLMs) have achieved superior performance, empowering the development of large multimodal agents (LMAs).

agent, arxiv preprint arxiv, dataset, (14 more...)

Country:

Asia > China > Guangdong Province > Shenzhen (0.04)
Europe > Iceland > Capital Region > Reykjavik (0.04)
Asia > China > Hong Kong (0.04)

Genre: Research Report > New Finding (0.46)

Industry:

Law (1.00)
Information Technology (1.00)
Leisure & Entertainment > Games > Computer Games (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.93)

arXiv.org Artificial IntelligenceOct-7-2025

AgentTypo: Adaptive Typographic Prompt Injection Attacks against Black-box Multimodal Agents

Li, Yanjie, Cao, Yiming, Wang, Dong, Xiao, Bin

Abstract--Multimodal agents built on large vision-language models (L VLMs) are increasingly deployed in open-world settings but remain highly vulnerable to prompt injection, especially through visual inputs. We introduce AgentTypo, a black-box red-teaming framework that mounts adaptive typographic prompt injection by embedding optimized text into webpage images. Our automatic typographic prompt injection (A TPI) algorithm maximizes prompt reconstruction by substituting captioners while minimizing human detectability via a stealth loss, with a Tree-structured Parzen Estimator guiding black-box optimization over text placement, size, and color . T o further enhance attack strength, we develop AgentTypo-pro, a multi-LLM system that iteratively refines injection prompts using evaluation feedback and retrieves successful past examples for continual learning. Effective prompts are abstracted into generalizable strategies and stored in a strategy repository, enabling progressive knowledge accumulation and reuse in future attacks. Experiments on the VW A-Adv benchmark across Classifieds, Shopping, and Reddit scenarios show that AgentTypo significantly outperforms the latest image-based attacks such as AgentAttack. On GPT -4o agents, our image-only attack raises the success rate from 23% to 45%, with consistent results across GPT -4V, GPT -4o-mini, Gemini 1.5 Pro, and Claude 3 Opus. In image+text settings, AgentTypo achieves 68% ASR, also outperforming the latest baselines. Our findings reveal that AgentTypo poses a practical and potent threat to multimodal agents and highlight the urgent need for effective defense. As the reasoning capabilities of large vision language models (L VLMs) [1]-[5] continue to advance, increasingly powerful agents have been constructed based on these models [6]-[12]. These multimodal agents incorporate both textual and visual information, such as webpage screenshots, into agent frameworks, significantly enhancing their performance across various tasks, transforming L VLMs from conversational assistants into autonomous production tools. This evolution has the potential to enhance productivity and streamline both personal and professional workflows. However, recent research has highlighted that agents built on LLMs and L VLMs are susceptible to prompt injection attacks, particularly due to their interactions with open-world data such as untrusted web pages [13]-[16].

large language model, machine learning, natural language, (21 more...)

2510.04257

Genre: Research Report > New Finding (0.66)

Industry:

Information Technology > Security & Privacy (1.00)
Transportation > Air (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceOct-6-2025

Agent-ScanKit: Unraveling Memory and Reasoning of Multimodal Agents via Sensitivity Perturbations

Cheng, Pengzhou, Dong, Lingzhong, Wu, Zeng, Wu, Zongru, Tang, Xiangru, Qin, Chengwei, Zhang, Zhuosheng, Liu, Gongshen

Although numerous strategies have recently been proposed to enhance the autonomous interaction capabilities of multimodal agents in graphical user interface (GUI), their reliability remains limited when faced with complex or out-of-domain tasks. This raises a fundamental question: Are existing multimodal agents reasoning spuriously? In this paper, we propose \textbf{Agent-ScanKit}, a systematic probing framework to unravel the memory and reasoning capabilities of multimodal agents under controlled perturbations. Specifically, we introduce three orthogonal probing paradigms: visual-guided, text-guided, and structure-guided, each designed to quantify the contributions of memorization and reasoning without requiring access to model internals. In five publicly available GUI benchmarks involving 18 multimodal agents, the results demonstrate that mechanical memorization often outweighs systematic reasoning. Most of the models function predominantly as retrievers of training-aligned knowledge, exhibiting limited generalization. Our findings underscore the necessity of robust reasoning modeling for multimodal agents in real-world scenarios, offering valuable insights toward the development of reliable multimodal agents.

large language model, machine learning, natural language, (18 more...)

2510.00496

Country: Asia > China (0.46)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.88)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)

arXiv.org Artificial IntelligenceSep-30-2025

Automotive-ENV: Benchmarking Multimodal Agents in Vehicle Interface Systems

Yan, Junfeng, Wu, Biao, Fang, Meng, Chen, Ling

Multimodal agents have demonstrated strong performance in general GUI interactions, but their application in automotive systems has been largely unexplored. In-vehicle GUIs present distinct challenges: drivers' limited attention, strict safety requirements, and complex location-based interaction patterns. To address these challenges, we introduce Automotive-ENV, the first high-fidelity benchmark and interaction environment tailored for vehicle GUIs. This platform defines 185 parameterized tasks spanning explicit control, implicit intent understanding, and safety-aware tasks, and provides structured multimodal observations with precise programmatic checks for reproducible evaluation. Building on this benchmark, we propose ASURADA, a geo-aware multimodal agent that integrates GPS-informed context to dynamically adjust actions based on location, environmental conditions, and regional driving norms. Experiments show that geo-aware information significantly improves success on safety-aware tasks, highlighting the importance of location-based context in automotive environments. We will release Automotive-ENV, complete with all tasks and benchmarking tools, to further the development of safe and adaptive in-vehicle agents.

large language model, machine learning, natural language, (17 more...)

2509.21143

Country: Asia (0.68)

Genre: Research Report (0.65)

Industry:

Automobiles & Trucks (0.68)
Transportation > Ground > Road (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
(2 more...)

arXiv.org Artificial IntelligenceSep-18-2025

See, Think, Act: Teaching Multimodal Agents to Effectively Interact with GUI by Identifying Toggles

Wu, Zongru, Mao, Rui, Tian, Zhiyuan, Cheng, Pengzhou, Ju, Tianjie, Wu, Zheng, Dong, Lingzhong, Sheng, Haiyue, Zhang, Zhuosheng, Liu, Gongshen

The advent of multimodal agents facilitates effective interaction within graphical user interface (GUI), especially in ubiquitous GUI control. However, their inability to reliably execute toggle control instructions remains a key bottleneck. To investigate this, we construct a state control benchmark with binary toggle instructions from public datasets. Evaluations of existing agents demonstrate their unreliability, particularly when the current toggle state already matches the desired state. To address the challenge, we propose State-aware Reasoning (StaR), a training method that teaches agents to perceive the current toggle state, analyze the desired state from the instruction, and act accordingly. Experiments on three multimodal agents demonstrate that StaR can improve toggle instruction execution accuracy by over 30\%. Further evaluations on three public benchmarks show that StaR also enhances general task performance. Finally, evaluations on a dynamic environment highlight the potential of StaR for real-world applications. Code, benchmark, and StaR-enhanced agents are available at https://github.com/ZrW00/StaR.

large language model, machine learning, natural language, (17 more...)

2509.13615

Country:

North America > United States (1.00)
Asia (1.00)
Europe > Austria > Vienna (0.15)

Genre: Research Report > New Finding (0.93)

Industry: Information Technology (0.46)

Technology:

Information Technology > Graphics (1.00)
Information Technology > Communications > Mobile (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(3 more...)