AITopics

Country:

North America > Canada > Ontario (0.05)
South America > Chile (0.04)
North America > United States > Texas (0.04)

Genre: Collection (0.40)

Industry:

Media > Film (1.00)
Leisure & Entertainment > Games > Computer Games (0.68)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.47)

Neural Information Processing SystemsFeb-13-2026, 17:37:19 GMT

563991b5c8b45fe75bea42db738223b2-Paper-Datasets_and_Benchmarks_Track.pdf

large language model, machine learning, natural language, (21 more...)

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
North America > Canada > Ontario (0.04)
Europe > Netherlands > North Holland > Amsterdam (0.04)
(9 more...)

Genre: Research Report (0.67)

Industry:

Media > Film (1.00)
Leisure & Entertainment > Games > Computer Games (0.67)
Information Technology (0.67)
Leisure & Entertainment > Games > Chess (0.47)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(3 more...)

Neural Information Processing SystemsOct-10-2025, 03:05:37 GMT

Part I Appendix Table of Contents

Table 9) and identifying reflections (Error #20 in Table 13) are also noted.

ext prompt, failure case, image error, (17 more...)

Country:

North America > Canada > Ontario (0.05)
South America > Chile (0.04)
North America > United States > Texas (0.04)

Genre: Collection (0.40)

Industry:

Media > Film (1.00)
Leisure & Entertainment > Games > Computer Games (0.68)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.47)

Neural Information Processing SystemsOct-10-2025, 03:05:35 GMT

Evaluating Vision-Language Models in the Wild with Human Preferences

Our comprehensive analysis of 20K real-world interactions reveals important insights into the failure cases of top-performing VLMs.

ext prompt, wang, zhang, (15 more...)

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
North America > Canada > Ontario (0.04)
Europe > Netherlands > North Holland > Amsterdam (0.04)
(9 more...)

Genre: Research Report (0.67)

Industry:

Media > Film (1.00)
Leisure & Entertainment > Games > Computer Games (0.67)
Information Technology (0.67)
Leisure & Entertainment > Games > Chess (0.47)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

arXiv.org Artificial IntelligenceJun-10-2025

Contextual Experience Replay for Self-Improvement of Language Agents

Liu, Yitao, Si, Chenglei, Narasimhan, Karthik, Yao, Shunyu

Large language model (LLM) agents have been applied to sequential decision-making tasks such as web navigation, but without any environment-specific experiences, they often fail in these complex tasks. Moreover, current LLM agents are not designed to continually learn from past experiences during inference time, which could be crucial for them to gain these environment-specific experiences. To address this, we propose Contextual Experience Replay (CER), a training-free framework to enable efficient self-improvement for language agents in their context window. Specifically, CER accumulates and synthesizes past experiences into a dynamic memory buffer. These experiences encompass environment dynamics and common decision-making patterns, allowing the agents to retrieve and augment themselves with relevant knowledge in new tasks, enhancing their adaptability in complex environments. We evaluate CER on the challenging WebArena and VisualWebArena benchmarks. On VisualWebArena, CER achieves a competitive performance of 31.9%. On WebArena, CER also gets a competitive average success rate of 36.7%, relatively improving the success rate of the GPT-4o agent baseline by 51.0%. We also conduct a comprehensive analysis on it to prove its efficiency, validity and understand it better.

large language model, machine learning, trajectory, (20 more...)

2506.06698

Country: Asia (0.28)

Genre:

Workflow (1.00)
Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

arXiv.org Artificial IntelligenceMar-6-2025

SafeArena: Evaluating the Safety of Autonomous Web Agents

Tur, Ada Defne, Meade, Nicholas, Lù, Xing Han, Zambrano, Alejandra, Patel, Arkil, Durmus, Esin, Gella, Spandana, Stańczak, Karolina, Reddy, Siva

LLM-based agents are becoming increasingly proficient at solving web-based tasks. With this capability comes a greater risk of misuse for malicious purposes, such as posting misinformation in an online forum or selling illicit substances on a website. To evaluate these risks, we propose SafeArena, the first benchmark to focus on the deliberate misuse of web agents. SafeArena comprises 250 safe and 250 harmful tasks across four websites. We classify the harmful tasks into five harm categories -- misinformation, illegal activity, harassment, cybercrime, and social bias, designed to assess realistic misuses of web agents. We evaluate leading LLM-based web agents, including GPT-4o, Claude-3.5 Sonnet, Qwen-2-VL 72B, and Llama-3.2 90B, on our benchmark. To systematically assess their susceptibility to harmful tasks, we introduce the Agent Risk Assessment framework that categorizes agent behavior across four risk levels. We find agents are surprisingly compliant with malicious requests, with GPT-4o and Qwen-2 completing 34.7% and 27.3% of harmful requests, respectively. Our findings highlight the urgent need for safety alignment procedures for web agents. Our benchmark is available here: https://safearena.github.io

afe, agent, harmful task, (15 more...)

2503.04957

Country:

Europe > Austria > Vienna (0.14)
North America > Canada > Quebec > Montreal (0.14)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
(6 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Law > Criminal Law (1.00)
Information Technology > Security & Privacy (1.00)
Government > Immigration & Customs (0.93)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceSep-13-2024

Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale

Bonatti, Rogerio, Zhao, Dan, Bonacci, Francesco, Dupont, Dillon, Abdali, Sara, Li, Yinheng, Lu, Yadong, Wagle, Justin, Koishida, Kazuhito, Bucker, Arthur, Jang, Lawrence, Hui, Zack

Large language models (LLMs) show remarkable potential to act as computer agents, enhancing human productivity and software accessibility in multi-modal tasks that require planning and reasoning. However, measuring agent performance in realistic environments remains a challenge since: (i) most benchmarks are limited to specific modalities or domains (e.g. text-only, web navigation, Q&A, coding) and (ii) full benchmark evaluations are slow (on order of magnitude of days) given the multi-step sequential nature of tasks. To address these challenges, we introduce the Windows Agent Arena: a reproducible, general environment focusing exclusively on the Windows operating system (OS) where agents can operate freely within a real Windows OS and use the same wide range of applications, tools, and web browsers available to human users when solving tasks. We adapt the OSWorld framework (Xie et al., 2024) to create 150+ diverse Windows tasks across representative domains that require agent abilities in planning, screen understanding, and tool usage. Our benchmark is scalable and can be seamlessly parallelized in Azure for a full benchmark evaluation in as little as 20 minutes. To demonstrate Windows Agent Arena's capabilities, we also introduce a new multi-modal agent, Navi. Our agent achieves a success rate of 19.5% in the Windows domain, compared to 74.5% performance of an unassisted human. Navi also demonstrates strong performance on another popular web-based benchmark, Mind2Web. We offer extensive quantitative and qualitative analysis of Navi's performance, and provide insights into the opportunities for future research in agent development and data generation using Windows Agent Arena. Webpage: https://microsoft.github.io/WindowsAgentArena Code: https://github.com/microsoft/WindowsAgentArena

agent, computer, indow, (14 more...)

2409.08264

Country:

Europe > Sweden > Stockholm > Stockholm (0.04)
Asia > India > Maharashtra > Mumbai (0.04)
North America > United States > Michigan > Jackson County > Jackson (0.04)
(3 more...)

Genre:

Research Report (0.82)
Workflow (0.68)

Industry:

Information Technology > Software (0.67)
Health & Medicine > Therapeutic Area (0.46)
Leisure & Entertainment > Sports > Baseball (0.45)
Health & Medicine > Consumer Health (0.45)

Technology:

Information Technology > Software (1.00)
Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
(4 more...)

arXiv.org Artificial IntelligenceAug-1-2024

DriveArena: A Closed-loop Generative Simulation Platform for Autonomous Driving

Yang, Xuemeng, Wen, Licheng, Ma, Yukai, Mei, Jianbiao, Li, Xin, Wei, Tiantian, Lei, Wenjie, Fu, Daocheng, Cai, Pinlong, Dou, Min, Shi, Botian, He, Liang, Liu, Yong, Qiao, Yu

This paper presented DriveArena, the first high-fidelity closed-loop simulation system designed for driving agents navigating in real scenarios. DriveArena features a flexible, modular architecture, allowing for the seamless interchange of its core components: Traffic Manager, a traffic simulator capable of generating realistic traffic flow on any worldwide street map, and World Dreamer, a high-fidelity conditional generative model with infinite autoregression. This powerful synergy empowers any driving agent capable of processing real-world images to navigate in DriveArena's simulated environment. The agent perceives its surroundings through images generated by World Dreamer and output trajectories. These trajectories are fed into Traffic Manager, achieving realistic interactions with other vehicles and producing a new scene layout. Finally, the latest scene layout is relayed back into World Dreamer, perpetuating the simulation cycle. This iterative process fosters closed-loop exploration within a highly realistic environment, providing a valuable platform for developing and evaluating driving agents across diverse and challenging scenarios. DriveArena signifies a substantial leap forward in leveraging generative image data for the driving simulation platform, opening insights for closed-loop autonomous driving. Code will be available soon on GitHub: https://github.com/PJLab-ADG/DriveArena

agent, rive, traffic manager, (15 more...)

2408.00415

Country:

Asia > China > Shanghai > Shanghai (0.04)
Asia > Singapore (0.04)
North America > United States > Nevada > Clark County > Las Vegas (0.04)
(2 more...)

Genre: Research Report > New Finding (0.46)

Industry:

Transportation > Ground > Road (1.00)
Information Technology (1.00)
Energy > Renewable > Geothermal > Geothermal Energy Systems and Facilities > Geothermal System for Power Generation > Advanced Geothermal System (AGS) (1.00)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (1.00)
(3 more...)

arXiv.org Artificial IntelligenceJul-18-2024

RAG-QA Arena: Evaluating Domain Robustness for Long-form Retrieval Augmented Question Answering

Han, Rujun, Zhang, Yuhao, Qi, Peng, Xu, Yumo, Wang, Jenyuan, Liu, Lan, Wang, William Yang, Min, Bonan, Castelli, Vittorio

Question answering based on retrieval augmented generation (RAG-QA) is an important research topic in NLP and has a wide range of real-world applications. However, most existing datasets for this task are either constructed using a single source corpus or consist of short extractive answers, which fall short of evaluating large language model (LLM) based RAG-QA systems on cross-domain generalization. To address these limitations, we create Long-form RobustQA (LFRQA), a new dataset comprising human-written long-form answers that integrate short extractive answers from multiple documents into a single, coherent narrative, covering 26K queries and large corpora across seven different domains. We further propose RAG-QA Arena by directly comparing model-generated answers against LFRQA's answers using LLMs as evaluators. We show via extensive experiments that RAG-QA Arena and human judgments on answer quality are highly correlated. Moreover, only 41.3% of the most competitive LLM's answers are preferred to LFRQA's answers, demonstrating RAG-QA Arena as a challenging evaluation platform for future research.

frqa, information, query, (16 more...)

2407.13998

Country:

North America > United States > Washington > King County > Seattle (0.04)
North America > United States > Texas > Travis County > Austin (0.04)
North America > United States > New York (0.04)
(5 more...)

Genre:

Research Report > Experimental Study (0.68)
Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceJun-16-2024

WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences

Lu, Yujie, Jiang, Dongfu, Chen, Wenhu, Wang, William Yang, Choi, Yejin, Lin, Bill Yuchen

Recent breakthroughs in vision-language models (VLMs) emphasize the necessity of benchmarking human preferences in real-world multimodal interactions. To address this gap, we launched WildVision-Arena (WV-Arena), an online platform that collects human preferences to evaluate VLMs. We curated WV-Bench by selecting 500 high-quality samples from 8,000 user submissions in WV-Arena. WV-Bench uses GPT-4 as the judge to compare each VLM with Claude-3-Sonnet, achieving a Spearman correlation of 0.94 with the WV-Arena Elo. This significantly outperforms other benchmarks like MMVet, MMMU, and MMStar. Our comprehensive analysis of 20K real-world interactions reveals important insights into the failure cases of top-performing VLMs. For example, we find that although GPT-4V surpasses many other models like Reka-Flash, Opus, and Yi-VL-Plus in simple visual recognition and reasoning tasks, it still faces challenges with subtle contextual cues, spatial reasoning, visual imagination, and expert domain knowledge. Additionally, current VLMs exhibit issues with hallucinations and safety when intentionally provoked. We are releasing our chat and feedback data to further advance research in the field of VLMs.

text prompt, wang, zhang, (15 more...)

2406.11069

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
North America > Canada > Ontario (0.04)
Europe > Netherlands > North Holland > Amsterdam (0.04)
(9 more...)

Genre: Research Report (1.00)

Industry:

Leisure & Entertainment > Games (0.94)
Media > Film (0.93)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)