rena
- North America > Canada > Ontario (0.05)
- South America > Chile (0.04)
- North America > United States > Texas (0.04)
- Media > Film (1.00)
- Leisure & Entertainment > Games > Computer Games (0.68)
- Europe > Switzerland > Zürich > Zürich (0.14)
- North America > Canada > Ontario (0.04)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- (9 more...)
- Media > Film (1.00)
- Leisure & Entertainment > Games > Computer Games (0.67)
- Information Technology (0.67)
- Leisure & Entertainment > Games > Chess (0.47)
- North America > Canada > Ontario (0.05)
- South America > Chile (0.04)
- North America > United States > Texas (0.04)
- Media > Film (1.00)
- Leisure & Entertainment > Games > Computer Games (0.68)
- Europe > Switzerland > Zürich > Zürich (0.14)
- North America > Canada > Ontario (0.04)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- (9 more...)
- Media > Film (1.00)
- Leisure & Entertainment > Games > Computer Games (0.67)
- Information Technology (0.67)
- Leisure & Entertainment > Games > Chess (0.47)
LongReasonArena: A Long Reasoning Benchmark for Large Language Models
Ding, Jiayu, Ma, Shuming, Cui, Lei, Zheng, Nanning, Wei, Furu
Existing long-context benchmarks for Large Language Models (LLMs) focus on evaluating comprehension of long inputs, while overlooking the evaluation of long reasoning abilities. To address this gap, we introduce LongReasonArena, a benchmark specifically designed to assess the long reasoning capabilities of LLMs. Our tasks require models to solve problems by executing multi-step algorithms that reflect key aspects of long reasoning, such as retrieval and backtracking. By controlling the inputs, the required reasoning length can be arbitrarily scaled, reaching up to 1 million tokens of reasoning for the most challenging tasks. Extensive evaluation results demonstrate that LongReasonArena presents a significant challenge for both open-source and proprietary LLMs. For instance, Deepseek-R1 achieves only 7.5% accuracy on our task. Further analysis also reveals that the accuracy exhibits a linear decline with respect to the logarithm of the expected number of reasoning steps. Our code and data is available at https://github.com/LongReasonArena/LongReasonArena.
Contextual Experience Replay for Self-Improvement of Language Agents
Liu, Yitao, Si, Chenglei, Narasimhan, Karthik, Yao, Shunyu
Large language model (LLM) agents have been applied to sequential decision-making tasks such as web navigation, but without any environment-specific experiences, they often fail in these complex tasks. Moreover, current LLM agents are not designed to continually learn from past experiences during inference time, which could be crucial for them to gain these environment-specific experiences. To address this, we propose Contextual Experience Replay (CER), a training-free framework to enable efficient self-improvement for language agents in their context window. Specifically, CER accumulates and synthesizes past experiences into a dynamic memory buffer. These experiences encompass environment dynamics and common decision-making patterns, allowing the agents to retrieve and augment themselves with relevant knowledge in new tasks, enhancing their adaptability in complex environments. We evaluate CER on the challenging WebArena and VisualWebArena benchmarks. On VisualWebArena, CER achieves a competitive performance of 31.9%. On WebArena, CER also gets a competitive average success rate of 36.7%, relatively improving the success rate of the GPT-4o agent baseline by 51.0%. We also conduct a comprehensive analysis on it to prove its efficiency, validity and understand it better.
- Asia > Middle East > Saudi Arabia > Asir Province > Abha (0.04)
- Asia > China > Hong Kong (0.04)
- Workflow (1.00)
- Research Report (0.82)
SafeArena: Evaluating the Safety of Autonomous Web Agents
Tur, Ada Defne, Meade, Nicholas, Lù, Xing Han, Zambrano, Alejandra, Patel, Arkil, Durmus, Esin, Gella, Spandana, Stańczak, Karolina, Reddy, Siva
LLM-based agents are becoming increasingly proficient at solving web-based tasks. With this capability comes a greater risk of misuse for malicious purposes, such as posting misinformation in an online forum or selling illicit substances on a website. To evaluate these risks, we propose SafeArena, the first benchmark to focus on the deliberate misuse of web agents. SafeArena comprises 250 safe and 250 harmful tasks across four websites. We classify the harmful tasks into five harm categories -- misinformation, illegal activity, harassment, cybercrime, and social bias, designed to assess realistic misuses of web agents. We evaluate leading LLM-based web agents, including GPT-4o, Claude-3.5 Sonnet, Qwen-2-VL 72B, and Llama-3.2 90B, on our benchmark. To systematically assess their susceptibility to harmful tasks, we introduce the Agent Risk Assessment framework that categorizes agent behavior across four risk levels. We find agents are surprisingly compliant with malicious requests, with GPT-4o and Qwen-2 completing 34.7% and 27.3% of harmful requests, respectively. Our findings highlight the urgent need for safety alignment procedures for web agents. Our benchmark is available here: https://safearena.github.io
- Europe > Austria > Vienna (0.14)
- North America > Canada > Quebec > Montreal (0.14)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- (6 more...)
- Law > Criminal Law (1.00)
- Information Technology > Security & Privacy (1.00)
- Government > Immigration & Customs (0.93)
- (3 more...)
Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale
Bonatti, Rogerio, Zhao, Dan, Bonacci, Francesco, Dupont, Dillon, Abdali, Sara, Li, Yinheng, Lu, Yadong, Wagle, Justin, Koishida, Kazuhito, Bucker, Arthur, Jang, Lawrence, Hui, Zack
Large language models (LLMs) show remarkable potential to act as computer agents, enhancing human productivity and software accessibility in multi-modal tasks that require planning and reasoning. However, measuring agent performance in realistic environments remains a challenge since: (i) most benchmarks are limited to specific modalities or domains (e.g. text-only, web navigation, Q&A, coding) and (ii) full benchmark evaluations are slow (on order of magnitude of days) given the multi-step sequential nature of tasks. To address these challenges, we introduce the Windows Agent Arena: a reproducible, general environment focusing exclusively on the Windows operating system (OS) where agents can operate freely within a real Windows OS and use the same wide range of applications, tools, and web browsers available to human users when solving tasks. We adapt the OSWorld framework (Xie et al., 2024) to create 150+ diverse Windows tasks across representative domains that require agent abilities in planning, screen understanding, and tool usage. Our benchmark is scalable and can be seamlessly parallelized in Azure for a full benchmark evaluation in as little as 20 minutes. To demonstrate Windows Agent Arena's capabilities, we also introduce a new multi-modal agent, Navi. Our agent achieves a success rate of 19.5% in the Windows domain, compared to 74.5% performance of an unassisted human. Navi also demonstrates strong performance on another popular web-based benchmark, Mind2Web. We offer extensive quantitative and qualitative analysis of Navi's performance, and provide insights into the opportunities for future research in agent development and data generation using Windows Agent Arena. Webpage: https://microsoft.github.io/WindowsAgentArena Code: https://github.com/microsoft/WindowsAgentArena
- Europe > Sweden > Stockholm > Stockholm (0.04)
- Asia > India > Maharashtra > Mumbai (0.04)
- North America > United States > Michigan > Jackson County > Jackson (0.04)
- (3 more...)
- Research Report (0.82)
- Workflow (0.68)
- Information Technology > Software (0.67)
- Health & Medicine > Therapeutic Area (0.46)
- Leisure & Entertainment > Sports > Baseball (0.45)
- Health & Medicine > Consumer Health (0.45)
- Information Technology > Software (1.00)
- Information Technology > Communications (1.00)
- Information Technology > Artificial Intelligence > Vision (1.00)
- (4 more...)
DriveArena: A Closed-loop Generative Simulation Platform for Autonomous Driving
Yang, Xuemeng, Wen, Licheng, Ma, Yukai, Mei, Jianbiao, Li, Xin, Wei, Tiantian, Lei, Wenjie, Fu, Daocheng, Cai, Pinlong, Dou, Min, Shi, Botian, He, Liang, Liu, Yong, Qiao, Yu
This paper presented DriveArena, the first high-fidelity closed-loop simulation system designed for driving agents navigating in real scenarios. DriveArena features a flexible, modular architecture, allowing for the seamless interchange of its core components: Traffic Manager, a traffic simulator capable of generating realistic traffic flow on any worldwide street map, and World Dreamer, a high-fidelity conditional generative model with infinite autoregression. This powerful synergy empowers any driving agent capable of processing real-world images to navigate in DriveArena's simulated environment. The agent perceives its surroundings through images generated by World Dreamer and output trajectories. These trajectories are fed into Traffic Manager, achieving realistic interactions with other vehicles and producing a new scene layout. Finally, the latest scene layout is relayed back into World Dreamer, perpetuating the simulation cycle. This iterative process fosters closed-loop exploration within a highly realistic environment, providing a valuable platform for developing and evaluating driving agents across diverse and challenging scenarios. DriveArena signifies a substantial leap forward in leveraging generative image data for the driving simulation platform, opening insights for closed-loop autonomous driving. Code will be available soon on GitHub: https://github.com/PJLab-ADG/DriveArena
- Asia > China > Shanghai > Shanghai (0.04)
- Asia > Singapore (0.04)
- North America > United States > Nevada > Clark County > Las Vegas (0.04)
- (2 more...)
RAG-QA Arena: Evaluating Domain Robustness for Long-form Retrieval Augmented Question Answering
Han, Rujun, Zhang, Yuhao, Qi, Peng, Xu, Yumo, Wang, Jenyuan, Liu, Lan, Wang, William Yang, Min, Bonan, Castelli, Vittorio
Question answering based on retrieval augmented generation (RAG-QA) is an important research topic in NLP and has a wide range of real-world applications. However, most existing datasets for this task are either constructed using a single source corpus or consist of short extractive answers, which fall short of evaluating large language model (LLM) based RAG-QA systems on cross-domain generalization. To address these limitations, we create Long-form RobustQA (LFRQA), a new dataset comprising human-written long-form answers that integrate short extractive answers from multiple documents into a single, coherent narrative, covering 26K queries and large corpora across seven different domains. We further propose RAG-QA Arena by directly comparing model-generated answers against LFRQA's answers using LLMs as evaluators. We show via extensive experiments that RAG-QA Arena and human judgments on answer quality are highly correlated. Moreover, only 41.3% of the most competitive LLM's answers are preferred to LFRQA's answers, demonstrating RAG-QA Arena as a challenging evaluation platform for future research.
- North America > United States > Washington > King County > Seattle (0.04)
- North America > United States > Texas > Travis County > Austin (0.04)
- North America > United States > New York (0.04)
- (5 more...)
- Research Report > Experimental Study (0.68)
- Research Report > New Finding (0.46)