Autonomous Evaluation and Refinement of Digital Agents

Pan, Jiayi, Zhang, Yichi, Tomlin, Nicholas, Zhou, Yifei, Levine, Sergey, Suhr, Alane

Apr-10-2024–arXiv.org Artificial Intelligence

We show that domain-general automatic evaluators can significantly improve the performance of agents for web navigation and device control. We experiment with multiple evaluation models that trade off between inference cost, modularity of design, and accuracy. We validate the performance of these models in several popular benchmarks for digital agents, finding between 74.4 and 92.9% agreement with oracle evaluation metrics. Finally, we use these evaluators to improve the performance of existing agents via fine-tuning and inference-time guidance. Without any additional supervision, we improve state-of-the-art performance by 29% on the popular benchmark WebArena, and achieve a 75% relative improvement in a challenging domain transfer scenario.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

Apr-10-2024

arXiv.org PDF

Add feedback

Country:
- Asia (1.00)
- North America > United States (0.93)

Genre:
- Research Report > New Finding (0.67)

Industry:
- Consumer Products & Services > Restaurants (1.00)
- Information Technology > Services (0.93)
- Retail (1.00)

Technology:
- Information Technology
  - Artificial Intelligence
    - Machine Learning > Neural Networks
      - Deep Learning (0.69)
    - Natural Language
      - Chatbot (0.69)
      - Large Language Model (1.00)
    - Representation & Reasoning > Agents (0.93)
  - Communications
    - Mobile (1.00)
    - Social Media (1.00)
  - Information Management > Search (0.93)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found