Goto

Collaborating Authors

 contact app


CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents

arXiv.org Artificial Intelligence

The development of autonomous agents increasingly relies on Multimodal Language Models (MLMs) to perform tasks described in natural language with GUI environments, such as websites, desktop computers, or mobile phones. Existing benchmarks for MLM agents in interactive environments are limited by their focus on a single environment, lack of detailed and generalized evaluation methods, and the complexities of constructing tasks and evaluators. To overcome these limitations, we introduce Crab, the first agent benchmark framework designed to support cross-environment tasks, incorporating a graph-based fine-grained evaluation method and an efficient mechanism for task and evaluator construction. Our framework supports multiple devices and can be easily extended to any environment with a Python interface. Leveraging Crab, we developed a cross-platform Crab Benchmark-v0 comprising 100 tasks in computer desktop and mobile phone environments. We evaluated four advanced MLMs using different single and multi-agent system configurations on this benchmark. The experimental results demonstrate that the single agent with GPT-4o achieves the best completion ratio of 35.26%. All framework code, agent code, and task datasets are publicly available at https://github.com/camel-ai/crab.


Autonomous Evaluation and Refinement of Digital Agents

arXiv.org Artificial Intelligence

We show that domain-general automatic evaluators can significantly improve the performance of agents for web navigation and device control. We experiment with multiple evaluation models that trade off between inference cost, modularity of design, and accuracy. We validate the performance of these models in several popular benchmarks for digital agents, finding between 74.4 and 92.9% agreement with oracle evaluation metrics. Finally, we use these evaluators to improve the performance of existing agents via fine-tuning and inference-time guidance. Without any additional supervision, we improve state-of-the-art performance by 29% on the popular benchmark WebArena, and achieve a 75% relative improvement in a challenging domain transfer scenario.


Your iPhone's Contacts App Is More Powerful Than You Realize. Here Are 5 Ways to Get the Most Out of It

TIME - Tech

You're not the only one who silently laments spending time searching through the Contacts app on your iPhone or other iOS device, hunting for that one person you barely remember yet need to get in touch with for whatever reason. It only gets worse when you realize their information is either incorrect, outdated, or not where you thought you saved it. Whether you're looking for a co-worker, a client, an acquaintance, or a long-lost friend you bumped into at a party, it's helpful to keep who's who in order in your Contacts app. And you just might find that the Contacts app is far more powerful when you take the time to get the most out of it. Filling out contact information beyond a person's name, email, and phone number might seem like overkill, but doing so can make Siri a more powerful tool when it comes to connecting with people.