InfantAgent-Next: A Multimodal Generalist Agent for Automated Computer Interaction

Lei, Bin, Kang, Weitai, Zhang, Zijian, Chen, Winson, Xie, Xi, Zuo, Shan, Xie, Mimi, Payani, Ali, Hong, Mingyi, Yan, Yan, Ding, Caiwen

May-27-2025–arXiv.org Artificial Intelligence

This paper introduces \textsc{InfantAgent-Next}, a generalist agent capable of interacting with computers in a multimodal manner, encompassing text, images, audio, and video. Unlike existing approaches that either build intricate workflows around a single large model or only provide workflow modularity, our agent integrates tool-based and pure vision agents within a highly modular architecture, enabling different models to collaboratively solve decoupled tasks in a step-by-step manner. Our generality is demonstrated by our ability to evaluate not only pure vision-based real-world benchmarks (i.e., OSWorld), but also more general or tool-intensive benchmarks (e.g., GAIA and SWE-Bench). Specifically, we achieve $\mathbf{7.27\%}$ accuracy on OSWorld, higher than Claude-Computer-Use. Codes and evaluation scripts are open-sourced at https://github.com/bin123apple/InfantAgent.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

May-27-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Texas (0.04)
  - Minnesota (0.04)
  - Connecticut (0.04)
  - Illinois > Cook County
    - Chicago (0.04)
- Asia > Myanmar
  - Tanintharyi Region > Dawei (0.04)

Genre:
- Workflow (0.86)

Industry:
- Information Technology (0.68)

Technology:
- Information Technology
  - Communications (0.93)
  - Information Management > Search (0.68)
  - Artificial Intelligence
    - Natural Language > Large Language Model (1.00)
    - Representation & Reasoning > Agents (0.94)
    - Machine Learning > Neural Networks
      - Deep Learning (0.48)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found