InfantAgent-Next: AMultimodal Generalist Agent for Automated Computer Interaction
–Neural Information Processing Systems
This paper introduces INFANTAGENT-NEXT, a generalist agent capable of interacting with computers in a multimodal manner, encompassing text, images, audio, and video. Unlike existing approaches that either build intricate workflows around a single large model or only provide workflow modularity, our agent integrates tool-based and pure vision agents within a highly modular architecture, enabling different models to collaboratively solve decoupled tasks in a step-by-step manner. Our generality is demonstrated by our ability to evaluate not only pure vision-based real-world benchmarks (i.e., OSWorld), but also more general or tool-intensive benchmarks (e.g., GAIA and SWE-Bench). Specifically, we achieve a 7.27%accuracy gain over Claude-Computer-Use on OSWorld.
Neural Information Processing Systems
Jun-16-2026, 02:42:02 GMT
- Country:
- North America > United States (0.67)
- Asia (0.46)
- Genre:
- Research Report > Experimental Study (1.00)
- Workflow (0.68)
- Industry:
- Information Technology (1.00)
- Banking & Finance (0.68)
- Technology:
- Information Technology
- Communications (1.00)
- Information Management > Search (0.68)
- Artificial Intelligence
- Natural Language > Large Language Model (1.00)
- Vision (0.94)
- Representation & Reasoning > Agents (0.93)
- Machine Learning > Neural Networks
- Deep Learning (0.47)
- Information Technology