Surfer-H Meets Holo1: Cost-Efficient Web Agent Powered by Open Weights

Andreux, Mathieu, Skuk, Breno Baldas, Benchekroun, Hamza, Biré, Emilien, Bonnet, Antoine, Bordie, Riaz, Bout, Nathan, Brunel, Matthias, Cedoz, Pierre-Louis, Chassang, Antoine, Chen, Mickaël, Constantinou, Alexandra D., d'Andigné, Antoine, de La Jonquière, Hubert, Delfosse, Aurélien, Denoyer, Ludovic, Deprez, Alexis, Derupti, Augustin, Eickenberg, Michael, Federico, Mathïs, Kantor, Charles, Koegler, Xavier, Labbé, Yann, Lee, Matthew C. H., de Kergaradec, Erwan Le Jumeau, Mahla, Amir, Manevich, Avshalom, Maret, Adrien, Masson, Charles, Maurin, Rafaël, Mena, Arturo, Modard, Philippe, Moyal, Axel, Kerbel, Axel Nguyen, Revelle, Julien, Richter, Mats L., Santos, María, Sifre, Laurent, Theillard, Maxime, Thibault, Marc, Thiry, Louis, Tronchon, Léo, Usunier, Nicolas, Wu, Tony

arXiv.org Artificial Intelligence 

Building AI agents requires designing systems capable of acting in and adapting to dynamic digital environments in real time. In this context, Large Language Models (LLMs) have made remarkable progress in reasoning and problem solving, rivaling or even surpassing human experts in domain-specific tasks [12, 32]. However, in their most fundamental form, LLMs are confined to a static, pre-trained world: they cannot act, verify, or access up-to-date information. For instance, they cannot answer questions about current events, book a restaurant table, or avoid hallucination [30, 35]. To circumvent their limitations, research has focused on enhancing LLMs with tool-use capabilities, enabling them to execute code snippets [7, 29], query Application Programming Interfaces (APIs) [18, 31], or retrieve information at scale with multi-step reasoning [33, 38, 24, 26]. These systems, often referred to 1 as agents, extend LLMs into more capable virtual assistants [36]. However, their real-world utility remains bounded by the available predefined tools and the engineering effort required to expand them [13]. Approaching this problem from another angle, computer use agents have recently emerged as a new paradigm in which agents interact with software directly through Graphical User Interfaces (GUIs) [1, 8, 11, 15, 17, 23, 39], i.e. using the same interface humans are presented with. This approach avoids relying on custom integrations or APIs, opening the door to more adaptable general-purpose agents with higher potential and broader real-world utility.