BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents
Wei, Jason, Sun, Zhiqing, Papay, Spencer, McKinney, Scott, Han, Jeffrey, Fulford, Isa, Chung, Hyung Won, Passos, Alex Tachard, Fedus, William, Glaese, Amelia
–arXiv.org Artificial Intelligence
Although the internet has transformed the way we access informa tion, human navigation of the internet to find information is clunky for several reasons: (1) our m emory and world knowledge are limited; (2) our browsing abilities are hindered by distraction and fatig ue; and (3) human brains can only attend to one thing at a time and cannot be parallelized. Machine in telligence, on the other hand, has much more extensive recall and can operate tirelessly without g etting distracted. A sufficiently capable machine intelligence should be able to, in principle, retrieve any well-specified any piece of information from the open web, even if retrieving it would require bro wsing thousands of web pages. As AI progresses from chatbots to reasoners and to agents, th ere has been increased interest in models that can browse the internet beyond simple queries ( Google, 2024; OpenAI, 2025b, a; perplexity.AI, 2025; x.AI, 2025). While past benchmarks have measured the ability to retrieve information ( Joshi et al., 2017; Yang et al., 2018; Thorne et al., 2018; Dinan et al., 2019; Fan et al., 2019; Mialon et al., 2023), most of these benchmarks focus on retrieving information that ca n be found easily, and hence have become saturated by recent language models. Here we introduce a new benchmark called BrowseComp, which stands for "Browsing Competition" and comprises 1,266 challe nging problems that require browsing a large number of websites to solve. Three example questio ns are shown below.
arXiv.org Artificial Intelligence
Apr-18-2025