WebWalker: Benchmarking LLMs in Web Traversal
Wu, Jialong, Yin, Wenbiao, Jiang, Yong, Wang, Zhenglin, Xi, Zekun, Fang, Runnan, Zhang, Linhai, He, Yulan, Zhou, Deyu, Xie, Pengjun, Huang, Fei
–arXiv.org Artificial Intelligence
Retrieval-augmented generation (RAG) demonstrates remarkable performance across tasks in open-domain question-answering. However, traditional search engines may retrieve shallow content, limiting the ability of LLMs to handle complex, multi-layered information. To address it, we introduce WebWalkerQA, a benchmark designed to assess the ability of LLMs to perform web traversal. It evaluates the capacity of LLMs to traverse a website's subpages to extract high-quality data systematically. We propose WebWalker, which is a multi-agent framework that mimics human-like web navigation through an explore-critic paradigm. Extensive experimental results show that WebWalkerQA is challenging and demonstrates the effectiveness of RAG combined with WebWalker, through the horizontal and vertical integration in real-world scenarios.
arXiv.org Artificial Intelligence
Jan-14-2025