WebWalker: Benchmarking LLMs in Web Traversal

Wu, Jialong, Yin, Wenbiao, Jiang, Yong, Wang, Zhenglin, Xi, Zekun, Fang, Runnan, Zhang, Linhai, He, Yulan, Zhou, Deyu, Xie, Pengjun, Huang, Fei

Jan-14-2025–arXiv.org Artificial Intelligence

Retrieval-augmented generation (RAG) demonstrates remarkable performance across tasks in open-domain question-answering. However, traditional search engines may retrieve shallow content, limiting the ability of LLMs to handle complex, multi-layered information. To address it, we introduce WebWalkerQA, a benchmark designed to assess the ability of LLMs to perform web traversal. It evaluates the capacity of LLMs to traverse a website's subpages to extract high-quality data systematically. We propose WebWalker, which is a multi-agent framework that mimics human-like web navigation through an explore-critic paradigm. Extensive experimental results show that WebWalkerQA is challenging and demonstrates the effectiveness of RAG combined with WebWalker, through the horizontal and vertical integration in real-world scenarios.

information, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

Jan-14-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.46)

Genre:
- Research Report > New Finding (0.48)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language > Large Language Model (1.00)