Goto

Collaborating Authors

 challenging benchmark


BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

arXiv.org Artificial Intelligence

Although the internet has transformed the way we access informa tion, human navigation of the internet to find information is clunky for several reasons: (1) our m emory and world knowledge are limited; (2) our browsing abilities are hindered by distraction and fatig ue; and (3) human brains can only attend to one thing at a time and cannot be parallelized. Machine in telligence, on the other hand, has much more extensive recall and can operate tirelessly without g etting distracted. A sufficiently capable machine intelligence should be able to, in principle, retrieve any well-specified any piece of information from the open web, even if retrieving it would require bro wsing thousands of web pages. As AI progresses from chatbots to reasoners and to agents, th ere has been increased interest in models that can browse the internet beyond simple queries ( Google, 2024; OpenAI, 2025b, a; perplexity.AI, 2025; x.AI, 2025). While past benchmarks have measured the ability to retrieve information ( Joshi et al., 2017; Yang et al., 2018; Thorne et al., 2018; Dinan et al., 2019; Fan et al., 2019; Mialon et al., 2023), most of these benchmarks focus on retrieving information that ca n be found easily, and hence have become saturated by recent language models. Here we introduce a new benchmark called BrowseComp, which stands for "Browsing Competition" and comprises 1,266 challe nging problems that require browsing a large number of websites to solve. Three example questio ns are shown below.


DeSIQ: Towards an Unbiased, Challenging Benchmark for Social Intelligence Understanding

arXiv.org Artificial Intelligence

Social intelligence is essential for understanding and reasoning about human expressions, intents and interactions. One representative benchmark for its study is Social Intelligence Queries (Social-IQ), a dataset of multiple-choice questions on videos of complex social interactions. We define a comprehensive methodology to study the soundness of Social-IQ, as the soundness of such benchmark datasets is crucial to the investigation of the underlying research problem. Our analysis reveals that Social-IQ contains substantial biases, which can be exploited by a moderately strong language model to learn spurious correlations to achieve perfect performance without being given the context or even the question. We introduce DeSIQ, a new challenging dataset, constructed by applying simple perturbations to Social-IQ. Our empirical analysis shows DeSIQ significantly reduces the biases in the original Social-IQ dataset. Furthermore, we examine and shed light on the effect of model size, model style, learning settings, commonsense knowledge, and multi-modality on the new benchmark performance. Our new dataset, observations and findings open up important research questions for the study of social intelligence.


A Challenging Benchmark for Low-Resource Learning

arXiv.org Artificial Intelligence

With promising yet saturated results in high-resource settings, low-resource datasets have gradually become popular benchmarks for evaluating the learning ability of advanced neural networks (e.g., BigBench, superGLUE). Some models even surpass humans according to benchmark test results. However, we find that there exists a set of hard examples in low-resource settings that challenge neural networks but are not well evaluated, which causes over-estimated performance. We first give a theoretical analysis on which factors bring the difficulty of low-resource learning. It then motivate us to propose a challenging benchmark hardBench to better evaluate the learning ability, which covers 11 datasets, including 3 computer vision (CV) datasets and 8 natural language process (NLP) datasets. Experiments on a wide range of models show that neural networks, even pre-trained language models, have sharp performance drops on our benchmark, demonstrating the effectiveness on evaluating the weaknesses of neural networks. On NLP tasks, we surprisingly find that despite better results on traditional low-resource benchmarks, pre-trained networks, does not show performance improvements on our benchmarks. These results demonstrate that there are still a large robustness gap between existing models and human-level performance.


WHU-Stereo: A Challenging Benchmark for Stereo Matching of High-Resolution Satellite Images

arXiv.org Artificial Intelligence

Stereo matching of high-resolution satellite images (HRSI) is still a fundamental but challenging task in the field of photogrammetry and remote sensing. Recently, deep learning (DL) methods, especially convolutional neural networks (CNNs), have demonstrated tremendous potential for stereo matching on public benchmark datasets. However, datasets for stereo matching of satellite images are scarce. To facilitate further research, this paper creates and publishes a challenging dataset, termed WHU-Stereo, for stereo matching DL network training and testing. This dataset is created by using airborne LiDAR point clouds and high-resolution stereo imageries taken from the Chinese GaoFen-7 satellite (GF-7). The WHU-Stereo dataset contains more than 1700 epipolar rectified image pairs, which cover six areas in China and includes various kinds of landscapes. We have assessed the accuracy of ground-truth disparity maps, and it is proved that our dataset achieves comparable precision compared with existing state-of-the-art stereo matching datasets. To verify its feasibility, in experiments, the hand-crafted SGM stereo matching algorithm and recent deep learning networks have been tested on the WHU-Stereo dataset. Experimental results show that deep learning networks can be well trained and achieves higher performance than hand-crafted SGM algorithm, and the dataset has great potential in remote sensing application. The WHU-Stereo dataset can serve as a challenging benchmark for stereo matching of high-resolution satellite images, and performance evaluation of deep learning models. Our dataset is available at https://github.com/Sheng029/WHU-Stereo