AITopics | challenging benchmark

Collaborating Authors

challenging benchmark

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

Wei, Jason, Sun, Zhiqing, Papay, Spencer, McKinney, Scott, Han, Jeffrey, Fulford, Isa, Chung, Hyung Won, Passos, Alex Tachard, Fedus, William, Glaese, Amelia

arXiv.org Artificial IntelligenceApr-18-2025

Although the internet has transformed the way we access informa tion, human navigation of the internet to find information is clunky for several reasons: (1) our m emory and world knowledge are limited; (2) our browsing abilities are hindered by distraction and fatig ue; and (3) human brains can only attend to one thing at a time and cannot be parallelized. Machine in telligence, on the other hand, has much more extensive recall and can operate tirelessly without g etting distracted. A sufficiently capable machine intelligence should be able to, in principle, retrieve any well-specified any piece of information from the open web, even if retrieving it would require bro wsing thousands of web pages. As AI progresses from chatbots to reasoners and to agents, th ere has been increased interest in models that can browse the internet beyond simple queries ( Google, 2024; OpenAI, 2025b, a; perplexity.AI, 2025; x.AI, 2025). While past benchmarks have measured the ability to retrieve information ( Joshi et al., 2017; Yang et al., 2018; Thorne et al., 2018; Dinan et al., 2019; Fan et al., 2019; Mialon et al., 2023), most of these benchmarks focus on retrieving information that ca n be found easily, and hence have become saturated by recent language models. Here we introduce a new benchmark called BrowseComp, which stands for "Browsing Competition" and comprises 1,266 challe nging problems that require browsing a large number of websites to solve. Three example questio ns are shown below.

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

2504.12516

Genre: Research Report (0.50)

Industry: Leisure & Entertainment (0.94)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.55)

Add feedback

DeSIQ: Towards an Unbiased, Challenging Benchmark for Social Intelligence Understanding

Guo, Xiao-Yu, Li, Yuan-Fang, Haffari, Gholamreza

arXiv.org Artificial IntelligenceOct-24-2023

Social intelligence is essential for understanding and reasoning about human expressions, intents and interactions. One representative benchmark for its study is Social Intelligence Queries (Social-IQ), a dataset of multiple-choice questions on videos of complex social interactions. We define a comprehensive methodology to study the soundness of Social-IQ, as the soundness of such benchmark datasets is crucial to the investigation of the underlying research problem. Our analysis reveals that Social-IQ contains substantial biases, which can be exploited by a moderately strong language model to learn spurious correlations to achieve perfect performance without being given the context or even the question. We introduce DeSIQ, a new challenging dataset, constructed by applying simple perturbations to Social-IQ. Our empirical analysis shows DeSIQ significantly reduces the biases in the original Social-IQ dataset. Furthermore, we examine and shed light on the effect of model size, model style, learning settings, commonsense knowledge, and multi-modality on the new benchmark performance. Our new dataset, observations and findings open up important research questions for the study of social intelligence.

challenging benchmark, desiq, unbiased

arXiv.org Artificial Intelligence

2310.18359

Genre: Research Report (0.89)

Technology: Information Technology > Artificial Intelligence (1.00)

Add feedback

A Challenging Benchmark for Low-Resource Learning

Wang, Yudong, Ma, Chang, Dong, Qingxiu, Kong, Lingpeng, Xu, Jingjing

arXiv.org Artificial IntelligenceMar-9-2023

With promising yet saturated results in high-resource settings, low-resource datasets have gradually become popular benchmarks for evaluating the learning ability of advanced neural networks (e.g., BigBench, superGLUE). Some models even surpass humans according to benchmark test results. However, we find that there exists a set of hard examples in low-resource settings that challenge neural networks but are not well evaluated, which causes over-estimated performance. We first give a theoretical analysis on which factors bring the difficulty of low-resource learning. It then motivate us to propose a challenging benchmark hardBench to better evaluate the learning ability, which covers 11 datasets, including 3 computer vision (CV) datasets and 8 natural language process (NLP) datasets. Experiments on a wide range of models show that neural networks, even pre-trained language models, have sharp performance drops on our benchmark, demonstrating the effectiveness on evaluating the weaknesses of neural networks. On NLP tasks, we surprisingly find that despite better results on traditional low-resource benchmarks, pre-trained networks, does not show performance improvements on our benchmarks. These results demonstrate that there are still a large robustness gap between existing models and human-level performance.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2303.0384

Country:

North America > Canada > Ontario > Toronto (0.14)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
North America > United States > Nevada > Clark County > Las Vegas (0.04)
(5 more...)

Genre: Research Report > New Finding (0.48)

Industry:

Education (0.46)
Media (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

WHU-Stereo: A Challenging Benchmark for Stereo Matching of High-Resolution Satellite Images

Li, Shenhong, He, Sheng, Jiang, San, Jiang, Wanshou, Zhang, Lin

arXiv.org Artificial IntelligenceJun-6-2022

Stereo matching of high-resolution satellite images (HRSI) is still a fundamental but challenging task in the field of photogrammetry and remote sensing. Recently, deep learning (DL) methods, especially convolutional neural networks (CNNs), have demonstrated tremendous potential for stereo matching on public benchmark datasets. However, datasets for stereo matching of satellite images are scarce. To facilitate further research, this paper creates and publishes a challenging dataset, termed WHU-Stereo, for stereo matching DL network training and testing. This dataset is created by using airborne LiDAR point clouds and high-resolution stereo imageries taken from the Chinese GaoFen-7 satellite (GF-7). The WHU-Stereo dataset contains more than 1700 epipolar rectified image pairs, which cover six areas in China and includes various kinds of landscapes. We have assessed the accuracy of ground-truth disparity maps, and it is proved that our dataset achieves comparable precision compared with existing state-of-the-art stereo matching datasets. To verify its feasibility, in experiments, the hand-crafted SGM stereo matching algorithm and recent deep learning networks have been tested on the WHU-Stereo dataset. Experimental results show that deep learning networks can be well trained and achieves higher performance than hand-crafted SGM algorithm, and the dataset has great potential in remote sensing application. The WHU-Stereo dataset can serve as a challenging benchmark for stereo matching of high-resolution satellite images, and performance evaluation of deep learning models. Our dataset is available at https://github.com/Sheng029/WHU-Stereo

artificial intelligence, high-resolution satellite image, machine learning, (4 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/TGRS.2023.3245205

2206.02342

Country: Asia > China (0.24)

Genre: Research Report (0.69)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback