human performance
- North America > United States > Massachusetts (0.04)
- North America > United States > California > Alameda County > Berkeley (0.04)
- North America > Mexico > Puebla (0.04)
- Research Report > Experimental Study (0.69)
- Research Report > New Finding (0.46)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Cognitive Science (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Avalon: A Benchmark for RL Generalization Using Procedurally Generated Worlds
Despite impressive successes, deep reinforcement learning (RL) systems still fall short of human performance on generalization to new tasks and environments that differ from their training. As a benchmark tailored for studying RL generalization, we introduce Avalon, a set of tasks in which embodied agents in highly diverse procedural 3D worlds must survive by navigating terrain, hunting or gathering food, and avoiding hazards. Avalon is unique among existing RL benchmarks in that the reward function, world dynamics, and action space are the same for every task, with tasks differentiated solely by altering the environment; its 20 tasks, ranging in complexity from eat and throw to hunt and navigate, each create worlds in which the agent must perform specific skills in order to survive. This setup enables investigations of generalization within tasks, between tasks, and to compositional tasks that require combining skills learned from previous tasks. Avalon includes a highly efficient simulator, a library of baselines, and a benchmark with scoring metrics evaluated against hundreds of hours of human performance, all of which are open-source and publicly available. We find that standard RL baselines make progress on most tasks but are still far from human performance, suggesting Avalon is challenging enough to advance the quest for generalizable RL.
Evaluating Small Vision-Language Models on Distance-Dependent Traffic Perception
Theodoridis, Nikos, Brophy, Tim, Mohandas, Reenu, Sistu, Ganesh, Collins, Fiachra, Scanlan, Anthony, Eising, Ciaran
Vision-Language Models (VLMs) are becoming increasingly powerful, demonstrating strong performance on a variety of tasks that require both visual and textual understanding. Their strong generalisation abilities make them a promising component for automated driving systems, which must handle unexpected corner cases. However, to be trusted in such safety-critical applications, a model must first possess a reliable perception system. Moreover, since critical objects and agents in traffic scenes are often at a distance, we require systems that are not "shortsighted", i.e., systems with strong perception capabilities at both close (up to 20 meters) and long (30+ meters) range. With this in mind, we introduce Distance-Annotated Traffic Perception Question Answering (DTPQA), the first Visual Question Answering (VQA) benchmark focused solely on perception-based questions in traffic scenes, enriched with distance annotations. By excluding questions that require reasoning, we ensure that model performance reflects perception capabilities alone. Since automated driving hardware has limited processing power and cannot support large VLMs, our study centers on smaller VLMs. More specifically, we evaluate several state-of-the-art (SOTA) small VLMs on DTPQA and show that, despite the simplicity of the questions, these models significantly underperform compared to humans (~60% average accuracy for the best-performing small VLM versus ~85% human performance). However, it is important to note that the human sample size was relatively small, which imposes statistical limitations. We also identify specific perception tasks, such as distinguishing left from right, that remain particularly challenging for these models.
- Europe > Ireland > Munster > County Limerick > Limerick (0.04)
- Europe > Ireland > Connaught > County Galway > Galway (0.04)
- Europe > Greece > Central Macedonia > Thessaloniki (0.04)
- (3 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Transportation > Ground > Road (1.00)
- Information Technology (1.00)
- Automobiles & Trucks (1.00)
HUME: Measuring the Human-Model Performance Gap in Text Embedding Tasks
Assadi, Adnan El, Chung, Isaac, Solomatin, Roman, Muennighoff, Niklas, Enevoldsen, Kenneth
Comparing human and model performance offers a valuable perspective for understanding the strengths and limitations of embedding models, highlighting where they succeed and where they fail to capture meaning and nuance. However, such comparisons are rarely made, as human performance on embedding tasks is difficult to measure. To fill this gap, we introduce HUME: Human Evaluation Framework for Text Embeddings. While frameworks like MTEB provide broad model evaluation, they lack reliable estimates of human performance, limiting the interpretability of model scores. We measure human performance across 16 MTEB datasets spanning reranking, classification, clustering, and semantic textual similarity across linguistically diverse high- and low-resource languages. Humans achieve an average performance of 77.6% compared to 80.1% for the best embedding model, though with substantial variation: models reach high performance on some datasets while struggling on notably low-resource languages. Our human annotations also reveal multiple dataset issues. We additionally benchmark nine LLMs as annotators on reranking, classification, and STS tasks, finding that they fall short of human performance (76.1% vs. 81.2%) despite offering scalability advantages. We provide human performance baselines, insights into task difficulty patterns, and an extensible evaluation framework that enables a more meaningful interpretation of results and informs the development of both models and benchmarks. Our code, dataset, and leaderboard are publicly available at https://github.com/embeddings-benchmark/mteb.
- Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
- South America > Suriname > Marowijne District > Albina (0.04)
- South America > Paraguay > Asunción > Asunción (0.04)
- (14 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.93)
- North America > United States > New York (0.76)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > Puerto Rico (0.04)
- (3 more...)
- Research Report (0.69)
- Personal (0.46)
- Education (0.68)
- Leisure & Entertainment > Games > Computer Games (0.47)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.71)
- Information Technology > Human Computer Interaction > Interfaces > Virtual Reality (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)
Which Way Does Time Flow? A Psychophysics-Grounded Evaluation for Vision-Language Models
Matta, Shiho, Pereira, Lis Kanashiro, Han, Peitao, Cheng, Fei, Kitazawa, Shigeru
Modern vision-language models (VLMs) excel at many multimodal tasks, yet their grasp of temporal information in video remains weak and, crucially, under-evaluated. We probe this gap with a deceptively simple but revealing challenge: judging the arrow of time (AoT)-whether a short clip is played forward or backward. We introduce AoT-PsyPhyBENCH, a psychophysically validated benchmark that tests whether VLMs can infer temporal direction in natural videos using the same stimuli and behavioral baselines established for humans. Our comprehensive evaluation of open-weight and proprietary, reasoning and non-reasoning VLMs reveals that most models perform near chance, and even the best lag far behind human accuracy on physically irreversible processes (e.g., free fall, diffusion/explosion) and causal manual actions (division/addition) that humans recognize almost instantly. These results highlight a fundamental gap in current multimodal systems: while they capture rich visual-semantic correlations, they lack the inductive biases required for temporal continuity and causal understanding. We release the code and data for AoT-PsyPhyBENCH to encourage further progress in the physical and temporal reasoning capabilities of VLMs.
- Asia > Japan > Honshū > Kansai > Kyoto Prefecture > Kyoto (0.04)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- Asia > Japan > Honshū > Kansai > Osaka Prefecture > Osaka (0.04)
- North America > United States > Massachusetts (0.04)
- North America > United States > California > Alameda County > Berkeley (0.04)
- North America > Mexico > Puebla (0.04)
- Research Report > Experimental Study (0.69)
- Research Report > New Finding (0.46)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Cognitive Science (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Export Reviews, Discussions, Author Feedback and Meta-Reviews
First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. Summary: This very strong paper proposes a rational model for algorithm selection based on problem features and Bayesian regression. The model is shown to be effective computationally and to better predict human performance than comparable models. This paper is the epitome of a strong NIPS paper. The paper is clearly written and addresses an interesting problem. There is both a nice computational result about the algorithm and a cognitive model that is tested with a brief experiment.